Skip to content Skip to sidebar Skip to footer

Create A Column In A Pyspark Dataframe Using A List Whose Indices Are Present In One Column Of The Dataframe

I'm new to Python and PySpark. I have a dataframe in PySpark like the following: ## +---+---+------+ ## | x1| x2| x3 | ## +---+---+------+ ## | 0| a | 13.0| ## | 2| B | -33.0|

Solution 1:

As you're doing a join between the indices of your array and your original DataFrame, one approach would be to convert your array into a DataFrame, generate the rownumber()-1 (which becomes your indices) and then join the two DataFrames together.

from pyspark.sql import Row

# Create original DataFrame `df`
df = sqlContext.createDataFrame(
    [(0, "a", 13.0), (2, "B", -33.0), (1, "B", -63.0)], ("x1", "x2", "x3"))
df.createOrReplaceTempView("df")

# Createcolumn "x4"
row=Row("x4")

# Take the array
arr = [10, 12, 13]

# ConvertArrayto RDD, andthencreate DataFrame
rdd = sc.parallelize(arr)
df2 = rdd.map(row).toDF()
df2.createOrReplaceTempView("df2")

# Create indices via row number
df3 = spark.sql("SELECT (row_number() OVER (ORDER by x4))-1 as indices, * FROM df2")
df3.createOrReplaceTempView("df3")

Now that you have the two DataFrames: df and df3, you can run the SQL query below to join the two DataFrames together.

select a.x1, a.x2, a.x3, b.x4 from df a join df3 b on b.indices = a.x1

Note, here is also good reference answer to the adding columns to DataFrames.

Post a Comment for "Create A Column In A Pyspark Dataframe Using A List Whose Indices Are Present In One Column Of The Dataframe"