Skip to content Skip to sidebar Skip to footer

What Type Should The Dense Vector Be, When Using UDF Function In Pyspark?

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what

Solution 1:

You can use vectors and VectorUDT with UDF,

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F

ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a                        |b                    |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+

df.printSchema()
root
  |-- a: array (nullable = true)
  |    |-- element: double (containsNull = true)
  |-- b: vector (nullable = true)

About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html


Post a Comment for "What Type Should The Dense Vector Be, When Using UDF Function In Pyspark?"