What Type Should The Dense Vector Be, When Using UDF Function In Pyspark?
I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what
Solution 1:
You can use vectors and VectorUDT with UDF,
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F
ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a |b |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: double (containsNull = true)
|-- b: vector (nullable = true)
About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html
Post a Comment for "What Type Should The Dense Vector Be, When Using UDF Function In Pyspark?"