Skip to content Skip to sidebar Skip to footer

Can't Apply A Pandas_udf In Pyspark

I'm trying out some pyspark related experiments on jupyter notebook attached to an AWS EMR instance. I've a spark dataframe which reads data from s3, and then filters out some stuf

Solution 1:

I think pandas_udf doesn't support all the spark types yet, and it seems like it's having trouble with your date_time column.

One issue with any udf is that all the data has to be materialized for your udf, even if the udf ignores the values, which can result in issues like this, or at minimum performance degradation. All else being equal, you should try to reduce the number of columns you pass into your udf. For example, by adding a select before your groupby.

df2 = df1.select('idvalue', 'hour').groupBy('idvalue').apply(normalize).show()

Post a Comment for "Can't Apply A Pandas_udf In Pyspark"