Can't Apply A Pandas_udf In Pyspark

December 27, 2023 Post a Comment

I'm trying out some pyspark related experiments on jupyter notebook attached to an AWS EMR instance. I've a spark dataframe which reads data from s3, and then filters out some stuf

Solution 1:

I think pandas_udf doesn't support all the spark types yet, and it seems like it's having trouble with your date_time column.

One issue with any udf is that all the data has to be materialized for your udf, even if the udf ignores the values, which can result in issues like this, or at minimum performance degradation. All else being equal, you should try to reduce the number of columns you pass into your udf. For example, by adding a select before your groupby.

df2 = df1.select('idvalue', 'hour').groupBy('idvalue').apply(normalize).show()

Introduction to Python Course

Can't Apply A Pandas_udf In Pyspark

Solution 1:

Post a Comment for "Can't Apply A Pandas_udf In Pyspark"