Skip to content Skip to sidebar Skip to footer

Get 20th To 80th Percentile Of Each Group - Pyspark

I have three columns in a pyspark data frame ( sample data given below ) I wanted to get the remove the outliers from each orderType. In order to do that I am removing the top Nth

Solution 1:

You can use approx_percentile, then filter:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'percentile',
    F.expr("approx_percentile(amount, array(0.2, 0.8), 100) over (partition by orderType)")
).filter(
    'amount between percentile[0] and percentile[1]'
)

Usage of the function is documented here.

Post a Comment for "Get 20th To 80th Percentile Of Each Group - Pyspark"