Get 20th To 80th Percentile Of Each Group - Pyspark
I have three columns in a pyspark data frame ( sample data given below )  I wanted to get the remove the outliers from each orderType. In order to do that I am removing the top Nth
Solution 1:
You can use approx_percentile, then filter:
import pyspark.sql.functions as F
df2 = df.withColumn(
    'percentile',
    F.expr("approx_percentile(amount, array(0.2, 0.8), 100) over (partition by orderType)")
).filter(
    'amount between percentile[0] and percentile[1]'
)
Usage of the function is documented here.
Post a Comment for "Get 20th To 80th Percentile Of Each Group - Pyspark"