Get 20th To 80th Percentile Of Each Group - Pyspark
I have three columns in a pyspark data frame ( sample data given below ) I wanted to get the remove the outliers from each orderType. In order to do that I am removing the top Nth
Solution 1:
You can use approx_percentile
, then filter:
import pyspark.sql.functions as F
df2 = df.withColumn(
'percentile',
F.expr("approx_percentile(amount, array(0.2, 0.8), 100) over (partition by orderType)")
).filter(
'amount between percentile[0] and percentile[1]'
)
Usage of the function is documented here.
Post a Comment for "Get 20th To 80th Percentile Of Each Group - Pyspark"