Pandas - Non Overlapping Group Members
I have the following dataframe: id start end score C1 2 592 157 C1 179 592 87 C1 113 553 82 C2 152 219 350 C2 13 70 319 C2
Solution 1:
Here is a shorter version that
- Add a column to compute the range, and uses the fact that the largest range can nest anything with a smaller range
- Sorts on the range column to exploit this property
- Removes any that are nested in each pass, so that they aren't compared multiple times.
This is just setup to make running easy.
import pandas as pd
import numpy as np
import StringIO as sio
data = """
id,start,end,score
C1,2,592,157
C1,179,592,87
C1,113,553,82
C2,152,219,350
C2,13,70,319
C2,13,70,188
C2,15,70,156
C2,87,139,130
C2,92,140,102
C3,18,38,348
C3,20,35,320
C3,31,57,310
C4,347,51,514"""
data = pd.read_csv(sio.StringIO(data))
The next block does the work.
data['range'] = data.end - data.start
data.sort_values(['id','range'])
g = data.groupby('id')
def f(df):
keep = []
while df.shape[0] > 0:
widest = df.iloc[0]
nested = (df.start >= widest.start) & (df.end <= widest.end)
retain = df.loc[nested]
loc = retain.score.values.argmax()
keep.append(retain.iloc[[loc]])
df = df.loc[np.logical_not(nested)]
return pd.concat(keep,0)
out = g.apply(f).drop('range', 1)
out.index = np.arange(out.shape[0])
Using the data above, out
In[3]: outOut[3]:
id startend score
0 C1 25921571 C2 1522193502 C2 13703193 C2 871391304 C2 921401025 C3 18383486 C3 31573107 C4 34751514
Solution 2:
This is shorter and meets all requirements. You need:
- A way to check overlap
- A way to group your data by ID
- A way to grab the best from each group, after checking overlap.
This does all of those, cheating by using logic and groupby
# from Ned Batchfelder# http://nedbatchelder.com/blog/201310/range_overlap_in_two_compares.htmldefoverlap(start1, end1, start2, end2):
"""
Does the range (start1, end1) overlap with (start2, end2)?
"""return end1 >= start2 and end2 >= start1
defcompare_rows(group):
winners = []
skip = []
iflen(group) == 1:
return group[['start', 'end', 'score']]
for i in group.index:
if i in skip:
continuefor j in group.index:
last = j == group.index[-1]
istart = group.loc[i, 'start']
iend = group.loc[i, 'end']
jstart = group.loc[j, 'start']
jend = group.loc[j, 'end']
if overlap(istart, iend, jstart, jend):
winner = group.loc[[i, j], 'score'].idxmax()
if winner == j:
winners.append(winner)
skip.append(i)
breakif last:
winners.append(i)
return group.loc[winners, ['start', 'end', 'score']].drop_duplicates()
grouped = df.groupby('id')
print grouped.apply(compare_rows)
Post a Comment for "Pandas - Non Overlapping Group Members"