Skip to content Skip to sidebar Skip to footer

Pandas - Non Overlapping Group Members

I have the following dataframe: id start end score C1 2 592 157 C1 179 592 87 C1 113 553 82 C2 152 219 350 C2 13 70 319 C2

Solution 1:

Here is a shorter version that

  • Add a column to compute the range, and uses the fact that the largest range can nest anything with a smaller range
  • Sorts on the range column to exploit this property
  • Removes any that are nested in each pass, so that they aren't compared multiple times.

This is just setup to make running easy.

import pandas as pd
import numpy as np
import StringIO as sio


data = """
id,start,end,score
C1,2,592,157
C1,179,592,87
C1,113,553,82
C2,152,219,350
C2,13,70,319
C2,13,70,188
C2,15,70,156
C2,87,139,130
C2,92,140,102
C3,18,38,348
C3,20,35,320
C3,31,57,310
C4,347,51,514"""

data = pd.read_csv(sio.StringIO(data))

The next block does the work.

data['range'] = data.end - data.start
data.sort_values(['id','range'])
g = data.groupby('id')

def f(df):
    keep = []
    while df.shape[0] > 0:
        widest = df.iloc[0]
        nested = (df.start >= widest.start) & (df.end <= widest.end)
        retain = df.loc[nested]
        loc = retain.score.values.argmax()
        keep.append(retain.iloc[[loc]])
        df = df.loc[np.logical_not(nested)]
    return pd.concat(keep,0)

out = g.apply(f).drop('range', 1)
out.index = np.arange(out.shape[0])

Using the data above, out

In[3]: outOut[3]: 
   id  startend  score
0  C1      25921571  C2    1522193502  C2     13703193  C2     871391304  C2     921401025  C3     18383486  C3     31573107  C4    34751514

Solution 2:

This is shorter and meets all requirements. You need:

  1. A way to check overlap
  2. A way to group your data by ID
  3. A way to grab the best from each group, after checking overlap.

This does all of those, cheating by using logic and groupby

# from Ned Batchfelder# http://nedbatchelder.com/blog/201310/range_overlap_in_two_compares.htmldefoverlap(start1, end1, start2, end2):
    """
    Does the range (start1, end1) overlap with (start2, end2)?
    """return end1 >= start2 and end2 >= start1

defcompare_rows(group):
    winners = []
    skip = []
    iflen(group) == 1:
        return group[['start', 'end', 'score']]
    for i in group.index:
        if i in skip:
            continuefor j in group.index:
            last = j == group.index[-1]
            istart = group.loc[i, 'start']
            iend = group.loc[i, 'end']
            jstart = group.loc[j, 'start']
            jend = group.loc[j, 'end']
            if overlap(istart, iend, jstart, jend):
                winner = group.loc[[i, j], 'score'].idxmax()
                if winner == j:
                    winners.append(winner)
                    skip.append(i)
                    breakif last:
                winners.append(i)
    return group.loc[winners, ['start', 'end', 'score']].drop_duplicates()

grouped = df.groupby('id')
print grouped.apply(compare_rows)

Post a Comment for "Pandas - Non Overlapping Group Members"