Skip to content Skip to sidebar Skip to footer

Python Dataframe Rolling_sum With Numbers Not Date

Say part of my dataframe df[(df['person_num'] == 1) | (df['person_num'] == 2) ] looks like this: person_num Days IS_TRUE 1 1 1 1 4 1 1

Solution 1:

You could implicitly do the for loop through vectorization, which will in general be faster than explicitly writing a for loop. Here's a working example on the data you provided:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Days': [1, 4, 5, 9, 1, 4, 5, 9],
                   'IS_TRUE': [1, 1, 0, 1, 1, 1, 0, 1],
                   'person_num': [1, 1, 1, 1, 2, 2, 2, 2]})

def window(group):
    diff = np.subtract.outer(group.Days, group.Days)
    group['IS_TRUE_7day_WINDOW'] = np.dot((diff > 0) & (diff <= 7),
                                          group['IS_TRUE'])
    returngroup

f.groupby('person_num').apply(window)

Output is this:

   Days  IS_TRUE  person_num  IS_TRUE_7day_WINDOW
0     1        1           1                    0
1     4        1           1                    1
2     5        0           1                    2
3     9        1           1                    1
4     1        1           2                    0
5     4        1           2                    1
6     5        0           2                    2
7     9        1           2                    1

Solution 2:

Since you mentioned data frame derives from a database, consider an SQL solution using a subquery which runs the calculation in its engine and not directly in Python.

Below assumes a MySQL database but adjust library and connection string according to your actual backend (SQLite, PostgreSQL, SQL Server, etc.). Below should be ANSI-syntax SQL, compliant in most RDMS.

SQL Solution

import pandas pd
importpymysqlconn= pymysql.connect(host="localhost" port=3306,
                     user="username", passwd="***", db="databasename") 

sql = "SELECT t1.Days, t1.person_num, t1.IS_TRUE, \
        (SELECT IFNULL(SUM(t2.IS_TRUE),0)  \
           FROM TableName t2 \
          WHERE t2.person_num= t1.person_num \
            AND t2.Days >= t1.Days - 7  \
            AND t2.Days < t1.Days) AS IS_TRUE_7DAY_WINDOW \
       FROM TableName t1"

df = pd.read_sql(sql, conn)

OUTPUT

Days    person_num  IS_TRUE IS_TRUE_7DAY_WINDOW
   1             1        1                   0
   4             1        1                   1
   5             1        0                   2
   9             1        1                   1
   1             2        1                   0
   4             2        1                   1
   5             2        0                   2
   9             2        1                   1                  

Solution 3:

The rolling_functions like rolling_sum use the index of the DataFrame or Series when seeing how far to go back. It doesn't have to be a datetime index. Below is some code to find the calculation for each user...

First use crosstab to make a DataFrame with a column for each person_num and a row for each day.

    >>> days_person = pd.crosstab(data['days'], 
                      data['person_num'], 
                      values=data['is_true'], 
                      aggfunc=pd.np.sum)
    >>> days_person
    person_num  12
    days
    111411500911

Next I'm going to fill in missing days with 0's, because you only have a few days of data.

    >>> empty_data = {n: [0]*10 for n in days_person.columns}
    >>> days_person = (days_person + pd.DataFrame(empty_data)).fillna(0)
    >>> days_person
    person_num  1  2
    days
    1           1  1
    2           0  0
    3           0  0
    4           1  1
    5           0  0
    6           0  0
    7           0  0
    8           0  0 
    9           1  1

Now use rolling_sum to get the table you're looking for. Note that days 1-6 will have NaN values, because there weren't enough previous days to do the calculation.

    >>> pd.rolling_sum(days_person, 7)

Post a Comment for "Python Dataframe Rolling_sum With Numbers Not Date"