Skip to content Skip to sidebar Skip to footer

Pandas Join Datatable To Sql Table To Prevent Memory Errors

So I have about 4-5 million rows of data per table. I have about 10-15 of these tables. I created a table that will join 30,000 rows to some of these million rows based on some ID

Solution 1:

I once faced the same problem as you are. My solution was to filter as much as possible in the SQL layer. Since I don't have your code and your DB, what I write below is untested code and very possibly contain bugs. You will have to correct them as needed.

The idea is to read as little as possible from the DB. pandas is not designed to analyze frames of millions of rows (at least on a typical computer). To do that, pass the filter criteria from df to your DB call:

from sqlalchemy import MetaData, and_, or_

engine = ... # construct your SQL Alchemy engine. May correspond to your `conn` object
meta = MetaData()
meta.reflect(bind=engine, only=tablesToJoin)


for table in tablesToJoin:
    t = meta[table]
    # Building the WHERE clause. This is equivalent to:#     WHERE     ((MemberID = <MemberID 1>) AND (SnapshotDate = date))#            OR ((MemberID = <MemberID 2>) AND (SnapshotDate = date))#            OR ((MemberID = <MemberID 3>) AND (SnapshotDate = date))
    cond = _or(**[and_(t.c['MemberID'] == member_id, t.c['SnapshotDate'] == date) for member_id indf['MemberID'] ])

    # Be frugal here: only get the columns that you need, or you will blow your memory# If you specify None, it's equivalent to a `SELECT *`
    statement = t.select(None).where(cond)

    # Note that it's `read_sql`, not `read_sql_query` here
    loadedDf = pd.read_sql(statement, engine)

    # loadedDf should be much smaller now since you have already filtered it at the DB level# Now do your joins...

Post a Comment for "Pandas Join Datatable To Sql Table To Prevent Memory Errors"