performance - Python - Reducing Import and Parse Time for Large CSV Files -

- April 15, 2010

my first post:

before beginning, should note relatively new oop, though have done db/stat work in sas, r, etc., question may not posed: please let me know if need clarify anything.

my question:

i attempting import , parse large csv files (~6mm rows , larger come). 2 limitations i've run repeatedly have been runtime , memory (32-bit implementation of python). below simplified version of neophyte (nth) attempt @ importing , parsing in reasonable time. how can speed process? splitting file import , performing interim summaries due memory limitations , using pandas summarization:

parsing , summarization:

def parseints(instring):     try:         return int(instring)     except:         return none def texttoyearmo(instring):     try:         return 100*instring[0:4]+int(instring[5:7])     except:         return 100*instring[0:4]+int(instring[5:6]) def parseallelements(elmvalue,elmpos):     if elmpos in [0,2,5]:         return elmvalue     elif elmpos == 3:         return texttoyearmo(elmvalue)     else:         if elmpos == 18:             return parseints(elmvalue.strip('\n'))         else:             return parseints(elmvalue)  def makeandsumlist(inlist):     df = pd.dataframe(inlist, columns = ['x1','x2','x3','x4','x5',                                          'x6','x7','x8','x9','x10',                                          'x11','x12','x13','x14'])     return df[['x1','x2','x3','x4','x5',                'x6','x7','x8','x9','x10',                'x11','x12','x13','x14']].groupby(                ['x1','x2','x3','x4','x5']).sum().reset_index()

function calls:

def parsedsummary(longstring,delimtr,rownum):     keepcolumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18]      #do other stuff takes little time      return [pse.parseallelements(longstring.split(delimtr)[i],i) in keepcolumns]  def csvtolist(filename, delimtr=','):     open(filename) f:         enumfile = enumerate(f)         listenumfile = set(enumfile)         linecount, l in enumfile:             pass          maxsplit = math.floor(linecount / 10) + 1          counter = 0         summary = pd.dataframe({}, columns = ['x1','x2','x3','x4','x5',                                               'x6','x7','x8','x9','x10',                                               'x11','x12','x13','x14'])         counter in range(0,10):             startrow     = int(counter * maxsplit)             endrow       = int((counter + 1) * maxsplit)             includedrows = set(range(startrow,endrow))              listofrows = [parsedsummary(row,delimtr,rownum)                              rownum, row in listenumfile if rownum in includedrows]             summary = pd.concat([summary,pse.makeandsumlist(listofrows)])              listofrows = []             counter += 1     return summary

(again, first question - apologize if simplified or, more likely, little, @ loss how expedite this.)

for runtime comparison:

using access can import, parse, summarize, , merge several files in size-range in <5 mins (though right @ 2gb lim). i'd hope can comparable results in python - presently i'm estimating ~30 min run time 1 file. note: threw in access' miserable environment because didn't have admin rights readily available install else.

edit: updated parsing code. able shave off 5 minutes (est. runtime @ 25m) changing conditional logic try/except. - runtime estimate doesn't include pandas portion - i'd forgotten i'd commented out while testing, impact seems negligible.

if want optimize performance, don't roll own csv reader in python. there standard csv module. perhaps pandas or numpy have faster csv readers; i'm not sure.

from https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file:

in short, pandas.io.parsers.read_csv beats else, numpy's loadtxt impressively slow , numpy's from_file , load impressively fast.

Search This Blog

Crty

performance - Python - Reducing Import and Parse Time for Large CSV Files -

Comments

Post a Comment

Popular posts from this blog

python - mat is not a numerical tuple : openCV error -

c# - MSAA finds controls UI Automation doesn't -

wordpress - .htaccess: RewriteRule: bad flag delimiters -