performance - Python - Reducing Import and Parse Time for Large CSV Files -
my first post:
before beginning, should note relatively new oop, though have done db/stat work in sas, r, etc., question may not posed: please let me know if need clarify anything.
my question:
i attempting import , parse large csv files (~6mm rows , larger come). 2 limitations i've run repeatedly have been runtime , memory (32-bit implementation of python). below simplified version of neophyte (nth) attempt @ importing , parsing in reasonable time. how can speed process? splitting file import , performing interim summaries due memory limitations , using pandas summarization:
parsing , summarization:
def parseints(instring): try: return int(instring) except: return none def texttoyearmo(instring): try: return 100*instring[0:4]+int(instring[5:7]) except: return 100*instring[0:4]+int(instring[5:6]) def parseallelements(elmvalue,elmpos): if elmpos in [0,2,5]: return elmvalue elif elmpos == 3: return texttoyearmo(elmvalue) else: if elmpos == 18: return parseints(elmvalue.strip('\n')) else: return parseints(elmvalue) def makeandsumlist(inlist): df = pd.dataframe(inlist, columns = ['x1','x2','x3','x4','x5', 'x6','x7','x8','x9','x10', 'x11','x12','x13','x14']) return df[['x1','x2','x3','x4','x5', 'x6','x7','x8','x9','x10', 'x11','x12','x13','x14']].groupby( ['x1','x2','x3','x4','x5']).sum().reset_index()
function calls:
def parsedsummary(longstring,delimtr,rownum): keepcolumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18] #do other stuff takes little time return [pse.parseallelements(longstring.split(delimtr)[i],i) in keepcolumns] def csvtolist(filename, delimtr=','): open(filename) f: enumfile = enumerate(f) listenumfile = set(enumfile) linecount, l in enumfile: pass maxsplit = math.floor(linecount / 10) + 1 counter = 0 summary = pd.dataframe({}, columns = ['x1','x2','x3','x4','x5', 'x6','x7','x8','x9','x10', 'x11','x12','x13','x14']) counter in range(0,10): startrow = int(counter * maxsplit) endrow = int((counter + 1) * maxsplit) includedrows = set(range(startrow,endrow)) listofrows = [parsedsummary(row,delimtr,rownum) rownum, row in listenumfile if rownum in includedrows] summary = pd.concat([summary,pse.makeandsumlist(listofrows)]) listofrows = [] counter += 1 return summary
(again, first question - apologize if simplified or, more likely, little, @ loss how expedite this.)
for runtime comparison:
using access can import, parse, summarize, , merge several files in size-range in <5 mins (though right @ 2gb lim). i'd hope can comparable results in python - presently i'm estimating ~30 min run time 1 file. note: threw in access' miserable environment because didn't have admin rights readily available install else.
edit: updated parsing code. able shave off 5 minutes (est. runtime @ 25m) changing conditional logic try/except. - runtime estimate doesn't include pandas portion - i'd forgotten i'd commented out while testing, impact seems negligible.
if want optimize performance, don't roll own csv reader in python. there standard csv
module. perhaps pandas
or numpy
have faster csv readers; i'm not sure.
from https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file:
in short,
pandas.io.parsers.read_csv
beats else, numpy'sloadtxt
impressively slow , numpy'sfrom_file
,load
impressively fast.
Comments
Post a Comment