python regex- getting everything (except \n) between two characters in a multiline string -
i have file input:
>x0 cuugacgauca cgcaucg >x55 uacggcgg uucagc aucg >x300 aaacccgggg
and need concatenation of lines between '>' characters:
cuugacgaucacgcaucg uacggcgguucagcaucg aaacccgggg
my attempt use "re.match(r'^>.*\n(.*)>.*' ,a,re.dotall)"
, delete '\n' each match, regex not returning anything. wrong?
some people, when confronted problem, think "i know, i'll use regular expressions." have 2 problems. - jamie zawinski
that being said, why not more understandable string processing?
tmp = [] seqs = [] open('txtfile') f: line in f: if line.startswith('>'): seqs.append(''.join(tmp)) tmp = [] else: tmp.append(line.strip()) else: seqs.pop(0) seqs.append(''.join(tmp))
alternatively, if want use regex, try first stripping newlines , splitting >x[digit]
patterns:
re.split(r'>x\d+', re.sub(r'\n', '', data))
but has downside entire textfile has loaded variable data
, not interesting large file (which in bio-informatics quite common). then, approach given first more interesting, memory-wise, process each finished dna/rna-sequence in turn.
Comments
Post a Comment