python regex- getting everything (except \n) between two characters in a multiline string -


i have file input:

>x0 cuugacgauca cgcaucg >x55 uacggcgg uucagc aucg >x300 aaacccgggg 

and need concatenation of lines between '>' characters:

cuugacgaucacgcaucg uacggcgguucagcaucg aaacccgggg 

my attempt use "re.match(r'^>.*\n(.*)>.*' ,a,re.dotall)" , delete '\n' each match, regex not returning anything. wrong?

some people, when confronted problem, think "i know, i'll use regular expressions." have 2 problems. - jamie zawinski

that being said, why not more understandable string processing?

tmp = [] seqs = [] open('txtfile') f:     line in f:         if line.startswith('>'):             seqs.append(''.join(tmp))             tmp = []         else:             tmp.append(line.strip())     else:         seqs.pop(0)         seqs.append(''.join(tmp)) 

alternatively, if want use regex, try first stripping newlines , splitting >x[digit] patterns:

re.split(r'>x\d+', re.sub(r'\n', '', data)) 

but has downside entire textfile has loaded variable data, not interesting large file (which in bio-informatics quite common). then, approach given first more interesting, memory-wise, process each finished dna/rna-sequence in turn.


Comments

Popular posts from this blog

python - mat is not a numerical tuple : openCV error -

c# - MSAA finds controls UI Automation doesn't -

wordpress - .htaccess: RewriteRule: bad flag delimiters -