Program to process fastq files
up vote
3
down vote
favorite
My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:
file=@SRR587217.1 1 length=490
NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT
+SRR587217.1 1 length=490
BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b ^^b^bdbdddIX^d]W]QW^dd^^dcfW
@SRR587217.4 4 length=502
NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA
+SRR587217.4 4 length=502
BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW
Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)
First of all i subdivide the three necessary lines (line1, line2 and line4)
Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:
file=
"abcdefghijkl n
"abcdefghijkl"
list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]
with those lists i calculate means, medians, variance for some plotting.
The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?
def main():
parser = ArgumentParser(prog='fastq', description=desc)
parser.add_argument('dateien', metavar='Files', nargs='*',
default=['sys.stdin'], help='Input of the file')
parser.add_argument('-p', '--phred',dest='phred',nargs= '?',action='store',
choices= [33 ,64 ], required=True, type=int )
parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',
action='store', default=1, type=int,
help='deletes whole Reads with bad quality')
parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',
action='store', default=1, type=int,
help='Trimming Readends with bad quality')
#parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))
args = parser.parse_args()
names = ; x=0 ;names1 = ""
seq_dict={} ; mean_dict={} ; qual_dict={}
qual_liste= ; i=0 ; sequenz= ; seq_name=
mean1= ; summe=0
quality_code=
gc_dict={}
qually=
quality=
with open('test.fastq', 'r') as seq:
for line in seq:
quality_code=
line=line.rstrip('n')
rest=i%4 #modulo to focus on specific lines
i+=1
#print(rest)
if rest==0: #line including the sequenz name
name=line[:-13] #some embellishment
name1=name.strip("@")
name2=name1.strip("length=" or "length" or "lengl" or "leng")
name3=name2.rstrip()
seq_name.append(name3) #List which includes lists of the seq names
elif rest == 1: #Line including sequenz nucleotides
seq=line[:-args.trim] #trimming the sequenz
sequenz.append(seq)
elif rest ==3: #line including quality score for sequenz
qual=line[:-args.trim]
if args.phred == 33:
quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]
elif args.phred == 64:
quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]
for list in quality_code:
mean=sum(list)/len(qual)
mean1.append(int(mean))
if mean >= int(args.cutoff):
seq_dict[name3]=seq
mean_dict[name3]=int(mean) #filter/cutoff of reads with worse quality
qual_dict[name3]=quality_code
The whole code includes some more lines about plotting to show the results.
python performance python-3.x parsing bioinformatics
bumped to the homepage by Community♦ 14 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
up vote
3
down vote
favorite
My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:
file=@SRR587217.1 1 length=490
NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT
+SRR587217.1 1 length=490
BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b ^^b^bdbdddIX^d]W]QW^dd^^dcfW
@SRR587217.4 4 length=502
NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA
+SRR587217.4 4 length=502
BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW
Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)
First of all i subdivide the three necessary lines (line1, line2 and line4)
Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:
file=
"abcdefghijkl n
"abcdefghijkl"
list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]
with those lists i calculate means, medians, variance for some plotting.
The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?
def main():
parser = ArgumentParser(prog='fastq', description=desc)
parser.add_argument('dateien', metavar='Files', nargs='*',
default=['sys.stdin'], help='Input of the file')
parser.add_argument('-p', '--phred',dest='phred',nargs= '?',action='store',
choices= [33 ,64 ], required=True, type=int )
parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',
action='store', default=1, type=int,
help='deletes whole Reads with bad quality')
parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',
action='store', default=1, type=int,
help='Trimming Readends with bad quality')
#parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))
args = parser.parse_args()
names = ; x=0 ;names1 = ""
seq_dict={} ; mean_dict={} ; qual_dict={}
qual_liste= ; i=0 ; sequenz= ; seq_name=
mean1= ; summe=0
quality_code=
gc_dict={}
qually=
quality=
with open('test.fastq', 'r') as seq:
for line in seq:
quality_code=
line=line.rstrip('n')
rest=i%4 #modulo to focus on specific lines
i+=1
#print(rest)
if rest==0: #line including the sequenz name
name=line[:-13] #some embellishment
name1=name.strip("@")
name2=name1.strip("length=" or "length" or "lengl" or "leng")
name3=name2.rstrip()
seq_name.append(name3) #List which includes lists of the seq names
elif rest == 1: #Line including sequenz nucleotides
seq=line[:-args.trim] #trimming the sequenz
sequenz.append(seq)
elif rest ==3: #line including quality score for sequenz
qual=line[:-args.trim]
if args.phred == 33:
quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]
elif args.phred == 64:
quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]
for list in quality_code:
mean=sum(list)/len(qual)
mean1.append(int(mean))
if mean >= int(args.cutoff):
seq_dict[name3]=seq
mean_dict[name3]=int(mean) #filter/cutoff of reads with worse quality
qual_dict[name3]=quality_code
The whole code includes some more lines about plotting to show the results.
python performance python-3.x parsing bioinformatics
bumped to the homepage by Community♦ 14 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
What do you usesequenz
for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13
You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:
file=@SRR587217.1 1 length=490
NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT
+SRR587217.1 1 length=490
BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b ^^b^bdbdddIX^d]W]QW^dd^^dcfW
@SRR587217.4 4 length=502
NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA
+SRR587217.4 4 length=502
BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW
Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)
First of all i subdivide the three necessary lines (line1, line2 and line4)
Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:
file=
"abcdefghijkl n
"abcdefghijkl"
list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]
with those lists i calculate means, medians, variance for some plotting.
The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?
def main():
parser = ArgumentParser(prog='fastq', description=desc)
parser.add_argument('dateien', metavar='Files', nargs='*',
default=['sys.stdin'], help='Input of the file')
parser.add_argument('-p', '--phred',dest='phred',nargs= '?',action='store',
choices= [33 ,64 ], required=True, type=int )
parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',
action='store', default=1, type=int,
help='deletes whole Reads with bad quality')
parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',
action='store', default=1, type=int,
help='Trimming Readends with bad quality')
#parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))
args = parser.parse_args()
names = ; x=0 ;names1 = ""
seq_dict={} ; mean_dict={} ; qual_dict={}
qual_liste= ; i=0 ; sequenz= ; seq_name=
mean1= ; summe=0
quality_code=
gc_dict={}
qually=
quality=
with open('test.fastq', 'r') as seq:
for line in seq:
quality_code=
line=line.rstrip('n')
rest=i%4 #modulo to focus on specific lines
i+=1
#print(rest)
if rest==0: #line including the sequenz name
name=line[:-13] #some embellishment
name1=name.strip("@")
name2=name1.strip("length=" or "length" or "lengl" or "leng")
name3=name2.rstrip()
seq_name.append(name3) #List which includes lists of the seq names
elif rest == 1: #Line including sequenz nucleotides
seq=line[:-args.trim] #trimming the sequenz
sequenz.append(seq)
elif rest ==3: #line including quality score for sequenz
qual=line[:-args.trim]
if args.phred == 33:
quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]
elif args.phred == 64:
quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]
for list in quality_code:
mean=sum(list)/len(qual)
mean1.append(int(mean))
if mean >= int(args.cutoff):
seq_dict[name3]=seq
mean_dict[name3]=int(mean) #filter/cutoff of reads with worse quality
qual_dict[name3]=quality_code
The whole code includes some more lines about plotting to show the results.
python performance python-3.x parsing bioinformatics
My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:
file=@SRR587217.1 1 length=490
NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT
+SRR587217.1 1 length=490
BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b ^^b^bdbdddIX^d]W]QW^dd^^dcfW
@SRR587217.4 4 length=502
NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA
+SRR587217.4 4 length=502
BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW
Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)
First of all i subdivide the three necessary lines (line1, line2 and line4)
Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:
file=
"abcdefghijkl n
"abcdefghijkl"
list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]
with those lists i calculate means, medians, variance for some plotting.
The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?
def main():
parser = ArgumentParser(prog='fastq', description=desc)
parser.add_argument('dateien', metavar='Files', nargs='*',
default=['sys.stdin'], help='Input of the file')
parser.add_argument('-p', '--phred',dest='phred',nargs= '?',action='store',
choices= [33 ,64 ], required=True, type=int )
parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',
action='store', default=1, type=int,
help='deletes whole Reads with bad quality')
parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',
action='store', default=1, type=int,
help='Trimming Readends with bad quality')
#parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))
args = parser.parse_args()
names = ; x=0 ;names1 = ""
seq_dict={} ; mean_dict={} ; qual_dict={}
qual_liste= ; i=0 ; sequenz= ; seq_name=
mean1= ; summe=0
quality_code=
gc_dict={}
qually=
quality=
with open('test.fastq', 'r') as seq:
for line in seq:
quality_code=
line=line.rstrip('n')
rest=i%4 #modulo to focus on specific lines
i+=1
#print(rest)
if rest==0: #line including the sequenz name
name=line[:-13] #some embellishment
name1=name.strip("@")
name2=name1.strip("length=" or "length" or "lengl" or "leng")
name3=name2.rstrip()
seq_name.append(name3) #List which includes lists of the seq names
elif rest == 1: #Line including sequenz nucleotides
seq=line[:-args.trim] #trimming the sequenz
sequenz.append(seq)
elif rest ==3: #line including quality score for sequenz
qual=line[:-args.trim]
if args.phred == 33:
quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]
elif args.phred == 64:
quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]
for list in quality_code:
mean=sum(list)/len(qual)
mean1.append(int(mean))
if mean >= int(args.cutoff):
seq_dict[name3]=seq
mean_dict[name3]=int(mean) #filter/cutoff of reads with worse quality
qual_dict[name3]=quality_code
The whole code includes some more lines about plotting to show the results.
python performance python-3.x parsing bioinformatics
python performance python-3.x parsing bioinformatics
edited Feb 14 at 23:23
200_success
128k15149412
128k15149412
asked Feb 14 at 20:38
Julian
161
161
bumped to the homepage by Community♦ 14 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 14 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
What do you usesequenz
for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13
You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37
add a comment |
What do you usesequenz
for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13
You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37
What do you use
sequenz
for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.– André Paramés
Feb 15 at 9:13
What do you use
sequenz
for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.– André Paramés
Feb 15 at 9:13
You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37
You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
This won't solve your problem, but here are some suggestions regarding the organization of the code:
You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using
rest
:
with open('test.fastq', 'r') as seq:
for line1, line2, line3, line4 in group_it(4, seq):
...
Instead of doing
if args.phred == 33:
you can just use the variable in the calculation, avoiding the repetition:
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187589%2fprogram-to-process-fastq-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
This won't solve your problem, but here are some suggestions regarding the organization of the code:
You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using
rest
:
with open('test.fastq', 'r') as seq:
for line1, line2, line3, line4 in group_it(4, seq):
...
Instead of doing
if args.phred == 33:
you can just use the variable in the calculation, avoiding the repetition:
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
add a comment |
up vote
0
down vote
This won't solve your problem, but here are some suggestions regarding the organization of the code:
You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using
rest
:
with open('test.fastq', 'r') as seq:
for line1, line2, line3, line4 in group_it(4, seq):
...
Instead of doing
if args.phred == 33:
you can just use the variable in the calculation, avoiding the repetition:
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
add a comment |
up vote
0
down vote
up vote
0
down vote
This won't solve your problem, but here are some suggestions regarding the organization of the code:
You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using
rest
:
with open('test.fastq', 'r') as seq:
for line1, line2, line3, line4 in group_it(4, seq):
...
Instead of doing
if args.phred == 33:
you can just use the variable in the calculation, avoiding the repetition:
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
This won't solve your problem, but here are some suggestions regarding the organization of the code:
You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using
rest
:
with open('test.fastq', 'r') as seq:
for line1, line2, line3, line4 in group_it(4, seq):
...
Instead of doing
if args.phred == 33:
you can just use the variable in the calculation, avoiding the repetition:
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
answered Feb 15 at 10:20
André Paramés
14315
14315
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187589%2fprogram-to-process-fastq-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What do you use
sequenz
for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.– André Paramés
Feb 15 at 9:13
You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37