Program to process fastq files











up vote
3
down vote

favorite












My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:



file=@SRR587217.1 1 length=490
NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT
+SRR587217.1 1 length=490
BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b ^^b^bdbdddIX^d]W]QW^dd^^dcfW
@SRR587217.4 4 length=502
NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA
+SRR587217.4 4 length=502
BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW


Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)



First of all i subdivide the three necessary lines (line1, line2 and line4)



Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:



file=
"abcdefghijkl n
"abcdefghijkl"
list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]


with those lists i calculate means, medians, variance for some plotting.





The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?



def main():
parser = ArgumentParser(prog='fastq', description=desc)
parser.add_argument('dateien', metavar='Files', nargs='*',
default=['sys.stdin'], help='Input of the file')
parser.add_argument('-p', '--phred',dest='phred',nargs= '?',action='store',
choices= [33 ,64 ], required=True, type=int )
parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',
action='store', default=1, type=int,
help='deletes whole Reads with bad quality')
parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',
action='store', default=1, type=int,
help='Trimming Readends with bad quality')
#parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))

args = parser.parse_args()

names = ; x=0 ;names1 = ""
seq_dict={} ; mean_dict={} ; qual_dict={}
qual_liste= ; i=0 ; sequenz= ; seq_name=
mean1= ; summe=0
quality_code=
gc_dict={}
qually=
quality=
with open('test.fastq', 'r') as seq:
for line in seq:
quality_code=
line=line.rstrip('n')
rest=i%4 #modulo to focus on specific lines
i+=1
#print(rest)

if rest==0: #line including the sequenz name
name=line[:-13] #some embellishment
name1=name.strip("@")
name2=name1.strip("length=" or "length" or "lengl" or "leng")
name3=name2.rstrip()
seq_name.append(name3) #List which includes lists of the seq names
elif rest == 1: #Line including sequenz nucleotides
seq=line[:-args.trim] #trimming the sequenz
sequenz.append(seq)
elif rest ==3: #line including quality score for sequenz
qual=line[:-args.trim]
if args.phred == 33:
quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]
elif args.phred == 64:
quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]
for list in quality_code:
mean=sum(list)/len(qual)
mean1.append(int(mean))

if mean >= int(args.cutoff):
seq_dict[name3]=seq
mean_dict[name3]=int(mean) #filter/cutoff of reads with worse quality
qual_dict[name3]=quality_code


The whole code includes some more lines about plotting to show the results.










share|improve this question
















bumped to the homepage by Community 14 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
    – André Paramés
    Feb 15 at 9:13










  • You might want to look into this project: github.com/JohnLonginotto/SeQC
    – Austin Hastings
    Mar 17 at 12:37















up vote
3
down vote

favorite












My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:



file=@SRR587217.1 1 length=490
NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT
+SRR587217.1 1 length=490
BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b ^^b^bdbdddIX^d]W]QW^dd^^dcfW
@SRR587217.4 4 length=502
NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA
+SRR587217.4 4 length=502
BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW


Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)



First of all i subdivide the three necessary lines (line1, line2 and line4)



Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:



file=
"abcdefghijkl n
"abcdefghijkl"
list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]


with those lists i calculate means, medians, variance for some plotting.





The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?



def main():
parser = ArgumentParser(prog='fastq', description=desc)
parser.add_argument('dateien', metavar='Files', nargs='*',
default=['sys.stdin'], help='Input of the file')
parser.add_argument('-p', '--phred',dest='phred',nargs= '?',action='store',
choices= [33 ,64 ], required=True, type=int )
parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',
action='store', default=1, type=int,
help='deletes whole Reads with bad quality')
parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',
action='store', default=1, type=int,
help='Trimming Readends with bad quality')
#parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))

args = parser.parse_args()

names = ; x=0 ;names1 = ""
seq_dict={} ; mean_dict={} ; qual_dict={}
qual_liste= ; i=0 ; sequenz= ; seq_name=
mean1= ; summe=0
quality_code=
gc_dict={}
qually=
quality=
with open('test.fastq', 'r') as seq:
for line in seq:
quality_code=
line=line.rstrip('n')
rest=i%4 #modulo to focus on specific lines
i+=1
#print(rest)

if rest==0: #line including the sequenz name
name=line[:-13] #some embellishment
name1=name.strip("@")
name2=name1.strip("length=" or "length" or "lengl" or "leng")
name3=name2.rstrip()
seq_name.append(name3) #List which includes lists of the seq names
elif rest == 1: #Line including sequenz nucleotides
seq=line[:-args.trim] #trimming the sequenz
sequenz.append(seq)
elif rest ==3: #line including quality score for sequenz
qual=line[:-args.trim]
if args.phred == 33:
quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]
elif args.phred == 64:
quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]
for list in quality_code:
mean=sum(list)/len(qual)
mean1.append(int(mean))

if mean >= int(args.cutoff):
seq_dict[name3]=seq
mean_dict[name3]=int(mean) #filter/cutoff of reads with worse quality
qual_dict[name3]=quality_code


The whole code includes some more lines about plotting to show the results.










share|improve this question
















bumped to the homepage by Community 14 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
    – André Paramés
    Feb 15 at 9:13










  • You might want to look into this project: github.com/JohnLonginotto/SeQC
    – Austin Hastings
    Mar 17 at 12:37













up vote
3
down vote

favorite









up vote
3
down vote

favorite











My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:



file=@SRR587217.1 1 length=490
NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT
+SRR587217.1 1 length=490
BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b ^^b^bdbdddIX^d]W]QW^dd^^dcfW
@SRR587217.4 4 length=502
NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA
+SRR587217.4 4 length=502
BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW


Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)



First of all i subdivide the three necessary lines (line1, line2 and line4)



Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:



file=
"abcdefghijkl n
"abcdefghijkl"
list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]


with those lists i calculate means, medians, variance for some plotting.





The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?



def main():
parser = ArgumentParser(prog='fastq', description=desc)
parser.add_argument('dateien', metavar='Files', nargs='*',
default=['sys.stdin'], help='Input of the file')
parser.add_argument('-p', '--phred',dest='phred',nargs= '?',action='store',
choices= [33 ,64 ], required=True, type=int )
parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',
action='store', default=1, type=int,
help='deletes whole Reads with bad quality')
parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',
action='store', default=1, type=int,
help='Trimming Readends with bad quality')
#parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))

args = parser.parse_args()

names = ; x=0 ;names1 = ""
seq_dict={} ; mean_dict={} ; qual_dict={}
qual_liste= ; i=0 ; sequenz= ; seq_name=
mean1= ; summe=0
quality_code=
gc_dict={}
qually=
quality=
with open('test.fastq', 'r') as seq:
for line in seq:
quality_code=
line=line.rstrip('n')
rest=i%4 #modulo to focus on specific lines
i+=1
#print(rest)

if rest==0: #line including the sequenz name
name=line[:-13] #some embellishment
name1=name.strip("@")
name2=name1.strip("length=" or "length" or "lengl" or "leng")
name3=name2.rstrip()
seq_name.append(name3) #List which includes lists of the seq names
elif rest == 1: #Line including sequenz nucleotides
seq=line[:-args.trim] #trimming the sequenz
sequenz.append(seq)
elif rest ==3: #line including quality score for sequenz
qual=line[:-args.trim]
if args.phred == 33:
quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]
elif args.phred == 64:
quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]
for list in quality_code:
mean=sum(list)/len(qual)
mean1.append(int(mean))

if mean >= int(args.cutoff):
seq_dict[name3]=seq
mean_dict[name3]=int(mean) #filter/cutoff of reads with worse quality
qual_dict[name3]=quality_code


The whole code includes some more lines about plotting to show the results.










share|improve this question















My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:



file=@SRR587217.1 1 length=490
NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT
+SRR587217.1 1 length=490
BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b ^^b^bdbdddIX^d]W]QW^dd^^dcfW
@SRR587217.4 4 length=502
NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA
+SRR587217.4 4 length=502
BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW


Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)



First of all i subdivide the three necessary lines (line1, line2 and line4)



Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:



file=
"abcdefghijkl n
"abcdefghijkl"
list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]


with those lists i calculate means, medians, variance for some plotting.





The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?



def main():
parser = ArgumentParser(prog='fastq', description=desc)
parser.add_argument('dateien', metavar='Files', nargs='*',
default=['sys.stdin'], help='Input of the file')
parser.add_argument('-p', '--phred',dest='phred',nargs= '?',action='store',
choices= [33 ,64 ], required=True, type=int )
parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',
action='store', default=1, type=int,
help='deletes whole Reads with bad quality')
parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',
action='store', default=1, type=int,
help='Trimming Readends with bad quality')
#parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))

args = parser.parse_args()

names = ; x=0 ;names1 = ""
seq_dict={} ; mean_dict={} ; qual_dict={}
qual_liste= ; i=0 ; sequenz= ; seq_name=
mean1= ; summe=0
quality_code=
gc_dict={}
qually=
quality=
with open('test.fastq', 'r') as seq:
for line in seq:
quality_code=
line=line.rstrip('n')
rest=i%4 #modulo to focus on specific lines
i+=1
#print(rest)

if rest==0: #line including the sequenz name
name=line[:-13] #some embellishment
name1=name.strip("@")
name2=name1.strip("length=" or "length" or "lengl" or "leng")
name3=name2.rstrip()
seq_name.append(name3) #List which includes lists of the seq names
elif rest == 1: #Line including sequenz nucleotides
seq=line[:-args.trim] #trimming the sequenz
sequenz.append(seq)
elif rest ==3: #line including quality score for sequenz
qual=line[:-args.trim]
if args.phred == 33:
quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]
elif args.phred == 64:
quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]
for list in quality_code:
mean=sum(list)/len(qual)
mean1.append(int(mean))

if mean >= int(args.cutoff):
seq_dict[name3]=seq
mean_dict[name3]=int(mean) #filter/cutoff of reads with worse quality
qual_dict[name3]=quality_code


The whole code includes some more lines about plotting to show the results.







python performance python-3.x parsing bioinformatics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 14 at 23:23









200_success

128k15149412




128k15149412










asked Feb 14 at 20:38









Julian

161




161





bumped to the homepage by Community 14 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 14 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
    – André Paramés
    Feb 15 at 9:13










  • You might want to look into this project: github.com/JohnLonginotto/SeQC
    – Austin Hastings
    Mar 17 at 12:37


















  • What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
    – André Paramés
    Feb 15 at 9:13










  • You might want to look into this project: github.com/JohnLonginotto/SeQC
    – Austin Hastings
    Mar 17 at 12:37
















What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13




What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13












You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37




You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37










1 Answer
1






active

oldest

votes

















up vote
0
down vote













This won't solve your problem, but here are some suggestions regarding the organization of the code:





  • You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:



    with open('test.fastq', 'r') as seq:
    for line1, line2, line3, line4 in group_it(4, seq):
    ...



  • Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:



    quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]







share|improve this answer





















    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187589%2fprogram-to-process-fastq-files%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    This won't solve your problem, but here are some suggestions regarding the organization of the code:





    • You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:



      with open('test.fastq', 'r') as seq:
      for line1, line2, line3, line4 in group_it(4, seq):
      ...



    • Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:



      quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]







    share|improve this answer

























      up vote
      0
      down vote













      This won't solve your problem, but here are some suggestions regarding the organization of the code:





      • You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:



        with open('test.fastq', 'r') as seq:
        for line1, line2, line3, line4 in group_it(4, seq):
        ...



      • Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:



        quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]







      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        This won't solve your problem, but here are some suggestions regarding the organization of the code:





        • You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:



          with open('test.fastq', 'r') as seq:
          for line1, line2, line3, line4 in group_it(4, seq):
          ...



        • Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:



          quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]







        share|improve this answer












        This won't solve your problem, but here are some suggestions regarding the organization of the code:





        • You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:



          with open('test.fastq', 'r') as seq:
          for line1, line2, line3, line4 in group_it(4, seq):
          ...



        • Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:



          quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]








        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 15 at 10:20









        André Paramés

        14315




        14315






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187589%2fprogram-to-process-fastq-files%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ellipse (mathématiques)

            Quarter-circle Tiles

            Mont Emei