Program to process fastq files

up vote
3
down vote

favorite

My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:

file=@SRR587217.1 1 length=490

NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT

+SRR587217.1 1 length=490

BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b   ^^b^bdbdddIX^d]W]QW^dd^^dcfW

@SRR587217.4 4 length=502

NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA

+SRR587217.4 4 length=502

BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW

Each sequence is divided into 4 lines. First line shows us the sequence name.
Second line shows us the sequence itself. Third line shows us some comments (irrelevant for this programm) and the fourth line shows as the quality of the bases in ascii character (Phred score)

First of all i subdivide the three necessary lines (line1, line2 and line4)

Line one (sequence names) are saved in one list. Line two are saved in another list as well. In line four, I convert the ASCII character into numbers and save these numbers for each line in a different list. Example:

file=

"abcdefghijkl n

"abcdefghijkl"

list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]

with those lists i calculate means, medians, variance for some plotting.

The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?

def main():

    parser = ArgumentParser(prog='fastq', description=desc)

    parser.add_argument('dateien', metavar='Files', nargs='*',

                    default=['sys.stdin'], help='Input of the file')

    parser.add_argument('-p', '--phred',dest='phred',nargs=  '?',action='store',

                    choices= [33 ,64 ], required=True, type=int ) 

    parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',

                    action='store', default=1, type=int,

                    help='deletes whole Reads with bad quality')

    parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',

                    action='store', default=1, type=int,

                    help='Trimming Readends with bad quality')

    #parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))



    args = parser.parse_args()



    names =  ; x=0 ;names1 = ""

    seq_dict={} ; mean_dict={} ; qual_dict={}

    qual_liste= ; i=0 ; sequenz= ; seq_name=

    mean1= ; summe=0

    quality_code=

    gc_dict={}

    qually=

    quality=

    with open('test.fastq', 'r') as seq:

      for line in seq:                

        quality_code=

        line=line.rstrip('n')   

        rest=i%4                   #modulo to focus on specific lines

        i+=1                        

        #print(rest)



        if rest==0:     #line including the sequenz name

            name=line[:-13]    #some embellishment

            name1=name.strip("@")

            name2=name1.strip("length=" or "length" or "lengl" or "leng")

            name3=name2.rstrip()

            seq_name.append(name3) #List which includes lists of the seq names

        elif rest == 1:  #Line including sequenz nucleotides

            seq=line[:-args.trim] #trimming the sequenz

            sequenz.append(seq)

        elif rest ==3:    #line including quality score for sequenz

            qual=line[:-args.trim]           

            if args.phred == 33:

                quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]

            elif args.phred == 64:

                quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]

            for list in quality_code:

                mean=sum(list)/len(qual)              

            mean1.append(int(mean))    



            if mean >= int(args.cutoff):

                seq_dict[name3]=seq

                mean_dict[name3]=int(mean)   #filter/cutoff of reads with worse quality

                qual_dict[name3]=quality_code

The whole code includes some more lines about plotting to show the results.

edited Feb 14 at 23:23

200_success

128k15149412

asked Feb 14 at 20:38

Julian

161

bumped to the homepage by Community♦ 14 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13

You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37

add a comment |

up vote
3
down vote

favorite

My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:

file=@SRR587217.1 1 length=490

NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT

+SRR587217.1 1 length=490

BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b   ^^b^bdbdddIX^d]W]QW^dd^^dcfW

@SRR587217.4 4 length=502

NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA

+SRR587217.4 4 length=502

BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW

First of all i subdivide the three necessary lines (line1, line2 and line4)

file=

"abcdefghijkl n

"abcdefghijkl"

list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]

with those lists i calculate means, medians, variance for some plotting.

The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?

def main():

    parser = ArgumentParser(prog='fastq', description=desc)

    parser.add_argument('dateien', metavar='Files', nargs='*',

                    default=['sys.stdin'], help='Input of the file')

    parser.add_argument('-p', '--phred',dest='phred',nargs=  '?',action='store',

                    choices= [33 ,64 ], required=True, type=int ) 

    parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',

                    action='store', default=1, type=int,

                    help='deletes whole Reads with bad quality')

    parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',

                    action='store', default=1, type=int,

                    help='Trimming Readends with bad quality')

    #parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))



    args = parser.parse_args()



    names =  ; x=0 ;names1 = ""

    seq_dict={} ; mean_dict={} ; qual_dict={}

    qual_liste= ; i=0 ; sequenz= ; seq_name=

    mean1= ; summe=0

    quality_code=

    gc_dict={}

    qually=

    quality=

    with open('test.fastq', 'r') as seq:

      for line in seq:                

        quality_code=

        line=line.rstrip('n')   

        rest=i%4                   #modulo to focus on specific lines

        i+=1                        

        #print(rest)



        if rest==0:     #line including the sequenz name

            name=line[:-13]    #some embellishment

            name1=name.strip("@")

            name2=name1.strip("length=" or "length" or "lengl" or "leng")

            name3=name2.rstrip()

            seq_name.append(name3) #List which includes lists of the seq names

        elif rest == 1:  #Line including sequenz nucleotides

            seq=line[:-args.trim] #trimming the sequenz

            sequenz.append(seq)

        elif rest ==3:    #line including quality score for sequenz

            qual=line[:-args.trim]           

            if args.phred == 33:

                quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]

            elif args.phred == 64:

                quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]

            for list in quality_code:

                mean=sum(list)/len(qual)              

            mean1.append(int(mean))    



            if mean >= int(args.cutoff):

                seq_dict[name3]=seq

                mean_dict[name3]=int(mean)   #filter/cutoff of reads with worse quality

                qual_dict[name3]=quality_code

The whole code includes some more lines about plotting to show the results.

edited Feb 14 at 23:23

200_success

128k15149412

asked Feb 14 at 20:38

Julian

161

bumped to the homepage by Community♦ 14 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13

You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37

add a comment |

up vote
3
down vote

favorite

My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:

file=@SRR587217.1 1 length=490

NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT

+SRR587217.1 1 length=490

BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b   ^^b^bdbdddIX^d]W]QW^dd^^dcfW

@SRR587217.4 4 length=502

NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA

+SRR587217.4 4 length=502

BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW

First of all i subdivide the three necessary lines (line1, line2 and line4)

file=

"abcdefghijkl n

"abcdefghijkl"

list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]

with those lists i calculate means, medians, variance for some plotting.

The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?

def main():

    parser = ArgumentParser(prog='fastq', description=desc)

    parser.add_argument('dateien', metavar='Files', nargs='*',

                    default=['sys.stdin'], help='Input of the file')

    parser.add_argument('-p', '--phred',dest='phred',nargs=  '?',action='store',

                    choices= [33 ,64 ], required=True, type=int ) 

    parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',

                    action='store', default=1, type=int,

                    help='deletes whole Reads with bad quality')

    parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',

                    action='store', default=1, type=int,

                    help='Trimming Readends with bad quality')

    #parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))



    args = parser.parse_args()



    names =  ; x=0 ;names1 = ""

    seq_dict={} ; mean_dict={} ; qual_dict={}

    qual_liste= ; i=0 ; sequenz= ; seq_name=

    mean1= ; summe=0

    quality_code=

    gc_dict={}

    qually=

    quality=

    with open('test.fastq', 'r') as seq:

      for line in seq:                

        quality_code=

        line=line.rstrip('n')   

        rest=i%4                   #modulo to focus on specific lines

        i+=1                        

        #print(rest)



        if rest==0:     #line including the sequenz name

            name=line[:-13]    #some embellishment

            name1=name.strip("@")

            name2=name1.strip("length=" or "length" or "lengl" or "leng")

            name3=name2.rstrip()

            seq_name.append(name3) #List which includes lists of the seq names

        elif rest == 1:  #Line including sequenz nucleotides

            seq=line[:-args.trim] #trimming the sequenz

            sequenz.append(seq)

        elif rest ==3:    #line including quality score for sequenz

            qual=line[:-args.trim]           

            if args.phred == 33:

                quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]

            elif args.phred == 64:

                quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]

            for list in quality_code:

                mean=sum(list)/len(qual)              

            mean1.append(int(mean))    



            if mean >= int(args.cutoff):

                seq_dict[name3]=seq

                mean_dict[name3]=int(mean)   #filter/cutoff of reads with worse quality

                qual_dict[name3]=quality_code

The whole code includes some more lines about plotting to show the results.

edited Feb 14 at 23:23

200_success

128k15149412

asked Feb 14 at 20:38

Julian

161

My program evaluates fastq files.
The fastq file is just a huge text file (about 1,5gb in my case), which looks like this:

file=@SRR587217.1 1 length=490

NTCCGGATGATGTCGCTGTTGCTGACAATGGTAATACGTTGACGGGGCAATATGCAGTTCGCTGCATACCGGTCCGACCCCGTACTGCTCACGCAGCTTATCCAGCAGTGGCATCATTTTTTCCAGAGGCGGTCGAACTCCGCCTTCGCAAAAAAAAAGGGAGCCCGGCGGAGGAGAACGTTACTGCGGCGGAGGTTACGATTTTTCCGGTTCCGCTCCTTTAGAAGCCGGACGTCTACCCGGCTCTTTTTGCTGAACGTCAGCGTCTGAAAGAGCTGGAACGTGAAAATCGTGAACTGCGCCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGCCTCTGGAAAAAATGATGCCACTGCTGGATAAGCTGCGTGAGCAGTACGGGGTCGGACCGGTATGCAGCGAACTGCATATTGCCCCGTCAACGTATTACCATTGTCAGCAACAGCGACATCATCCGGAT

+SRR587217.1 1 length=490

BT[^^^^^caccccacffffffhhhgghhhhefeggghhghghgfgggghfgggeghhhhhgggghhgggggggggggggggdgggggfffffffffffffddfffffbfffffffffffffffffffffffMFFFHOIIPHMWHOIIHHMM^IFTMFHOHIHMHWMWFMFFFTHHIIIHIOIHIIIHMFFFFFFOIIIHHIOH^IIIHMSFOIHMHMFPYIIIIIIIIFFMFMOWIIIPIHFFMOIYIII^^``^aaaccccccccffffffghghhggghghgghghhhggggefgfededbe`cgeghggegaefgghegggggggggccdggggceeggfT``dddfffdfffffac]Zbffdfdbdbdfffddd^^dfdfdfddffffdbdbdfffffffbdddf``c`bccQW`MSHYbdffff]cbbdbbbdYI^bd^[WdbbbfMXbbbf^b   ^^b^bdbdddIX^d]W]QW^dd^^dcfW

@SRR587217.4 4 length=502

NTCCATGACTTCTGCACGCGTCGGCATCGGGTTAGTAATCATTGACTCCATCATCTGGGTCGCCGTGATTACCGCTCGGTTTAGCTGACGCGCACGACGGATCAACGCTTTCTGAATGCCGACCAGTTCCGGGTCGCCAATTTCCACACCGAGGTCGCCACGTGCAACCATTACCACGTCAGAGGCGAGGATGATGTCATCCATTGCATCCTGGCTGCAAACGGCTTCCGCACGTTCAACCTTGGCAACAAGTGATGCGAAAATTGTTGCCAAGGTTGAACGTGCGGAAGCCGTTTGCAGCCAGGATGCAATGGATGACATCATCCTCGCCTCTGACGTGGTAATGGTTGCACGTGGCGACCTCGGTGTGGAAATTGGCGACCCGGAACTGGTCGGCATTCAGAAAGCGTTGATCCGTCGTGCGCGTCAGCTAAACCGAGCGGTAATCACGGCGACCCAGATGATGGAGTCAATGATTACTAACCCGATGCCGACGCGTGCA

+SRR587217.4 4 length=502

BTT^^^aaccccccccffffeeggggghgggegghffghggggghhhhhhhhhhgghhhegegggdgghhgggggggggcgggffffffffffffffffffffffffffffffffffbffffffffdffffffb`fffffffffffffffffffbfffffffffffffffffffffffdfffffffffffWYbddfdddffffffffffffdfffbffffffffcW]ffff]cfddddffffffffffffd^^^^^aa`ccccccccffffffhhhghghghggggggggghgbegggfgghgeegggggghhhhhhhhhhgfghghfggbcdggggg^cceeecegeeffffeeedfffaaffdfbdbddfdfffddffffffcff[`^bfdfffffWbffddbdfffdb]dddbbfdfcfffffSWbdd^bP^db]QWcccOSWbbbddb`ffW`fff^bbI^YYbbYYYOYY^^b^^bbddbbWW]bdbQMMW]cS]dW

First of all i subdivide the three necessary lines (line1, line2 and line4)

file=

"abcdefghijkl n

"abcdefghijkl"

list of numbers=[[1,2,3,4,5,6,7,8][1,2,3,4,5,6,7,8]]

with those lists i calculate means, medians, variance for some plotting.

The code works fine with smaller "test.files" but if I use the big one it overloads my PC.
Is there a way to compress the code?
Should I use less lists/dicts?

def main():

    parser = ArgumentParser(prog='fastq', description=desc)

    parser.add_argument('dateien', metavar='Files', nargs='*',

                    default=['sys.stdin'], help='Input of the file')

    parser.add_argument('-p', '--phred',dest='phred',nargs=  '?',action='store',

                    choices= [33 ,64 ], required=True, type=int ) 

    parser.add_argument('-c', '--cutoff',dest = "cutoff",nargs='?',

                    action='store', default=1, type=int,

                    help='deletes whole Reads with bad quality')

    parser.add_argument('-t', '--trim',dest = 'trim',nargs='?',

                    action='store', default=1, type=int,

                    help='Trimming Readends with bad quality')

    #parser.add_argument('-v','--version',action='version', version="{prog}: {version}".format(prog=args.prog, version='0.1'))



    args = parser.parse_args()



    names =  ; x=0 ;names1 = ""

    seq_dict={} ; mean_dict={} ; qual_dict={}

    qual_liste= ; i=0 ; sequenz= ; seq_name=

    mean1= ; summe=0

    quality_code=

    gc_dict={}

    qually=

    quality=

    with open('test.fastq', 'r') as seq:

      for line in seq:                

        quality_code=

        line=line.rstrip('n')   

        rest=i%4                   #modulo to focus on specific lines

        i+=1                        

        #print(rest)



        if rest==0:     #line including the sequenz name

            name=line[:-13]    #some embellishment

            name1=name.strip("@")

            name2=name1.strip("length=" or "length" or "lengl" or "leng")

            name3=name2.rstrip()

            seq_name.append(name3) #List which includes lists of the seq names

        elif rest == 1:  #Line including sequenz nucleotides

            seq=line[:-args.trim] #trimming the sequenz

            sequenz.append(seq)

        elif rest ==3:    #line including quality score for sequenz

            qual=line[:-args.trim]           

            if args.phred == 33:

                quality_code=[[(ord(ii)-33) for ii in i] for i in qual.split('n')]

            elif args.phred == 64:

                quality_code=[[(ord(ii)-64) for ii in i] for i in qual.split('n')]

            for list in quality_code:

                mean=sum(list)/len(qual)              

            mean1.append(int(mean))    



            if mean >= int(args.cutoff):

                seq_dict[name3]=seq

                mean_dict[name3]=int(mean)   #filter/cutoff of reads with worse quality

                qual_dict[name3]=quality_code

The whole code includes some more lines about plotting to show the results.

python performance python-3.x parsing bioinformatics

edited Feb 14 at 23:23

200_success

128k15149412

asked Feb 14 at 20:38

Julian

161

edited Feb 14 at 23:23

200_success

128k15149412

asked Feb 14 at 20:38

Julian

161

edited Feb 14 at 23:23

200_success

128k15149412

edited Feb 14 at 23:23

200_success

128k15149412

edited Feb 14 at 23:23

200_success

128k15149412

asked Feb 14 at 20:38

Julian

161

asked Feb 14 at 20:38

Julian

161

asked Feb 14 at 20:38

Julian

161

bumped to the homepage by Community♦ 14 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 14 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13

You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37

add a comment |

What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13

You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37

What do you use sequenz for? You are storing every second line of the file in a list; if the file has 1.5GB, that means using hundreds of MBs of memory even if Python stored them very efficiently. It's also not very CPU friendly, since the list is constantly having to be expanded.
– André Paramés
Feb 15 at 9:13

You might want to look into this project: github.com/JohnLonginotto/SeQC
– Austin Hastings
Mar 17 at 12:37

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

This won't solve your problem, but here are some suggestions regarding the organization of the code:

You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:
```
with open('test.fastq', 'r') as seq:

    for line1, line2, line3, line4 in group_it(4, seq):

        ...
```

Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:
```
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
```

answered Feb 15 at 10:20

André Paramés

14315

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187589%2fprogram-to-process-fastq-files%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

This won't solve your problem, but here are some suggestions regarding the organization of the code:

You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:
```
with open('test.fastq', 'r') as seq:

    for line1, line2, line3, line4 in group_it(4, seq):

        ...
```

Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:
```
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
```

answered Feb 15 at 10:20

André Paramés

14315

add a comment |

up vote
0
down vote

This won't solve your problem, but here are some suggestions regarding the organization of the code:

You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:
```
with open('test.fastq', 'r') as seq:

    for line1, line2, line3, line4 in group_it(4, seq):

        ...
```

Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:
```
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
```

answered Feb 15 at 10:20

André Paramés

14315

add a comment |

up vote
0
down vote

This won't solve your problem, but here are some suggestions regarding the organization of the code:

You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:
```
with open('test.fastq', 'r') as seq:

    for line1, line2, line3, line4 in group_it(4, seq):

        ...
```

Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:
```
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
```

answered Feb 15 at 10:20

André Paramés

14315

This won't solve your problem, but here are some suggestions regarding the organization of the code:

You can use a function like this to fetch four lines at once, which I think makes your code easier to understand than using rest:
```
with open('test.fastq', 'r') as seq:

    for line1, line2, line3, line4 in group_it(4, seq):

        ...
```

Instead of doing if args.phred == 33: you can just use the variable in the calculation, avoiding the repetition:
```
quality_code=[[(ord(ii)-args.phred) for ii in i] for i in qual.split('n')]
```

answered Feb 15 at 10:20

André Paramés

14315

answered Feb 15 at 10:20

André Paramés

14315

answered Feb 15 at 10:20

André Paramés

14315

answered Feb 15 at 10:20

André Paramés

14315

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

V6oIs kxSOL,0JSn x8kVoE9TG,P1RX8y,YVQ,Fr9fYAGcYNANdKGpeN04Lw 2Il

搜尋此網誌

Krdytkyu