Extracting all reads from bam file which match read IDs in another file












2














I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:



m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...


etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:



for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done


However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.



I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:



samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done


Am I missing something or is this the only way to accomplish this task?










share|improve this question


















  • 1




    Just as a general rule never do for var in $(cat file). Also known as bash pitfall number 1.
    – terdon
    Dec 6 at 21:45










  • thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
    – d_kennetz
    Dec 6 at 21:51










  • Yeah, most of us use it very often until it bites us! :)
    – terdon
    Dec 6 at 22:03
















2














I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:



m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...


etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:



for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done


However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.



I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:



samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done


Am I missing something or is this the only way to accomplish this task?










share|improve this question


















  • 1




    Just as a general rule never do for var in $(cat file). Also known as bash pitfall number 1.
    – terdon
    Dec 6 at 21:45










  • thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
    – d_kennetz
    Dec 6 at 21:51










  • Yeah, most of us use it very often until it bites us! :)
    – terdon
    Dec 6 at 22:03














2












2








2







I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:



m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...


etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:



for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done


However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.



I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:



samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done


Am I missing something or is this the only way to accomplish this task?










share|improve this question













I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:



m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...


etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:



for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done


However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.



I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:



samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done


Am I missing something or is this the only way to accomplish this task?







bam samtools reads






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Dec 6 at 20:01









d_kennetz

1929




1929








  • 1




    Just as a general rule never do for var in $(cat file). Also known as bash pitfall number 1.
    – terdon
    Dec 6 at 21:45










  • thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
    – d_kennetz
    Dec 6 at 21:51










  • Yeah, most of us use it very often until it bites us! :)
    – terdon
    Dec 6 at 22:03














  • 1




    Just as a general rule never do for var in $(cat file). Also known as bash pitfall number 1.
    – terdon
    Dec 6 at 21:45










  • thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
    – d_kennetz
    Dec 6 at 21:51










  • Yeah, most of us use it very often until it bites us! :)
    – terdon
    Dec 6 at 22:03








1




1




Just as a general rule never do for var in $(cat file). Also known as bash pitfall number 1.
– terdon
Dec 6 at 21:45




Just as a general rule never do for var in $(cat file). Also known as bash pitfall number 1.
– terdon
Dec 6 at 21:45












thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
Dec 6 at 21:51




thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
Dec 6 at 21:51












Yeah, most of us use it very often until it bites us! :)
– terdon
Dec 6 at 22:03




Yeah, most of us use it very often until it bites us! :)
– terdon
Dec 6 at 22:03










2 Answers
2






active

oldest

votes


















5














It is still slow but grep has a -f option to take in a file



samtools view inbam.bam | grep -f read_names.txt > read_locs.txt






share|improve this answer





















  • This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
    – d_kennetz
    Dec 6 at 20:18










  • just as a warning this can take much longer if the read name file is big.
    – Bioathlete
    Dec 6 at 20:24










  • I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
    – d_kennetz
    Dec 6 at 20:26



















4














The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep "$
id" >> out;
done'

real 0m14.310s
user 0m14.170s
sys 0m2.540s


A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep -m1 "$
id" >> out;
done'

real 0m14.091s
user 0m13.927s
sys 0m2.506s


So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):



$ time bash -c "while read -r id; do 
samtools view foo.bam | grep -m1 '$
id';
done < names > out"

real 0m0.302s
user 0m0.355s
sys 0m0.126s


I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.



Now, Bioathlete's answer:



$ time bash -c "samtools view foo.bam | grep -f names > out"

real 0m0.176s
user 0m0.210s
sys 0m0.015s




Or, on a larger (336M) file so the times are worth measuring:



$ time bash -c "samtools view bar.bam  | grep -F names > out"

real 0m11.137s
user 0m10.653s
sys 0m1.972s


And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:



$ time bash -c "samtools view bar.bam  | grep -m 1 -wFf names > out"

real 0m4.972s
user 0m6.527s
sys 0m0.595s





share|improve this answer























  • oh good call on the -w forgot about that
    – Bioathlete
    Dec 6 at 22:43










  • @Bioathlete it's the -m that really speeds it up though. But yeah, the -w is needed for safety.
    – terdon
    Dec 6 at 22:49











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f5600%2fextracting-all-reads-from-bam-file-which-match-read-ids-in-another-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









5














It is still slow but grep has a -f option to take in a file



samtools view inbam.bam | grep -f read_names.txt > read_locs.txt






share|improve this answer





















  • This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
    – d_kennetz
    Dec 6 at 20:18










  • just as a warning this can take much longer if the read name file is big.
    – Bioathlete
    Dec 6 at 20:24










  • I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
    – d_kennetz
    Dec 6 at 20:26
















5














It is still slow but grep has a -f option to take in a file



samtools view inbam.bam | grep -f read_names.txt > read_locs.txt






share|improve this answer





















  • This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
    – d_kennetz
    Dec 6 at 20:18










  • just as a warning this can take much longer if the read name file is big.
    – Bioathlete
    Dec 6 at 20:24










  • I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
    – d_kennetz
    Dec 6 at 20:26














5












5








5






It is still slow but grep has a -f option to take in a file



samtools view inbam.bam | grep -f read_names.txt > read_locs.txt






share|improve this answer












It is still slow but grep has a -f option to take in a file



samtools view inbam.bam | grep -f read_names.txt > read_locs.txt







share|improve this answer












share|improve this answer



share|improve this answer










answered Dec 6 at 20:04









Bioathlete

1,703215




1,703215












  • This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
    – d_kennetz
    Dec 6 at 20:18










  • just as a warning this can take much longer if the read name file is big.
    – Bioathlete
    Dec 6 at 20:24










  • I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
    – d_kennetz
    Dec 6 at 20:26


















  • This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
    – d_kennetz
    Dec 6 at 20:18










  • just as a warning this can take much longer if the read name file is big.
    – Bioathlete
    Dec 6 at 20:24










  • I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
    – d_kennetz
    Dec 6 at 20:26
















This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18




This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18












just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24




just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24












I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26




I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26











4














The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep "$
id" >> out;
done'

real 0m14.310s
user 0m14.170s
sys 0m2.540s


A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep -m1 "$
id" >> out;
done'

real 0m14.091s
user 0m13.927s
sys 0m2.506s


So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):



$ time bash -c "while read -r id; do 
samtools view foo.bam | grep -m1 '$
id';
done < names > out"

real 0m0.302s
user 0m0.355s
sys 0m0.126s


I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.



Now, Bioathlete's answer:



$ time bash -c "samtools view foo.bam | grep -f names > out"

real 0m0.176s
user 0m0.210s
sys 0m0.015s




Or, on a larger (336M) file so the times are worth measuring:



$ time bash -c "samtools view bar.bam  | grep -F names > out"

real 0m11.137s
user 0m10.653s
sys 0m1.972s


And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:



$ time bash -c "samtools view bar.bam  | grep -m 1 -wFf names > out"

real 0m4.972s
user 0m6.527s
sys 0m0.595s





share|improve this answer























  • oh good call on the -w forgot about that
    – Bioathlete
    Dec 6 at 22:43










  • @Bioathlete it's the -m that really speeds it up though. But yeah, the -w is needed for safety.
    – terdon
    Dec 6 at 22:49
















4














The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep "$
id" >> out;
done'

real 0m14.310s
user 0m14.170s
sys 0m2.540s


A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep -m1 "$
id" >> out;
done'

real 0m14.091s
user 0m13.927s
sys 0m2.506s


So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):



$ time bash -c "while read -r id; do 
samtools view foo.bam | grep -m1 '$
id';
done < names > out"

real 0m0.302s
user 0m0.355s
sys 0m0.126s


I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.



Now, Bioathlete's answer:



$ time bash -c "samtools view foo.bam | grep -f names > out"

real 0m0.176s
user 0m0.210s
sys 0m0.015s




Or, on a larger (336M) file so the times are worth measuring:



$ time bash -c "samtools view bar.bam  | grep -F names > out"

real 0m11.137s
user 0m10.653s
sys 0m1.972s


And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:



$ time bash -c "samtools view bar.bam  | grep -m 1 -wFf names > out"

real 0m4.972s
user 0m6.527s
sys 0m0.595s





share|improve this answer























  • oh good call on the -w forgot about that
    – Bioathlete
    Dec 6 at 22:43










  • @Bioathlete it's the -m that really speeds it up though. But yeah, the -w is needed for safety.
    – terdon
    Dec 6 at 22:49














4












4








4






The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep "$
id" >> out;
done'

real 0m14.310s
user 0m14.170s
sys 0m2.540s


A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep -m1 "$
id" >> out;
done'

real 0m14.091s
user 0m13.927s
sys 0m2.506s


So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):



$ time bash -c "while read -r id; do 
samtools view foo.bam | grep -m1 '$
id';
done < names > out"

real 0m0.302s
user 0m0.355s
sys 0m0.126s


I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.



Now, Bioathlete's answer:



$ time bash -c "samtools view foo.bam | grep -f names > out"

real 0m0.176s
user 0m0.210s
sys 0m0.015s




Or, on a larger (336M) file so the times are worth measuring:



$ time bash -c "samtools view bar.bam  | grep -F names > out"

real 0m11.137s
user 0m10.653s
sys 0m1.972s


And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:



$ time bash -c "samtools view bar.bam  | grep -m 1 -wFf names > out"

real 0m4.972s
user 0m6.527s
sys 0m0.595s





share|improve this answer














The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep "$
id" >> out;
done'

real 0m14.310s
user 0m14.170s
sys 0m2.540s


A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:



$ time bash -c 'for id in `cat names`; do 
samtools view foo.bam | grep -m1 "$
id" >> out;
done'

real 0m14.091s
user 0m13.927s
sys 0m2.506s


So will a safer shell loop (you should never do for var in `command`; do) although, in any case, using shell loops for text parsing is inherently a bad idea):



$ time bash -c "while read -r id; do 
samtools view foo.bam | grep -m1 '$
id';
done < names > out"

real 0m0.302s
user 0m0.355s
sys 0m0.126s


I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.



Now, Bioathlete's answer:



$ time bash -c "samtools view foo.bam | grep -f names > out"

real 0m0.176s
user 0m0.210s
sys 0m0.015s




Or, on a larger (336M) file so the times are worth measuring:



$ time bash -c "samtools view bar.bam  | grep -F names > out"

real 0m11.137s
user 0m10.653s
sys 0m1.972s


And we can probably improve on that slightly if we add -w so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F so that grep doesn't try to interpret the name as regular expressions:



$ time bash -c "samtools view bar.bam  | grep -m 1 -wFf names > out"

real 0m4.972s
user 0m6.527s
sys 0m0.595s






share|improve this answer














share|improve this answer



share|improve this answer








edited Dec 6 at 22:38

























answered Dec 6 at 22:20









terdon

3,9161726




3,9161726












  • oh good call on the -w forgot about that
    – Bioathlete
    Dec 6 at 22:43










  • @Bioathlete it's the -m that really speeds it up though. But yeah, the -w is needed for safety.
    – terdon
    Dec 6 at 22:49


















  • oh good call on the -w forgot about that
    – Bioathlete
    Dec 6 at 22:43










  • @Bioathlete it's the -m that really speeds it up though. But yeah, the -w is needed for safety.
    – terdon
    Dec 6 at 22:49
















oh good call on the -w forgot about that
– Bioathlete
Dec 6 at 22:43




oh good call on the -w forgot about that
– Bioathlete
Dec 6 at 22:43












@Bioathlete it's the -m that really speeds it up though. But yeah, the -w is needed for safety.
– terdon
Dec 6 at 22:49




@Bioathlete it's the -m that really speeds it up though. But yeah, the -w is needed for safety.
– terdon
Dec 6 at 22:49


















draft saved

draft discarded




















































Thanks for contributing an answer to Bioinformatics Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f5600%2fextracting-all-reads-from-bam-file-which-match-read-ids-in-another-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Ellipse (mathématiques)

Quarter-circle Tiles

Mont Emei