Extracting all reads from bam file which match read IDs in another file
I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:
m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...
etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:
for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done
However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt
files.
I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:
samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done
Am I missing something or is this the only way to accomplish this task?
bam samtools reads
add a comment |
I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:
m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...
etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:
for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done
However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt
files.
I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:
samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done
Am I missing something or is this the only way to accomplish this task?
bam samtools reads
1
Just as a general rule never dofor var in $(cat file)
. Also known as bash pitfall number 1.
– terdon♦
Dec 6 at 21:45
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
Dec 6 at 21:51
Yeah, most of us use it very often until it bites us! :)
– terdon♦
Dec 6 at 22:03
add a comment |
I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:
m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...
etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:
for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done
However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt
files.
I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:
samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done
Am I missing something or is this the only way to accomplish this task?
bam samtools reads
I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:
m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...
etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:
for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done
However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt
files.
I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:
samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done
Am I missing something or is this the only way to accomplish this task?
bam samtools reads
bam samtools reads
asked Dec 6 at 20:01
d_kennetz
1929
1929
1
Just as a general rule never dofor var in $(cat file)
. Also known as bash pitfall number 1.
– terdon♦
Dec 6 at 21:45
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
Dec 6 at 21:51
Yeah, most of us use it very often until it bites us! :)
– terdon♦
Dec 6 at 22:03
add a comment |
1
Just as a general rule never dofor var in $(cat file)
. Also known as bash pitfall number 1.
– terdon♦
Dec 6 at 21:45
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
Dec 6 at 21:51
Yeah, most of us use it very often until it bites us! :)
– terdon♦
Dec 6 at 22:03
1
1
Just as a general rule never do
for var in $(cat file)
. Also known as bash pitfall number 1.– terdon♦
Dec 6 at 21:45
Just as a general rule never do
for var in $(cat file)
. Also known as bash pitfall number 1.– terdon♦
Dec 6 at 21:45
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
Dec 6 at 21:51
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
Dec 6 at 21:51
Yeah, most of us use it very often until it bites us! :)
– terdon♦
Dec 6 at 22:03
Yeah, most of us use it very often until it bites us! :)
– terdon♦
Dec 6 at 22:03
add a comment |
2 Answers
2
active
oldest
votes
It is still slow but grep has a -f
option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18
just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26
add a comment |
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do
) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w
so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F
so that grep
doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
oh good call on the-w
forgot about that
– Bioathlete
Dec 6 at 22:43
@Bioathlete it's the-m
that really speeds it up though. But yeah, the-w
is needed for safety.
– terdon♦
Dec 6 at 22:49
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f5600%2fextracting-all-reads-from-bam-file-which-match-read-ids-in-another-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
It is still slow but grep has a -f
option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18
just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26
add a comment |
It is still slow but grep has a -f
option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18
just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26
add a comment |
It is still slow but grep has a -f
option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
It is still slow but grep has a -f
option to take in a file
samtools view inbam.bam | grep -f read_names.txt > read_locs.txt
answered Dec 6 at 20:04
Bioathlete
1,703215
1,703215
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18
just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26
add a comment |
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18
just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18
This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you!
– d_kennetz
Dec 6 at 20:18
just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24
just as a warning this can take much longer if the read name file is big.
– Bioathlete
Dec 6 at 20:24
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26
I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning).
– d_kennetz
Dec 6 at 20:26
add a comment |
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do
) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w
so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F
so that grep
doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
oh good call on the-w
forgot about that
– Bioathlete
Dec 6 at 22:43
@Bioathlete it's the-m
that really speeds it up though. But yeah, the-w
is needed for safety.
– terdon♦
Dec 6 at 22:49
add a comment |
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do
) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w
so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F
so that grep
doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
oh good call on the-w
forgot about that
– Bioathlete
Dec 6 at 22:43
@Bioathlete it's the-m
that really speeds it up though. But yeah, the-w
is needed for safety.
– terdon♦
Dec 6 at 22:49
add a comment |
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do
) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w
so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F
so that grep
doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
The reason your approach is slow is because you need to process the entire bam file from start to finish for every ID you are searching for. Here are some other solutions and a time comparison:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep "$id" >> out;
done'
real 0m14.310s
user 0m14.170s
sys 0m2.540s
A very simple improvement, stop reading the file when you find the first match, will already speed things up a little:
$ time bash -c 'for id in `cat names`; do
samtools view foo.bam | grep -m1 "$id" >> out;
done'
real 0m14.091s
user 0m13.927s
sys 0m2.506s
So will a safer shell loop (you should never do for var in `command`; do
) although, in any case, using shell loops for text parsing is inherently a bad idea):
$ time bash -c "while read -r id; do
samtools view foo.bam | grep -m1 '$id';
done < names > out"
real 0m0.302s
user 0m0.355s
sys 0m0.126s
I don't quite understand why that one is so much faster, but presumably the shel is doing some sort of optimization.
Now, Bioathlete's answer:
$ time bash -c "samtools view foo.bam | grep -f names > out"
real 0m0.176s
user 0m0.210s
sys 0m0.015s
Or, on a larger (336M) file so the times are worth measuring:
$ time bash -c "samtools view bar.bam | grep -F names > out"
real 0m11.137s
user 0m10.653s
sys 0m1.972s
And we can probably improve on that slightly if we add -w
so it will only match entire words (which you want anyway since one read's name might be a substring of another's) and -F
so that grep
doesn't try to interpret the name as regular expressions:
$ time bash -c "samtools view bar.bam | grep -m 1 -wFf names > out"
real 0m4.972s
user 0m6.527s
sys 0m0.595s
edited Dec 6 at 22:38
answered Dec 6 at 22:20
terdon♦
3,9161726
3,9161726
oh good call on the-w
forgot about that
– Bioathlete
Dec 6 at 22:43
@Bioathlete it's the-m
that really speeds it up though. But yeah, the-w
is needed for safety.
– terdon♦
Dec 6 at 22:49
add a comment |
oh good call on the-w
forgot about that
– Bioathlete
Dec 6 at 22:43
@Bioathlete it's the-m
that really speeds it up though. But yeah, the-w
is needed for safety.
– terdon♦
Dec 6 at 22:49
oh good call on the
-w
forgot about that– Bioathlete
Dec 6 at 22:43
oh good call on the
-w
forgot about that– Bioathlete
Dec 6 at 22:43
@Bioathlete it's the
-m
that really speeds it up though. But yeah, the -w
is needed for safety.– terdon♦
Dec 6 at 22:49
@Bioathlete it's the
-m
that really speeds it up though. But yeah, the -w
is needed for safety.– terdon♦
Dec 6 at 22:49
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f5600%2fextracting-all-reads-from-bam-file-which-match-read-ids-in-another-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Just as a general rule never do
for var in $(cat file)
. Also known as bash pitfall number 1.– terdon♦
Dec 6 at 21:45
thanks terdon, I will keep this in mind and find the better ways to handle this in the future. I use this method quite often so that is great for me to know!
– d_kennetz
Dec 6 at 21:51
Yeah, most of us use it very often until it bites us! :)
– terdon♦
Dec 6 at 22:03