Creating a csv file using scrapy
up vote
1
down vote
favorite
I've created a script using python in association with scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a csv file other than using the built in command provided by scrapy cause when I do like this scrapy crawl torrentdata -o outputfile.csv -t csv
I get a blank line in every alternate row in the csv file.
However, I thought to go in a slightly different way to achieve the same. Now, I get a data laden csv file in the right format when I run the following script. Most importantly I made use of with statement
while creating a csv file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.
My question: ain't it a better idea, if I comply with the way I tried below?
This is the working script:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
python python-3.x web-scraping scrapy
add a comment |
up vote
1
down vote
favorite
I've created a script using python in association with scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a csv file other than using the built in command provided by scrapy cause when I do like this scrapy crawl torrentdata -o outputfile.csv -t csv
I get a blank line in every alternate row in the csv file.
However, I thought to go in a slightly different way to achieve the same. Now, I get a data laden csv file in the right format when I run the following script. Most importantly I made use of with statement
while creating a csv file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.
My question: ain't it a better idea, if I comply with the way I tried below?
This is the working script:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
python python-3.x web-scraping scrapy
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I've created a script using python in association with scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a csv file other than using the built in command provided by scrapy cause when I do like this scrapy crawl torrentdata -o outputfile.csv -t csv
I get a blank line in every alternate row in the csv file.
However, I thought to go in a slightly different way to achieve the same. Now, I get a data laden csv file in the right format when I run the following script. Most importantly I made use of with statement
while creating a csv file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.
My question: ain't it a better idea, if I comply with the way I tried below?
This is the working script:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
python python-3.x web-scraping scrapy
I've created a script using python in association with scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a csv file other than using the built in command provided by scrapy cause when I do like this scrapy crawl torrentdata -o outputfile.csv -t csv
I get a blank line in every alternate row in the csv file.
However, I thought to go in a slightly different way to achieve the same. Now, I get a data laden csv file in the right format when I run the following script. Most importantly I made use of with statement
while creating a csv file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.
My question: ain't it a better idea, if I comply with the way I tried below?
This is the working script:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
itemlist =
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
items = {}
items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')
items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')
self.itemlist.append(items)
with open("outputfile.csv","w", newline="") as f:
writer = csv.DictWriter(f,['Name','Year'])
writer.writeheader()
for data in self.itemlist:
writer.writerow(data)
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TorrentSpider)
c.start()
python python-3.x web-scraping scrapy
python python-3.x web-scraping scrapy
asked 48 mins ago
robots.txt
162
162
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209766%2fcreating-a-csv-file-using-scrapy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
add a comment |
up vote
0
down vote
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
add a comment |
up vote
0
down vote
up vote
0
down vote
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.
As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse()
callback:
import scrapy
class TorrentSpider(scrapy.Spider):
name = "torrentdata"
start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list
def parse(self, response):
for record in response.css('.browse-movie-bottom'):
yield {
"Name": record.css('.browse-movie-title::text').extract_first(default=''),
"Year": record.css('.browse-movie-year::text').extract_first(default='')
}
Then, by running:
scrapy runspider spider.py -o outputfile.csv -t csv
(or the crawl
command)
you would have the following in the outputfile.csv
:
Name,Year
"Faith, Love & Chocolate",2018
Bennett's Song,2018
...
Tender Mercies,1983
You Might Be the Killer,2018
answered 30 mins ago
alecxe
14.6k53377
14.6k53377
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209766%2fcreating-a-csv-file-using-scrapy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown