Creating a csv file using scrapy

up vote
1
down vote

favorite

I've created a script using python in association with scrapy to parse the movie names and its years spread across multiple pages from a torrent site. My goal here is to write the parsed data in a csv file other than using the built in command provided by scrapy cause when I do like this scrapy crawl torrentdata -o outputfile.csv -t csv I get a blank line in every alternate row in the csv file.

However, I thought to go in a slightly different way to achieve the same. Now, I get a data laden csv file in the right format when I run the following script. Most importantly I made use of with statement while creating a csv file so that when the writing is done the file gets automatically closed. I used crawlerprocess to execute the script from within an IDE.

My question: ain't it a better idea, if I comply with the way I tried below?

This is the working script:

import scrapy

from scrapy.crawler import CrawlerProcess

import csv



class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

    itemlist = 



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            items = {}

            items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')

            items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')

            self.itemlist.append(items)



        with open("outputfile.csv","w", newline="") as f:

            writer = csv.DictWriter(f,['Name','Year'])

            writer.writeheader()

            for data in self.itemlist:

                writer.writerow(data)



c = CrawlerProcess({

    'USER_AGENT': 'Mozilla/5.0',   

})

c.crawl(TorrentSpider)

c.start()

asked 48 mins ago

robots.txt

162

add a comment |

up vote
1
down vote

favorite

My question: ain't it a better idea, if I comply with the way I tried below?

This is the working script:

import scrapy

from scrapy.crawler import CrawlerProcess

import csv



class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

    itemlist = 



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            items = {}

            items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')

            items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')

            self.itemlist.append(items)



        with open("outputfile.csv","w", newline="") as f:

            writer = csv.DictWriter(f,['Name','Year'])

            writer.writeheader()

            for data in self.itemlist:

                writer.writerow(data)



c = CrawlerProcess({

    'USER_AGENT': 'Mozilla/5.0',   

})

c.crawl(TorrentSpider)

c.start()

asked 48 mins ago

robots.txt

162

add a comment |

up vote
1
down vote

favorite

My question: ain't it a better idea, if I comply with the way I tried below?

This is the working script:

import scrapy

from scrapy.crawler import CrawlerProcess

import csv



class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

    itemlist = 



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            items = {}

            items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')

            items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')

            self.itemlist.append(items)



        with open("outputfile.csv","w", newline="") as f:

            writer = csv.DictWriter(f,['Name','Year'])

            writer.writeheader()

            for data in self.itemlist:

                writer.writerow(data)



c = CrawlerProcess({

    'USER_AGENT': 'Mozilla/5.0',   

})

c.crawl(TorrentSpider)

c.start()

asked 48 mins ago

robots.txt

162

My question: ain't it a better idea, if I comply with the way I tried below?

This is the working script:

import scrapy

from scrapy.crawler import CrawlerProcess

import csv



class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list

    itemlist = 



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            items = {}

            items["Name"] = record.css('.browse-movie-title::text').extract_first(default='')

            items["Year"] = record.css('.browse-movie-year::text').extract_first(default='')

            self.itemlist.append(items)



        with open("outputfile.csv","w", newline="") as f:

            writer = csv.DictWriter(f,['Name','Year'])

            writer.writeheader()

            for data in self.itemlist:

                writer.writerow(data)



c = CrawlerProcess({

    'USER_AGENT': 'Mozilla/5.0',   

})

c.crawl(TorrentSpider)

c.start()

python python-3.x web-scraping scrapy

asked 48 mins ago

robots.txt

162

asked 48 mins ago

robots.txt

162

asked 48 mins ago

robots.txt

162

asked 48 mins ago

robots.txt

162

asked 48 mins ago

robots.txt

162

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

By putting the CSV exporting logic into the spider itself, you are re-inventing the wheel and not using all the advantages of Scrapy and its components and, also, making the crawling slower as you are writing to disk in the crawling stage every time the callback is triggered.

As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:

import scrapy





class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            yield {

                "Name": record.css('.browse-movie-title::text').extract_first(default=''),

                "Year": record.css('.browse-movie-year::text').extract_first(default='')

            }

Then, by running:

scrapy runspider spider.py -o outputfile.csv -t csv

(or the crawl command)

you would have the following in the outputfile.csv:

Name,Year

"Faith, Love & Chocolate",2018

Bennett's Song,2018

...

Tender Mercies,1983

You Might Be the Killer,2018

answered 30 mins ago

alecxe

14.6k53377

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209766%2fcreating-a-csv-file-using-scrapy%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:

import scrapy





class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            yield {

                "Name": record.css('.browse-movie-title::text').extract_first(default=''),

                "Year": record.css('.browse-movie-year::text').extract_first(default='')

            }

Then, by running:

scrapy runspider spider.py -o outputfile.csv -t csv

(or the crawl command)

you would have the following in the outputfile.csv:

Name,Year

"Faith, Love & Chocolate",2018

Bennett's Song,2018

...

Tender Mercies,1983

You Might Be the Killer,2018

answered 30 mins ago

alecxe

14.6k53377

add a comment |

up vote
0
down vote

As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:

import scrapy





class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            yield {

                "Name": record.css('.browse-movie-title::text').extract_first(default=''),

                "Year": record.css('.browse-movie-year::text').extract_first(default='')

            }

Then, by running:

scrapy runspider spider.py -o outputfile.csv -t csv

(or the crawl command)

you would have the following in the outputfile.csv:

Name,Year

"Faith, Love & Chocolate",2018

Bennett's Song,2018

...

Tender Mercies,1983

You Might Be the Killer,2018

answered 30 mins ago

alecxe

14.6k53377

add a comment |

up vote
0
down vote

As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:

import scrapy





class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            yield {

                "Name": record.css('.browse-movie-title::text').extract_first(default=''),

                "Year": record.css('.browse-movie-year::text').extract_first(default='')

            }

Then, by running:

scrapy runspider spider.py -o outputfile.csv -t csv

(or the crawl command)

you would have the following in the outputfile.csv:

Name,Year

"Faith, Love & Chocolate",2018

Bennett's Song,2018

...

Tender Mercies,1983

You Might Be the Killer,2018

answered 30 mins ago

alecxe

14.6k53377

As you mentioned, the CSV exporter is built-in, you just need to yield/return items from the parse() callback:

import scrapy





class TorrentSpider(scrapy.Spider):

    name = "torrentdata"

    start_urls = ["https://yts.am/browse-movies?page={}".format(page) for page in range(2,20)] #get something within list



    def parse(self, response):

        for record in response.css('.browse-movie-bottom'):

            yield {

                "Name": record.css('.browse-movie-title::text').extract_first(default=''),

                "Year": record.css('.browse-movie-year::text').extract_first(default='')

            }

Then, by running:

scrapy runspider spider.py -o outputfile.csv -t csv

(or the crawl command)

you would have the following in the outputfile.csv:

Name,Year

"Faith, Love & Chocolate",2018

Bennett's Song,2018

...

Tender Mercies,1983

You Might Be the Killer,2018

answered 30 mins ago

alecxe

14.6k53377

answered 30 mins ago

alecxe

14.6k53377

answered 30 mins ago

alecxe

14.6k53377

answered 30 mins ago

alecxe

14.6k53377

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

TLMHTCIuqfneH,Ezs m1I,GJ,zjhHKF70q

搜尋此網誌

Krdytkyu