Scrape data from a webpage and store them in a database

up vote
1
down vote

favorite

I've written a script in python to parse different names ,bubbles and reviews of various restaurant names from tripadvisor and finally store them in a table named webdata in a database. There are two functions within my script: one for storing the scraped data in MySQL and the other for printing the same in the console. The scraper is doing its job just flawless. As this is my first ever try to store items in a database and print the already stored items in the console, I might not have been able to follow the rules of DRY.

I will vastly appreciate any modification I should pursue to make the script robust.

This is my attempt so far:

import mysql.connector

from bs4 import BeautifulSoup

import requests



URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"



def store_info(link):

    mydb = mysql.connector.connect(

      host="localhost",

      user="root",

      passwd = "test123",

      database="mydatabase"

    )



    mycursor = mydb.cursor()

    mycursor.execute("DROP TABLE webdata") #kick out the table if it already exists

    mycursor.execute("CREATE TABLE webdata (name VARCHAR(255), bubble VARCHAR(255), review VARCHAR(255))")



    response = requests.get(link)

    soup = BeautifulSoup(response.text,"lxml")

    for items in soup.find_all(class_="shortSellDetails"):

        name = items.find(class_="property_title").get_text(strip=True)

        bubble = items.find(class_="ui_bubble_rating").get("alt")

        review = items.find(class_="reviewCount").get_text(strip=True)



        mycursor.execute("INSERT INTO webdata (name,bubble,review) VALUES (%s,%s,%s)",(name,bubble,review))

    mydb.commit()



def get_data():

    mydb = mysql.connector.connect(

      host="localhost",

      user="root",

      passwd = "test123",

      database="mydatabase"

    )

    mycursor = mydb.cursor()

    mycursor.execute("SELECT * FROM webdata")

    for item in mycursor.fetchall():

        print(item)

    mydb.close()



if __name__ == '__main__':

    store_info(URL)

    get_data() #retrieve the data from that table

asked Sep 28 at 9:56

SIM

1,132629

bumped to the homepage by Community♦ 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

up vote
1
down vote

favorite

I will vastly appreciate any modification I should pursue to make the script robust.

This is my attempt so far:

import mysql.connector

from bs4 import BeautifulSoup

import requests



URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"



def store_info(link):

    mydb = mysql.connector.connect(

      host="localhost",

      user="root",

      passwd = "test123",

      database="mydatabase"

    )



    mycursor = mydb.cursor()

    mycursor.execute("DROP TABLE webdata") #kick out the table if it already exists

    mycursor.execute("CREATE TABLE webdata (name VARCHAR(255), bubble VARCHAR(255), review VARCHAR(255))")



    response = requests.get(link)

    soup = BeautifulSoup(response.text,"lxml")

    for items in soup.find_all(class_="shortSellDetails"):

        name = items.find(class_="property_title").get_text(strip=True)

        bubble = items.find(class_="ui_bubble_rating").get("alt")

        review = items.find(class_="reviewCount").get_text(strip=True)



        mycursor.execute("INSERT INTO webdata (name,bubble,review) VALUES (%s,%s,%s)",(name,bubble,review))

    mydb.commit()



def get_data():

    mydb = mysql.connector.connect(

      host="localhost",

      user="root",

      passwd = "test123",

      database="mydatabase"

    )

    mycursor = mydb.cursor()

    mycursor.execute("SELECT * FROM webdata")

    for item in mycursor.fetchall():

        print(item)

    mydb.close()



if __name__ == '__main__':

    store_info(URL)

    get_data() #retrieve the data from that table

asked Sep 28 at 9:56

SIM

1,132629

bumped to the homepage by Community♦ 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

up vote
1
down vote

favorite

I will vastly appreciate any modification I should pursue to make the script robust.

This is my attempt so far:

import mysql.connector

from bs4 import BeautifulSoup

import requests



URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"



def store_info(link):

    mydb = mysql.connector.connect(

      host="localhost",

      user="root",

      passwd = "test123",

      database="mydatabase"

    )



    mycursor = mydb.cursor()

    mycursor.execute("DROP TABLE webdata") #kick out the table if it already exists

    mycursor.execute("CREATE TABLE webdata (name VARCHAR(255), bubble VARCHAR(255), review VARCHAR(255))")



    response = requests.get(link)

    soup = BeautifulSoup(response.text,"lxml")

    for items in soup.find_all(class_="shortSellDetails"):

        name = items.find(class_="property_title").get_text(strip=True)

        bubble = items.find(class_="ui_bubble_rating").get("alt")

        review = items.find(class_="reviewCount").get_text(strip=True)



        mycursor.execute("INSERT INTO webdata (name,bubble,review) VALUES (%s,%s,%s)",(name,bubble,review))

    mydb.commit()



def get_data():

    mydb = mysql.connector.connect(

      host="localhost",

      user="root",

      passwd = "test123",

      database="mydatabase"

    )

    mycursor = mydb.cursor()

    mycursor.execute("SELECT * FROM webdata")

    for item in mycursor.fetchall():

        print(item)

    mydb.close()



if __name__ == '__main__':

    store_info(URL)

    get_data() #retrieve the data from that table

asked Sep 28 at 9:56

SIM

1,132629

I will vastly appreciate any modification I should pursue to make the script robust.

This is my attempt so far:

import mysql.connector

from bs4 import BeautifulSoup

import requests



URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"



def store_info(link):

    mydb = mysql.connector.connect(

      host="localhost",

      user="root",

      passwd = "test123",

      database="mydatabase"

    )



    mycursor = mydb.cursor()

    mycursor.execute("DROP TABLE webdata") #kick out the table if it already exists

    mycursor.execute("CREATE TABLE webdata (name VARCHAR(255), bubble VARCHAR(255), review VARCHAR(255))")



    response = requests.get(link)

    soup = BeautifulSoup(response.text,"lxml")

    for items in soup.find_all(class_="shortSellDetails"):

        name = items.find(class_="property_title").get_text(strip=True)

        bubble = items.find(class_="ui_bubble_rating").get("alt")

        review = items.find(class_="reviewCount").get_text(strip=True)



        mycursor.execute("INSERT INTO webdata (name,bubble,review) VALUES (%s,%s,%s)",(name,bubble,review))

    mydb.commit()



def get_data():

    mydb = mysql.connector.connect(

      host="localhost",

      user="root",

      passwd = "test123",

      database="mydatabase"

    )

    mycursor = mydb.cursor()

    mycursor.execute("SELECT * FROM webdata")

    for item in mycursor.fetchall():

        print(item)

    mydb.close()



if __name__ == '__main__':

    store_info(URL)

    get_data() #retrieve the data from that table

python python-3.x mysql web-scraping

asked Sep 28 at 9:56

SIM

1,132629

asked Sep 28 at 9:56

SIM

1,132629

asked Sep 28 at 9:56

SIM

1,132629

asked Sep 28 at 9:56

SIM

1,132629

asked Sep 28 at 9:56

SIM

1,132629

bumped to the homepage by Community♦ 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

The database setup mydb should be in it's own function, or even just
a global since this script is rather minimal. There's no reason to
duplicate this code.

If mysql.connector doesn't, I'd look for a database connector that
supports the with ... statement for connections and transactions
(relatively sure e.g. sqlalchemy supports that). Because you can
always forget to call commit (or rollback), but with with you
have a little bit less mental overhead.

It's a simple script ... long term I'd worry about losing data that
way. Also why is it dropping the whole table instead of, say,
DELETE FROM webdata; and only recreating the table if it doesn't
exist?

The bs4 part looks okay and the insert too; I'd perhaps have some
checks in there that you don't overflow on the 255 limit with the
review (or the name I suppose). Also some warnings around empty
values might be good for you to debug / notice that something's off.

AFAIK requests is a good choice, you might also want to check the
status code though.

Okay, so long-term:

To be more flexible, look at argparse or any of the command line
parsing libraries so that you never have to edit the script in case
one of the parameters changes (database connection, URL). I'd keep
the HTML querying like it is, that's usually not wise to put into a
configuration unless it changes frequently.

answered Sep 28 at 20:14

ferada

9,1011554

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204521%2fscrape-data-from-a-webpage-and-store-them-in-a-database%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

The database setup mydb should be in it's own function, or even just
a global since this script is rather minimal. There's no reason to
duplicate this code.

If mysql.connector doesn't, I'd look for a database connector that
supports the with ... statement for connections and transactions
(relatively sure e.g. sqlalchemy supports that). Because you can
always forget to call commit (or rollback), but with with you
have a little bit less mental overhead.

It's a simple script ... long term I'd worry about losing data that
way. Also why is it dropping the whole table instead of, say,
DELETE FROM webdata; and only recreating the table if it doesn't
exist?

The bs4 part looks okay and the insert too; I'd perhaps have some
checks in there that you don't overflow on the 255 limit with the
review (or the name I suppose). Also some warnings around empty
values might be good for you to debug / notice that something's off.

AFAIK requests is a good choice, you might also want to check the
status code though.

Okay, so long-term:

To be more flexible, look at argparse or any of the command line
parsing libraries so that you never have to edit the script in case
one of the parameters changes (database connection, URL). I'd keep
the HTML querying like it is, that's usually not wise to put into a
configuration unless it changes frequently.

answered Sep 28 at 20:14

ferada

9,1011554

add a comment |

up vote
0
down vote

The database setup mydb should be in it's own function, or even just
a global since this script is rather minimal. There's no reason to
duplicate this code.

If mysql.connector doesn't, I'd look for a database connector that
supports the with ... statement for connections and transactions
(relatively sure e.g. sqlalchemy supports that). Because you can
always forget to call commit (or rollback), but with with you
have a little bit less mental overhead.

It's a simple script ... long term I'd worry about losing data that
way. Also why is it dropping the whole table instead of, say,
DELETE FROM webdata; and only recreating the table if it doesn't
exist?

The bs4 part looks okay and the insert too; I'd perhaps have some
checks in there that you don't overflow on the 255 limit with the
review (or the name I suppose). Also some warnings around empty
values might be good for you to debug / notice that something's off.

AFAIK requests is a good choice, you might also want to check the
status code though.

Okay, so long-term:

To be more flexible, look at argparse or any of the command line
parsing libraries so that you never have to edit the script in case
one of the parameters changes (database connection, URL). I'd keep
the HTML querying like it is, that's usually not wise to put into a
configuration unless it changes frequently.

answered Sep 28 at 20:14

ferada

9,1011554

add a comment |

up vote
0
down vote

The database setup mydb should be in it's own function, or even just
a global since this script is rather minimal. There's no reason to
duplicate this code.

If mysql.connector doesn't, I'd look for a database connector that
supports the with ... statement for connections and transactions
(relatively sure e.g. sqlalchemy supports that). Because you can
always forget to call commit (or rollback), but with with you
have a little bit less mental overhead.

It's a simple script ... long term I'd worry about losing data that
way. Also why is it dropping the whole table instead of, say,
DELETE FROM webdata; and only recreating the table if it doesn't
exist?

The bs4 part looks okay and the insert too; I'd perhaps have some
checks in there that you don't overflow on the 255 limit with the
review (or the name I suppose). Also some warnings around empty
values might be good for you to debug / notice that something's off.

AFAIK requests is a good choice, you might also want to check the
status code though.

Okay, so long-term:

To be more flexible, look at argparse or any of the command line
parsing libraries so that you never have to edit the script in case
one of the parameters changes (database connection, URL). I'd keep
the HTML querying like it is, that's usually not wise to put into a
configuration unless it changes frequently.

answered Sep 28 at 20:14

ferada

9,1011554

The database setup mydb should be in it's own function, or even just
a global since this script is rather minimal. There's no reason to
duplicate this code.

If mysql.connector doesn't, I'd look for a database connector that
supports the with ... statement for connections and transactions
(relatively sure e.g. sqlalchemy supports that). Because you can
always forget to call commit (or rollback), but with with you
have a little bit less mental overhead.

It's a simple script ... long term I'd worry about losing data that
way. Also why is it dropping the whole table instead of, say,
DELETE FROM webdata; and only recreating the table if it doesn't
exist?

The bs4 part looks okay and the insert too; I'd perhaps have some
checks in there that you don't overflow on the 255 limit with the
review (or the name I suppose). Also some warnings around empty
values might be good for you to debug / notice that something's off.

AFAIK requests is a good choice, you might also want to check the
status code though.

Okay, so long-term:

To be more flexible, look at argparse or any of the command line
parsing libraries so that you never have to edit the script in case
one of the parameters changes (database connection, URL). I'd keep
the HTML querying like it is, that's usually not wise to put into a
configuration unless it changes frequently.

answered Sep 28 at 20:14

ferada

9,1011554

answered Sep 28 at 20:14

ferada

9,1011554

answered Sep 28 at 20:14

ferada

9,1011554

answered Sep 28 at 20:14

ferada

9,1011554

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Vt,m,m,9uQ,BoWWiaB97fpUIC 2CkHjGe,0pKaQT,LOQf8dVubjAATsJHCf6EvB6SAWUa1l,jQpg8bdj,y3

搜尋此網誌

Krdytkyu