Scrape data from a webpage and store them in a database











up vote
1
down vote

favorite












I've written a script in python to parse different names ,bubbles and reviews of various restaurant names from tripadvisor and finally store them in a table named webdata in a database. There are two functions within my script: one for storing the scraped data in MySQL and the other for printing the same in the console. The scraper is doing its job just flawless. As this is my first ever try to store items in a database and print the already stored items in the console, I might not have been able to follow the rules of DRY.



I will vastly appreciate any modification I should pursue to make the script robust.



This is my attempt so far:



import mysql.connector
from bs4 import BeautifulSoup
import requests

URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"

def store_info(link):
mydb = mysql.connector.connect(
host="localhost",
user="root",
passwd = "test123",
database="mydatabase"
)

mycursor = mydb.cursor()
mycursor.execute("DROP TABLE webdata") #kick out the table if it already exists
mycursor.execute("CREATE TABLE webdata (name VARCHAR(255), bubble VARCHAR(255), review VARCHAR(255))")

response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
for items in soup.find_all(class_="shortSellDetails"):
name = items.find(class_="property_title").get_text(strip=True)
bubble = items.find(class_="ui_bubble_rating").get("alt")
review = items.find(class_="reviewCount").get_text(strip=True)

mycursor.execute("INSERT INTO webdata (name,bubble,review) VALUES (%s,%s,%s)",(name,bubble,review))
mydb.commit()

def get_data():
mydb = mysql.connector.connect(
host="localhost",
user="root",
passwd = "test123",
database="mydatabase"
)
mycursor = mydb.cursor()
mycursor.execute("SELECT * FROM webdata")
for item in mycursor.fetchall():
print(item)
mydb.close()

if __name__ == '__main__':
store_info(URL)
get_data() #retrieve the data from that table









share|improve this question














bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.



















    up vote
    1
    down vote

    favorite












    I've written a script in python to parse different names ,bubbles and reviews of various restaurant names from tripadvisor and finally store them in a table named webdata in a database. There are two functions within my script: one for storing the scraped data in MySQL and the other for printing the same in the console. The scraper is doing its job just flawless. As this is my first ever try to store items in a database and print the already stored items in the console, I might not have been able to follow the rules of DRY.



    I will vastly appreciate any modification I should pursue to make the script robust.



    This is my attempt so far:



    import mysql.connector
    from bs4 import BeautifulSoup
    import requests

    URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"

    def store_info(link):
    mydb = mysql.connector.connect(
    host="localhost",
    user="root",
    passwd = "test123",
    database="mydatabase"
    )

    mycursor = mydb.cursor()
    mycursor.execute("DROP TABLE webdata") #kick out the table if it already exists
    mycursor.execute("CREATE TABLE webdata (name VARCHAR(255), bubble VARCHAR(255), review VARCHAR(255))")

    response = requests.get(link)
    soup = BeautifulSoup(response.text,"lxml")
    for items in soup.find_all(class_="shortSellDetails"):
    name = items.find(class_="property_title").get_text(strip=True)
    bubble = items.find(class_="ui_bubble_rating").get("alt")
    review = items.find(class_="reviewCount").get_text(strip=True)

    mycursor.execute("INSERT INTO webdata (name,bubble,review) VALUES (%s,%s,%s)",(name,bubble,review))
    mydb.commit()

    def get_data():
    mydb = mysql.connector.connect(
    host="localhost",
    user="root",
    passwd = "test123",
    database="mydatabase"
    )
    mycursor = mydb.cursor()
    mycursor.execute("SELECT * FROM webdata")
    for item in mycursor.fetchall():
    print(item)
    mydb.close()

    if __name__ == '__main__':
    store_info(URL)
    get_data() #retrieve the data from that table









    share|improve this question














    bumped to the homepage by Community 2 days ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

















      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I've written a script in python to parse different names ,bubbles and reviews of various restaurant names from tripadvisor and finally store them in a table named webdata in a database. There are two functions within my script: one for storing the scraped data in MySQL and the other for printing the same in the console. The scraper is doing its job just flawless. As this is my first ever try to store items in a database and print the already stored items in the console, I might not have been able to follow the rules of DRY.



      I will vastly appreciate any modification I should pursue to make the script robust.



      This is my attempt so far:



      import mysql.connector
      from bs4 import BeautifulSoup
      import requests

      URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"

      def store_info(link):
      mydb = mysql.connector.connect(
      host="localhost",
      user="root",
      passwd = "test123",
      database="mydatabase"
      )

      mycursor = mydb.cursor()
      mycursor.execute("DROP TABLE webdata") #kick out the table if it already exists
      mycursor.execute("CREATE TABLE webdata (name VARCHAR(255), bubble VARCHAR(255), review VARCHAR(255))")

      response = requests.get(link)
      soup = BeautifulSoup(response.text,"lxml")
      for items in soup.find_all(class_="shortSellDetails"):
      name = items.find(class_="property_title").get_text(strip=True)
      bubble = items.find(class_="ui_bubble_rating").get("alt")
      review = items.find(class_="reviewCount").get_text(strip=True)

      mycursor.execute("INSERT INTO webdata (name,bubble,review) VALUES (%s,%s,%s)",(name,bubble,review))
      mydb.commit()

      def get_data():
      mydb = mysql.connector.connect(
      host="localhost",
      user="root",
      passwd = "test123",
      database="mydatabase"
      )
      mycursor = mydb.cursor()
      mycursor.execute("SELECT * FROM webdata")
      for item in mycursor.fetchall():
      print(item)
      mydb.close()

      if __name__ == '__main__':
      store_info(URL)
      get_data() #retrieve the data from that table









      share|improve this question













      I've written a script in python to parse different names ,bubbles and reviews of various restaurant names from tripadvisor and finally store them in a table named webdata in a database. There are two functions within my script: one for storing the scraped data in MySQL and the other for printing the same in the console. The scraper is doing its job just flawless. As this is my first ever try to store items in a database and print the already stored items in the console, I might not have been able to follow the rules of DRY.



      I will vastly appreciate any modification I should pursue to make the script robust.



      This is my attempt so far:



      import mysql.connector
      from bs4 import BeautifulSoup
      import requests

      URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"

      def store_info(link):
      mydb = mysql.connector.connect(
      host="localhost",
      user="root",
      passwd = "test123",
      database="mydatabase"
      )

      mycursor = mydb.cursor()
      mycursor.execute("DROP TABLE webdata") #kick out the table if it already exists
      mycursor.execute("CREATE TABLE webdata (name VARCHAR(255), bubble VARCHAR(255), review VARCHAR(255))")

      response = requests.get(link)
      soup = BeautifulSoup(response.text,"lxml")
      for items in soup.find_all(class_="shortSellDetails"):
      name = items.find(class_="property_title").get_text(strip=True)
      bubble = items.find(class_="ui_bubble_rating").get("alt")
      review = items.find(class_="reviewCount").get_text(strip=True)

      mycursor.execute("INSERT INTO webdata (name,bubble,review) VALUES (%s,%s,%s)",(name,bubble,review))
      mydb.commit()

      def get_data():
      mydb = mysql.connector.connect(
      host="localhost",
      user="root",
      passwd = "test123",
      database="mydatabase"
      )
      mycursor = mydb.cursor()
      mycursor.execute("SELECT * FROM webdata")
      for item in mycursor.fetchall():
      print(item)
      mydb.close()

      if __name__ == '__main__':
      store_info(URL)
      get_data() #retrieve the data from that table






      python python-3.x mysql web-scraping






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Sep 28 at 9:56









      SIM

      1,132629




      1,132629





      bumped to the homepage by Community 2 days ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 2 days ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote














          • The database setup mydb should be in it's own function, or even just
            a global since this script is rather minimal. There's no reason to
            duplicate this code.

          • If mysql.connector doesn't, I'd look for a database connector that
            supports the with ... statement for connections and transactions
            (relatively sure e.g. sqlalchemy supports that). Because you can
            always forget to call commit (or rollback), but with with you
            have a little bit less mental overhead.

          • It's a simple script ... long term I'd worry about losing data that
            way. Also why is it dropping the whole table instead of, say,
            DELETE FROM webdata; and only recreating the table if it doesn't
            exist?

          • The bs4 part looks okay and the insert too; I'd perhaps have some
            checks in there that you don't overflow on the 255 limit with the
            review (or the name I suppose). Also some warnings around empty
            values might be good for you to debug / notice that something's off.

          • AFAIK requests is a good choice, you might also want to check the
            status code though.


          Okay, so long-term:




          • To be more flexible, look at argparse or any of the command line
            parsing libraries so that you never have to edit the script in case
            one of the parameters changes (database connection, URL). I'd keep
            the HTML querying like it is, that's usually not wise to put into a
            configuration unless it changes frequently.






          share|improve this answer





















            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "196"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204521%2fscrape-data-from-a-webpage-and-store-them-in-a-database%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote














            • The database setup mydb should be in it's own function, or even just
              a global since this script is rather minimal. There's no reason to
              duplicate this code.

            • If mysql.connector doesn't, I'd look for a database connector that
              supports the with ... statement for connections and transactions
              (relatively sure e.g. sqlalchemy supports that). Because you can
              always forget to call commit (or rollback), but with with you
              have a little bit less mental overhead.

            • It's a simple script ... long term I'd worry about losing data that
              way. Also why is it dropping the whole table instead of, say,
              DELETE FROM webdata; and only recreating the table if it doesn't
              exist?

            • The bs4 part looks okay and the insert too; I'd perhaps have some
              checks in there that you don't overflow on the 255 limit with the
              review (or the name I suppose). Also some warnings around empty
              values might be good for you to debug / notice that something's off.

            • AFAIK requests is a good choice, you might also want to check the
              status code though.


            Okay, so long-term:




            • To be more flexible, look at argparse or any of the command line
              parsing libraries so that you never have to edit the script in case
              one of the parameters changes (database connection, URL). I'd keep
              the HTML querying like it is, that's usually not wise to put into a
              configuration unless it changes frequently.






            share|improve this answer

























              up vote
              0
              down vote














              • The database setup mydb should be in it's own function, or even just
                a global since this script is rather minimal. There's no reason to
                duplicate this code.

              • If mysql.connector doesn't, I'd look for a database connector that
                supports the with ... statement for connections and transactions
                (relatively sure e.g. sqlalchemy supports that). Because you can
                always forget to call commit (or rollback), but with with you
                have a little bit less mental overhead.

              • It's a simple script ... long term I'd worry about losing data that
                way. Also why is it dropping the whole table instead of, say,
                DELETE FROM webdata; and only recreating the table if it doesn't
                exist?

              • The bs4 part looks okay and the insert too; I'd perhaps have some
                checks in there that you don't overflow on the 255 limit with the
                review (or the name I suppose). Also some warnings around empty
                values might be good for you to debug / notice that something's off.

              • AFAIK requests is a good choice, you might also want to check the
                status code though.


              Okay, so long-term:




              • To be more flexible, look at argparse or any of the command line
                parsing libraries so that you never have to edit the script in case
                one of the parameters changes (database connection, URL). I'd keep
                the HTML querying like it is, that's usually not wise to put into a
                configuration unless it changes frequently.






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote










                • The database setup mydb should be in it's own function, or even just
                  a global since this script is rather minimal. There's no reason to
                  duplicate this code.

                • If mysql.connector doesn't, I'd look for a database connector that
                  supports the with ... statement for connections and transactions
                  (relatively sure e.g. sqlalchemy supports that). Because you can
                  always forget to call commit (or rollback), but with with you
                  have a little bit less mental overhead.

                • It's a simple script ... long term I'd worry about losing data that
                  way. Also why is it dropping the whole table instead of, say,
                  DELETE FROM webdata; and only recreating the table if it doesn't
                  exist?

                • The bs4 part looks okay and the insert too; I'd perhaps have some
                  checks in there that you don't overflow on the 255 limit with the
                  review (or the name I suppose). Also some warnings around empty
                  values might be good for you to debug / notice that something's off.

                • AFAIK requests is a good choice, you might also want to check the
                  status code though.


                Okay, so long-term:




                • To be more flexible, look at argparse or any of the command line
                  parsing libraries so that you never have to edit the script in case
                  one of the parameters changes (database connection, URL). I'd keep
                  the HTML querying like it is, that's usually not wise to put into a
                  configuration unless it changes frequently.






                share|improve this answer













                • The database setup mydb should be in it's own function, or even just
                  a global since this script is rather minimal. There's no reason to
                  duplicate this code.

                • If mysql.connector doesn't, I'd look for a database connector that
                  supports the with ... statement for connections and transactions
                  (relatively sure e.g. sqlalchemy supports that). Because you can
                  always forget to call commit (or rollback), but with with you
                  have a little bit less mental overhead.

                • It's a simple script ... long term I'd worry about losing data that
                  way. Also why is it dropping the whole table instead of, say,
                  DELETE FROM webdata; and only recreating the table if it doesn't
                  exist?

                • The bs4 part looks okay and the insert too; I'd perhaps have some
                  checks in there that you don't overflow on the 255 limit with the
                  review (or the name I suppose). Also some warnings around empty
                  values might be good for you to debug / notice that something's off.

                • AFAIK requests is a good choice, you might also want to check the
                  status code though.


                Okay, so long-term:




                • To be more flexible, look at argparse or any of the command line
                  parsing libraries so that you never have to edit the script in case
                  one of the parameters changes (database connection, URL). I'd keep
                  the HTML querying like it is, that's usually not wise to put into a
                  configuration unless it changes frequently.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Sep 28 at 20:14









                ferada

                9,1011554




                9,1011554






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Code Review Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204521%2fscrape-data-from-a-webpage-and-store-them-in-a-database%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Ellipse (mathématiques)

                    Quarter-circle Tiles

                    Mont Emei