Web scraping google trends in python











up vote
4
down vote

favorite
2












Question



Recently I have tried to create a web scraping program to get data from google trends. It uses an RSS feed to do so. My question is as follows:



How can I improve my code such that it is more concise, efficient, and pleasing to a programmers eye? I would like to emphasise that I'm most concerned about if I've misused any functions, methods, or syntax in my code and if it could be 'cleaner'



Context




  • I'm not experienced at all. I'm doing this to learn.

  • Security isn't an issue since the program is running locally.


The code



"""Uses webscraping to retrieve html information from Google Trends for 
parsing"""

# Imports

import time
import ssl
from requests import get
from bs4 import BeautifulSoup

# Url for request

url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US"

# Used to create an unverified certificate.

ssl._create_default_https_context = ssl._create_unverified_context


class connection(object):

def __init__(self, f):
self.f = f

def __call__(self):
"""Calculates the time taken to complete the request without any errors"""
try:
# Times the amount of time the function takes to complete
global html
start = time.time()
html = get(url)
self.f()
end = time.time()
time_taken = end - start
result = print("the function {} took {} to connect".format(self.f.__name__, time_taken))
return result
except:
print("function {} failed to connect".format(self.f.__name__))

@connection
def html_parser():
"""Parses the html into storable text"""
html_unparsed = html.text
soup = BeautifulSoup(html_unparsed, "html.parser")
titles_unformatted = soup.find_all("title")
traffic_unformatted = soup.find_all("ht:approx_traffic")
# Indexes through list to make data readable
titles = [x.text for x in titles_unformatted]
traffic =
for x in traffic_unformatted:
x = x.text
x = x.replace("+","")
x = x.replace(",", "")
traffic.append(x)
print(traffic, titles)
html_parser()


Output



['100000', '2000000', '2000000', '1000000', '1000000', '1000000', 
'1000000', '500000', '500000', '500000', '200000', '200000', '200000',
'200000', '200000', '200000', '200000', '200000', '200000', '200000']
['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart
Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019',
'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner',
'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black
Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings',
'Safeway']
the function html_parser took 1.0186748504638672 to connect


Concerns




  • Makes programmers cringe.


As someone relatively new to python and programming in general my worst fear is that this code gives someone else a headache or a laugh to look at. At the end of the day I just want to improve, so to reiterate: how can I make this code look better? Have I misused any functions?










share|improve this question







New contributor




Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
























    up vote
    4
    down vote

    favorite
    2












    Question



    Recently I have tried to create a web scraping program to get data from google trends. It uses an RSS feed to do so. My question is as follows:



    How can I improve my code such that it is more concise, efficient, and pleasing to a programmers eye? I would like to emphasise that I'm most concerned about if I've misused any functions, methods, or syntax in my code and if it could be 'cleaner'



    Context




    • I'm not experienced at all. I'm doing this to learn.

    • Security isn't an issue since the program is running locally.


    The code



    """Uses webscraping to retrieve html information from Google Trends for 
    parsing"""

    # Imports

    import time
    import ssl
    from requests import get
    from bs4 import BeautifulSoup

    # Url for request

    url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US"

    # Used to create an unverified certificate.

    ssl._create_default_https_context = ssl._create_unverified_context


    class connection(object):

    def __init__(self, f):
    self.f = f

    def __call__(self):
    """Calculates the time taken to complete the request without any errors"""
    try:
    # Times the amount of time the function takes to complete
    global html
    start = time.time()
    html = get(url)
    self.f()
    end = time.time()
    time_taken = end - start
    result = print("the function {} took {} to connect".format(self.f.__name__, time_taken))
    return result
    except:
    print("function {} failed to connect".format(self.f.__name__))

    @connection
    def html_parser():
    """Parses the html into storable text"""
    html_unparsed = html.text
    soup = BeautifulSoup(html_unparsed, "html.parser")
    titles_unformatted = soup.find_all("title")
    traffic_unformatted = soup.find_all("ht:approx_traffic")
    # Indexes through list to make data readable
    titles = [x.text for x in titles_unformatted]
    traffic =
    for x in traffic_unformatted:
    x = x.text
    x = x.replace("+","")
    x = x.replace(",", "")
    traffic.append(x)
    print(traffic, titles)
    html_parser()


    Output



    ['100000', '2000000', '2000000', '1000000', '1000000', '1000000', 
    '1000000', '500000', '500000', '500000', '200000', '200000', '200000',
    '200000', '200000', '200000', '200000', '200000', '200000', '200000']
    ['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart
    Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019',
    'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner',
    'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black
    Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings',
    'Safeway']
    the function html_parser took 1.0186748504638672 to connect


    Concerns




    • Makes programmers cringe.


    As someone relatively new to python and programming in general my worst fear is that this code gives someone else a headache or a laugh to look at. At the end of the day I just want to improve, so to reiterate: how can I make this code look better? Have I misused any functions?










    share|improve this question







    New contributor




    Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






















      up vote
      4
      down vote

      favorite
      2









      up vote
      4
      down vote

      favorite
      2






      2





      Question



      Recently I have tried to create a web scraping program to get data from google trends. It uses an RSS feed to do so. My question is as follows:



      How can I improve my code such that it is more concise, efficient, and pleasing to a programmers eye? I would like to emphasise that I'm most concerned about if I've misused any functions, methods, or syntax in my code and if it could be 'cleaner'



      Context




      • I'm not experienced at all. I'm doing this to learn.

      • Security isn't an issue since the program is running locally.


      The code



      """Uses webscraping to retrieve html information from Google Trends for 
      parsing"""

      # Imports

      import time
      import ssl
      from requests import get
      from bs4 import BeautifulSoup

      # Url for request

      url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US"

      # Used to create an unverified certificate.

      ssl._create_default_https_context = ssl._create_unverified_context


      class connection(object):

      def __init__(self, f):
      self.f = f

      def __call__(self):
      """Calculates the time taken to complete the request without any errors"""
      try:
      # Times the amount of time the function takes to complete
      global html
      start = time.time()
      html = get(url)
      self.f()
      end = time.time()
      time_taken = end - start
      result = print("the function {} took {} to connect".format(self.f.__name__, time_taken))
      return result
      except:
      print("function {} failed to connect".format(self.f.__name__))

      @connection
      def html_parser():
      """Parses the html into storable text"""
      html_unparsed = html.text
      soup = BeautifulSoup(html_unparsed, "html.parser")
      titles_unformatted = soup.find_all("title")
      traffic_unformatted = soup.find_all("ht:approx_traffic")
      # Indexes through list to make data readable
      titles = [x.text for x in titles_unformatted]
      traffic =
      for x in traffic_unformatted:
      x = x.text
      x = x.replace("+","")
      x = x.replace(",", "")
      traffic.append(x)
      print(traffic, titles)
      html_parser()


      Output



      ['100000', '2000000', '2000000', '1000000', '1000000', '1000000', 
      '1000000', '500000', '500000', '500000', '200000', '200000', '200000',
      '200000', '200000', '200000', '200000', '200000', '200000', '200000']
      ['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart
      Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019',
      'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner',
      'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black
      Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings',
      'Safeway']
      the function html_parser took 1.0186748504638672 to connect


      Concerns




      • Makes programmers cringe.


      As someone relatively new to python and programming in general my worst fear is that this code gives someone else a headache or a laugh to look at. At the end of the day I just want to improve, so to reiterate: how can I make this code look better? Have I misused any functions?










      share|improve this question







      New contributor




      Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      Question



      Recently I have tried to create a web scraping program to get data from google trends. It uses an RSS feed to do so. My question is as follows:



      How can I improve my code such that it is more concise, efficient, and pleasing to a programmers eye? I would like to emphasise that I'm most concerned about if I've misused any functions, methods, or syntax in my code and if it could be 'cleaner'



      Context




      • I'm not experienced at all. I'm doing this to learn.

      • Security isn't an issue since the program is running locally.


      The code



      """Uses webscraping to retrieve html information from Google Trends for 
      parsing"""

      # Imports

      import time
      import ssl
      from requests import get
      from bs4 import BeautifulSoup

      # Url for request

      url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US"

      # Used to create an unverified certificate.

      ssl._create_default_https_context = ssl._create_unverified_context


      class connection(object):

      def __init__(self, f):
      self.f = f

      def __call__(self):
      """Calculates the time taken to complete the request without any errors"""
      try:
      # Times the amount of time the function takes to complete
      global html
      start = time.time()
      html = get(url)
      self.f()
      end = time.time()
      time_taken = end - start
      result = print("the function {} took {} to connect".format(self.f.__name__, time_taken))
      return result
      except:
      print("function {} failed to connect".format(self.f.__name__))

      @connection
      def html_parser():
      """Parses the html into storable text"""
      html_unparsed = html.text
      soup = BeautifulSoup(html_unparsed, "html.parser")
      titles_unformatted = soup.find_all("title")
      traffic_unformatted = soup.find_all("ht:approx_traffic")
      # Indexes through list to make data readable
      titles = [x.text for x in titles_unformatted]
      traffic =
      for x in traffic_unformatted:
      x = x.text
      x = x.replace("+","")
      x = x.replace(",", "")
      traffic.append(x)
      print(traffic, titles)
      html_parser()


      Output



      ['100000', '2000000', '2000000', '1000000', '1000000', '1000000', 
      '1000000', '500000', '500000', '500000', '200000', '200000', '200000',
      '200000', '200000', '200000', '200000', '200000', '200000', '200000']
      ['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart
      Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019',
      'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner',
      'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black
      Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings',
      'Safeway']
      the function html_parser took 1.0186748504638672 to connect


      Concerns




      • Makes programmers cringe.


      As someone relatively new to python and programming in general my worst fear is that this code gives someone else a headache or a laugh to look at. At the end of the day I just want to improve, so to reiterate: how can I make this code look better? Have I misused any functions?







      python object-oriented python-3.x web-scraping






      share|improve this question







      New contributor




      Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 2 days ago









      Ranch Mayonaise

      233




      233




      New contributor




      Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          The idea behind your script is pretty cool and there's a couple ways to polish it up! While it certainly wouldn't cause a headache, I think that creating an entire class just to measure how long the request is going to take to complete is a little overkill.



          Within the class itself, let's look at 2 things which aren't the best practice.




          • Use of a global variable

          • Use of the try/except block


          Global variables



          There's a brilliant StackOverflow question with really good answers as to why using global variables should be discouraged. While in the scope of your script it is not harmful, imagine you had many functions and methods in your script. Using the global variable could decrease readability and potentially alter the behaviour of your code. Mind you, global variables aren't the same as global constants (also discussed in the StackOverflow question).



          Exception handling



          Using a bare except statement will handle all exceptions, even the ones you might not anticipate. Catch errors explicitly, otherwise you might cause unexpected behaviour in your code, should it become more complex. In this scope, your bare except statement should become more like this:



          try:
          ...
          except requests.exceptions.ConnectionError:
          print("failed to connect")


          This way, you are handling the case where your request actually fails and you let other errors happen in their intended way. Otherwise, a different exception could have been raised and you won't know what caused it.



          Additionally, on this class:




          • Following naming conventions, class names should use CapWords.

          • In Python 3+, classes no longer need to be inherited from object.


          Instead of the class, you could just define a simple function which retrieves the XML document from the RSS feed, declare your timing functionality within it and returns the XML document. Seeing as the URL query takes in what's seemingly an ISO 3166-2 country code as one of the parameters, you could pass the country code into this function and have the ability to fetch search trends for so many different countries! Please note that in the following snippet I am using f-strings which were introduced since Python 3.6.



          def fetch_xml(country_code):
          url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
          start = time.time()
          response = requests.get(url)
          response_time = time.time() - start
          print(f"The request took {response_time}s to complete.")
          return response.content`


          Parser



          Looking at the docstring of your parser, one might argue it's a little misleading. You're mentioning "storable text" while you're only printing your output! To come closer to the intention of the docstring, it would be better to return the parsed data and decide what to do with it later. I think in this case, a dictionary would a very fitting data structure. Dictionaries store key-value pairs, so lets use the trend title as the key and the traffic approximation as the value.



          Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:



          for x in traffic_unformatted: 
          x = x.text
          x = x.replace("+","")
          x = x.replace(",", "")


          Can now become:



              for x in traffic_unformatted:
          x = re.sub("[+,]", "", x.text)


          Zip the two lists of titles and traffic approximations together and iterate over them to create the dictionary. I am going to use a dictionary comprehension in the final script, however it works identically to the following.



          titles = soup.find_all("title")
          approximate_traffic = soup.find_all("ht:approx_traffic")
          trends = {}
          for title, traffic in zip(titles[1:], approximate_traffic):
          trends[title.text] = re.sub("[+,]", "", traffic.text)
          return trends


          Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.



          Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.



          Entire script:



          import time
          import re
          import requests
          from bs4 import BeautifulSoup


          def fetch_xml(country_code):
          url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
          start = time.time()
          response = requests.get(url)
          response_time = time.time() - start
          print(f"The request took {response_time}s to complete.")
          return response.content

          def trends_retriever(country_code):
          """Parses the Google Trends RSS feed using BeautifulSoup.

          Returns:
          dict: Trend title for key, trend approximate traffic for value.
          """
          xml_document = fetch_xml(country_code)
          soup = BeautifulSoup(xml_document, "lxml")
          titles = soup.find_all("title")
          approximate_traffic = soup.find_all("ht:approx_traffic")
          return {title.text: re.sub("[+,]", "", traffic.text)
          for title, traffic in zip(titles[1:], approximate_traffic)}


          if __name__ == '__main__':
          trends = trends_retriever("US")
          print(trends)


          Output:



          The request took 0.2512788772583008s to complete.
          {'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}





          share|improve this answer





















            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "196"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            Ranch Mayonaise is a new contributor. Be nice, and check out our Code of Conduct.










             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208277%2fweb-scraping-google-trends-in-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            2
            down vote



            accepted










            The idea behind your script is pretty cool and there's a couple ways to polish it up! While it certainly wouldn't cause a headache, I think that creating an entire class just to measure how long the request is going to take to complete is a little overkill.



            Within the class itself, let's look at 2 things which aren't the best practice.




            • Use of a global variable

            • Use of the try/except block


            Global variables



            There's a brilliant StackOverflow question with really good answers as to why using global variables should be discouraged. While in the scope of your script it is not harmful, imagine you had many functions and methods in your script. Using the global variable could decrease readability and potentially alter the behaviour of your code. Mind you, global variables aren't the same as global constants (also discussed in the StackOverflow question).



            Exception handling



            Using a bare except statement will handle all exceptions, even the ones you might not anticipate. Catch errors explicitly, otherwise you might cause unexpected behaviour in your code, should it become more complex. In this scope, your bare except statement should become more like this:



            try:
            ...
            except requests.exceptions.ConnectionError:
            print("failed to connect")


            This way, you are handling the case where your request actually fails and you let other errors happen in their intended way. Otherwise, a different exception could have been raised and you won't know what caused it.



            Additionally, on this class:




            • Following naming conventions, class names should use CapWords.

            • In Python 3+, classes no longer need to be inherited from object.


            Instead of the class, you could just define a simple function which retrieves the XML document from the RSS feed, declare your timing functionality within it and returns the XML document. Seeing as the URL query takes in what's seemingly an ISO 3166-2 country code as one of the parameters, you could pass the country code into this function and have the ability to fetch search trends for so many different countries! Please note that in the following snippet I am using f-strings which were introduced since Python 3.6.



            def fetch_xml(country_code):
            url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
            start = time.time()
            response = requests.get(url)
            response_time = time.time() - start
            print(f"The request took {response_time}s to complete.")
            return response.content`


            Parser



            Looking at the docstring of your parser, one might argue it's a little misleading. You're mentioning "storable text" while you're only printing your output! To come closer to the intention of the docstring, it would be better to return the parsed data and decide what to do with it later. I think in this case, a dictionary would a very fitting data structure. Dictionaries store key-value pairs, so lets use the trend title as the key and the traffic approximation as the value.



            Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:



            for x in traffic_unformatted: 
            x = x.text
            x = x.replace("+","")
            x = x.replace(",", "")


            Can now become:



                for x in traffic_unformatted:
            x = re.sub("[+,]", "", x.text)


            Zip the two lists of titles and traffic approximations together and iterate over them to create the dictionary. I am going to use a dictionary comprehension in the final script, however it works identically to the following.



            titles = soup.find_all("title")
            approximate_traffic = soup.find_all("ht:approx_traffic")
            trends = {}
            for title, traffic in zip(titles[1:], approximate_traffic):
            trends[title.text] = re.sub("[+,]", "", traffic.text)
            return trends


            Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.



            Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.



            Entire script:



            import time
            import re
            import requests
            from bs4 import BeautifulSoup


            def fetch_xml(country_code):
            url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
            start = time.time()
            response = requests.get(url)
            response_time = time.time() - start
            print(f"The request took {response_time}s to complete.")
            return response.content

            def trends_retriever(country_code):
            """Parses the Google Trends RSS feed using BeautifulSoup.

            Returns:
            dict: Trend title for key, trend approximate traffic for value.
            """
            xml_document = fetch_xml(country_code)
            soup = BeautifulSoup(xml_document, "lxml")
            titles = soup.find_all("title")
            approximate_traffic = soup.find_all("ht:approx_traffic")
            return {title.text: re.sub("[+,]", "", traffic.text)
            for title, traffic in zip(titles[1:], approximate_traffic)}


            if __name__ == '__main__':
            trends = trends_retriever("US")
            print(trends)


            Output:



            The request took 0.2512788772583008s to complete.
            {'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}





            share|improve this answer

























              up vote
              2
              down vote



              accepted










              The idea behind your script is pretty cool and there's a couple ways to polish it up! While it certainly wouldn't cause a headache, I think that creating an entire class just to measure how long the request is going to take to complete is a little overkill.



              Within the class itself, let's look at 2 things which aren't the best practice.




              • Use of a global variable

              • Use of the try/except block


              Global variables



              There's a brilliant StackOverflow question with really good answers as to why using global variables should be discouraged. While in the scope of your script it is not harmful, imagine you had many functions and methods in your script. Using the global variable could decrease readability and potentially alter the behaviour of your code. Mind you, global variables aren't the same as global constants (also discussed in the StackOverflow question).



              Exception handling



              Using a bare except statement will handle all exceptions, even the ones you might not anticipate. Catch errors explicitly, otherwise you might cause unexpected behaviour in your code, should it become more complex. In this scope, your bare except statement should become more like this:



              try:
              ...
              except requests.exceptions.ConnectionError:
              print("failed to connect")


              This way, you are handling the case where your request actually fails and you let other errors happen in their intended way. Otherwise, a different exception could have been raised and you won't know what caused it.



              Additionally, on this class:




              • Following naming conventions, class names should use CapWords.

              • In Python 3+, classes no longer need to be inherited from object.


              Instead of the class, you could just define a simple function which retrieves the XML document from the RSS feed, declare your timing functionality within it and returns the XML document. Seeing as the URL query takes in what's seemingly an ISO 3166-2 country code as one of the parameters, you could pass the country code into this function and have the ability to fetch search trends for so many different countries! Please note that in the following snippet I am using f-strings which were introduced since Python 3.6.



              def fetch_xml(country_code):
              url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
              start = time.time()
              response = requests.get(url)
              response_time = time.time() - start
              print(f"The request took {response_time}s to complete.")
              return response.content`


              Parser



              Looking at the docstring of your parser, one might argue it's a little misleading. You're mentioning "storable text" while you're only printing your output! To come closer to the intention of the docstring, it would be better to return the parsed data and decide what to do with it later. I think in this case, a dictionary would a very fitting data structure. Dictionaries store key-value pairs, so lets use the trend title as the key and the traffic approximation as the value.



              Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:



              for x in traffic_unformatted: 
              x = x.text
              x = x.replace("+","")
              x = x.replace(",", "")


              Can now become:



                  for x in traffic_unformatted:
              x = re.sub("[+,]", "", x.text)


              Zip the two lists of titles and traffic approximations together and iterate over them to create the dictionary. I am going to use a dictionary comprehension in the final script, however it works identically to the following.



              titles = soup.find_all("title")
              approximate_traffic = soup.find_all("ht:approx_traffic")
              trends = {}
              for title, traffic in zip(titles[1:], approximate_traffic):
              trends[title.text] = re.sub("[+,]", "", traffic.text)
              return trends


              Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.



              Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.



              Entire script:



              import time
              import re
              import requests
              from bs4 import BeautifulSoup


              def fetch_xml(country_code):
              url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
              start = time.time()
              response = requests.get(url)
              response_time = time.time() - start
              print(f"The request took {response_time}s to complete.")
              return response.content

              def trends_retriever(country_code):
              """Parses the Google Trends RSS feed using BeautifulSoup.

              Returns:
              dict: Trend title for key, trend approximate traffic for value.
              """
              xml_document = fetch_xml(country_code)
              soup = BeautifulSoup(xml_document, "lxml")
              titles = soup.find_all("title")
              approximate_traffic = soup.find_all("ht:approx_traffic")
              return {title.text: re.sub("[+,]", "", traffic.text)
              for title, traffic in zip(titles[1:], approximate_traffic)}


              if __name__ == '__main__':
              trends = trends_retriever("US")
              print(trends)


              Output:



              The request took 0.2512788772583008s to complete.
              {'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}





              share|improve this answer























                up vote
                2
                down vote



                accepted







                up vote
                2
                down vote



                accepted






                The idea behind your script is pretty cool and there's a couple ways to polish it up! While it certainly wouldn't cause a headache, I think that creating an entire class just to measure how long the request is going to take to complete is a little overkill.



                Within the class itself, let's look at 2 things which aren't the best practice.




                • Use of a global variable

                • Use of the try/except block


                Global variables



                There's a brilliant StackOverflow question with really good answers as to why using global variables should be discouraged. While in the scope of your script it is not harmful, imagine you had many functions and methods in your script. Using the global variable could decrease readability and potentially alter the behaviour of your code. Mind you, global variables aren't the same as global constants (also discussed in the StackOverflow question).



                Exception handling



                Using a bare except statement will handle all exceptions, even the ones you might not anticipate. Catch errors explicitly, otherwise you might cause unexpected behaviour in your code, should it become more complex. In this scope, your bare except statement should become more like this:



                try:
                ...
                except requests.exceptions.ConnectionError:
                print("failed to connect")


                This way, you are handling the case where your request actually fails and you let other errors happen in their intended way. Otherwise, a different exception could have been raised and you won't know what caused it.



                Additionally, on this class:




                • Following naming conventions, class names should use CapWords.

                • In Python 3+, classes no longer need to be inherited from object.


                Instead of the class, you could just define a simple function which retrieves the XML document from the RSS feed, declare your timing functionality within it and returns the XML document. Seeing as the URL query takes in what's seemingly an ISO 3166-2 country code as one of the parameters, you could pass the country code into this function and have the ability to fetch search trends for so many different countries! Please note that in the following snippet I am using f-strings which were introduced since Python 3.6.



                def fetch_xml(country_code):
                url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
                start = time.time()
                response = requests.get(url)
                response_time = time.time() - start
                print(f"The request took {response_time}s to complete.")
                return response.content`


                Parser



                Looking at the docstring of your parser, one might argue it's a little misleading. You're mentioning "storable text" while you're only printing your output! To come closer to the intention of the docstring, it would be better to return the parsed data and decide what to do with it later. I think in this case, a dictionary would a very fitting data structure. Dictionaries store key-value pairs, so lets use the trend title as the key and the traffic approximation as the value.



                Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:



                for x in traffic_unformatted: 
                x = x.text
                x = x.replace("+","")
                x = x.replace(",", "")


                Can now become:



                    for x in traffic_unformatted:
                x = re.sub("[+,]", "", x.text)


                Zip the two lists of titles and traffic approximations together and iterate over them to create the dictionary. I am going to use a dictionary comprehension in the final script, however it works identically to the following.



                titles = soup.find_all("title")
                approximate_traffic = soup.find_all("ht:approx_traffic")
                trends = {}
                for title, traffic in zip(titles[1:], approximate_traffic):
                trends[title.text] = re.sub("[+,]", "", traffic.text)
                return trends


                Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.



                Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.



                Entire script:



                import time
                import re
                import requests
                from bs4 import BeautifulSoup


                def fetch_xml(country_code):
                url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
                start = time.time()
                response = requests.get(url)
                response_time = time.time() - start
                print(f"The request took {response_time}s to complete.")
                return response.content

                def trends_retriever(country_code):
                """Parses the Google Trends RSS feed using BeautifulSoup.

                Returns:
                dict: Trend title for key, trend approximate traffic for value.
                """
                xml_document = fetch_xml(country_code)
                soup = BeautifulSoup(xml_document, "lxml")
                titles = soup.find_all("title")
                approximate_traffic = soup.find_all("ht:approx_traffic")
                return {title.text: re.sub("[+,]", "", traffic.text)
                for title, traffic in zip(titles[1:], approximate_traffic)}


                if __name__ == '__main__':
                trends = trends_retriever("US")
                print(trends)


                Output:



                The request took 0.2512788772583008s to complete.
                {'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}





                share|improve this answer












                The idea behind your script is pretty cool and there's a couple ways to polish it up! While it certainly wouldn't cause a headache, I think that creating an entire class just to measure how long the request is going to take to complete is a little overkill.



                Within the class itself, let's look at 2 things which aren't the best practice.




                • Use of a global variable

                • Use of the try/except block


                Global variables



                There's a brilliant StackOverflow question with really good answers as to why using global variables should be discouraged. While in the scope of your script it is not harmful, imagine you had many functions and methods in your script. Using the global variable could decrease readability and potentially alter the behaviour of your code. Mind you, global variables aren't the same as global constants (also discussed in the StackOverflow question).



                Exception handling



                Using a bare except statement will handle all exceptions, even the ones you might not anticipate. Catch errors explicitly, otherwise you might cause unexpected behaviour in your code, should it become more complex. In this scope, your bare except statement should become more like this:



                try:
                ...
                except requests.exceptions.ConnectionError:
                print("failed to connect")


                This way, you are handling the case where your request actually fails and you let other errors happen in their intended way. Otherwise, a different exception could have been raised and you won't know what caused it.



                Additionally, on this class:




                • Following naming conventions, class names should use CapWords.

                • In Python 3+, classes no longer need to be inherited from object.


                Instead of the class, you could just define a simple function which retrieves the XML document from the RSS feed, declare your timing functionality within it and returns the XML document. Seeing as the URL query takes in what's seemingly an ISO 3166-2 country code as one of the parameters, you could pass the country code into this function and have the ability to fetch search trends for so many different countries! Please note that in the following snippet I am using f-strings which were introduced since Python 3.6.



                def fetch_xml(country_code):
                url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
                start = time.time()
                response = requests.get(url)
                response_time = time.time() - start
                print(f"The request took {response_time}s to complete.")
                return response.content`


                Parser



                Looking at the docstring of your parser, one might argue it's a little misleading. You're mentioning "storable text" while you're only printing your output! To come closer to the intention of the docstring, it would be better to return the parsed data and decide what to do with it later. I think in this case, a dictionary would a very fitting data structure. Dictionaries store key-value pairs, so lets use the trend title as the key and the traffic approximation as the value.



                Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:



                for x in traffic_unformatted: 
                x = x.text
                x = x.replace("+","")
                x = x.replace(",", "")


                Can now become:



                    for x in traffic_unformatted:
                x = re.sub("[+,]", "", x.text)


                Zip the two lists of titles and traffic approximations together and iterate over them to create the dictionary. I am going to use a dictionary comprehension in the final script, however it works identically to the following.



                titles = soup.find_all("title")
                approximate_traffic = soup.find_all("ht:approx_traffic")
                trends = {}
                for title, traffic in zip(titles[1:], approximate_traffic):
                trends[title.text] = re.sub("[+,]", "", traffic.text)
                return trends


                Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.



                Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.



                Entire script:



                import time
                import re
                import requests
                from bs4 import BeautifulSoup


                def fetch_xml(country_code):
                url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"
                start = time.time()
                response = requests.get(url)
                response_time = time.time() - start
                print(f"The request took {response_time}s to complete.")
                return response.content

                def trends_retriever(country_code):
                """Parses the Google Trends RSS feed using BeautifulSoup.

                Returns:
                dict: Trend title for key, trend approximate traffic for value.
                """
                xml_document = fetch_xml(country_code)
                soup = BeautifulSoup(xml_document, "lxml")
                titles = soup.find_all("title")
                approximate_traffic = soup.find_all("ht:approx_traffic")
                return {title.text: re.sub("[+,]", "", traffic.text)
                for title, traffic in zip(titles[1:], approximate_traffic)}


                if __name__ == '__main__':
                trends = trends_retriever("US")
                print(trends)


                Output:



                The request took 0.2512788772583008s to complete.
                {'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 2 days ago









                Lukasz Salitra

                783216




                783216






















                    Ranch Mayonaise is a new contributor. Be nice, and check out our Code of Conduct.










                     

                    draft saved


                    draft discarded


















                    Ranch Mayonaise is a new contributor. Be nice, and check out our Code of Conduct.













                    Ranch Mayonaise is a new contributor. Be nice, and check out our Code of Conduct.












                    Ranch Mayonaise is a new contributor. Be nice, and check out our Code of Conduct.















                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208277%2fweb-scraping-google-trends-in-python%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Mont Emei

                    Province de Neuquén

                    Journaliste