Web scraping google trends in python

up vote
4
down vote

favorite

Question

Recently I have tried to create a web scraping program to get data from google trends. It uses an RSS feed to do so. My question is as follows:

How can I improve my code such that it is more concise, efficient, and pleasing to a programmers eye? I would like to emphasise that I'm most concerned about if I've misused any functions, methods, or syntax in my code and if it could be 'cleaner'

Context

I'm not experienced at all. I'm doing this to learn.

Security isn't an issue since the program is running locally.

The code

"""Uses webscraping to retrieve html information from Google Trends for 

parsing""" 



# Imports 



import time 

import ssl

from requests import get 

from bs4 import BeautifulSoup



# Url for request 



url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US"



# Used to create an unverified certificate. 



ssl._create_default_https_context = ssl._create_unverified_context





class connection(object):



    def __init__(self, f):

        self.f = f



    def __call__(self):

        """Calculates the time taken to complete the request without any errors"""

        try: 

            # Times the amount of time the function takes to complete

            global html 

            start = time.time()

            html = get(url) 

            self.f()

            end = time.time() 

            time_taken = end - start 

            result = print("the function {} took {} to connect".format(self.f.__name__, time_taken))

            return result 

        except:

            print("function {} failed to connect".format(self.f.__name__))



@connection

def html_parser():

    """Parses the html into storable text""" 

    html_unparsed = html.text

    soup = BeautifulSoup(html_unparsed, "html.parser")

    titles_unformatted = soup.find_all("title")

    traffic_unformatted = soup.find_all("ht:approx_traffic")

    # Indexes through list to make data readable 

    titles = [x.text for x in titles_unformatted] 

    traffic =  

    for x in traffic_unformatted: 

        x = x.text 

        x = x.replace("+","")

        x = x.replace(",", "")

        traffic.append(x)

    print(traffic, titles)

html_parser()

Output

['100000', '2000000', '2000000', '1000000', '1000000', '1000000', 

'1000000', '500000', '500000', '500000', '200000', '200000', '200000', 

'200000', '200000', '200000', '200000', '200000', '200000', '200000'] 

['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart 

Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019', 

'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner', 

'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black 

Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings', 

'Safeway']

the function html_parser took 1.0186748504638672 to connect

Concerns

Makes programmers cringe.

As someone relatively new to python and programming in general my worst fear is that this code gives someone else a headache or a laugh to look at. At the end of the day I just want to improve, so to reiterate: how can I make this code look better? Have I misused any functions?

asked 2 days ago

Ranch Mayonaise

233

New contributor

add a comment |

up vote
4
down vote

favorite

Question

Recently I have tried to create a web scraping program to get data from google trends. It uses an RSS feed to do so. My question is as follows:

Context

I'm not experienced at all. I'm doing this to learn.

Security isn't an issue since the program is running locally.

The code

"""Uses webscraping to retrieve html information from Google Trends for 

parsing""" 



# Imports 



import time 

import ssl

from requests import get 

from bs4 import BeautifulSoup



# Url for request 



url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US"



# Used to create an unverified certificate. 



ssl._create_default_https_context = ssl._create_unverified_context





class connection(object):



    def __init__(self, f):

        self.f = f



    def __call__(self):

        """Calculates the time taken to complete the request without any errors"""

        try: 

            # Times the amount of time the function takes to complete

            global html 

            start = time.time()

            html = get(url) 

            self.f()

            end = time.time() 

            time_taken = end - start 

            result = print("the function {} took {} to connect".format(self.f.__name__, time_taken))

            return result 

        except:

            print("function {} failed to connect".format(self.f.__name__))



@connection

def html_parser():

    """Parses the html into storable text""" 

    html_unparsed = html.text

    soup = BeautifulSoup(html_unparsed, "html.parser")

    titles_unformatted = soup.find_all("title")

    traffic_unformatted = soup.find_all("ht:approx_traffic")

    # Indexes through list to make data readable 

    titles = [x.text for x in titles_unformatted] 

    traffic =  

    for x in traffic_unformatted: 

        x = x.text 

        x = x.replace("+","")

        x = x.replace(",", "")

        traffic.append(x)

    print(traffic, titles)

html_parser()

Output

['100000', '2000000', '2000000', '1000000', '1000000', '1000000', 

'1000000', '500000', '500000', '500000', '200000', '200000', '200000', 

'200000', '200000', '200000', '200000', '200000', '200000', '200000'] 

['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart 

Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019', 

'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner', 

'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black 

Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings', 

'Safeway']

the function html_parser took 1.0186748504638672 to connect

Concerns

Makes programmers cringe.

asked 2 days ago

Ranch Mayonaise

233

New contributor

add a comment |

up vote
4
down vote

favorite

Question

Recently I have tried to create a web scraping program to get data from google trends. It uses an RSS feed to do so. My question is as follows:

Context

I'm not experienced at all. I'm doing this to learn.

Security isn't an issue since the program is running locally.

The code

"""Uses webscraping to retrieve html information from Google Trends for 

parsing""" 



# Imports 



import time 

import ssl

from requests import get 

from bs4 import BeautifulSoup



# Url for request 



url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US"



# Used to create an unverified certificate. 



ssl._create_default_https_context = ssl._create_unverified_context





class connection(object):



    def __init__(self, f):

        self.f = f



    def __call__(self):

        """Calculates the time taken to complete the request without any errors"""

        try: 

            # Times the amount of time the function takes to complete

            global html 

            start = time.time()

            html = get(url) 

            self.f()

            end = time.time() 

            time_taken = end - start 

            result = print("the function {} took {} to connect".format(self.f.__name__, time_taken))

            return result 

        except:

            print("function {} failed to connect".format(self.f.__name__))



@connection

def html_parser():

    """Parses the html into storable text""" 

    html_unparsed = html.text

    soup = BeautifulSoup(html_unparsed, "html.parser")

    titles_unformatted = soup.find_all("title")

    traffic_unformatted = soup.find_all("ht:approx_traffic")

    # Indexes through list to make data readable 

    titles = [x.text for x in titles_unformatted] 

    traffic =  

    for x in traffic_unformatted: 

        x = x.text 

        x = x.replace("+","")

        x = x.replace(",", "")

        traffic.append(x)

    print(traffic, titles)

html_parser()

Output

['100000', '2000000', '2000000', '1000000', '1000000', '1000000', 

'1000000', '500000', '500000', '500000', '200000', '200000', '200000', 

'200000', '200000', '200000', '200000', '200000', '200000', '200000'] 

['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart 

Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019', 

'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner', 

'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black 

Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings', 

'Safeway']

the function html_parser took 1.0186748504638672 to connect

Concerns

Makes programmers cringe.

asked 2 days ago

Ranch Mayonaise

233

New contributor

Question

Recently I have tried to create a web scraping program to get data from google trends. It uses an RSS feed to do so. My question is as follows:

Context

I'm not experienced at all. I'm doing this to learn.

Security isn't an issue since the program is running locally.

The code

"""Uses webscraping to retrieve html information from Google Trends for 

parsing""" 



# Imports 



import time 

import ssl

from requests import get 

from bs4 import BeautifulSoup



# Url for request 



url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US"



# Used to create an unverified certificate. 



ssl._create_default_https_context = ssl._create_unverified_context





class connection(object):



    def __init__(self, f):

        self.f = f



    def __call__(self):

        """Calculates the time taken to complete the request without any errors"""

        try: 

            # Times the amount of time the function takes to complete

            global html 

            start = time.time()

            html = get(url) 

            self.f()

            end = time.time() 

            time_taken = end - start 

            result = print("the function {} took {} to connect".format(self.f.__name__, time_taken))

            return result 

        except:

            print("function {} failed to connect".format(self.f.__name__))



@connection

def html_parser():

    """Parses the html into storable text""" 

    html_unparsed = html.text

    soup = BeautifulSoup(html_unparsed, "html.parser")

    titles_unformatted = soup.find_all("title")

    traffic_unformatted = soup.find_all("ht:approx_traffic")

    # Indexes through list to make data readable 

    titles = [x.text for x in titles_unformatted] 

    traffic =  

    for x in traffic_unformatted: 

        x = x.text 

        x = x.replace("+","")

        x = x.replace(",", "")

        traffic.append(x)

    print(traffic, titles)

html_parser()

Output

['100000', '2000000', '2000000', '1000000', '1000000', '1000000', 

'1000000', '500000', '500000', '500000', '200000', '200000', '200000', 

'200000', '200000', '200000', '200000', '200000', '200000', '200000'] 

['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart 

Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019', 

'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner', 

'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black 

Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings', 

'Safeway']

the function html_parser took 1.0186748504638672 to connect

Concerns

Makes programmers cringe.

python object-oriented python-3.x web-scraping

asked 2 days ago

Ranch Mayonaise

233

New contributor

asked 2 days ago

Ranch Mayonaise

233

New contributor

asked 2 days ago

Ranch Mayonaise

233

New contributor

asked 2 days ago

Ranch Mayonaise

233

asked 2 days ago

Ranch Mayonaise

233

New contributor

Ranch Mayonaise is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

The idea behind your script is pretty cool and there's a couple ways to polish it up! While it certainly wouldn't cause a headache, I think that creating an entire class just to measure how long the request is going to take to complete is a little overkill.

Within the class itself, let's look at 2 things which aren't the best practice.

Use of a global variable

Use of the try/except block

Global variables

There's a brilliant StackOverflow question with really good answers as to why using global variables should be discouraged. While in the scope of your script it is not harmful, imagine you had many functions and methods in your script. Using the global variable could decrease readability and potentially alter the behaviour of your code. Mind you, global variables aren't the same as global constants (also discussed in the StackOverflow question).

Exception handling

Using a bare except statement will handle all exceptions, even the ones you might not anticipate. Catch errors explicitly, otherwise you might cause unexpected behaviour in your code, should it become more complex. In this scope, your bare except statement should become more like this:

try:

    ...

except requests.exceptions.ConnectionError:

    print("failed to connect")

This way, you are handling the case where your request actually fails and you let other errors happen in their intended way. Otherwise, a different exception could have been raised and you won't know what caused it.

Additionally, on this class:

Following naming conventions, class names should use CapWords.

In Python 3+, classes no longer need to be inherited from object.

Instead of the class, you could just define a simple function which retrieves the XML document from the RSS feed, declare your timing functionality within it and returns the XML document. Seeing as the URL query takes in what's seemingly an ISO 3166-2 country code as one of the parameters, you could pass the country code into this function and have the ability to fetch search trends for so many different countries! Please note that in the following snippet I am using f-strings which were introduced since Python 3.6.

def fetch_xml(country_code):

    url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content`

Parser

Looking at the docstring of your parser, one might argue it's a little misleading. You're mentioning "storable text" while you're only printing your output! To come closer to the intention of the docstring, it would be better to return the parsed data and decide what to do with it later. I think in this case, a dictionary would a very fitting data structure. Dictionaries store key-value pairs, so lets use the trend title as the key and the traffic approximation as the value.

Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:

for x in traffic_unformatted: 

    x = x.text 

    x = x.replace("+","")

    x = x.replace(",", "")

Can now become:

    for x in traffic_unformatted:

        x = re.sub("[+,]", "", x.text)

Zip the two lists of titles and traffic approximations together and iterate over them to create the dictionary. I am going to use a dictionary comprehension in the final script, however it works identically to the following.

titles = soup.find_all("title")

approximate_traffic = soup.find_all("ht:approx_traffic")

trends = {}

for title, traffic in zip(titles[1:], approximate_traffic):

    trends[title.text] = re.sub("[+,]", "", traffic.text)

return trends

Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.

Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.

Entire script:

import time

import re

import requests

from bs4 import BeautifulSoup





def fetch_xml(country_code):

    url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content



def trends_retriever(country_code):

    """Parses the Google Trends RSS feed using BeautifulSoup.



    Returns:

        dict: Trend title for key, trend approximate traffic for value.

    """

    xml_document = fetch_xml(country_code)

    soup = BeautifulSoup(xml_document, "lxml")

    titles = soup.find_all("title")

    approximate_traffic = soup.find_all("ht:approx_traffic")

    return {title.text: re.sub("[+,]", "", traffic.text)

            for title, traffic in zip(titles[1:], approximate_traffic)}





if __name__ == '__main__':

    trends = trends_retriever("US")

    print(trends)

Output:

The request took 0.2512788772583008s to complete.

{'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}

answered 2 days ago

Lukasz Salitra

783216

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Ranch Mayonaise is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208277%2fweb-scraping-google-trends-in-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

Within the class itself, let's look at 2 things which aren't the best practice.

Use of a global variable

Use of the try/except block

Global variables

Exception handling

try:

    ...

except requests.exceptions.ConnectionError:

    print("failed to connect")

Additionally, on this class:

Following naming conventions, class names should use CapWords.

In Python 3+, classes no longer need to be inherited from object.

def fetch_xml(country_code):

    url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content`

Parser

Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:

for x in traffic_unformatted: 

    x = x.text 

    x = x.replace("+","")

    x = x.replace(",", "")

Can now become:

    for x in traffic_unformatted:

        x = re.sub("[+,]", "", x.text)

titles = soup.find_all("title")

approximate_traffic = soup.find_all("ht:approx_traffic")

trends = {}

for title, traffic in zip(titles[1:], approximate_traffic):

    trends[title.text] = re.sub("[+,]", "", traffic.text)

return trends

Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.

Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.

Entire script:

import time

import re

import requests

from bs4 import BeautifulSoup





def fetch_xml(country_code):

    url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content



def trends_retriever(country_code):

    """Parses the Google Trends RSS feed using BeautifulSoup.



    Returns:

        dict: Trend title for key, trend approximate traffic for value.

    """

    xml_document = fetch_xml(country_code)

    soup = BeautifulSoup(xml_document, "lxml")

    titles = soup.find_all("title")

    approximate_traffic = soup.find_all("ht:approx_traffic")

    return {title.text: re.sub("[+,]", "", traffic.text)

            for title, traffic in zip(titles[1:], approximate_traffic)}





if __name__ == '__main__':

    trends = trends_retriever("US")

    print(trends)

Output:

The request took 0.2512788772583008s to complete.

{'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}

answered 2 days ago

Lukasz Salitra

783216

add a comment |

up vote
2
down vote

accepted

Within the class itself, let's look at 2 things which aren't the best practice.

Use of a global variable

Use of the try/except block

Global variables

Exception handling

try:

    ...

except requests.exceptions.ConnectionError:

    print("failed to connect")

Additionally, on this class:

Following naming conventions, class names should use CapWords.

In Python 3+, classes no longer need to be inherited from object.

def fetch_xml(country_code):

    url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content`

Parser

Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:

for x in traffic_unformatted: 

    x = x.text 

    x = x.replace("+","")

    x = x.replace(",", "")

Can now become:

    for x in traffic_unformatted:

        x = re.sub("[+,]", "", x.text)

titles = soup.find_all("title")

approximate_traffic = soup.find_all("ht:approx_traffic")

trends = {}

for title, traffic in zip(titles[1:], approximate_traffic):

    trends[title.text] = re.sub("[+,]", "", traffic.text)

return trends

Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.

Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.

Entire script:

import time

import re

import requests

from bs4 import BeautifulSoup





def fetch_xml(country_code):

    url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content



def trends_retriever(country_code):

    """Parses the Google Trends RSS feed using BeautifulSoup.



    Returns:

        dict: Trend title for key, trend approximate traffic for value.

    """

    xml_document = fetch_xml(country_code)

    soup = BeautifulSoup(xml_document, "lxml")

    titles = soup.find_all("title")

    approximate_traffic = soup.find_all("ht:approx_traffic")

    return {title.text: re.sub("[+,]", "", traffic.text)

            for title, traffic in zip(titles[1:], approximate_traffic)}





if __name__ == '__main__':

    trends = trends_retriever("US")

    print(trends)

Output:

The request took 0.2512788772583008s to complete.

{'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}

answered 2 days ago

Lukasz Salitra

783216

add a comment |

up vote
2
down vote

accepted

Within the class itself, let's look at 2 things which aren't the best practice.

Use of a global variable

Use of the try/except block

Global variables

Exception handling

try:

    ...

except requests.exceptions.ConnectionError:

    print("failed to connect")

Additionally, on this class:

Following naming conventions, class names should use CapWords.

In Python 3+, classes no longer need to be inherited from object.

def fetch_xml(country_code):

    url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content`

Parser

Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:

for x in traffic_unformatted: 

    x = x.text 

    x = x.replace("+","")

    x = x.replace(",", "")

Can now become:

    for x in traffic_unformatted:

        x = re.sub("[+,]", "", x.text)

titles = soup.find_all("title")

approximate_traffic = soup.find_all("ht:approx_traffic")

trends = {}

for title, traffic in zip(titles[1:], approximate_traffic):

    trends[title.text] = re.sub("[+,]", "", traffic.text)

return trends

Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.

Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.

Entire script:

import time

import re

import requests

from bs4 import BeautifulSoup





def fetch_xml(country_code):

    url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content



def trends_retriever(country_code):

    """Parses the Google Trends RSS feed using BeautifulSoup.



    Returns:

        dict: Trend title for key, trend approximate traffic for value.

    """

    xml_document = fetch_xml(country_code)

    soup = BeautifulSoup(xml_document, "lxml")

    titles = soup.find_all("title")

    approximate_traffic = soup.find_all("ht:approx_traffic")

    return {title.text: re.sub("[+,]", "", traffic.text)

            for title, traffic in zip(titles[1:], approximate_traffic)}





if __name__ == '__main__':

    trends = trends_retriever("US")

    print(trends)

Output:

The request took 0.2512788772583008s to complete.

{'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}

answered 2 days ago

Lukasz Salitra

783216

Within the class itself, let's look at 2 things which aren't the best practice.

Use of a global variable

Use of the try/except block

Global variables

Exception handling

try:

    ...

except requests.exceptions.ConnectionError:

    print("failed to connect")

Additionally, on this class:

Following naming conventions, class names should use CapWords.

In Python 3+, classes no longer need to be inherited from object.

def fetch_xml(country_code):

    url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content`

Parser

Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:

for x in traffic_unformatted: 

    x = x.text 

    x = x.replace("+","")

    x = x.replace(",", "")

Can now become:

    for x in traffic_unformatted:

        x = re.sub("[+,]", "", x.text)

titles = soup.find_all("title")

approximate_traffic = soup.find_all("ht:approx_traffic")

trends = {}

for title, traffic in zip(titles[1:], approximate_traffic):

    trends[title.text] = re.sub("[+,]", "", traffic.text)

return trends

Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.

Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.

Entire script:

import time

import re

import requests

from bs4 import BeautifulSoup





def fetch_xml(country_code):

    url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}"

    start = time.time()

    response = requests.get(url)

    response_time = time.time() - start

    print(f"The request took {response_time}s to complete.")

    return response.content



def trends_retriever(country_code):

    """Parses the Google Trends RSS feed using BeautifulSoup.



    Returns:

        dict: Trend title for key, trend approximate traffic for value.

    """

    xml_document = fetch_xml(country_code)

    soup = BeautifulSoup(xml_document, "lxml")

    titles = soup.find_all("title")

    approximate_traffic = soup.find_all("ht:approx_traffic")

    return {title.text: re.sub("[+,]", "", traffic.text)

            for title, traffic in zip(titles[1:], approximate_traffic)}





if __name__ == '__main__':

    trends = trends_retriever("US")

    print(trends)

Output:

The request took 0.2512788772583008s to complete.

{'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'}

answered 2 days ago

Lukasz Salitra

783216

answered 2 days ago

Lukasz Salitra

783216

answered 2 days ago

Lukasz Salitra

783216

answered 2 days ago

Lukasz Salitra

783216

add a comment |

Ranch Mayonaise is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Ranch Mayonaise is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Krdytkyu