Making use of a decorator within a python script
up vote
1
down vote
favorite
I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.
I've applied @get_links decorator which scrapes the titles from its inner page.
However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(func):
def get_target_link(*args,**kwargs):
titles =
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link
@get_links
def get_info(link):
ilink =
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink
if __name__ == '__main__':
print(get_info(url))
python python-3.x web-scraping
add a comment |
up vote
1
down vote
favorite
I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.
I've applied @get_links decorator which scrapes the titles from its inner page.
However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(func):
def get_target_link(*args,**kwargs):
titles =
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link
@get_links
def get_info(link):
ilink =
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink
if __name__ == '__main__':
print(get_info(url))
python python-3.x web-scraping
I don't get howcheck_paginationis supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday
Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday
3
Why do you think that using a decorator is appropriate here?
– 200_success
yesterday
Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago
1
Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.
I've applied @get_links decorator which scrapes the titles from its inner page.
However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(func):
def get_target_link(*args,**kwargs):
titles =
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link
@get_links
def get_info(link):
ilink =
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink
if __name__ == '__main__':
print(get_info(url))
python python-3.x web-scraping
I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.
I've applied @get_links decorator which scrapes the titles from its inner page.
However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(func):
def get_target_link(*args,**kwargs):
titles =
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link
@get_links
def get_info(link):
ilink =
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink
if __name__ == '__main__':
print(get_info(url))
python python-3.x web-scraping
python python-3.x web-scraping
edited yesterday
asked yesterday
asmitu
1268
1268
I don't get howcheck_paginationis supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday
Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday
3
Why do you think that using a decorator is appropriate here?
– 200_success
yesterday
Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago
1
Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago
add a comment |
I don't get howcheck_paginationis supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday
Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday
3
Why do you think that using a decorator is appropriate here?
– 200_success
yesterday
Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago
1
Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago
I don't get how
check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?– Mathias Ettinger
yesterday
I don't get how
check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?– Mathias Ettinger
yesterday
Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday
Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday
3
3
Why do you think that using a decorator is appropriate here?
– 200_success
yesterday
Why do you think that using a decorator is appropriate here?
– 200_success
yesterday
Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago
Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago
1
1
Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago
Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago
add a comment |
1 Answer
1
active
oldest
votes
up vote
6
down vote
While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.
Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_title(link):
"""Load a link to get the page title."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
# return soup.select_one("h1[itemprop='name'] a").text
def get_links(link):
"""Get all links from a page."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
relative_urls = soup.select(".summary .question-hyperlink")
return [urljoin(url, items.get('href')) for items in relative_urls]
if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
links = get_links(url)
link_titles = [get_title(link) for link in links]
print(link_titles)
If you really want to, you can then make a new function that uses these two functions:
def get_link_titles(url):
"""Get the titles of all links present in `url`."""
return [get_title(link) for link in get_links(url)]
In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).
You could put getting a page and parsing it with BeautifulSoup into its own function:
SESSION = requests.Session()
def get_soup(url):
res = SESSION.get(url)
return BeautifulSoup(res.text,"lxml")
You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
6
down vote
While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.
Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_title(link):
"""Load a link to get the page title."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
# return soup.select_one("h1[itemprop='name'] a").text
def get_links(link):
"""Get all links from a page."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
relative_urls = soup.select(".summary .question-hyperlink")
return [urljoin(url, items.get('href')) for items in relative_urls]
if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
links = get_links(url)
link_titles = [get_title(link) for link in links]
print(link_titles)
If you really want to, you can then make a new function that uses these two functions:
def get_link_titles(url):
"""Get the titles of all links present in `url`."""
return [get_title(link) for link in get_links(url)]
In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).
You could put getting a page and parsing it with BeautifulSoup into its own function:
SESSION = requests.Session()
def get_soup(url):
res = SESSION.get(url)
return BeautifulSoup(res.text,"lxml")
You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).
add a comment |
up vote
6
down vote
While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.
Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_title(link):
"""Load a link to get the page title."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
# return soup.select_one("h1[itemprop='name'] a").text
def get_links(link):
"""Get all links from a page."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
relative_urls = soup.select(".summary .question-hyperlink")
return [urljoin(url, items.get('href')) for items in relative_urls]
if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
links = get_links(url)
link_titles = [get_title(link) for link in links]
print(link_titles)
If you really want to, you can then make a new function that uses these two functions:
def get_link_titles(url):
"""Get the titles of all links present in `url`."""
return [get_title(link) for link in get_links(url)]
In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).
You could put getting a page and parsing it with BeautifulSoup into its own function:
SESSION = requests.Session()
def get_soup(url):
res = SESSION.get(url)
return BeautifulSoup(res.text,"lxml")
You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).
add a comment |
up vote
6
down vote
up vote
6
down vote
While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.
Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_title(link):
"""Load a link to get the page title."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
# return soup.select_one("h1[itemprop='name'] a").text
def get_links(link):
"""Get all links from a page."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
relative_urls = soup.select(".summary .question-hyperlink")
return [urljoin(url, items.get('href')) for items in relative_urls]
if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
links = get_links(url)
link_titles = [get_title(link) for link in links]
print(link_titles)
If you really want to, you can then make a new function that uses these two functions:
def get_link_titles(url):
"""Get the titles of all links present in `url`."""
return [get_title(link) for link in get_links(url)]
In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).
You could put getting a page and parsing it with BeautifulSoup into its own function:
SESSION = requests.Session()
def get_soup(url):
res = SESSION.get(url)
return BeautifulSoup(res.text,"lxml")
You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).
While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.
Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_title(link):
"""Load a link to get the page title."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
# return soup.select_one("h1[itemprop='name'] a").text
def get_links(link):
"""Get all links from a page."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
relative_urls = soup.select(".summary .question-hyperlink")
return [urljoin(url, items.get('href')) for items in relative_urls]
if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
links = get_links(url)
link_titles = [get_title(link) for link in links]
print(link_titles)
If you really want to, you can then make a new function that uses these two functions:
def get_link_titles(url):
"""Get the titles of all links present in `url`."""
return [get_title(link) for link in get_links(url)]
In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).
You could put getting a page and parsing it with BeautifulSoup into its own function:
SESSION = requests.Session()
def get_soup(url):
res = SESSION.get(url)
return BeautifulSoup(res.text,"lxml")
You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).
edited 13 hours ago
answered yesterday
Graipher
23k53384
23k53384
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209153%2fmaking-use-of-a-decorator-within-a-python-script%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I don't get how
check_paginationis supposed to help from the code itself, can you explain its purpose in more details, please?– Mathias Ettinger
yesterday
Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday
3
Why do you think that using a decorator is appropriate here?
– 200_success
yesterday
Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago
1
Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago