Making use of a decorator within a python script

up vote
1
down vote

favorite

I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.

I've applied @get_links decorator which scrapes the titles from its inner page.

However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



url = "https://stackoverflow.com/questions/tagged/web-scraping"



def get_links(func):



    def get_target_link(*args,**kwargs):

        titles = 

        for link in func(*args,**kwargs):

            res = requests.get(link)

            soup = BeautifulSoup(res.text,"lxml")

            title = soup.select_one("h1[itemprop='name'] a").text

            titles.append(title)

        return titles

    return get_target_link



@get_links

def get_info(link):

    ilink = 

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    for items in soup.select(".summary .question-hyperlink"):

        ilink.append(urljoin(url,items.get('href')))

    return ilink



if __name__ == '__main__':

    print(get_info(url))

edited yesterday

asked yesterday

asmitu

1268

I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday

Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday

3

Why do you think that using a decorator is appropriate here?
– 200_success
yesterday

Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago

1

Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago

add a comment |

up vote
1
down vote

favorite

I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.

I've applied @get_links decorator which scrapes the titles from its inner page.

However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



url = "https://stackoverflow.com/questions/tagged/web-scraping"



def get_links(func):



    def get_target_link(*args,**kwargs):

        titles = 

        for link in func(*args,**kwargs):

            res = requests.get(link)

            soup = BeautifulSoup(res.text,"lxml")

            title = soup.select_one("h1[itemprop='name'] a").text

            titles.append(title)

        return titles

    return get_target_link



@get_links

def get_info(link):

    ilink = 

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    for items in soup.select(".summary .question-hyperlink"):

        ilink.append(urljoin(url,items.get('href')))

    return ilink



if __name__ == '__main__':

    print(get_info(url))

edited yesterday

asked yesterday

asmitu

1268

I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday

Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday

3

Why do you think that using a decorator is appropriate here?
– 200_success
yesterday

Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago

1

Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago

add a comment |

up vote
1
down vote

favorite

I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.

I've applied @get_links decorator which scrapes the titles from its inner page.

However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



url = "https://stackoverflow.com/questions/tagged/web-scraping"



def get_links(func):



    def get_target_link(*args,**kwargs):

        titles = 

        for link in func(*args,**kwargs):

            res = requests.get(link)

            soup = BeautifulSoup(res.text,"lxml")

            title = soup.select_one("h1[itemprop='name'] a").text

            titles.append(title)

        return titles

    return get_target_link



@get_links

def get_info(link):

    ilink = 

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    for items in soup.select(".summary .question-hyperlink"):

        ilink.append(urljoin(url,items.get('href')))

    return ilink



if __name__ == '__main__':

    print(get_info(url))

edited yesterday

asked yesterday

asmitu

1268

I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.

I've applied @get_links decorator which scrapes the titles from its inner page.

However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



url = "https://stackoverflow.com/questions/tagged/web-scraping"



def get_links(func):



    def get_target_link(*args,**kwargs):

        titles = 

        for link in func(*args,**kwargs):

            res = requests.get(link)

            soup = BeautifulSoup(res.text,"lxml")

            title = soup.select_one("h1[itemprop='name'] a").text

            titles.append(title)

        return titles

    return get_target_link



@get_links

def get_info(link):

    ilink = 

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    for items in soup.select(".summary .question-hyperlink"):

        ilink.append(urljoin(url,items.get('href')))

    return ilink



if __name__ == '__main__':

    print(get_info(url))

python python-3.x web-scraping

edited yesterday

asked yesterday

asmitu

1268

edited yesterday

asked yesterday

asmitu

1268

edited yesterday

asked yesterday

asmitu

1268

asked yesterday

asmitu

1268

asked yesterday

asmitu

1268

I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday

Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday

3

Why do you think that using a decorator is appropriate here?
– 200_success
yesterday

Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago

1

Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago

add a comment |

I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday

Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday

3

Why do you think that using a decorator is appropriate here?
– 200_success
yesterday

Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago

1

Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago

I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday

Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday

Why do you think that using a decorator is appropriate here?
– 200_success
yesterday

Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago

Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago

add a comment |

1 Answer
1

active

oldest

votes

up vote
6
down vote

While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.

Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



def get_title(link):

    """Load a link to get the page title."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    return soup.title.text.split(" - ")[1] # Will only work exactly like this  with Stackexchange

    # return soup.select_one("h1[itemprop='name'] a").text



def get_links(link):

    """Get all links from a page."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    relative_urls = soup.select(".summary .question-hyperlink")

    return [urljoin(url, items.get('href')) for items in relative_urls]





if __name__ == '__main__':

    url = "https://stackoverflow.com/questions/tagged/web-scraping"

    links = get_links(url)

    link_titles = [get_title(link) for link in links]

    print(link_titles)

If you really want to, you can then make a new function that uses these two functions:

def get_link_titles(url):

    """Get the titles of all links present in `url`."""

    return [get_title(link) for link in get_links(url)]

In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).

You could put getting a page and parsing it with BeautifulSoup into its own function:

SESSION = requests.Session()



def get_soup(url):

    res = SESSION.get(url)

    return BeautifulSoup(res.text,"lxml")

You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).

edited 13 hours ago

answered yesterday

Graipher

23k53384

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209153%2fmaking-use-of-a-decorator-within-a-python-script%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
6
down vote

While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.

Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



def get_title(link):

    """Load a link to get the page title."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    return soup.title.text.split(" - ")[1] # Will only work exactly like this  with Stackexchange

    # return soup.select_one("h1[itemprop='name'] a").text



def get_links(link):

    """Get all links from a page."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    relative_urls = soup.select(".summary .question-hyperlink")

    return [urljoin(url, items.get('href')) for items in relative_urls]





if __name__ == '__main__':

    url = "https://stackoverflow.com/questions/tagged/web-scraping"

    links = get_links(url)

    link_titles = [get_title(link) for link in links]

    print(link_titles)

If you really want to, you can then make a new function that uses these two functions:

def get_link_titles(url):

    """Get the titles of all links present in `url`."""

    return [get_title(link) for link in get_links(url)]

In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).

You could put getting a page and parsing it with BeautifulSoup into its own function:

SESSION = requests.Session()



def get_soup(url):

    res = SESSION.get(url)

    return BeautifulSoup(res.text,"lxml")

edited 13 hours ago

answered yesterday

Graipher

23k53384

add a comment |

up vote
6
down vote

While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.

Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



def get_title(link):

    """Load a link to get the page title."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    return soup.title.text.split(" - ")[1] # Will only work exactly like this  with Stackexchange

    # return soup.select_one("h1[itemprop='name'] a").text



def get_links(link):

    """Get all links from a page."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    relative_urls = soup.select(".summary .question-hyperlink")

    return [urljoin(url, items.get('href')) for items in relative_urls]





if __name__ == '__main__':

    url = "https://stackoverflow.com/questions/tagged/web-scraping"

    links = get_links(url)

    link_titles = [get_title(link) for link in links]

    print(link_titles)

If you really want to, you can then make a new function that uses these two functions:

def get_link_titles(url):

    """Get the titles of all links present in `url`."""

    return [get_title(link) for link in get_links(url)]

In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).

You could put getting a page and parsing it with BeautifulSoup into its own function:

SESSION = requests.Session()



def get_soup(url):

    res = SESSION.get(url)

    return BeautifulSoup(res.text,"lxml")

edited 13 hours ago

answered yesterday

Graipher

23k53384

add a comment |

up vote
6
down vote

While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.

Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



def get_title(link):

    """Load a link to get the page title."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    return soup.title.text.split(" - ")[1] # Will only work exactly like this  with Stackexchange

    # return soup.select_one("h1[itemprop='name'] a").text



def get_links(link):

    """Get all links from a page."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    relative_urls = soup.select(".summary .question-hyperlink")

    return [urljoin(url, items.get('href')) for items in relative_urls]





if __name__ == '__main__':

    url = "https://stackoverflow.com/questions/tagged/web-scraping"

    links = get_links(url)

    link_titles = [get_title(link) for link in links]

    print(link_titles)

If you really want to, you can then make a new function that uses these two functions:

def get_link_titles(url):

    """Get the titles of all links present in `url`."""

    return [get_title(link) for link in get_links(url)]

In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).

You could put getting a page and parsing it with BeautifulSoup into its own function:

SESSION = requests.Session()



def get_soup(url):

    res = SESSION.get(url)

    return BeautifulSoup(res.text,"lxml")

edited 13 hours ago

answered yesterday

Graipher

23k53384

While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.

Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



def get_title(link):

    """Load a link to get the page title."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    return soup.title.text.split(" - ")[1] # Will only work exactly like this  with Stackexchange

    # return soup.select_one("h1[itemprop='name'] a").text



def get_links(link):

    """Get all links from a page."""

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    relative_urls = soup.select(".summary .question-hyperlink")

    return [urljoin(url, items.get('href')) for items in relative_urls]





if __name__ == '__main__':

    url = "https://stackoverflow.com/questions/tagged/web-scraping"

    links = get_links(url)

    link_titles = [get_title(link) for link in links]

    print(link_titles)

If you really want to, you can then make a new function that uses these two functions:

def get_link_titles(url):

    """Get the titles of all links present in `url`."""

    return [get_title(link) for link in get_links(url)]

In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).

You could put getting a page and parsing it with BeautifulSoup into its own function:

SESSION = requests.Session()



def get_soup(url):

    res = SESSION.get(url)

    return BeautifulSoup(res.text,"lxml")

edited 13 hours ago

answered yesterday

Graipher

23k53384

edited 13 hours ago

answered yesterday

Graipher

23k53384

answered yesterday

Graipher

23k53384

answered yesterday

Graipher

23k53384

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Krdytkyu