Making use of a decorator within a python script











up vote
1
down vote

favorite












I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.



I've applied @get_links decorator which scrapes the titles from its inner page.



However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.



import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(func):

def get_target_link(*args,**kwargs):
titles =
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link

@get_links
def get_info(link):
ilink =
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink

if __name__ == '__main__':
print(get_info(url))









share|improve this question
























  • I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
    – Mathias Ettinger
    yesterday










  • Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
    – asmitu
    yesterday






  • 3




    Why do you think that using a decorator is appropriate here?
    – 200_success
    yesterday










  • Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
    – asmitu
    17 hours ago






  • 1




    Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
    – 200_success
    17 hours ago

















up vote
1
down vote

favorite












I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.



I've applied @get_links decorator which scrapes the titles from its inner page.



However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.



import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(func):

def get_target_link(*args,**kwargs):
titles =
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link

@get_links
def get_info(link):
ilink =
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink

if __name__ == '__main__':
print(get_info(url))









share|improve this question
























  • I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
    – Mathias Ettinger
    yesterday










  • Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
    – asmitu
    yesterday






  • 3




    Why do you think that using a decorator is appropriate here?
    – 200_success
    yesterday










  • Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
    – asmitu
    17 hours ago






  • 1




    Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
    – 200_success
    17 hours ago















up vote
1
down vote

favorite









up vote
1
down vote

favorite











I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.



I've applied @get_links decorator which scrapes the titles from its inner page.



However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.



import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(func):

def get_target_link(*args,**kwargs):
titles =
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link

@get_links
def get_info(link):
ilink =
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink

if __name__ == '__main__':
print(get_info(url))









share|improve this question















I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.



I've applied @get_links decorator which scrapes the titles from its inner page.



However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.



import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(func):

def get_target_link(*args,**kwargs):
titles =
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link

@get_links
def get_info(link):
ilink =
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink

if __name__ == '__main__':
print(get_info(url))






python python-3.x web-scraping






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday

























asked yesterday









asmitu

1268




1268












  • I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
    – Mathias Ettinger
    yesterday










  • Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
    – asmitu
    yesterday






  • 3




    Why do you think that using a decorator is appropriate here?
    – 200_success
    yesterday










  • Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
    – asmitu
    17 hours ago






  • 1




    Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
    – 200_success
    17 hours ago




















  • I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
    – Mathias Ettinger
    yesterday










  • Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
    – asmitu
    yesterday






  • 3




    Why do you think that using a decorator is appropriate here?
    – 200_success
    yesterday










  • Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
    – asmitu
    17 hours ago






  • 1




    Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
    – 200_success
    17 hours ago


















I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday




I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?
– Mathias Ettinger
yesterday












Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday




Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.
– asmitu
yesterday




3




3




Why do you think that using a decorator is appropriate here?
– 200_success
yesterday




Why do you think that using a decorator is appropriate here?
– 200_success
yesterday












Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago




Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.
– asmitu
17 hours ago




1




1




Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago






Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.
– 200_success
17 hours ago












1 Answer
1






active

oldest

votes

















up vote
6
down vote













While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.



Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:



import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def get_title(link):
"""Load a link to get the page title."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
# return soup.select_one("h1[itemprop='name'] a").text

def get_links(link):
"""Get all links from a page."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
relative_urls = soup.select(".summary .question-hyperlink")
return [urljoin(url, items.get('href')) for items in relative_urls]


if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
links = get_links(url)
link_titles = [get_title(link) for link in links]
print(link_titles)


If you really want to, you can then make a new function that uses these two functions:



def get_link_titles(url):
"""Get the titles of all links present in `url`."""
return [get_title(link) for link in get_links(url)]


In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).



You could put getting a page and parsing it with BeautifulSoup into its own function:



SESSION = requests.Session()

def get_soup(url):
res = SESSION.get(url)
return BeautifulSoup(res.text,"lxml")


You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).






share|improve this answer























    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209153%2fmaking-use-of-a-decorator-within-a-python-script%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    6
    down vote













    While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.



    Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:



    import requests
    from urllib.parse import urljoin
    from bs4 import BeautifulSoup

    def get_title(link):
    """Load a link to get the page title."""
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
    # return soup.select_one("h1[itemprop='name'] a").text

    def get_links(link):
    """Get all links from a page."""
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    relative_urls = soup.select(".summary .question-hyperlink")
    return [urljoin(url, items.get('href')) for items in relative_urls]


    if __name__ == '__main__':
    url = "https://stackoverflow.com/questions/tagged/web-scraping"
    links = get_links(url)
    link_titles = [get_title(link) for link in links]
    print(link_titles)


    If you really want to, you can then make a new function that uses these two functions:



    def get_link_titles(url):
    """Get the titles of all links present in `url`."""
    return [get_title(link) for link in get_links(url)]


    In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).



    You could put getting a page and parsing it with BeautifulSoup into its own function:



    SESSION = requests.Session()

    def get_soup(url):
    res = SESSION.get(url)
    return BeautifulSoup(res.text,"lxml")


    You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).






    share|improve this answer



























      up vote
      6
      down vote













      While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.



      Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:



      import requests
      from urllib.parse import urljoin
      from bs4 import BeautifulSoup

      def get_title(link):
      """Load a link to get the page title."""
      res = requests.get(link)
      soup = BeautifulSoup(res.text,"lxml")
      return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
      # return soup.select_one("h1[itemprop='name'] a").text

      def get_links(link):
      """Get all links from a page."""
      res = requests.get(link)
      soup = BeautifulSoup(res.text,"lxml")
      relative_urls = soup.select(".summary .question-hyperlink")
      return [urljoin(url, items.get('href')) for items in relative_urls]


      if __name__ == '__main__':
      url = "https://stackoverflow.com/questions/tagged/web-scraping"
      links = get_links(url)
      link_titles = [get_title(link) for link in links]
      print(link_titles)


      If you really want to, you can then make a new function that uses these two functions:



      def get_link_titles(url):
      """Get the titles of all links present in `url`."""
      return [get_title(link) for link in get_links(url)]


      In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).



      You could put getting a page and parsing it with BeautifulSoup into its own function:



      SESSION = requests.Session()

      def get_soup(url):
      res = SESSION.get(url)
      return BeautifulSoup(res.text,"lxml")


      You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).






      share|improve this answer

























        up vote
        6
        down vote










        up vote
        6
        down vote









        While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.



        Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:



        import requests
        from urllib.parse import urljoin
        from bs4 import BeautifulSoup

        def get_title(link):
        """Load a link to get the page title."""
        res = requests.get(link)
        soup = BeautifulSoup(res.text,"lxml")
        return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
        # return soup.select_one("h1[itemprop='name'] a").text

        def get_links(link):
        """Get all links from a page."""
        res = requests.get(link)
        soup = BeautifulSoup(res.text,"lxml")
        relative_urls = soup.select(".summary .question-hyperlink")
        return [urljoin(url, items.get('href')) for items in relative_urls]


        if __name__ == '__main__':
        url = "https://stackoverflow.com/questions/tagged/web-scraping"
        links = get_links(url)
        link_titles = [get_title(link) for link in links]
        print(link_titles)


        If you really want to, you can then make a new function that uses these two functions:



        def get_link_titles(url):
        """Get the titles of all links present in `url`."""
        return [get_title(link) for link in get_links(url)]


        In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).



        You could put getting a page and parsing it with BeautifulSoup into its own function:



        SESSION = requests.Session()

        def get_soup(url):
        res = SESSION.get(url)
        return BeautifulSoup(res.text,"lxml")


        You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).






        share|improve this answer














        While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.



        Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:



        import requests
        from urllib.parse import urljoin
        from bs4 import BeautifulSoup

        def get_title(link):
        """Load a link to get the page title."""
        res = requests.get(link)
        soup = BeautifulSoup(res.text,"lxml")
        return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
        # return soup.select_one("h1[itemprop='name'] a").text

        def get_links(link):
        """Get all links from a page."""
        res = requests.get(link)
        soup = BeautifulSoup(res.text,"lxml")
        relative_urls = soup.select(".summary .question-hyperlink")
        return [urljoin(url, items.get('href')) for items in relative_urls]


        if __name__ == '__main__':
        url = "https://stackoverflow.com/questions/tagged/web-scraping"
        links = get_links(url)
        link_titles = [get_title(link) for link in links]
        print(link_titles)


        If you really want to, you can then make a new function that uses these two functions:



        def get_link_titles(url):
        """Get the titles of all links present in `url`."""
        return [get_title(link) for link in get_links(url)]


        In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).



        You could put getting a page and parsing it with BeautifulSoup into its own function:



        SESSION = requests.Session()

        def get_soup(url):
        res = SESSION.get(url)
        return BeautifulSoup(res.text,"lxml")


        You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 13 hours ago

























        answered yesterday









        Graipher

        23k53384




        23k53384






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209153%2fmaking-use-of-a-decorator-within-a-python-script%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Mont Emei

            Province de Neuquén

            Journaliste