Sourcing content from a webpage in an organized manner











up vote
1
down vote

favorite












I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.



Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?



This is the link to that site



Here is the working script:



import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}

def get_token(): #getting csrf_token
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("[name='bhc_csrf_token']")['value']

payload = {
'bhc_csrf_token': item,
'logout': 1
}
return payload

def make_post(payload): #making a post http request with the payload and parsing target page links
res = requests.post(url,data=payload,headers=headers)
sauce = BeautifulSoup(res.text,"lxml")
linklist =
for elem in sauce.select(".sr_property_block a.hotel_name_link"):
linklist.append(urljoin(url,elem.get("href").strip()))
return linklist

def get_info(link): #scraping title and address by using each of the links
response = requests.get(link,headers=headers)
soupobj = BeautifulSoup(response.text,"lxml")
name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)
addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)
print(f'{name}n{addr}n')

if __name__ == '__main__':
for link in make_post(get_token()):
get_info(link)









share|improve this question
















bumped to the homepage by Community 13 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.



















    up vote
    1
    down vote

    favorite












    I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.



    Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?



    This is the link to that site



    Here is the working script:



    import requests
    from urllib.parse import urljoin
    from bs4 import BeautifulSoup

    url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}

    def get_token(): #getting csrf_token
    r = requests.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    item = soup.select_one("[name='bhc_csrf_token']")['value']

    payload = {
    'bhc_csrf_token': item,
    'logout': 1
    }
    return payload

    def make_post(payload): #making a post http request with the payload and parsing target page links
    res = requests.post(url,data=payload,headers=headers)
    sauce = BeautifulSoup(res.text,"lxml")
    linklist =
    for elem in sauce.select(".sr_property_block a.hotel_name_link"):
    linklist.append(urljoin(url,elem.get("href").strip()))
    return linklist

    def get_info(link): #scraping title and address by using each of the links
    response = requests.get(link,headers=headers)
    soupobj = BeautifulSoup(response.text,"lxml")
    name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)
    addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)
    print(f'{name}n{addr}n')

    if __name__ == '__main__':
    for link in make_post(get_token()):
    get_info(link)









    share|improve this question
















    bumped to the homepage by Community 13 hours ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

















      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.



      Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?



      This is the link to that site



      Here is the working script:



      import requests
      from urllib.parse import urljoin
      from bs4 import BeautifulSoup

      url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'
      headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}

      def get_token(): #getting csrf_token
      r = requests.get(url)
      soup = BeautifulSoup(r.text,"lxml")
      item = soup.select_one("[name='bhc_csrf_token']")['value']

      payload = {
      'bhc_csrf_token': item,
      'logout': 1
      }
      return payload

      def make_post(payload): #making a post http request with the payload and parsing target page links
      res = requests.post(url,data=payload,headers=headers)
      sauce = BeautifulSoup(res.text,"lxml")
      linklist =
      for elem in sauce.select(".sr_property_block a.hotel_name_link"):
      linklist.append(urljoin(url,elem.get("href").strip()))
      return linklist

      def get_info(link): #scraping title and address by using each of the links
      response = requests.get(link,headers=headers)
      soupobj = BeautifulSoup(response.text,"lxml")
      name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)
      addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)
      print(f'{name}n{addr}n')

      if __name__ == '__main__':
      for link in make_post(get_token()):
      get_info(link)









      share|improve this question















      I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.



      Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?



      This is the link to that site



      Here is the working script:



      import requests
      from urllib.parse import urljoin
      from bs4 import BeautifulSoup

      url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'
      headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}

      def get_token(): #getting csrf_token
      r = requests.get(url)
      soup = BeautifulSoup(r.text,"lxml")
      item = soup.select_one("[name='bhc_csrf_token']")['value']

      payload = {
      'bhc_csrf_token': item,
      'logout': 1
      }
      return payload

      def make_post(payload): #making a post http request with the payload and parsing target page links
      res = requests.post(url,data=payload,headers=headers)
      sauce = BeautifulSoup(res.text,"lxml")
      linklist =
      for elem in sauce.select(".sr_property_block a.hotel_name_link"):
      linklist.append(urljoin(url,elem.get("href").strip()))
      return linklist

      def get_info(link): #scraping title and address by using each of the links
      response = requests.get(link,headers=headers)
      soupobj = BeautifulSoup(response.text,"lxml")
      name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)
      addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)
      print(f'{name}n{addr}n')

      if __name__ == '__main__':
      for link in make_post(get_token()):
      get_info(link)






      python python-3.x web-scraping beautifulsoup






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Oct 13 at 11:50

























      asked Oct 12 at 16:16









      asmitu

      1268




      1268





      bumped to the homepage by Community 13 hours ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 13 hours ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().



          Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.



          This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).






          share|improve this answer





















            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "196"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205458%2fsourcing-content-from-a-webpage-in-an-organized-manner%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().



            Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.



            This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).






            share|improve this answer

























              up vote
              0
              down vote













              Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().



              Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.



              This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().



                Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.



                This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).






                share|improve this answer












                Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().



                Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.



                This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Oct 13 at 5:52









                Julian Suggate

                1




                1






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Code Review Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205458%2fsourcing-content-from-a-webpage-in-an-organized-manner%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Mont Emei

                    Province de Neuquén

                    Soliste