Sourcing content from a webpage in an organized manner

up vote
1
down vote

favorite

I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.

Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?

This is the link to that site

Here is the working script:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}



def get_token():        #getting csrf_token

    r = requests.get(url)

    soup = BeautifulSoup(r.text,"lxml")

    item = soup.select_one("[name='bhc_csrf_token']")['value']



    payload = {

        'bhc_csrf_token': item,

        'logout': 1

    }

    return payload



def make_post(payload):  #making a post http request with the payload and parsing target page links

    res = requests.post(url,data=payload,headers=headers)

    sauce = BeautifulSoup(res.text,"lxml")

    linklist = 

    for elem in sauce.select(".sr_property_block a.hotel_name_link"):

        linklist.append(urljoin(url,elem.get("href").strip()))

    return linklist



def get_info(link): #scraping title and address by using each of the links

    response = requests.get(link,headers=headers)

    soupobj = BeautifulSoup(response.text,"lxml")

    name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)

    addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)

    print(f'{name}n{addr}n')



if __name__ == '__main__':

    for link in make_post(get_token()):

        get_info(link)

edited Oct 13 at 11:50

asked Oct 12 at 16:16

asmitu

1268

bumped to the homepage by Community♦ 13 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

up vote
1
down vote

favorite

Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?

This is the link to that site

Here is the working script:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}



def get_token():        #getting csrf_token

    r = requests.get(url)

    soup = BeautifulSoup(r.text,"lxml")

    item = soup.select_one("[name='bhc_csrf_token']")['value']



    payload = {

        'bhc_csrf_token': item,

        'logout': 1

    }

    return payload



def make_post(payload):  #making a post http request with the payload and parsing target page links

    res = requests.post(url,data=payload,headers=headers)

    sauce = BeautifulSoup(res.text,"lxml")

    linklist = 

    for elem in sauce.select(".sr_property_block a.hotel_name_link"):

        linklist.append(urljoin(url,elem.get("href").strip()))

    return linklist



def get_info(link): #scraping title and address by using each of the links

    response = requests.get(link,headers=headers)

    soupobj = BeautifulSoup(response.text,"lxml")

    name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)

    addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)

    print(f'{name}n{addr}n')



if __name__ == '__main__':

    for link in make_post(get_token()):

        get_info(link)

edited Oct 13 at 11:50

asked Oct 12 at 16:16

asmitu

1268

bumped to the homepage by Community♦ 13 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

up vote
1
down vote

favorite

Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?

This is the link to that site

Here is the working script:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}



def get_token():        #getting csrf_token

    r = requests.get(url)

    soup = BeautifulSoup(r.text,"lxml")

    item = soup.select_one("[name='bhc_csrf_token']")['value']



    payload = {

        'bhc_csrf_token': item,

        'logout': 1

    }

    return payload



def make_post(payload):  #making a post http request with the payload and parsing target page links

    res = requests.post(url,data=payload,headers=headers)

    sauce = BeautifulSoup(res.text,"lxml")

    linklist = 

    for elem in sauce.select(".sr_property_block a.hotel_name_link"):

        linklist.append(urljoin(url,elem.get("href").strip()))

    return linklist



def get_info(link): #scraping title and address by using each of the links

    response = requests.get(link,headers=headers)

    soupobj = BeautifulSoup(response.text,"lxml")

    name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)

    addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)

    print(f'{name}n{addr}n')



if __name__ == '__main__':

    for link in make_post(get_token()):

        get_info(link)

edited Oct 13 at 11:50

asked Oct 12 at 16:16

asmitu

1268

Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?

This is the link to that site

Here is the working script:

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup



url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}



def get_token():        #getting csrf_token

    r = requests.get(url)

    soup = BeautifulSoup(r.text,"lxml")

    item = soup.select_one("[name='bhc_csrf_token']")['value']



    payload = {

        'bhc_csrf_token': item,

        'logout': 1

    }

    return payload



def make_post(payload):  #making a post http request with the payload and parsing target page links

    res = requests.post(url,data=payload,headers=headers)

    sauce = BeautifulSoup(res.text,"lxml")

    linklist = 

    for elem in sauce.select(".sr_property_block a.hotel_name_link"):

        linklist.append(urljoin(url,elem.get("href").strip()))

    return linklist



def get_info(link): #scraping title and address by using each of the links

    response = requests.get(link,headers=headers)

    soupobj = BeautifulSoup(response.text,"lxml")

    name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)

    addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)

    print(f'{name}n{addr}n')



if __name__ == '__main__':

    for link in make_post(get_token()):

        get_info(link)

python python-3.x web-scraping beautifulsoup

edited Oct 13 at 11:50

asked Oct 12 at 16:16

asmitu

1268

edited Oct 13 at 11:50

asked Oct 12 at 16:16

asmitu

1268

edited Oct 13 at 11:50

asked Oct 12 at 16:16

asmitu

1268

asked Oct 12 at 16:16

asmitu

1268

asked Oct 12 at 16:16

asmitu

1268

bumped to the homepage by Community♦ 13 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 13 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().

Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.

This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).

answered Oct 13 at 5:52

Julian Suggate

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205458%2fsourcing-content-from-a-webpage-in-an-organized-manner%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

answered Oct 13 at 5:52

Julian Suggate

add a comment |

up vote
0
down vote

answered Oct 13 at 5:52

Julian Suggate

add a comment |

up vote
0
down vote

answered Oct 13 at 5:52

Julian Suggate

answered Oct 13 at 5:52

Julian Suggate

answered Oct 13 at 5:52

Julian Suggate

answered Oct 13 at 5:52

Julian Suggate

answered Oct 13 at 5:52

Julian Suggate

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Krdytkyu