Sourcing content from a webpage in an organized manner
up vote
1
down vote
favorite
I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.
Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?
This is the link to that site
Here is the working script:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
def get_token(): #getting csrf_token
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("[name='bhc_csrf_token']")['value']
payload = {
'bhc_csrf_token': item,
'logout': 1
}
return payload
def make_post(payload): #making a post http request with the payload and parsing target page links
res = requests.post(url,data=payload,headers=headers)
sauce = BeautifulSoup(res.text,"lxml")
linklist =
for elem in sauce.select(".sr_property_block a.hotel_name_link"):
linklist.append(urljoin(url,elem.get("href").strip()))
return linklist
def get_info(link): #scraping title and address by using each of the links
response = requests.get(link,headers=headers)
soupobj = BeautifulSoup(response.text,"lxml")
name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)
addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)
print(f'{name}n{addr}n')
if __name__ == '__main__':
for link in make_post(get_token()):
get_info(link)
python python-3.x web-scraping beautifulsoup
bumped to the homepage by Community♦ 13 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
up vote
1
down vote
favorite
I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.
Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?
This is the link to that site
Here is the working script:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
def get_token(): #getting csrf_token
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("[name='bhc_csrf_token']")['value']
payload = {
'bhc_csrf_token': item,
'logout': 1
}
return payload
def make_post(payload): #making a post http request with the payload and parsing target page links
res = requests.post(url,data=payload,headers=headers)
sauce = BeautifulSoup(res.text,"lxml")
linklist =
for elem in sauce.select(".sr_property_block a.hotel_name_link"):
linklist.append(urljoin(url,elem.get("href").strip()))
return linklist
def get_info(link): #scraping title and address by using each of the links
response = requests.get(link,headers=headers)
soupobj = BeautifulSoup(response.text,"lxml")
name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)
addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)
print(f'{name}n{addr}n')
if __name__ == '__main__':
for link in make_post(get_token()):
get_info(link)
python python-3.x web-scraping beautifulsoup
bumped to the homepage by Community♦ 13 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.
Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?
This is the link to that site
Here is the working script:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
def get_token(): #getting csrf_token
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("[name='bhc_csrf_token']")['value']
payload = {
'bhc_csrf_token': item,
'logout': 1
}
return payload
def make_post(payload): #making a post http request with the payload and parsing target page links
res = requests.post(url,data=payload,headers=headers)
sauce = BeautifulSoup(res.text,"lxml")
linklist =
for elem in sauce.select(".sr_property_block a.hotel_name_link"):
linklist.append(urljoin(url,elem.get("href").strip()))
return linklist
def get_info(link): #scraping title and address by using each of the links
response = requests.get(link,headers=headers)
soupobj = BeautifulSoup(response.text,"lxml")
name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)
addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)
print(f'{name}n{addr}n')
if __name__ == '__main__':
for link in make_post(get_token()):
get_info(link)
python python-3.x web-scraping beautifulsoup
I've written a script in python to grab different title and address from different pages of a website. Firstly the script will collect all the property links from the landing page and then go one layer deep to collect the title and address. When I run my script, I get the results accordingly.
Should it not be a better approach If I call a single function and the rest of the functions work like a chain to produce the same results? If so, what is the right way to do so?
This is the link to that site
Here is the working script:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'https://www.booking.com/searchresults.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNYBGhpiAEBmAExuAEHyAEM2AEB6AEB-AECkgIBeagCAw;sid=7abf6bf275d09e8d4f617d51f4d6c803;class_interval=1;dest_id=102;dest_type=country;dtdisc=0;from_sf=1;group_adults=2;group_children=0;inac=0;index_postcard=0;label_click=undef;no_rooms=1;offset=0;postcard=0;raw_dest_type=country;room1=A%2CA;sb_price_type=total;search_selected=1;slp_r_match=0;src=index;src_elem=sb;srpvid=134253d18d97017e;ss=Ireland;ss_all=0;ss_raw=ireland;ssb=empty;sshis=0&'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
def get_token(): #getting csrf_token
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("[name='bhc_csrf_token']")['value']
payload = {
'bhc_csrf_token': item,
'logout': 1
}
return payload
def make_post(payload): #making a post http request with the payload and parsing target page links
res = requests.post(url,data=payload,headers=headers)
sauce = BeautifulSoup(res.text,"lxml")
linklist =
for elem in sauce.select(".sr_property_block a.hotel_name_link"):
linklist.append(urljoin(url,elem.get("href").strip()))
return linklist
def get_info(link): #scraping title and address by using each of the links
response = requests.get(link,headers=headers)
soupobj = BeautifulSoup(response.text,"lxml")
name = soupobj.select_one("h2#hp_hotel_name").get_text(strip=True)
addr = soupobj.select_one(".hp_address_subtitle").get_text(strip=True)
print(f'{name}n{addr}n')
if __name__ == '__main__':
for link in make_post(get_token()):
get_info(link)
python python-3.x web-scraping beautifulsoup
python python-3.x web-scraping beautifulsoup
edited Oct 13 at 11:50
asked Oct 12 at 16:16
asmitu
1268
1268
bumped to the homepage by Community♦ 13 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 13 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().
Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.
This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().
Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.
This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).
add a comment |
up vote
0
down vote
Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().
Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.
This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).
add a comment |
up vote
0
down vote
up vote
0
down vote
Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().
Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.
This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).
Perhaps I misunderstand you. But it looks as though you've already written the functions to call one another: get_info() calls make_post() internally, and make_post() then delegates getting the token to get_token() by calling it internally to itself. So in your final line of code, you should be able to replace the three calls with a single call to get_info().
Whether or not that's "better" than calling them independently is a question of style. Personally, I think it's best to keep the functions independent and call them separately. This would mean re-writing them to take parameters that are the results of the preceding function call.
This way the functions don't have to know about each other's existence, and means you will be able to change each one independently over time as requirements evolve, without needing to update the others (this is called low coupling, a very useful design quality to keep in mind).
answered Oct 13 at 5:52
Julian Suggate
1
1
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205458%2fsourcing-content-from-a-webpage-in-an-organized-manner%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown