Getting HTML content – Dataplatform guide for contributors

The standard approach of getting HTML content in Python is as follows:

import requests

r = requests.get('example.com’')
print(r.text)

The is how you could use using beautiful soup library (also popular option for web scraping). This may be valuable since it can for example strip out unnecessary whitespaces for you.

from bs4 import BeautifulSoup

import requests

# request web page
resp = requests.get("http://example.com")

# get the response text. in this case it is HTML
html = resp.text

# parse the HTML
soup = BeautifulSoup(html, "html.parser")

# print the HTML as text

Here is the pattern we recommend to use in our platform. It has added value that if needed it can use 3rd party services to avoid blocking.

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)
headers = helper.create_headers(authority="example.com")
 
content = helper.from_url('http://example.com', headers=headers)

Leave a Reply Cancel reply