{"id":28,"date":"2023-11-02T22:50:53","date_gmt":"2023-11-02T22:50:53","guid":{"rendered":"https:\/\/blog.dataplatform.lt\/?p=28"},"modified":"2023-11-02T22:50:53","modified_gmt":"2023-11-02T22:50:53","slug":"getting-html-content","status":"publish","type":"post","link":"https:\/\/blog.dataplatform.lt\/?p=28","title":{"rendered":"Getting HTML content"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">The standard approach of getting HTML content in Python is as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import requests\n\nr = requests.get('example.com\u2019')\nprint(r.text)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The is how you could use using beautiful soup library (also popular option for web scraping). This may be valuable since it can for example strip out unnecessary whitespaces for you.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from bs4 import BeautifulSoup\n\nimport requests\n\n# request web page\nresp = requests.get(\"http:\/\/example.com\")\n\n# get the response text. in this case it is HTML\nhtml = resp.text\n\n# parse the HTML\nsoup = BeautifulSoup(html, \"html.parser\")\n\n# print the HTML as text\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here is the pattern we recommend to use in our platform. It has added value that if needed it can use 3rd party services to avoid blocking.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from dphelper import DPHelper\n\nhelper = DPHelper(is_verbose=True)\nheaders = helper.create_headers(authority=\"example.com\")\n \ncontent = helper.from_url('http:\/\/example.com', headers=headers)<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>The standard approach of getting HTML content in Python is as follows: The is how you could use using beautiful soup library (also popular option for web scraping). This may be valuable since it can for example strip out unnecessary whitespaces for you. Here is the pattern we recommend to use in our platform. It &hellip; <a href=\"https:\/\/blog.dataplatform.lt\/?p=28\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Getting HTML content&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[9],"class_list":["post-28","post","type-post","status-publish","format-standard","hentry","category-helper-library-dphelper","tag-dphelper"],"_links":{"self":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/28","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=28"}],"version-history":[{"count":1,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/28\/revisions"}],"predecessor-version":[{"id":29,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/28\/revisions\/29"}],"wp:attachment":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=28"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=28"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=28"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}