Introduction to dphelper

We have seen many scenarios and therefore decided to create set of tools which will help you automate data extraction and re-use tested patterns. We have created dphelper pip package for Python (supported version is starting 3.7). Here are some use cases where we provide you helper functions:

Parsing data. Some data will be present in unstructured way, i.e. HTML table consisting of list of strings. You can provide schema of data you expect, and we will transform it into correct format, for example (str) “100,234.00 eur” => (int) 100234. 

Avoiding getting blocked. Some targets (websites from where we extract data) may perform extra checks and therefore result in captchas and not able to retrieve HTML content. We have integrating 3rd party tools which use proxy servers and other techniques, such as headless browsers, to solve this for you.

Data platform API. Sometimes you will have to reuse, breakdown your data acquisition pipelines to reduce complexity of pipeline and improve reusability (see best practises). We provide you with API to combine, filter, adjust data or retrieve specific snapshot for specific date. 

Getting HTML content

The standard approach of getting HTML content in Python is as follows:

import requests

r = requests.get('example.com’')
print(r.text)

The is how you could use using beautiful soup library (also popular option for web scraping). This may be valuable since it can for example strip out unnecessary whitespaces for you.

from bs4 import BeautifulSoup

import requests

# request web page
resp = requests.get("http://example.com")

# get the response text. in this case it is HTML
html = resp.text

# parse the HTML
soup = BeautifulSoup(html, "html.parser")

# print the HTML as text

Here is the pattern we recommend to use in our platform. It has added value that if needed it can use 3rd party services to avoid blocking.

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)
headers = helper.create_headers(authority="example.com")
 
content = helper.from_url('http://example.com', headers=headers)

Parsing structured data

In many scenarios you will be able to access structured data (for example, via JSON via API). However, data platform is designed such way that the elements of your data should have SPECIFIC EXACT shape (match JSONSchema). Thus lets say API from target website returns data in such format about list of apartments:

[ 
    {"kaina": 5, "plotas": "70 m2"},
    {"kaina": "$25,99", "plotas": “50.22”},
    {"kaina": 0, "plotas": "12"},
]

While the schema has different field names, and different formats, and we want data to be in this format

[ 
    {"price": 5, "area": 70.00},
    {"price": 25.99, "area": 50.22},
    {"price": 0, "area": 12.00},
]

For this scenario, we can create schema field mapping and do transformation easier:

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)

schema_mapping = {
  "kaina": "price", 
  "plotas": "area",
}

# invalid format; 
partial_data = [
    {"kaina": 5, "kambariai": "1.5 kamb."},
    {"kaina": "$25,99"},
    {"kambariai": "2.2"},
]

# convert to right format
print(helper.transform_map(schema_mapping, data, parse=True))

Tip: always try to use structured data when possible. For example if data is stored in JSON file (or as variable in HTML), read entire thing as string and parse it as JSON. This is less fragile, since the order of attributes may (and likely will) change, new attributes will be added.

Example

HTML with JSON:

<div data-drupal-messages-fallback class="hidden"></div>
<script>
  var data = {"posts": [{"id":"1","post_title":"LP", ..}]};
</script>

Instead of doing this:

pattern = re.compile(r'{"post_id":"(.*?)","post_title":"(.*?)","post_adress":"(.*?)","longitude":"(.*?)","latitude":"(.*?)","type":"(.*?)",.*?}')
results = pattern.findall(data)

Do this

content = content.split('var datat = ')[1].split(';var data =')[0]
json_data = json.loads(content)
posts = json_data.get('posts')