Solution review

When validation passes, our experts will review your solution and schedule to run it automatically. If needed, though, we may reach out to you and ask to minor adjustments to fit our quality bar.

Below are the stages for review:

StageDescription
ValidatorsYour code output is run through validators for particular data type. Warning level may be accepted, however code with errors will not be flagged for review in next stage
Data spot checkData produced by program is compared with data on target website by performing random spot checks as well as checking column quality and number of rows matches.
Code reviewIf data quality is good, we inspect code and make sure it uses best practises and will work longer term without breaking
Resource optimizationIf code uses to much memory, compute power of third party apps, we will look into ways to make your pipeline more efficient to reduce our cost basis.

Maintaining solution

It is likely that your solution may break at some point in the future for external factors (for example, website design changes). At the moment we don’t expect you to fix these issues, but we are working on reward system for maintaining your pipelines healthy.

If pipeline breaks however, our team will create a new task (challenge) with a dedicated budget to get it fixed.

Introduction to dphelper

We have seen many scenarios and therefore decided to create set of tools which will help you automate data extraction and re-use tested patterns. We have created dphelper pip package for Python (supported version is starting 3.7). Here are some use cases where we provide you helper functions:

Parsing data. Some data will be present in unstructured way, i.e. HTML table consisting of list of strings. You can provide schema of data you expect, and we will transform it into correct format, for example (str) “100,234.00 eur” => (int) 100234. 

Avoiding getting blocked. Some targets (websites from where we extract data) may perform extra checks and therefore result in captchas and not able to retrieve HTML content. We have integrating 3rd party tools which use proxy servers and other techniques, such as headless browsers, to solve this for you.

Data platform API. Sometimes you will have to reuse, breakdown your data acquisition pipelines to reduce complexity of pipeline and improve reusability (see best practises). We provide you with API to combine, filter, adjust data or retrieve specific snapshot for specific date. 

Getting HTML content

The standard approach of getting HTML content in Python is as follows:

import requests

r = requests.get('example.com’')
print(r.text)

The is how you could use using beautiful soup library (also popular option for web scraping). This may be valuable since it can for example strip out unnecessary whitespaces for you.

from bs4 import BeautifulSoup

import requests

# request web page
resp = requests.get("http://example.com")

# get the response text. in this case it is HTML
html = resp.text

# parse the HTML
soup = BeautifulSoup(html, "html.parser")

# print the HTML as text

Here is the pattern we recommend to use in our platform. It has added value that if needed it can use 3rd party services to avoid blocking.

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)
headers = helper.create_headers(authority="example.com")
 
content = helper.from_url('http://example.com', headers=headers)

Parsing structured data

In many scenarios you will be able to access structured data (for example, via JSON via API). However, data platform is designed such way that the elements of your data should have SPECIFIC EXACT shape (match JSONSchema). Thus lets say API from target website returns data in such format about list of apartments:

[ 
    {"kaina": 5, "plotas": "70 m2"},
    {"kaina": "$25,99", "plotas": “50.22”},
    {"kaina": 0, "plotas": "12"},
]

While the schema has different field names, and different formats, and we want data to be in this format

[ 
    {"price": 5, "area": 70.00},
    {"price": 25.99, "area": 50.22},
    {"price": 0, "area": 12.00},
]

For this scenario, we can create schema field mapping and do transformation easier:

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)

schema_mapping = {
  "kaina": "price", 
  "plotas": "area",
}

# invalid format; 
partial_data = [
    {"kaina": 5, "kambariai": "1.5 kamb."},
    {"kaina": "$25,99"},
    {"kambariai": "2.2"},
]

# convert to right format
print(helper.transform_map(schema_mapping, data, parse=True))

Tip: always try to use structured data when possible. For example if data is stored in JSON file (or as variable in HTML), read entire thing as string and parse it as JSON. This is less fragile, since the order of attributes may (and likely will) change, new attributes will be added.

Example

HTML with JSON:

<div data-drupal-messages-fallback class="hidden"></div>
<script>
  var data = {"posts": [{"id":"1","post_title":"LP", ..}]};
</script>

Instead of doing this:

pattern = re.compile(r'{"post_id":"(.*?)","post_title":"(.*?)","post_adress":"(.*?)","longitude":"(.*?)","latitude":"(.*?)","type":"(.*?)",.*?}')
results = pattern.findall(data)

Do this

content = content.split('var datat = ')[1].split(';var data =')[0]
json_data = json.loads(content)
posts = json_data.get('posts')

Parsing unstructured data

Some cases you will have access to structured data (for example, API request returning data as JSON), but in many situations you will have data stored as HTML in unstructured format (text). For example, list of apartments:

One of the valid approaches is to use regexp and convert each value into corresponding format, for example:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

tmp = []
for id, floor, area, room, status, .. in results:
  tmp.append({
     ‘area’: float(area.replace(‘m2’, ‘’)),
     ‘status’: 0 if status == ‘Laisvas’ else 1,
     ..
  })


However this may be repetitive and errorprone. We allow you to use helper function which accepts schema and converts data into right format:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

apts = helper.parse_rows(
  ['id','floor','area','rooms','status'],
  results,
  verbose=True,
)

Useful: While building data pipeline you can use flag verbose=True to make debugging easier, and later switch it into verbose=False (or delete extra parameter)

Here is list of currently supported field types:

Field nameDescriptionExample input / outputs
idunique identifier
floorfloor for apt“trecias” (=3)
areaarea for apt“50.5” (=50.5)“50,5” (=50,5)
roomsnumber of rooms for apt“du” (=2)“2 kamb.” (=2)
orientationorientation“Siaure” = ([“N”])
priceprice of apt“100.200,00” (=100200)
statusstatus for apt“Laisvas” (=0)“Rezervuotas” (=1)“Parduotas” (=2)
floorfloor no for apt“Antras” (=2)
wwwextract link

Important: if schema field is not recognized, then after passing it into utility it will be kept unchanged (same format).

JSON schema

Our platform operates on structured data. We choose JSON schema as tool to make sure data is standard format.

JSON Schema

At this point we only support of list of structured data. In other words: list of stuff of certain format. Learn more about JSON schema open format.

Example of JSON schema

Internally we save definition for each data challenge as JSON schema. As contributor, you can open text editor and inspect details. At the moment, mainly to avoid breaking changes, editing of schema is restricted to platform moderators only.

Example: schema of list of apartments with pricing and meta information (area of apartment, floor, rooms, etc). The schema for this data could look like this:

{
 "$schema": "http://json-schema.org/schema#",
 "type": "array",
 "items": {
   "type": "object",
   "properties": {
     "id": {
       "type": "string"
     },
     "www": {
       "type": "string"
     },
     "area": {
       "type": "number"
     },
     "rooms": {
       "type": "integer"
     },
     "floor": {
       "type": "integer"
     },
     "price": {
       "type": "number"
     },
     "status": {
       "type": "integer"
     }
   },
   "required": [
     "area",
     "floor",
     "id",
     "rooms",
     "status"
   ]
 }
}