Getting HTML content

The standard approach of getting HTML content in Python is as follows:

import requests

r = requests.get('example.com’')
print(r.text)

The is how you could use using beautiful soup library (also popular option for web scraping). This may be valuable since it can for example strip out unnecessary whitespaces for you.

from bs4 import BeautifulSoup

import requests

# request web page
resp = requests.get("http://example.com")

# get the response text. in this case it is HTML
html = resp.text

# parse the HTML
soup = BeautifulSoup(html, "html.parser")

# print the HTML as text

Here is the pattern we recommend to use in our platform. It has added value that if needed it can use 3rd party services to avoid blocking.

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)
headers = helper.create_headers(authority="example.com")
 
content = helper.from_url('http://example.com', headers=headers)

Parsing structured data

In many scenarios you will be able to access structured data (for example, via JSON via API). However, data platform is designed such way that the elements of your data should have SPECIFIC EXACT shape (match JSONSchema). Thus lets say API from target website returns data in such format about list of apartments:

[ 
    {"kaina": 5, "plotas": "70 m2"},
    {"kaina": "$25,99", "plotas": “50.22”},
    {"kaina": 0, "plotas": "12"},
]

While the schema has different field names, and different formats, and we want data to be in this format

[ 
    {"price": 5, "area": 70.00},
    {"price": 25.99, "area": 50.22},
    {"price": 0, "area": 12.00},
]

For this scenario, we can create schema field mapping and do transformation easier:

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)

schema_mapping = {
  "kaina": "price", 
  "plotas": "area",
}

# invalid format; 
partial_data = [
    {"kaina": 5, "kambariai": "1.5 kamb."},
    {"kaina": "$25,99"},
    {"kambariai": "2.2"},
]

# convert to right format
print(helper.transform_map(schema_mapping, data, parse=True))

Tip: always try to use structured data when possible. For example if data is stored in JSON file (or as variable in HTML), read entire thing as string and parse it as JSON. This is less fragile, since the order of attributes may (and likely will) change, new attributes will be added.

Example

HTML with JSON:

<div data-drupal-messages-fallback class="hidden"></div>
<script>
  var data = {"posts": [{"id":"1","post_title":"LP", ..}]};
</script>

Instead of doing this:

pattern = re.compile(r'{"post_id":"(.*?)","post_title":"(.*?)","post_adress":"(.*?)","longitude":"(.*?)","latitude":"(.*?)","type":"(.*?)",.*?}')
results = pattern.findall(data)

Do this

content = content.split('var datat = ')[1].split(';var data =')[0]
json_data = json.loads(content)
posts = json_data.get('posts')

Parsing unstructured data

Some cases you will have access to structured data (for example, API request returning data as JSON), but in many situations you will have data stored as HTML in unstructured format (text). For example, list of apartments:

One of the valid approaches is to use regexp and convert each value into corresponding format, for example:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

tmp = []
for id, floor, area, room, status, .. in results:
  tmp.append({
     ‘area’: float(area.replace(‘m2’, ‘’)),
     ‘status’: 0 if status == ‘Laisvas’ else 1,
     ..
  })


However this may be repetitive and errorprone. We allow you to use helper function which accepts schema and converts data into right format:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

apts = helper.parse_rows(
  ['id','floor','area','rooms','status'],
  results,
  verbose=True,
)

Useful: While building data pipeline you can use flag verbose=True to make debugging easier, and later switch it into verbose=False (or delete extra parameter)

Here is list of currently supported field types:

Field nameDescriptionExample input / outputs
idunique identifier
floorfloor for apt“trecias” (=3)
areaarea for apt“50.5” (=50.5)“50,5” (=50,5)
roomsnumber of rooms for apt“du” (=2)“2 kamb.” (=2)
orientationorientation“Siaure” = ([“N”])
priceprice of apt“100.200,00” (=100200)
statusstatus for apt“Laisvas” (=0)“Rezervuotas” (=1)“Parduotas” (=2)
floorfloor no for apt“Antras” (=2)
wwwextract link

Important: if schema field is not recognized, then after passing it into utility it will be kept unchanged (same format).

JSON schema

Our platform operates on structured data. We choose JSON schema as tool to make sure data is standard format.

JSON Schema

At this point we only support of list of structured data. In other words: list of stuff of certain format. Learn more about JSON schema open format.

Example of JSON schema

Internally we save definition for each data challenge as JSON schema. As contributor, you can open text editor and inspect details. At the moment, mainly to avoid breaking changes, editing of schema is restricted to platform moderators only.

Example: schema of list of apartments with pricing and meta information (area of apartment, floor, rooms, etc). The schema for this data could look like this:

{
 "$schema": "http://json-schema.org/schema#",
 "type": "array",
 "items": {
   "type": "object",
   "properties": {
     "id": {
       "type": "string"
     },
     "www": {
       "type": "string"
     },
     "area": {
       "type": "number"
     },
     "rooms": {
       "type": "integer"
     },
     "floor": {
       "type": "integer"
     },
     "price": {
       "type": "number"
     },
     "status": {
       "type": "integer"
     }
   },
   "required": [
     "area",
     "floor",
     "id",
     "rooms",
     "status"
   ]
 }
}

Nullable and optional fields

To require that some fields are always available, whitelist it to “required” list.

Warning: this is different than nullable concept below. Field can be required, but nullable. In contrast, field also can be optional, but if provided it may not be null. Use example below as pattern how to handle required field when its missing.

items = requests.get(‘http://items.com)
results = []
for item in items:
   result = {
      ‘price’: item.get(‘kaina’), # always present
      ‘name’: item.get(‘pavadinimas’), # always present
   }

   # schema does not allow NULL (None) value
   discount = item.get(‘discount’)
   if discount is not None: 
      result[‘discount’] = discount

Nullable fields

To allow “None” value for field, you must make it nullable. 

Use case example: in the apartments list table, sold apartments usually has no longer pricing information. For apartments which are available to purchase this data is still available.

Currently the approach to make nullable is to use text version of JSON schema editor, and change type of field, for example:

    "price": {
       "type": [
         "number",
         "null"
       ]
     },

Custom validators

One of the most important goals of the data platform is to ensure data quality. One of the mechanisms is using JSON schema, as discussed above. However, we also provide ability to provide custom rules for your dataset. Here are some scenarios when it may be useful.

Contextual validation. While JSON schema works for most scenarios when we need to validate specific value (for example, number of rooms in apartment), it is impossible to validate context (for example that when room number is 1, area should be less than 50 square meters).

Custom logics. You may need very custom logic, for example normalize value (make string lowercase and then check if its one of the list).

We support warning and error level validators. Warning level validators indicate that data MAY be invalid and its implemented by printing into console (which is shown error message). Error messages are pipeline blocking and needs immediate attention. They implement by throwing exception with error message. Input data is read from standard input as JSON.

Example validator

Lets explore real life scenario – custom validator to ensure that each apartment in the list has reasonable square meter price. For this we define (larger) range where it would throw exception if outside, and narrower range for warnings.

import sys
import json

apartments = json.loads(sys.stdin.read())

# CONFIG START
WARNING_PRICE_MIN = 1500
WARNING_PRICE_MAX = 6000


ERROR_PRICE_MIN = 500
ERROR_PRICE_MAX = 10000
# CONFIG END


has_price = False
for apartment in apartments:
 flat_id = apartment.get('id')
 price = apartment.get('price')
 area = apartment.get('area')


 price_sqm = None
 if area and price:
   has_price = True
   price_sqm = price / area
   if price_sqm < ERROR_PRICE_MIN:
     raise Exception('Price for %s too low %f' % (flat_id, price_sqm))
   elif price_sqm > ERROR_PRICE_MAX:
     raise Exception('Price for %s too high %f' % (flat_id, price_sqm))

   elif price_sqm < WARNING_PRICE_MIN:
     print('Price for %s MAY BE too low %f' % (flat_id, price_sqm))
   elif price_sqm > WARNING_PRICE_MAX:
     print('Price for %s MAY BE too high %f' % (flat_id, price_sqm))

Lets break down parts of validator. First lets, lets read the data:

import sys
import json
apartments = json.loads(sys.stdin.read())

Then write your logic. This is example of warning level handling:

 elif price_sqm < WARNING_PRICE_MIN:
     print('Price for %s MAY BE too low %f' % (flat_id, price_sqm

This is example of error level handling:

 elif price_sqm > ERROR_PRICE_MAX:
     raise Exception('Price for %s too high %f' % (flat_id, price_sqm))

It is possible scenario that validator triggers both warning and error. In that case higher importance (error) level validation is shown in our tool.

Testing your validator

Now you have your validator code its time to test it! We recommend to test with valid and invalid scenario to make sure it works. First lets create valid JSON file APARTMENTS_VALID.json:

[{‘area’: 50, ‘price’: 200000, ‘id’: ‘1’}]

Test it by piping via command line using terminal:

>>
>> python3 price_validator.py < APARTMENTS_VALID.json
>>

Since data in JSON is valid, as expected nothing will be printed. Now lets created invalid data sample APARTMENTS_INVALID.JSON

[{‘area’: 50, ‘price’: 20000000, ‘id’: ‘1’}]

Lets run it in command line:

>>
>> python3 price_validator.py < APARTMENTS_INVALID.json
>> Price for 1 too high 200000000!

We have just completed writing and testing our first validator! Good luck creating new rules to ensure data quality.


Best practises

Better don’t mix several ideas into one code – for example check average price as above, and also check whether apartment ID’s are unique, etc. Better write separate code, since in case of error (Exception thrown) code execution will stop, while there be more insights about code quality.