Helper library (dphelper) – Dataplatform guide for contributors

Image service

This is description how to use our internal image service. First you need to initialize wit api_key which is stored on Dataplatform backend associated with user. Note that we don’t have have UEX yet to create/show/find it.

dphelper = DPHelper(api_key='ADD_YOUR_KEY_TO_TEST')

After authorizing with DPHelper, you can now use image service. To upload single image:

dphelper.upload_image_from_url('https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png')

To upload multiple images by url

 dphelper = DPHelper(api_key='ADD_YOUR_KEY_TO_TEST')
   image_urls = [
     'https://google.com/image1.png',
     'https://google.com/image2.png',
    ]
    results = dphelper.upload_all_images(image_urls, max_concurrent=10)

Note: API is permissive, i.e. does not throw if some of the imags are already downloaded

Geocoding

DPHelper allows to convert address to coordinates (lat, lng) or reverse. However this is gated feature. If you need this, ask your coordinator for API key.

Usage pattern

from dphelper import DPHelper

helper = DPHelper(api_key='XXX', geo_provider='mixed')
print(helper.geocode(location='Vilnius', is_reverse=False))
print(helper.get_coords("Didlaukio g. 59, Vilnius")

Geo Provider

rc – cheap, strict format, good for LT big cities; yet shall be used at risk of getting no coords;
google – moderate cost, loose format, handles worldwide addresses;
mixed – cheap + safe because uses rc where applicable, else uses google;
any – uses “google”

“rc” Supported examples:

Vilniaus g. 1, Vilnius 
žvejų g . 1 Lazdijai 
Didlaukio g. 59, 08302 Vilnius, Lithuania 
Genio g. 59A-1012 Vilnius, Lithuania 
.,,,,,.Lazdijai žvejų g 1 lt67120 lietuva.  . 
Vilnius genio gatvė 9 
Vilnius genio skveras 9, 
genio a. 9 Žemaičių kalvarija LT 
Vilnius, Didlaukio g. 59 
Kudirkos Naumiestis Dariaus ir Girėno g. 1 Lietuva 
Lazdijai žvejų g.1 
Lazdijai m. k. čiurlionio g. 9 
m. k. čiurlionio g. 9 Lazdijai 
M. K. Čiurlionio g. 9, Lazdijai, 67104 Lazdijų r. sav. 
Vilniaus g. 17, Parudaminys, Vilniaus raj. 
Vilniaus g. 17, Parudaminys, Vilniaus rajonas 
Vilniaus g. 17, Parudaminio k., Vilniaus raj.

Unsupported examples:

Martinavos k., Martinavos g. 8
Kauno g. 1, Šilagalio k.
Garnio g. 32, Gineitiškių k.
Parudaminio k. Vilniaus raj. Vilniaus g. 17
lt67120

Introduction to dphelper

We have seen many scenarios and therefore decided to create set of tools which will help you automate data extraction and re-use tested patterns. We have created dphelper pip package for Python (supported version is starting 3.7). Here are some use cases where we provide you helper functions:

Parsing data. Some data will be present in unstructured way, i.e. HTML table consisting of list of strings. You can provide schema of data you expect, and we will transform it into correct format, for example (str) “100,234.00 eur” => (int) 100234.

Avoiding getting blocked. Some targets (websites from where we extract data) may perform extra checks and therefore result in captchas and not able to retrieve HTML content. We have integrating 3rd party tools which use proxy servers and other techniques, such as headless browsers, to solve this for you.

Data platform API. Sometimes you will have to reuse, breakdown your data acquisition pipelines to reduce complexity of pipeline and improve reusability (see best practises). We provide you with API to combine, filter, adjust data or retrieve specific snapshot for specific date.

Install helper library

Start by install package your local machine or development server.

pip install dphelper

Important! Make sure to update your dphelper package (via command pip update) to make sure you have to the latest version of our APIs:

pip install dphelper --upgrade

Getting HTML content

The standard approach of getting HTML content in Python is as follows:

import requests

r = requests.get('example.com’')
print(r.text)

The is how you could use using beautiful soup library (also popular option for web scraping). This may be valuable since it can for example strip out unnecessary whitespaces for you.

from bs4 import BeautifulSoup

import requests

# request web page
resp = requests.get("http://example.com")

# get the response text. in this case it is HTML
html = resp.text

# parse the HTML
soup = BeautifulSoup(html, "html.parser")

# print the HTML as text

Here is the pattern we recommend to use in our platform. It has added value that if needed it can use 3rd party services to avoid blocking.

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)
headers = helper.create_headers(authority="example.com")
 
content = helper.from_url('http://example.com', headers=headers)

Parsing structured data

In many scenarios you will be able to access structured data (for example, via JSON via API). However, data platform is designed such way that the elements of your data should have SPECIFIC EXACT shape (match JSONSchema). Thus lets say API from target website returns data in such format about list of apartments:

[ 
    {"kaina": 5, "plotas": "70 m2"},
    {"kaina": "$25,99", "plotas": “50.22”},
    {"kaina": 0, "plotas": "12"},
]

While the schema has different field names, and different formats, and we want data to be in this format

[ 
    {"price": 5, "area": 70.00},
    {"price": 25.99, "area": 50.22},
    {"price": 0, "area": 12.00},
]

For this scenario, we can create schema field mapping and do transformation easier:

from dphelper import DPHelper

helper = DPHelper(is_verbose=True)

schema_mapping = {
  "kaina": "price", 
  "plotas": "area",
}

# invalid format; 
partial_data = [
    {"kaina": 5, "kambariai": "1.5 kamb."},
    {"kaina": "$25,99"},
    {"kambariai": "2.2"},
]

# convert to right format
print(helper.transform_map(schema_mapping, data, parse=True))

Tip: always try to use structured data when possible. For example if data is stored in JSON file (or as variable in HTML), read entire thing as string and parse it as JSON. This is less fragile, since the order of attributes may (and likely will) change, new attributes will be added.

Example

HTML with JSON:

<div data-drupal-messages-fallback class="hidden"></div>
<script>
  var data = {"posts": [{"id":"1","post_title":"LP", ..}]};
</script>

Instead of doing this:

pattern = re.compile(r'{"post_id":"(.*?)","post_title":"(.*?)","post_adress":"(.*?)","longitude":"(.*?)","latitude":"(.*?)","type":"(.*?)",.*?}')
results = pattern.findall(data)

Do this

content = content.split('var datat = ')[1].split(';var data =')[0]
json_data = json.loads(content)
posts = json_data.get('posts')

Parsing unstructured data

Some cases you will have access to structured data (for example, API request returning data as JSON), but in many situations you will have data stored as HTML in unstructured format (text). For example, list of apartments:

One of the valid approaches is to use regexp and convert each value into corresponding format, for example:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

tmp = []
for id, floor, area, room, status, .. in results:
  tmp.append({
     ‘area’: float(area.replace(‘m2’, ‘’)),
     ‘status’: 0 if status == ‘Laisvas’ else 1,
     ..
  })

However this may be repetitive and errorprone. We allow you to use helper function which accepts schema and converts data into right format:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

apts = helper.parse_rows(
  ['id','floor','area','rooms','status'],
  results,
  verbose=True,
)

Useful: While building data pipeline you can use flag verbose=True to make debugging easier, and later switch it into verbose=False (or delete extra parameter)

Here is list of currently supported field types:

Field name	Description	Example input / outputs
id	unique identifier
floor	floor for apt	“trecias” (=3)
area	area for apt	“50.5” (=50.5)“50,5” (=50,5)
rooms	number of rooms for apt	“du” (=2)“2 kamb.” (=2)
orientation	orientation	“Siaure” = ([“N”])
price	price of apt	“100.200,00” (=100200)
status	status for apt	“Laisvas” (=0)“Rezervuotas” (=1)“Parduotas” (=2)
floor	floor no for apt	“Antras” (=2)
www	extract link

Important: if schema field is not recognized, then after passing it into utility it will be kept unchanged (same format).