November 2023 – Dataplatform guide for contributors

Reading XLS files

Reading structured data from XLS format

Sometimes data is provided as the XLS file.

Example: Borrowing rates at Lithuanian Bank is provided as statistics page. It
has data structured as table, but it is always recommended to use structured data, in this case download as XLS.

The best strategy to fetch data is to use Python pandas library, as it has quite advanced XLS reading capabilities. Here is example how to read structured data for this example:

import requests
import json
import pandas as pd
from pprint import pprint

excel_url = "https://www.lb.lt/lt/m_statistika/t-paskolu-palukanu-normos/?export=xlsx"
df = pd.read_excel(excel_url)

selected_data = df.iloc[12:, [0, 6]]

selected_data.columns = ['date', 'value']
json_data = selected_data.to_dict(orient='records')

results = [{
  'date': row.get('date').replace(' ', '') + '-01',
  'value': row.get('value'),
} for row in json_data]

print(json.dumps(results, indent=2))

Sharing charts

Sometimes we want to visualise data, in the form of infographic or chart/dashboard. For example, this is interactive chart depicting salaries growth rate:

This could be used in media articles as well to back content with (live) data.

Infogram tool

We use infogram tool for data visualisation as it allows to share charts easily and have quite big collection of charts available.

Example: line chart

We will use example above (charts) to demonstrate how to build interactive chart.

Lets build static chart first. Pick line chart, and add data manually:

To keep it interactive you will need to use JSON feeds option.

Important: works for static snapshot, for live updates (i.e. update real time) it requires paid version of app.

Data should be in the proper JSON format with sheets as outer dimensions (similar to spreadsheets in Infogram), followed by rows and columns.

[[["Index of Happiness","09:00","10:00","11:00","12:00","13:00"],["Eden",794,404,301,563,406],["Shambhala",470,940,142,575,294],["Avalon",332,914,860,616,741],["Camelot",351,686,720,447,467],["El Dorado",863,353,612,872,533],["Atlantis",382,882,240,320,420]]]

Here is the API link which returns JUST data, without other meta information (snapshot created date, etc). You will have to use this in Infogram tool when prompted “Enter JSON feed URL” (replace challenge with your ID):

https://api.dataplatform.lt/snapshots/latest/json/?challenge_id=459

You can inspect results here. Here is example which would produce data in required format in data platform


import requests
import json


def get_last_snapshot(challenge_id):
  api_url = 'https://data-platform-backend-4ddpl.ondigitalocean.app/snapshots/latest/?challenge_id=%d' % challenge_id
  response = requests.get(api_url)
  data = response.json()
  return data.get('json_data', [])

payroll_data = get_last_snapshot(237)

location = 'Vilnius'
payroll_data = list(filter(lambda p: p.get('location') == location and p.get('salary') == 'Bruto' and p.get('company_status') == 'With', payroll_data))
payroll_data.sort(key=lambda d: d.get('date'))

results = []
results.append(['Data', 'Alga'])
for item in payroll_data:
  results.append([item.get('date'), item.get('value')])

combined_results = [results]
print(json.dumps(combined_results))

Debugging

Lost why your pipeline breaks? Add print statement (typically combining with existing program immediately, for example by throwing exception):

Now check intermediate print results shown in output

Another approach to debug is simply use stderr to print intermediate debug output. The pattern is as following:

import sys
print('This is an error message', file=sys.stderr)

# Output:
# This is an error message

Tip: data platform will only read sys.stdout as an output. You can use this as afeature: if you need to have some print statements (for example, statistics of each step in pipeline), you can route them to stderr.

Index and crawling pipelines

For use cases like listing portals, e-commerce (eshops), typical pattern is having very many pages in the system. Write one pipeline which crawls entire website and produces urls, while other populates data using those urls.

In a lot of cases even in indexing stage, you can find most of required information, so it would be faster. Furthermore, this will require less load on target website and decrease chance of being blocked.

Tip: please refer to section how to reuse data from another pipeline in our plaform.

Keep original data

In case you need to normalize data, write multiple pipelines: first, the one which scrapes original data unformatted. Then write extra pipeline which normalizes data from previous pipeline.

This way you can always fix problems ar adjust normalization for older datasets (no data is lost)

Protected: Using Scrapy framework

Protected: Reusing other pipeline

Schema “Flats”

Example end to end:

import requests
import json
import re
from pprint import pprint
from dphelper import DPHelper
helper = DPHelper(is_verbose=True)
headers = helper.create_headers(authority="https://berzunamai.lt/butai/")
 
content = helper.from_url('https://berzunamai.lt/butai/', headers=headers)
# content of table
rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)
# columns to get
apts = helper.parse_rows(
  ['id','floor_romanian','area','rooms','orientation','price','status','empty_column','www','floor'],
results,
verbose=True,    
)
# add flat link in to 'www' column
for row in apts:
    row['www'] = f"https://berzunamai.lt/butai/{row['floor']}-aukstas/{row['id']}-butas/"
 
# export to JSON
print(json.dumps(apts))

Protected: Caching images

Data from APIs

Sometimes data for web page is loaded dynamically, i.e. on demand. This can be done for performance reasons, to make initial load faster. Sometimes you can identify this pattern by observing spinners or loaders, or content appearing incrementally:

On the browser, this can be observed by inspecting Network tab of debugger tools. For example, on Chrome, you can find Network tab by clicking on menu “View > Developer > Developer Tools” and going into Network tab

. Watch screencast

Having problems finding the right request? Try searching by data

Now you can find API request which yields required data

curl https://gedimino37.lt//catalog.php

Which would (in this case) would yield structured data:

[{"id":"B0.1","status":"sold","direction":"PR","floor":"0","size":"87,69","rooms":"6}]

The code snippet in Python to retrieve this kind of data:

import json
from dphelper import DPHelper

helper = DPHelper()
headers = helper.create_headers(authority="gedimino37.lt")

content = helper.from_url('https://gedimino37.lt//catalog.php', headers=headers)
data = json.loads(content)

Or, without or helper library (using standard requests library):

import requests
from dphelper import DPHelper

helper = DPHelper()
headers = helper.create_headers(authority="gedimino37.lt")

r = requests.get('https://gedimino37.lt/catalog.php', headers=headers)
data = r.json()