Parsing unstructured data – Dataplatform guide for contributors

Some cases you will have access to structured data (for example, API request returning data as JSON), but in many situations you will have data stored as HTML in unstructured format (text). For example, list of apartments:

One of the valid approaches is to use regexp and convert each value into corresponding format, for example:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

tmp = []
for id, floor, area, room, status, .. in results:
  tmp.append({
     ‘area’: float(area.replace(‘m2’, ‘’)),
     ‘status’: 0 if status == ‘Laisvas’ else 1,
     ..
  })

However this may be repetitive and errorprone. We allow you to use helper function which accepts schema and converts data into right format:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

apts = helper.parse_rows(
  ['id','floor','area','rooms','status'],
  results,
  verbose=True,
)

Useful: While building data pipeline you can use flag verbose=True to make debugging easier, and later switch it into verbose=False (or delete extra parameter)

Here is list of currently supported field types:

Field name	Description	Example input / outputs
id	unique identifier
floor	floor for apt	“trecias” (=3)
area	area for apt	“50.5” (=50.5)“50,5” (=50,5)
rooms	number of rooms for apt	“du” (=2)“2 kamb.” (=2)
orientation	orientation	“Siaure” = ([“N”])
price	price of apt	“100.200,00” (=100200)
status	status for apt	“Laisvas” (=0)“Rezervuotas” (=1)“Parduotas” (=2)
floor	floor no for apt	“Antras” (=2)
www	extract link

Important: if schema field is not recognized, then after passing it into utility it will be kept unchanged (same format).

Leave a Reply Cancel reply