Parsing unstructured data

Some cases you will have access to structured data (for example, API request returning data as JSON), but in many situations you will have data stored as HTML in unstructured format (text). For example, list of apartments:

One of the valid approaches is to use regexp and convert each value into corresponding format, for example:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

tmp = []
for id, floor, area, room, status, .. in results:
  tmp.append({
     ‘area’: float(area.replace(‘m2’, ‘’)),
     ‘status’: 0 if status == ‘Laisvas’ else 1,
     ..
  })


However this may be repetitive and errorprone. We allow you to use helper function which accepts schema and converts data into right format:

import re

rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)

apts = helper.parse_rows(
  ['id','floor','area','rooms','status'],
  results,
  verbose=True,
)

Useful: While building data pipeline you can use flag verbose=True to make debugging easier, and later switch it into verbose=False (or delete extra parameter)

Here is list of currently supported field types:

Field nameDescriptionExample input / outputs
idunique identifier
floorfloor for apt“trecias” (=3)
areaarea for apt“50.5” (=50.5)“50,5” (=50,5)
roomsnumber of rooms for apt“du” (=2)“2 kamb.” (=2)
orientationorientation“Siaure” = ([“N”])
priceprice of apt“100.200,00” (=100200)
statusstatus for apt“Laisvas” (=0)“Rezervuotas” (=1)“Parduotas” (=2)
floorfloor no for apt“Antras” (=2)
wwwextract link

Important: if schema field is not recognized, then after passing it into utility it will be kept unchanged (same format).

Leave a Reply

Your email address will not be published. Required fields are marked *