Some cases you will have access to structured data (for example, API request returning data as JSON), but in many situations you will have data stored as HTML in unstructured format (text). For example, list of apartments:
One of the valid approaches is to use regexp and convert each value into corresponding format, for example:
import re
rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)
tmp = []
for id, floor, area, room, status, .. in results:
tmp.append({
‘area’: float(area.replace(‘m2’, ‘’)),
‘status’: 0 if status == ‘Laisvas’ else 1,
..
})
However this may be repetitive and errorprone. We allow you to use helper function which accepts schema and converts data into right format:
import re
rg = re.compile('<tr class=".*?"><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?">(.*?)</td><td class=".*?"><a href="(.*?)".*?></a>.*?</td><td class=".*?">(.*?)</td></tr>')
results = rg.findall(content)
apts = helper.parse_rows(
['id','floor','area','rooms','status'],
results,
verbose=True,
)
Useful: While building data pipeline you can use flag verbose=True to make debugging easier, and later switch it into verbose=False (or delete extra parameter)
Here is list of currently supported field types:
Field name | Description | Example input / outputs |
id | unique identifier | |
floor | floor for apt | “trecias” (=3) |
area | area for apt | “50.5” (=50.5)“50,5” (=50,5) |
rooms | number of rooms for apt | “du” (=2)“2 kamb.” (=2) |
orientation | orientation | “Siaure” = ([“N”]) |
price | price of apt | “100.200,00” (=100200) |
status | status for apt | “Laisvas” (=0)“Rezervuotas” (=1)“Parduotas” (=2) |
floor | floor no for apt | “Antras” (=2) |
www | extract link |
Important: if schema field is not recognized, then after passing it into utility it will be kept unchanged (same format).