In many scenarios you will be able to access structured data (for example, via JSON via API). However, data platform is designed such way that the elements of your data should have SPECIFIC EXACT shape (match JSONSchema). Thus lets say API from target website returns data in such format about list of apartments:
[
{"kaina": 5, "plotas": "70 m2"},
{"kaina": "$25,99", "plotas": “50.22”},
{"kaina": 0, "plotas": "12"},
]
While the schema has different field names, and different formats, and we want data to be in this format
[
{"price": 5, "area": 70.00},
{"price": 25.99, "area": 50.22},
{"price": 0, "area": 12.00},
]
For this scenario, we can create schema field mapping and do transformation easier:
from dphelper import DPHelper
helper = DPHelper(is_verbose=True)
schema_mapping = {
"kaina": "price",
"plotas": "area",
}
# invalid format;
partial_data = [
{"kaina": 5, "kambariai": "1.5 kamb."},
{"kaina": "$25,99"},
{"kambariai": "2.2"},
]
# convert to right format
print(helper.transform_map(schema_mapping, data, parse=True))
Tip: always try to use structured data when possible. For example if data is stored in JSON file (or as variable in HTML), read entire thing as string and parse it as JSON. This is less fragile, since the order of attributes may (and likely will) change, new attributes will be added.
Example
HTML with JSON:
<div data-drupal-messages-fallback class="hidden"></div>
<script>
var data = {"posts": [{"id":"1","post_title":"LP", ..}]};
</script>
Instead of doing this:
pattern = re.compile(r'{"post_id":"(.*?)","post_title":"(.*?)","post_adress":"(.*?)","longitude":"(.*?)","latitude":"(.*?)","type":"(.*?)",.*?}')
results = pattern.findall(data)
Do this
content = content.split('var datat = ')[1].split(';var data =')[0]
json_data = json.loads(content)
posts = json_data.get('posts')