{"id":33,"date":"2023-11-02T22:57:18","date_gmt":"2023-11-02T22:57:18","guid":{"rendered":"https:\/\/blog.dataplatform.lt\/?p=33"},"modified":"2023-11-02T22:57:18","modified_gmt":"2023-11-02T22:57:18","slug":"parsing-structured-data","status":"publish","type":"post","link":"https:\/\/blog.dataplatform.lt\/?p=33","title":{"rendered":"Parsing structured data"},"content":{"rendered":"\n<p>In many scenarios you will be able to access structured data (for example, via JSON via API). However, data platform is designed such way that the elements of your data should have SPECIFIC EXACT shape (match JSONSchema). Thus lets say API from target website returns data in such format about list of apartments:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91; \n    {\"kaina\": 5, \"plotas\": \"70 m2\"},\n    {\"kaina\": \"$25,99\", \"plotas\": \u201c50.22\u201d},\n    {\"kaina\": 0, \"plotas\": \"12\"},\n]\n<\/code><\/pre>\n\n\n\n<p>While the schema has different field names, and different formats, and we want data to be in this format<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91; \n    {\"price\": 5, \"area\": 70.00},\n    {\"price\": 25.99, \"area\": 50.22},\n    {\"price\": 0, \"area\": 12.00},\n]\n<\/code><\/pre>\n\n\n\n<p>For this scenario, we can create schema field mapping and do transformation easier:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from dphelper import DPHelper\n\nhelper = DPHelper(is_verbose=True)\n\nschema_mapping = {\n  \"kaina\": \"price\", \n  \"plotas\": \"area\",\n}\n\n# invalid format; \npartial_data = &#91;\n    {\"kaina\": 5, \"kambariai\": \"1.5 kamb.\"},\n    {\"kaina\": \"$25,99\"},\n    {\"kambariai\": \"2.2\"},\n]\n\n# convert to right format\nprint(helper.transform_map(schema_mapping, data, parse=True))\n<\/code><\/pre>\n\n\n\n<p class=\"has-light-blue-background-color has-background\">Tip: always try to use structured data when possible. For example if data is stored in JSON file (or as variable in HTML), read entire thing as string and parse it as JSON. This is less fragile, since the order of attributes may (and likely will) change, new attributes will be added.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example<\/h3>\n\n\n\n<p>HTML with JSON:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;div data-drupal-messages-fallback class=\"hidden\">&lt;\/div>\n&lt;script>\n<strong>  var data = {\"posts\": &#91;{\"id\":\"1\",\"post_title\":\"LP\", ..}]};<\/strong>\n&lt;\/script>\n<\/code><\/pre>\n\n\n\n<p>Instead of doing this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pattern = re.compile(r'{\"post_id\":\"(.*?)\",\"post_title\":\"(.*?)\",\"post_adress\":\"(.*?)\",\"longitude\":\"(.*?)\",\"latitude\":\"(.*?)\",\"type\":\"(.*?)\",.*?}')\nresults = pattern.findall(data)\n<\/code><\/pre>\n\n\n\n<p>Do this<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>content = content.split('var datat = ')&#91;1].split(';var data =')&#91;0]\njson_data = json.loads(content)\nposts = json_data.get('posts')\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In many scenarios you will be able to access structured data (for example, via JSON via API). However, data platform is designed such way that the elements of your data should have SPECIFIC EXACT shape (match JSONSchema). Thus lets say API from target website returns data in such format about list of apartments: While the &hellip; <a href=\"https:\/\/blog.dataplatform.lt\/?p=33\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Parsing structured data&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[9],"class_list":["post-33","post","type-post","status-publish","format-standard","hentry","category-helper-library-dphelper","tag-dphelper"],"_links":{"self":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/33","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=33"}],"version-history":[{"count":1,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/33\/revisions"}],"predecessor-version":[{"id":34,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/33\/revisions\/34"}],"wp:attachment":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=33"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=33"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=33"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}