{"id":30,"date":"2023-11-02T22:52:48","date_gmt":"2023-11-02T22:52:48","guid":{"rendered":"https:\/\/blog.dataplatform.lt\/?p=30"},"modified":"2023-11-02T22:53:12","modified_gmt":"2023-11-02T22:53:12","slug":"parsing-unstructured-data","status":"publish","type":"post","link":"https:\/\/blog.dataplatform.lt\/?p=30","title":{"rendered":"Parsing unstructured data"},"content":{"rendered":"\n<p>Some cases you will have access to structured data (for example, API request returning data as JSON), but in many situations you will have data stored as HTML in unstructured format (text). For example, list of apartments:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/z3xnfYpNmaMF-P_dXMiGQEcTKnAqT7i3HsSFqI9BXwGl9t6r53G8qnVNKoHZsLn672xoAkdrhjr2Nw2CKtWY1M0IetpDqZvXGVGrKxzXHcnwXsDcX349THkGbcy6VWFKx0K-Xayo-InxhEgbWwZlnKI\" width=\"286\" height=\"147\"><\/td><td><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/9XSP7xcDckWvbMzbxErJ7YkfqxGz1-6zE5Lui2-pAqjKAjAP9M4OPmSQvm5Or9sAEgSJXWXc1WNxUR3Q_SH_rs6aqvmKpNpRkHwoXizq_XScM_exV7IgexCX-ht3nSAqIhSHsV22_IprBPz56_CylKc\" width=\"286\" height=\"181\"><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>One of the valid approaches is to use regexp and convert each value into corresponding format, for example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import re\n\nrg = re.compile('&lt;tr class=\".*?\">&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">&lt;a href=\"(.*?)\".*?>&lt;\/a>.*?&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;\/tr>')\nresults = rg.findall(content)\n\ntmp = &#91;]\nfor id, floor, area, room, status, .. in results:\n  tmp.append({\n     \u2018area\u2019: float(area.replace(\u2018m2\u2019, \u2018\u2019)),\n     \u2018status\u2019: 0 if status == \u2018Laisvas\u2019 else 1,\n     ..\n  })\n<\/code><\/pre>\n\n\n\n<p><br>However this may be repetitive and errorprone. We allow you to use helper function which accepts schema and converts data into right format:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import re\n\nrg = re.compile('&lt;tr class=\".*?\">&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;td class=\".*?\">&lt;a href=\"(.*?)\".*?>&lt;\/a>.*?&lt;\/td>&lt;td class=\".*?\">(.*?)&lt;\/td>&lt;\/tr>')\nresults = rg.findall(content)\n\napts = helper.parse_rows(\n  &#91;'id','floor','area','rooms','status'],\n  results,\n  verbose=True,\n)\n\n<\/code><\/pre>\n\n\n\n<p class=\"has-light-blue-background-color has-background\">Useful: While building data pipeline you can use flag verbose=True to make debugging easier, and later switch it into verbose=False (or delete extra parameter)<\/p>\n\n\n\n<p>Here is list of currently supported field types:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Field name<\/strong><\/td><td><strong>Description<\/strong><\/td><td><strong>Example input \/ outputs<\/strong><\/td><\/tr><tr><td>id<\/td><td>unique identifier<\/td><td><\/td><\/tr><tr><td>floor<\/td><td>floor for apt<\/td><td>\u201ctrecias\u201d (=3)<\/td><\/tr><tr><td>area<\/td><td>area for apt<\/td><td>\u201c50.5\u201d (=50.5)\u201c50,5\u201d (=50,5)<\/td><\/tr><tr><td>rooms<\/td><td>number of rooms for apt<\/td><td>\u201cdu\u201d (=2)\u201c2 kamb.\u201d (=2)<\/td><\/tr><tr><td>orientation<\/td><td>orientation<\/td><td>\u201cSiaure\u201d = ([\u201cN\u201d])<\/td><\/tr><tr><td>price<\/td><td>price of apt<\/td><td>\u201c100.200,00\u201d (=100200)<\/td><\/tr><tr><td>status<\/td><td>status for apt<\/td><td>\u201cLaisvas\u201d (=0)\u201cRezervuotas\u201d (=1)\u201cParduotas\u201d (=2)<\/td><\/tr><tr><td>floor<\/td><td>floor no for apt<\/td><td>\u201cAntras\u201d (=2)<\/td><\/tr><tr><td>www<\/td><td>extract link<\/td><td><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Important: if schema field is not recognized, then after passing it into utility it will be kept unchanged (same format).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Some cases you will have access to structured data (for example, API request returning data as JSON), but in many situations you will have data stored as HTML in unstructured format (text). For example, list of apartments: One of the valid approaches is to use regexp and convert each value into corresponding format, for example: &hellip; <a href=\"https:\/\/blog.dataplatform.lt\/?p=30\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Parsing unstructured data&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[],"class_list":["post-30","post","type-post","status-publish","format-standard","hentry","category-helper-library-dphelper"],"_links":{"self":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/30","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=30"}],"version-history":[{"count":2,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/30\/revisions"}],"predecessor-version":[{"id":32,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/30\/revisions\/32"}],"wp:attachment":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=30"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=30"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=30"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}