{"id":56,"date":"2023-11-05T22:50:48","date_gmt":"2023-11-05T22:50:48","guid":{"rendered":"https:\/\/blog.dataplatform.lt\/?p=56"},"modified":"2023-11-05T22:50:48","modified_gmt":"2023-11-05T22:50:48","slug":"writing-solution-for-challenge","status":"publish","type":"post","link":"https:\/\/blog.dataplatform.lt\/?p=56","title":{"rendered":"Writing solution for challenge"},"content":{"rendered":"\n<p>Lets say the data challenge is to extract data from ecommerce site with product ids and pricing information. Thus the schema may look like this:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Field name<\/strong><\/td><td><strong>Field type<\/strong><\/td><\/tr><tr><td>product_id<\/td><td>INTEGER (REQUIRED)<\/td><\/tr><tr><td>price<\/td><td>FLOAT (REQUIRED)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>An example of very basic program (in Python) which just prints hardcoded values in required format looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import json\n\ndata = &#91;\n  {\u2018product_id\u2019: \u2018iphone_10\u2019, \u2018price\u2019: 119.99},\u00a0\u00a0\n  &#91;\u2018product_id\u2019: \u2018iphone_11\u2019, \u2018price\u2019: 159.99},\n]\n\nprint(json.dumps(data))<\/code><\/pre>\n\n\n\n<p>This <em>would<\/em> be accepted by platform, but of course the idea that data is real and up to date. Therefore your code rather should fetch HTML content, extract required data and structure in required format.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import json\nimport requests\n\n\nresponse = requests.get(\u2018http:\/\/www.iphones.lt\u2019)\n\n# models and prices stored in HTML table\nrg = re.compile('&lt;tr>&lt;td>(.*?)&lt;td>&lt;td>(.*?)&lt;\/td>&lt;\/tr>\u2019)\nresults = rg.findall(response.content)\n\ndata = &#91;]\nfor model_str, price_str in results:\n  data.append({\n    \u2018product_id\u2019: model_str,\n    \u2018price\u2019: float(price_str)\n  })\n\n\nprint(json.dumps(data))<\/code><\/pre>\n\n\n\n<p>The key part is that your last part where you print JSON serialized data into standard output using regular print statement. Don\u2019t forget to serialize it using json.dumps()! Typically, the program shall use<em> json.dumps(data)<\/em> only once to ensure output in JSON format.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>print(json.dumps(data))<\/code><\/pre>\n\n\n\n<p class=\"has-yellow-background-color has-background\">Warning: don\u2019f forget to remove your debugging prints as it will result in invalid data.<\/p>\n\n\n\n<p class=\"has-light-blue-background-color has-background\">Tip: Go to section Utils to learn how to make processing even easier.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><\/h3>\n","protected":false},"excerpt":{"rendered":"<p>Lets say the data challenge is to extract data from ecommerce site with product ids and pricing information. Thus the schema may look like this: Field name Field type product_id INTEGER (REQUIRED) price FLOAT (REQUIRED) An example of very basic program (in Python) which just prints hardcoded values in required format looks like this: This &hellip; <a href=\"https:\/\/blog.dataplatform.lt\/?p=56\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Writing solution for challenge&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-56","post","type-post","status-publish","format-standard","hentry","category-workflow"],"_links":{"self":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/56","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=56"}],"version-history":[{"count":1,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/56\/revisions"}],"predecessor-version":[{"id":57,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/56\/revisions\/57"}],"wp:attachment":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=56"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=56"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=56"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}