{"id":17,"date":"2023-11-02T22:07:34","date_gmt":"2023-11-02T22:07:34","guid":{"rendered":"https:\/\/blog.dataplatform.lt\/?p=17"},"modified":"2023-11-02T22:07:34","modified_gmt":"2023-11-02T22:07:34","slug":"custom-validators","status":"publish","type":"post","link":"https:\/\/blog.dataplatform.lt\/?p=17","title":{"rendered":"Custom validators"},"content":{"rendered":"\n<p>One of the most important goals of the data platform is to ensure data quality. One of the mechanisms is using JSON schema, as discussed above. However, we also provide ability to provide custom rules for your dataset. Here are some scenarios when it may be useful.<\/p>\n\n\n\n<p><em>Contextual validation<\/em>. While JSON schema works for most scenarios when we need to validate specific value (for example, number of rooms in apartment), it is impossible to validate context (for example that when room number is 1, area should be less than 50 square meters).<\/p>\n\n\n\n<p><em>Custom logics. <\/em>You may need very custom logic, for example normalize value (make string lowercase and then check if its one of the list).<\/p>\n\n\n\n<p>We support warning and error level validators. Warning level validators indicate that data MAY be invalid and its implemented by printing into console (which is shown error message). Error messages are pipeline blocking and needs immediate attention. They implement by throwing exception with error message. Input data is read from standard input as JSON.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example validator<\/h3>\n\n\n\n<p>Lets explore real life scenario &#8211; custom validator to ensure that each apartment in the list has reasonable square meter price. For this we define (larger) range where it would throw exception if outside, and narrower range for warnings.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import sys\nimport json\n\napartments = json.loads(sys.stdin.read())\n\n# CONFIG START\nWARNING_PRICE_MIN = 1500\nWARNING_PRICE_MAX = 6000\n\n\nERROR_PRICE_MIN = 500\nERROR_PRICE_MAX = 10000\n# CONFIG END\n\n\nhas_price = False\nfor apartment in apartments:\n flat_id = apartment.get('id')\n price = apartment.get('price')\n area = apartment.get('area')\n\n\n price_sqm = None\n if area and price:\n   has_price = True\n   price_sqm = price \/ area\n   if price_sqm &lt; ERROR_PRICE_MIN:\n     raise Exception('Price for %s too low %f' % (flat_id, price_sqm))\n   elif price_sqm > ERROR_PRICE_MAX:\n     raise Exception('Price for %s too high %f' % (flat_id, price_sqm))\n\n   elif price_sqm &lt; WARNING_PRICE_MIN:\n     print('Price for %s MAY BE too low %f' % (flat_id, price_sqm))\n   elif price_sqm > WARNING_PRICE_MAX:\n     print('Price for %s MAY BE too high %f' % (flat_id, price_sqm))<\/code><\/pre>\n\n\n\n<p>Lets break down parts of validator. First lets, lets read the data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import sys\nimport json\napartments = json.loads(sys.stdin.read())<\/code><\/pre>\n\n\n\n<p>Then write your logic. This is example of warning level handling:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> elif price_sqm &lt; WARNING_PRICE_MIN:\n     print('Price for %s MAY BE too low %f' % (flat_id, price_sqm<\/code><\/pre>\n\n\n\n<p>This is example of error level handling:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> elif price_sqm > ERROR_PRICE_MAX:\n     raise Exception('Price for %s too high %f' % (flat_id, price_sqm))<\/code><\/pre>\n\n\n\n<p>It is possible scenario that validator triggers both warning and error. In that case higher importance (error) level validation is shown in our tool.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Testing your validator<\/h3>\n\n\n\n<p>Now you have your validator code its time to test it! We recommend to test with valid and invalid scenario to make sure it works. First lets create valid JSON file APARTMENTS_VALID.json:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;{\u2018area\u2019: 50, \u2018price\u2019: 200000, \u2018id\u2019: \u20181\u2019}]<\/code><\/pre>\n\n\n\n<p>Test it by piping via command line using terminal:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>>>\n>> python3 price_validator.py &lt; APARTMENTS_VALID.json\n>>\n<\/code><\/pre>\n\n\n\n<p>Since data in JSON is valid, as expected nothing will be printed. Now lets created invalid data sample APARTMENTS_INVALID.JSON<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;{\u2018area\u2019: 50, \u2018price\u2019: 20000000, \u2018id\u2019: \u20181\u2019}]<\/code><\/pre>\n\n\n\n<p>Lets run it in command line:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>>>\n>> python3 price_validator.py &lt; APARTMENTS_INVALID.json\n>> Price for 1 too high 200000000!\n<\/code><\/pre>\n\n\n\n<p>We have just completed writing and testing our first validator! Good luck creating new rules to ensure data quality.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><br>Best practises<\/h3>\n\n\n\n<p>Better don\u2019t mix several ideas into one code &#8211; for example check average price as above, and also check whether apartment ID\u2019s are unique, etc. Better write separate code, since in case of error (Exception thrown) code execution will stop, while there be more insights about code quality.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the most important goals of the data platform is to ensure data quality. One of the mechanisms is using JSON schema, as discussed above. However, we also provide ability to provide custom rules for your dataset. Here are some scenarios when it may be useful. Contextual validation. While JSON schema works for most &hellip; <a href=\"https:\/\/blog.dataplatform.lt\/?p=17\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Custom validators&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-17","post","type-post","status-publish","format-standard","hentry","category-format"],"_links":{"self":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/17","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=17"}],"version-history":[{"count":1,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/17\/revisions"}],"predecessor-version":[{"id":18,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=\/wp\/v2\/posts\/17\/revisions\/18"}],"wp:attachment":[{"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=17"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=17"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.dataplatform.lt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=17"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}