We have seen many scenarios and therefore decided to create set of tools which will help you automate data extraction and re-use tested patterns. We have created dphelper pip package for Python (supported version is starting 3.7). Here are some use cases where we provide you helper functions:
Parsing data. Some data will be present in unstructured way, i.e. HTML table consisting of list of strings. You can provide schema of data you expect, and we will transform it into correct format, for example (str) “100,234.00 eur” => (int) 100234.
Avoiding getting blocked. Some targets (websites from where we extract data) may perform extra checks and therefore result in captchas and not able to retrieve HTML content. We have integrating 3rd party tools which use proxy servers and other techniques, such as headless browsers, to solve this for you.
Data platform API. Sometimes you will have to reuse, breakdown your data acquisition pipelines to reduce complexity of pipeline and improve reusability (see best practises). We provide you with API to combine, filter, adjust data or retrieve specific snapshot for specific date.