Data is a vital element in business decision-making. Modern technologies and algorithms allow for processing and storage of huge amounts of data, converting it into useful predictions and insights. But they also require high-quality data to ensure prediction accuracy and insight value.
In today’s world, the importance of data quality validation is hard to overestimate. For instance, the 2020 Gartner survey found that organizations estimate the average cost of poor data quality at $12.8 million per year, and this number will likely rise as business environments become increasingly complex.
Assuring the quality of data is possible with modern data pipelines that should include data quality components by default. I have solid experience in the Data Quality Assurance (Data QA) niche and understand how to achieve data quality in the best way possible. I will share some of my expertise in this article.
Great Expectations – A Data QA Tool of Choice
To begin with, let’s talk about one of the best Data QA tools – Great Expectations (GX).
Great Expectations is an open-source data quality tool based on Python. GX can help data teams to profile, test, and create reports for and on data. GX has a friendly command-line interface (CLI) that enables you to easily set up and create new tests, while quickly customizing available test reports. GX can be integrated with various extract, transform, and load (ETL) tools, such as Airflow, and also with many databases. (You can find the list of integrations here and official documentation here.)
Most importantly, Great Expectations supports AWS.
Reporting on Data with Allure
Allure is the gold standard for reporting in QA. Allure enables managers and non-technical professionals to review test results and keep track of the testing process. That is why, we decided to use Allure as a demonstration tool, to display Data QA results and implement a self-written adapter that converts GX results to the Allure format.
We suggest the following Data QA approach for automating test creation:
- Retrieve tested data from data sources using AWS Lambda
- Run AWS Lambda with Pandas Profiling and generate tests for GX
- Run GX Test Suite for each dataset, all run in parallel for each dataset
- Store/serve results for each data source as a static Amazon S3 website
- Convert GX results to the Allure report format using AWS Lambda
- Store results in Amazon S3
- Generate Allure reports from the Allure format; reports are stored and served in Amazon S3
- Send the reports to a Slack channel with AWS Lambda
- Push results to Amazon DynamoDB (or Amazon S3 to reduce costs)
- Crawl data from Amazon DynamoDB by using Amazon Athena
- Create a dashboard with Amazon Quicksight
Building a Data Quality Gate
We now have all the components needed to build an efficient data quality gate. To simplify their deployment to AWS, we created a Terraform module – Data Quality Gate – that enables you to assure the quality of your data in one click. This module allows you to quickly deploy the infrastructure for DQ and generate the first test suite for your data. Use this module as a standard Terraform module for AWS-based deployments.
Data Quality is a fast-growing field, and many engineers are involved in this process daily. Data Quality Engineers should build a solid pipeline for testing data and presenting results to stakeholders. Today, leveraging the availability of open source tools to deploy solutions faster plays a crucial role in data processing.
The post Assuring Data Quality: How to Build a Serverless Data Quality Gate on AWS appeared first on Datafloq.
This content was originally published here.