AWS Athena + DBT Integration

AWS Athena + DBT Integration

In 15 minutes or less

Photo by Etienne Girardet on Unsplash

Context

AWS Athena is serverless and intended for ad-hoc SQL queries against data on AWS S3. However, maintaining data lineage and dependency is tedious and error-prone (no difference in Data Warehouses though).

DBT (Data Build Tool) has recently become extremely popular as it can automatically draw data lineage/generate documentation of the data pipeline, not to mention its’ other features like Snapshot, Jinja & Macro support.

There are tons of references related to integrating DBT with Data Warehouses (Snowflake, Redshift, Databrick etc) but not Athena(Yes, it is not a Data Warehouse). However, there are many use cases and teams using Athena to generate BI reports etc:

  • The report has 20+ views generated from various AWS DataCatalog databases and tables.
  • Data lineage is important in order to understand business logic and it changes frequently.
  • The computation performance of Athena is satisfied.
  • Lack of resources/budget to implement/maintain Data Warehouses.

Is there any chance we could integrate DBT with Athena to provide an affordable solution?

Solution

Thanks to community who built a DBT Athena adapter, I used it to build a demo to verify how the integration works. All AWS infra resources are managed by Terraform and provided in my GitHub repo so you can build the same E2E demo in 15 minutes (or even less)for either POC(proof of concept), internal demo or self-learning purposes.

Raw Data

Women’s E-Commerce Clothing Reviews review data¹ (CC0: Public Domain license²) is used and details of the data statistical summary can be found on the Kaggle website.

Image by Author

In short, this data contains customer reviews on products and corresponding rates on products. A typical use case is to use this data to do sentiment analysis after proper featuring engineering (data transformation).

Dataflow

Under this demo, the raw data is stored in S3 with Glue services utilised to extract its schemas for Athena to query followed by DBT’s transformation.

Image by Author
  1. Glue crawler is manually triggered to infer data schema and update DataCatalog to Glue DataCatalog.
  2. Glue DataCatalog creates a reference to data (S3-to-Table) mapping, no data is moved.
  3. Query data in S3 using SQL, schema lookup in Glue Data Catalog, no data to load.
  4. DBT applies Software Engineering methodology to data transformation (SQL).

It is expected the reader would have some basic knowledge about Terraform, AWS and DBT otherwise this is also a good mini-project for a reader to learn about those services.

Implementation

Terraform Part

  • Make sure proper AWS user and permission are set up. For the sake of simplicity, my AWS user simply has an AdministratorAccess policy attached.
  • The profile name in Terraform main.tf should match the user profile, which is used to call AWS services, defined in “~./aws/credentials”.

https://medium.com/media/646dd0004c9f30aa19369f83232de517/hrefhttps://medium.com/media/de231b5a5161250b218554ae9f92e02e/href

Notes:

  • The region is set to “ap-southeast-2” by default, feel free to update it in variables.tf
  • Some AWS resources (s3 bucket etc) need to be globally unique, a random id generator is used in the Terraform to ensure uniqueness across different users. You will see several {16 digits id} conventions in this article and that refers to a random id.

Run the “terraform init” command to initiate the project.

Image by Author

Run the “terraform apply” command to apply all AWS resources

Image by Author

Manually trigger/run the crawler job (named “athena_dbt_demo_crawler” per this demo and configured in terraform) on AWS web console.

Image by Author

Once done you will find the table “women_clothes_reviews_raw_data” under database raw_data_{16 random id} in Athena (Note: Please select the athena-dbt-demo-workgroup instead of primary). Now, this raw data is ready for query.

Image by Author

You will also see the source raw data under the following bucket in S3: Athena-dbt-demo-{16 random id}/raw_data/women_clothing_ecommerce_reviews.csv. This file is uploaded to S3 automatically during the Terraform Apply phase.

DBT Part

Image by Author
  • Instead, I would suggest installing it through a clean virtual environment (Through conda etc). Note: Rquirements.txt is under the folder “Athena-dbt-demo-dbt-demo”

https://medium.com/media/463b71100d9c6b6a3cfe6a33e596d97a/href

Once done, check the version, now we are good to go!

Image by Author

Create your DBT profile if it doesn’t exist (~/.dtb/profiles.yml) or add this project to your existing DBT profile

https://medium.com/media/9a9e9bf9dbf3b24401b7016b6a571836/href

Note:

You should only update the following three values in the above file to match your case but keep the other values (per matching terraform configuration) untouched.

  • aws_profile_name: Same AWS profile name per configured in terraform main.tf and local AWS profile name.
  • s3_staging_dir: This is the s3 bucket created to store Athena query result, it is named “athena-dbt-demo-athena-query-result-bucket-{16 random id}” in your S3 bucket.
  • Schema: This is the database configured to export DBT models results and it should match athena_dbt_models_{16 digits id}

Lastly, update the database referenced in models/vw_women_clothes_reviews_rename_columns.sql to match the raw_data_{16 digits id} created.

...
FROM "RAW_DATA_9f40f41fee84f437"."WOMEN_CLOTHES_REVIEWS_RAW_DATA"

About Data Transformation

Within DBT, we implemented three transformations (/models)on top of raw data.

  • (View) vw_women_clothes_reviews_rename_columns conducts the first transformation: selecting needed columns and renaming them

https://medium.com/media/dc8b1e0200cdd85d7e8c8ac785ec714f/href

  • (View) vw_women_clothes_reviews_sentiments will create sentiment score based on pre-defined rule on column start_rating.

https://medium.com/media/fb8f9996b80ecb226b1b2bd807893510/href

  • (View) vw_women_clothes_reviews_training_data will conduct some final filtering and serve as the transformed result.

https://medium.com/media/e5fcb66b5e58eab6b2d0dfc082ccb1a9/href

DBT Run

Finally, run the command “dbt run” will compile all three models defined above and land results under AwsDataCatalog database “athena_dbt_models_{16 random id}” and you can also find it in Athena.

Image by Author

One key feature of DBT is to provide data lineage and documents.

  • Run command “dbt docs generate” will generate docs
  • Then set up a local web server through the command “dbt docs serve — port 8001” (it is two hyphens before port) to visualise docs.

Data Lineage:

  • women_clothes_reviews_training_data depends on vw_women_clothes_reviews_sentiments which is further based on vw_women_clothes_reviews_rename_columns.
Image by Author

Data Document:

Image by Author

Summary

Thanks for your interest and hope you enjoyed it. Despite the fact that Athena is normally used for ad-hoc query purposes, it can still benefit from DBT integration at a low cost.

Let’s have a quick recap of this article:

  • A product review data is loaded in S3 and “connected” to SQL query service Athena through AWS Glue services.
  • All AWS resources in this demo are managed by Terraform (IaC) so users can deploy/destroy them in seconds.
  • BI reports based on Athena will benefit a lot from DBT’s data lineage and documentation features.

[1] Women’s E-Commerce Clothing Reviews https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews?resource=download

[2] CC0: Public Domain License https://creativecommons.org/publicdomain/zero/1.0/


AWS Athena + DBT Integration was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content was originally published here.