Why should you use dbt for data engineering?

Transforming data in a database can be a tedious task especially if you need several transformation in a row (like in most data warehouses). In Analytics it is common to have several steps from the source of the data to the reporting layer where transformations happen. Plain - vanilla does the trick but the scheduling and maintaining of the workflow can be time consuming and frustrating. Furtunately dbt allows analyst to define a workflow and run it in an very easy and efficient way. As the dbt website states:

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.

This post explains some corner stones of dbt and why I like to use it in analytical projects.

Structure:

  1. General Information about dbt
  2. SQL and Jinja
  3. Directed Acyclic Graph (DAG)
  4. Do Not Repeat Yourself Principle (DRY)
  5. Testing
  6. Conclusion

1. General Information about dbt:

In general dbt is a framework that can transform your data inside a data warehouse like BigQuery, Redshift and the like. Basically the framework let’s the analyst define a workflow that transform the data from the source table to the data marts that are used for reporting or data science projects.

2. SQL and Jinja:

dbt is based on plain-vanilla SQL mixed with Jinja functionality. The analyst can build templates and macros and especially the templating capabilities are incredible powerful for every analyst and will improve the productivity of the analytics team a lot.

3. Directed Acyclic Graph (DAG):

dbt is build around the notion of models and DAGs. A model is a single SQL script that produces one output (table / view). Multiple models that are stacked together through the ref() command are building one workflow. The workflow of dbt is build around the concept of DAGs. The graph is flowing in one direction without going back to a previous node (having circles). Consider you have three models. Model A (perhaps data cleaning), Model B (some sort of business logic) and Model C (perhaps some aggregating for reporting). Then it seems reasonable that C builds on top of the B results and B on top a A results. Thus the data flow would be A -> B -> C. This would be a valid DAG and dbt could easily calculate this workflow. A -> B -> C -> B would be invalid because of the cycle B -> C -> B.

4. Do Not Repeat Yourself Principle (DRY):

Further dbt has the idea that analyst should not repeat the code. So the framework is build around the DRY principle and analysts should always be encouraged to follow this principle. The goal is to the code clean and to put business logics and reusable parts in macros that other analysts can understand the code.

5. Testing:

Normally keeping the data consistent can be one of the most important but also complex task. dbt allows to write test scripts for each model and also between different models. It comes with off the shelf testing possibilities but you can write own testing scripts as well, tailored to your business problem.

6. Conclusion:

In my opinion, dbt is a powerful tool to build the whole data stack. Especially with cloud data warehouses like BigQuery or Redshift in combination with data integration services like StitchData or Fivetran, it can be a very nice tool. It is easy to learn and build on top of SQL. In my previous job, I build the whole data stack with dbt from the source to the data marts that feeded several Tableau Reports. I can highly recommend the tool to every data analyst.

But now let’s install dbt