What is dbt?
dbt (Data Build Tool) is an exciting open-source data transformation tool that has gained a lot of support and enthusiasm on the data analytics market in the past 5 years. dbt allows data analysts to be more involved in or even own the entire analytics engineering workflow. Whereas that sounds like a bold claim, the popularity of dbt and its successful track record make it worth considering for any data analytics stack.
dbt is a command-line tool that performs the T (Transform) of ETL/ELT.
(Image Source: https://www.getdbt.com/product/)
At a minimum you can use dbt to store and run your SQL code against the database in an organized and version-controlled way. At a maximum, you can build out complex data transformation workflows leveraging all the additional dbt functionality.
An all-too-familiar scenario of multiple team members running reports based on the same business logic and producing different results can be frustrating to say the least. Using dbt can be a good solution to prevent such or similar issues. What really differentiates the tool is the fact that it was designed to follow the best principles of software engineering in terms of development workflow and lifecycle. This analytic framework empowers data analysts to write code that is version-controlled, thoroughly tested, documented, and is easy to collaborate on.
Let us dive into the top seven reasons why dbt might be worth your time:
All the work remains in SQL. There is no need to learn a new language or figure out the abstraction layer behind a new tool. You essentially use the same SQL you would run in your data warehouse, so you know exactly what is going on. Jinja (templating language for Python) is a great addition that makes SQL even more powerful allowing for use of control structures like IF statements or FOR loops.
dbt allows you to neatly organize all data transformations into discrete models. Each dbt model is a single select statement that either transforms raw data into the target dataset or serves as an intermediate step in such a transformation. Frequently used logic can be organized and materialized in a way that makes most sense for collaboration, version-control, and efficiency.
You get access to an auto-generated dbt documentation website that shows existing models, related database objects as well as detailed information about every model. For example, one can see the model definition, both the source and the compiled SQL behind it, column definitions, and any data integrity checks.
In addition, dbt builds a directed acyclic graph (DAG) based on the interdependencies between models which determines the order of model execution and provides a clear visualization of dependencies. That way, you never have to worry about how to sequence the model run to capture all the changes or how to parallelize it. Every model’s lineage graph is part of the documentation website which can be super helpful for both organizational transparency and knowledge sharing.
4. Version Control & CI/CD
dbt integrates nicely with Git. What is even better, each developer can have their own development sandbox, with the target configured as their own database or schema. That way, any new code can get safely tested, reviewed, and documented before being merged into the master branch. In other words, there is no chance of accidentally overwriting or changing a production table while working on something new.
With dbt cloud access you can also schedule jobs, including an option for triggering a job to be run automatically every time a pull request gets created. The setup process is extremely straightforward and can be fully managed by a data analyst.
5. Data Quality
There are several ways you can set up and enforce data quality checks in dbt. The tool allows you to seamlessly create data integrity checks (null, uniqueness, foreign keys, accepted value range) when you document a given model.
There is also functionality to build custom data tests that are driven by business logic and can be literally anything as long as the test assumption is formulated as a select statement. For example, to test whether all values in the AGE column are positive, you can simply try to return all zero or negative values:
Lastly, dbt gives you an option to create snapshot tables that effectively track any changes to the data. This approach can come in handy when dealing with mutable data since you would have full visibility into historical changes in the source data.
dbt is a highly configurable tool that was built for integration. There are multiple nice features (e.g., macros) that can enhance the development workflow and tailor it to the specific project needs. However, you have the freedom to leverage all or very few of them depending on the specific project needs. The same is true for the model organizational structure or naming conventions: do what works best for you.
7. Resources and dbt community
The documentation is extensive featuring everything from reference docs to step-by-step installation tutorials and FAQs. There is no steep learning curve for anyone familiar with SQL, but the available resources can make the getting up-to-speed process even smoother. The set-up process itself is straightforward and does not require dedicated IT support.
One of the often-overlooked benefits of this open-source community is that you can access dbt packages, i.e., libraries of models and macros that tackle a specific problem that someone else has already solved (for example, models with salesforce data transformations). Coupled with a dedicated slack channel and a discourse channel community, dbt provides multiple avenues for learning on your own or getting the community support when needed.
Overall, dbt can be a great tool for analytic workflows by not only resolving a lot of common headaches for data analysts but also creating a stable platform for cross-functional collaboration and knowledge sharing. Not to mention that it can truly empower data analysts by giving them a resource to leverage the best practices of software development.