DBT is a tool to simplify populating the relationships between different tables. With DBT, you can specify the queries used to create your tables, as well as parameterizing portions of those queries. You can also add data tests.
Introduction: Getting to Know Data Build Tool (dbt)
Ever wished for a smoother way to handle your data without the headaches of complex ETL processes? Enter Data Build Tool, or as we fondly call it, DBT. It makes data transformations and modeling feel like a breeze.
DBT takes a refreshing approach to data pipelines, letting you express your data transformations in good old SQL. No need for fancy jargon or convoluted workflows—just simple, straightforward SQL magic.
In this article, we’ll take a relaxed stroll through the world of DBT, from setting it up to some cool tricks it can do. So grab your favorite drink, kick back, and let’s dive into the world of DBT together!
Installation of Project Bootsrapping.
To get started with dbt, the recommended method for installation is to use pip.
pip install dbt-{your adapater}
I am using bigquery, but there are adapters for all the major database engines, including snowflake, postgres, cassandra, mySQL, SQLServer, SQLite, Oracle, Athena, Redshift and many more.
So for me, the install was done with:
pip install dbt-bigquery
This mat take a while because it has to install some GCS dependencies, such as grpc. You can then confirm your DBT installed correctly with:
dbt --version
Now you will be ready to start your own DBT project. I have been keeping a GCP console open to bigquery to verify my changes.
First navigate to the parent folder of where you want to store your DBT project, and then run:
dbt init
This will start a configuration wizard which has fairly
reasonable defaults. When the configuration is complete, cd
to your new folder and you will be ready to start working with DBT.
DBT Profiles
DBT init will have created a file named dbt_project.yml
in
your project root. Take a look! This will have all the info
you entered in the setup wizard. It’s also where you can
configure connections to external sources of data.
We can add a reference to the public dbt-tutorial
dataset
in GCP by adding this to dbt_profile.yml
:
jaffle_shop:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: dbt-tutorial
dataset: jaffle_shop
Your profile is also the place where you configure your materialization strategy. The default is view. You can configure this by:
models:
my_dataset:
# Config indicated by + and applies to all files
+materialized: table
You can override this materialization config on a per-model basis in your model.
Database Seeds
DBT supports database seeds. These are described as:
Seeds are CSV files in your dbt project (typically in your seeds directory), that dbt can load into your data warehouse using the dbt seed command. Seeds are best suited to static data which changes infrequently.
Seeds are not meant as a way of loading database dumps, but it does seem like it would be pretty easy to abuse this.
Here are some seeds I will use for my demo.
seeds/movies.csv
:
id,title,year
1,Speed,1994
2,Avatar,2008
3,The Matrix,1998
4,Paul,2012
5,Terminator 2,1992
seeds/actors.csv
:
id,name
1,Sandra Bullock
2,Keanu Reeves
3,Mickey Rooney
4,Sigourney Weaver
5,Nick Frost
6,Linda Hamilton
seeds/stg_actor_movies.csv
:
id,movie_id,actor_id
1,1,1
2,1,2
3,2,4
4,3,2
5,4,4
6,5,6
Note that this is closer to “loading a dump” than proper seed loading, but this is for a demo, so that is fine.
If you now run:
dbt run
And look in your database, you should see these three tables.
Models
The models/
folder is where you will put your model files.
You can put a file in here such as:
flowers.sql
:
select 1 as id, 'flea bane' as plant
union all
select 2, 'showy tick trefoil'
After a DBT run, these will appear in your database. It would be more common to do something like this:
stg_customers.sql
:
select
id as customer_id,
first_name,
last_name
from {{ ref('jaffle_shop', 'customers') }}
Note the use of the ref
function here. DBT will resolve this reference to the table
you set up in your profile; note how the first parameter
jaffle_shop
matches the profile name, and customers
is
the name of the table in that public dataset.
Macros
In the macros/
folder, you can define jinja macros that will
be accessible from your models. For example, you could make a
good customers macro called good_customer.sql
:
{% macro good_customer(number_of_orders) %}
CASE when {{ number_of_orders }} >= 3 then
true
else
false
end
{% endmacro %}
This is needlessly verbose but serves to illustrate how to use
a macro. Assume then that we use this macro in model called
customers.sql
:
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders,
{{ good_customer('customer_orders.number_of_orders') }} as is_good_customer
from customers
left join customer_orders using (customer_id)
)
select * from final
Here you see the reference to the good_customer
macro. You
can see the generated sql by looking in the target
folder.
So for this model, you could run
cat target/compiled/my_proj/models/customers.sql
, where you would see something like this:
with customers as (
select * from `my_proj`.`my_dataset`.`stg_customers`
),
orders as (
select * from `my_proj`.`my_dataset`.`stg_orders`
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders,
CASE when customer_orders.number_of_orders >= 3 then
true
else
false
end
as is_good_customer
from customers
left join customer_orders using (customer_id)
)
select * from final
Note how the case statement has been inlined from the macro.
Schemas
In the model folder, there is a file called schema.yml
that
will contain the schemas of all of your tables in yaml format.
This is not auto-generated. You should be going in here and
populating any fields you create. So for my customers table, I have:
models:
- name: customers
description: One record per customer
columns:
- name: customer_id
description: Primary key
tests:
- unique
- not_null
- name: first_order_date
description: NULL when a customer has not yet placed an order.
Note the presence of the tests
key. This is where you will
define your data tests, including the ones I’ve shown here
for column constraints, but you can do more interesting stuff
as well.
A DBT test passes when the query it represents returns zero
rows. So imagine I made a macro called
test_does_not_contain.sql
like this:
{% macro test_does_not_contain(model, column_name, unwanted) %}
select {{column_name}}
from {{model}}
where contains_substr({{column_name}}, '{{unwanted}}')
{% endmacro %}
(Note that test macro names must begin with test_
.)
I can then use this test for my fields by adding it in the schema, like so:
- name: customer_orders_by_name
description: a report on how many orders come from customers per first name.
columns:
- name: first_name
description: the 1st name.
tests:
- unique
- not_null
- does_not_contain:
unwanted: 😂
- name: total_orders
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
strictly: false
This will now throw an error if any first names contain the laughing smiling emoji ‘😂’.
Note the second added test. After installing the
dbt_expectations
package you can use its tests, and this one
does range checking. You can install the package by running
dbt deps --add-package dbt_expectations:1.0.0
.
Data Tests
For larger tests, such as data quality tests, you can add a test in the tests folder, like this:
select
order_amount_cents
from
{{ ref('order_amounts') }}
where order_amount_cents <= 0
As before, DBT expects zero rows to be returned from a test
for it to pass. You can configure the expected number of rows
for a failure or warning. You can run all the tests with
dbt test
. Tests will likely fail if you haven’t first
populated their underlying tables. Tests also run when you do
dbt run
.
Analyses
If you want the goodness of re-usable code to generate SQL
without necessarily using it to populate a table, you can
put SQL in the analyses folder. For example you could make
movies_by_year.sql
:
select year, count(*)
from
{{ ref('movies') }}
group by year
Then run
dbt compile
Then inspect the file:
cat target/compiled/my_proj/analyses/movies_by_year.sql
This will give you something like:
select year, count(*)
from
`my_project`.`my_dataset`.`movies`
group by year
Python models
DBT also supports python models.
You could make something like transformed_customers.py
:
def my_transform(name):
return f"{name} {name[0].lower()}"
def model(dbt, session):
dbt.config(materialized="table")
df = dbt.ref("customers")
pdf = df.to_pandas()
pdf['name2'] = pdf['first_name'].apply(my_transform)
return pdf
You don’t have to use pandas here. In fact, is probably
quite often a bad idea. df
will be a spark data frame.
This will apply your custom transformation logic to each row and save that in the new table. Note that the python logic will be run through whatever flavour of spark is convenient for your platform. So since I’m using bigquery, that means dataproc.
Conclusion
And there you have it -— your crash course in Data Build Tool (DBT). It’s like the Swiss Army knife of data, simplifying your life and making building and managing data projects a whole lot easier.
As you venture into the world of DBT, remember that you’re equipped with a powerful tool that’s trusted by data professionals worldwide. Whether you’re a seasoned analyst or a newcomer to the data scene, DBT offers a friendly and intuitive platform to work with.
So take a deep breath, relax, and dive into the wonderful world of DBT. Experiment, explore, and have fun with it! After all, data doesn’t have to be daunting -— it can be downright delightful with DBT by your side.
Here’s to embracing data adventures with a smile—dbt style!