Can you elaborate more on the "roles" of the "new stack"? To me dbt/dataform and...

pjot · on Jan 11, 2021

I've used all of these so I might be able to offer some perspective here

In an ELT/ETL pipeline:

Airflow is similar to the "extract" portion of the pipeline and is great for scheduling tasks and provides the high-level view for understanding state changes and status of a given system. I'll typically use airflow to schedule a job that will get raw data from xyz source(s), do something else with it, then drop it into S3. This can then trigger other tasks/workflows/slack notifications as necessary.

You can think of dbt as the "transform" part. It really shines with how it enables data teams to write modular, testable, and version controlled SQL - similar to how a more traditional type developer writes code. For example, when modeling a schema in a data warehouse all of the various source tables, transformation and aggregation logic, as well as materialization methods are able to to live in the their own files and be referenced elsewhere through templating. All of the table/view dependencies are handled under the hood by dbt. For my organization, it helped untangle the web of views building views building views and made it simpler to grok exactly what and where might be changing and how something may affect something else downstream. Airflow could do this too in theory, but given you write SQL to interface with dbt, it makes it far more accessible for a wider audience to contribute.

Fivetran/Stitch/Singer can serve as both the "extract" and "load" parts of the equation. Fivetran "does it for you" more or less with their range of connectors for various sources and destinations. Singer simply defines a spec for sources (taps) and destinations (targets) to be used as a standard when writing a pipeline. I think the way Singer drew a line in the sand and approached defining a way of doing things is pretty cool - however active development on it really took a hit when the company was acquired. Stitch came up with the singer spec and their offered service is through managing the and scheduling various taps and targets for you.

ianbutler · on Jan 11, 2021

Airflow allows for more complex transformations of data that SQL may not be suited for. DBT is largely stuck utilizing the SQL capabilities of the warehouse it sits on, so for instance, with Redshift you have a really bad time working with JSON based data with DBT, Airflow can solve this problem. That's one example, but last I was working with it we found DBT was great for analytical modeling type transformations but from getting whatever munged up data into a useable format in the first place Airflow was king.

We also trained our analysts to write the more analytical DBT transformations which was nice, shifted that work onto them.

Don't get me wrong though, you can get really far with just DBT + Fivetran, in fact, it removes like 80% of the really tedious, but trivial ETL work. Airflow is just there for the last 20%

(Plus you can then utilize airflow as a general job scheduler)