View in #prefect-community on Slack
@Andrew_Black: Re:dbt features — Hi everyone, I look after Prefect’s partnerships and we’re collaborating with dbt on what we should enable together. What would you like to see?
@Noah_Holm: Hey Andrew, would you say this is mostly about (verified) Prefect tasks or could there be other things that entails the partnerships? I’m not really hinting to some idea, just interesting to see what you think the partnerships entail as most of what I’ve understood and seen are Prefect tasks.
@Jared_Noynaert: Two function gaps today using Prefect + dbt. No promises about how relevant they are to the typical Prefect+dbt user, though:
- Separated lineage
a. The “run” node is just a single task with a lot of logs–you’re going to be headed to your dbt docs to figure out the impact of a failure, or to dbt cloud/your logs/your hacky custom viz to figure out execution time bottlenecks
b. We intentionally push things into Prefect that don’t have to live there so that it can serve as an observability layer, yet have poor-ish visibility into dbt from Prefect
c. Dagster’s new support for this in 0.14 looks really appealing so there is a bit of a competitive aspect here (their approach to data-aware tasks is much more general, but dbt is going to be the flagship example for that feature for a long time)
- Inserting tasks into the dbt DAG
An example of a feature that could help with separated lineage would be importing dbt run details as a subgraph during or after a run. Another option might be improved post-run artifacts–as a truly minimal example, just having a graphviz of the dbt dag with timings and node status might be helpful (or a snapshot/summary + clickthrough to dbt cloud, once they support better vis and reporting around this).
For the second one, I’m not sure what a good solution looks like, but the basic problem is when people want to do this:
- Run part of your dbt DAG
- Execute some code that modifies assets in your DWH (e.g. run a scoring model and upload the results, execute something in Snowpark)
- Run the downstream parts of your DAG
You can handle that today by using node selectors in different task steps, but it feels clunky. I think this use case will get more common with Snowpark and serverless Spark on BigQuery, and while the endgame for those specific scenarios is dbt supporting things like Snowpark, that could be a while. You might conclude that (upcoming) Python UDF support and external functions don’t leave much of a sweet spot to focus on here but it’s worth investigation.
@Noah_Holm: We don’t use dbt cloud so we use the shelltask and I think it’s great in that it gives a lot of flexibility with the “give whatever shell command to run”. I guess there could be a case for having several shell tasks that were specialised in a single dbt thing (run, test, snapshot, etc) but I don’t really see the point. We rename the tasks responsible for run or tests to give more clarity. Therefore I don’t have much thoughts regarding new/improvements for the task library (hence the previous question).
My unicorn wish would be in the lines of Jared, being able to inspect the dbt graph in the Prefect UI schematic with runtimes and potential failures would be more than fantastic. I also see myself going to the dbt docs page to start understanding the impact of a failed model.
I could also see benefits to using the Artifacts to give better overview of the run report than you get from the logs. In the logs you lose coloured text as well as some formatting that’s off. Having an artifact page that resembles a clearer run report to what’s inside dbt cloud would be nice.
@Matthias: I tend to agree with @Noah_Holm! The shell task is good as-is and if you want more flexibility in changing e.g. model selectors, you can create a wrapper task that does just that based on an optional input (string or dict).
The biggest gain would be to focus on functionality to enhance observability (viz of dbt DAG and using the generated Artifacts to make it easier to inspect/troubleshoot flows)
@Andrew_Black: This is wonderful feedback thank you!
@Matthias: Someone on this channel asked if
dbt-ol was supported by the dbt shell task. After further exploration of open lineage (I never heard of it), I figured the command “just” optionally parses the manifest to generate the necessary events. Perhaps worth exploring if we can create an additional task to do this? Anyway, I thought this could give some inspiration…
GitHub: GitHub - OpenLineage/OpenLineage: An Open Standard for lineage metadata collection
@Andrew_Black: Thank you @Matthias. Are you thinking a Prefect task to do this, or a way for Prefect and dbt to work together?
@Noah_Holm I’m sorry for the slow reply. To us, a task is basically a technical integration while a partnership entails working together to make our products better, as well as usually some marketing activities around that. dbt is a great example, where we are coordinating to make our products as powerful and easy as possible for our joint end users
@Matthias @Jared_Noynaert - Such great feedback. Thank you for the extensive outlining, making sure our teams see all of it.
It sounds like the summary is around sub-run observability inside a dbt run, as well as being able to potentially isolate different parts of a dbt run that can be interrupted by a different Prefect task outside of dbt?
@Matthias: @Andrew_Black as mentioned before, I never heard of open lineage before, so I have no idea about its adoption… I mainly pointed to the project for inspiration.
@Andrew_Black: got it excellent!
We’ve definitely explored it and are considering it. Thanks for the comparison, though.
@Matthias: @Andrew_Black been doing some digging this weekend on how Dagster is approaching this. It is nicely explained in this blog post. I was thinking, would it be possible to integrate dbt info as a Prefect artifact? It would be nice to have e.g. the lineage graph available there.
Rebundling the Data Platform | Dagster Blog