How can I trigger data quality checks as soon as my data arrives in a data lake?

View in #random on Slack

Gustavo_Puma @Gustavo_Puma: Hi peeps :wave:

At my company we’re trying to improve the data quality of our (Databricks) Delta Lake by introducing monitoring. We’d like to add assertions (as SQL statements) that we could then schedule and alert on (based on the query results). Any of you familiar with such tools that you could recommend:question:

Databricks has its own thing called SQL analytics but unfortunately it doesn’t have any form of git integration

Anna_Geller @Anna_Geller: I can very much relate to your problem! I used to approach it event-based - check this blog post as an example. The idea is that any time a new file or partition arrives in your data lake, data quality checks fire and alert you when data validation fails.

You could combine it with Prefect for enhanced observability - instead of implementing the data quality checks in AWS Lambda, your lambda function could just trigger a Prefect flow as explained here.

Medium: Put a Stop to Data Swamps With Event-Driven Data Testing

Jeremiah @Jeremiah: @Gustavo_Puma it’s a little early to discuss, but I think in the second half of 2022 Prefect will introduce the building blocks to become a first class solution for your problem. I’m not saying you should wait for that, but do keep it in mind :slightly_smiling_face:

Rio_McMahon @Rio_McMahon: We’ve been experimenting internally with Great Expectations (https://greatexpectations.io/expectations/) and evidently (https://github.com/evidentlyai/evidently). I am not sure if this exactly fits your use case but may be worth looking into.

GitHub: GitHub - evidentlyai/evidently: Evaluate and monitor ML models from validation to production. Join our Discord: Evidently AI - ML in Production

Anna_Geller @Anna_Geller: thanks for sharing, Rio!

Gustavo_Puma @Gustavo_Puma: Thanks all for your responses, we actually already use something similar to great expectations for the incoming , streaming data. I’m looking now for something after the data has already been persisted. This is use case driven and would involve joining different tables, that’s why alerting based on SQL queries would fit us nicely

Anna_Geller @Anna_Geller: Gustavo, I’d be curious to clarify the problem even more. When you say you want to run the data tests after the load, why is that important? Is it just some sort of “integration test” to see if after the load your data e.g. matches your expected value distribution?

Or is the underlying problem that you want to run those data quality tests any time new data arrives in the given destination (table, data lake path) regardless of which process (data pipeline from orchestrator, manual load from your dev machine or manual run from dbt run command) loaded that data? Just to confirm whether we have the same understanding of the problem

Gustavo_Puma @Gustavo_Puma: Hi Anna, more like the second case, I want to run these tests on either my entire dataset or a sliding window. Running these assertions as part of our ETL would increase the runtime by a lot and our Analysts would also be interested in defining them, this is why we’d like a post-etl step.

The type of issues are more related towards aggregation: e.g. a null value is acceptable in a certain column but if 80% of the values in a slice of data are null then this could be an issue. Another example is the value of a column in a table should match the one from another table, we already have some cases like this but would like to monitor if this increases