Task failures when cached results are missing

Overall the caching features in Prefect 2 are very welcome, however I’m struggling to understand the design decision behind a task failing if the cached results are no longer available. In typical caching scenarios I’ve dealt with, if for whatever reason there’s a cache miss the fallback is to just fall back to the underlying computation. However in Prefect, if I set up caching for a task and the cached results are no longer available, the task just fails.

My setup is that any cached results end up in an S3 bucket with a short lifecycle (2 days), so cached results are automatically deleted. While I do explicitly set a cache_expiration lower than the bucket lifecycle, there’s still a very real possibility to have someone on the team misconfigure the cache timeout (or some other parameter for the cache). Without any way to invalidate the cache aside from updating the task (or deleting and redeploying the flow?), this makes me somewhat uneasy about using the cache.

It would be great to at least have the option to set the behaviour to running the task if there is a cache miss. To me a very expensive computation is still better than not being able to complete the work at all, especially when the cached results could be completely unrecoverable.

  1. why would those cached results no longer be available? are you manually deleting those?
  2. did you try using the cache key expiration feature to automatically expire the cache when no longer needed? you shouldn’t have to manually do anything with the cache, it’s for internal use by the engine; no need to do it manually via S3 lifecycle policy
  3. what else would you expect if not a failure if the cache is forcibly removed? failing the task seems plausible

it looks like you can try a shorter expiration on a cache in such case

why would those cached results no longer be available? are you manually deleting those?

Yes (and no). The S3 bucket has a purposefully short lifecycle that automatically deleted any objects older than 2 days. There’s no need to keep cached results around indefinitely.

did you try using the cache key expiration feature to automatically expire the cache when no longer needed? you shouldn’t have to manually do anything with the cache, it’s for internal use by the engine; no need to do it manually via S3 lifecycle policy

Yes. I do have a cache expiration set shorter on the task than the lifecycle on the bucket. And everything works great. My concern is that if someone were to misconfigure the task in some way. Set a too long expiration (or none at all), now we’re forced to apply a hotfix to the code instead of taking the hit performing the expensive computation. In a perfect world this would never happen, but mistakes will no doubt happen.

what else would you expect if not a failure if the cache is forcibly removed? failing the task seems plausible

Failing if there’s a cache miss is obviously a conscious design decision, but I’m still trying to get at the heart of why that decision was made. In a typical Cache-Aside pattern (which this looks to be a variant of) a cache miss falls back to the original datasource.

There are other scenarios where being more lenient with clearing the cache would be very helpful. Maybe we have a task that queries a MySQL database. For the sake of argument the query is very expensive, so that’s why we cache the results. We find the data in one of the tables is wrong and the flow has to be re-run. Also for the sake of argument we’re using a static cache key since the query has no dynamic parameters. Since we have no way to invalidate the cache, it seems like the only way to get the task to re-run involves something along the lines of a deployment.

The above could also be solved by having some way to trigger a flow run through the UI with caching disabled, but what’s the argument for having it not also be solvable by just dropping the cache?

perhaps it’s worth applying that on a process level in your team when doing reviews of pull requests? you could enforce that no cache longer than X is possible e.g. by building tests to validate that behavior and ensure if some PR with a flow doesn’t pass that test, it won’t get merged

we want to implement a feature to make it possible to manually invalidate a given cache key; it’s a bit complicated and, to be transparent, not high on the roadmap atm, but it’s on our radar for sure :+1: