Overall the caching features in Prefect 2 are very welcome, however I’m struggling to understand the design decision behind a task failing if the cached results are no longer available. In typical caching scenarios I’ve dealt with, if for whatever reason there’s a cache miss the fallback is to just fall back to the underlying computation. However in Prefect, if I set up caching for a task and the cached results are no longer available, the task just fails.
My setup is that any cached results end up in an S3 bucket with a short lifecycle (2 days), so cached results are automatically deleted. While I do explicitly set a cache_expiration lower than the bucket lifecycle, there’s still a very real possibility to have someone on the team misconfigure the cache timeout (or some other parameter for the cache). Without any way to invalidate the cache aside from updating the task (or deleting and redeploying the flow?), this makes me somewhat uneasy about using the cache.
It would be great to at least have the option to set the behaviour to running the task if there is a cache miss. To me a very expensive computation is still better than not being able to complete the work at all, especially when the cached results could be completely unrecoverable.
why would those cached results no longer be available? are you manually deleting those?
did you try using the cache key expiration feature to automatically expire the cache when no longer needed? you shouldn’t have to manually do anything with the cache, it’s for internal use by the engine; no need to do it manually via S3 lifecycle policy
what else would you expect if not a failure if the cache is forcibly removed? failing the task seems plausible
it looks like you can try a shorter expiration on a cache in such case
why would those cached results no longer be available? are you manually deleting those?
Yes (and no). The S3 bucket has a purposefully short lifecycle that automatically deleted any objects older than 2 days. There’s no need to keep cached results around indefinitely.
did you try using the cache key expiration feature to automatically expire the cache when no longer needed? you shouldn’t have to manually do anything with the cache, it’s for internal use by the engine; no need to do it manually via S3 lifecycle policy
Yes. I do have a cache expiration set shorter on the task than the lifecycle on the bucket. And everything works great. My concern is that if someone were to misconfigure the task in some way. Set a too long expiration (or none at all), now we’re forced to apply a hotfix to the code instead of taking the hit performing the expensive computation. In a perfect world this would never happen, but mistakes will no doubt happen.
what else would you expect if not a failure if the cache is forcibly removed? failing the task seems plausible
Failing if there’s a cache miss is obviously a conscious design decision, but I’m still trying to get at the heart of why that decision was made. In a typical Cache-Aside pattern (which this looks to be a variant of) a cache miss falls back to the original datasource.
There are other scenarios where being more lenient with clearing the cache would be very helpful. Maybe we have a task that queries a MySQL database. For the sake of argument the query is very expensive, so that’s why we cache the results. We find the data in one of the tables is wrong and the flow has to be re-run. Also for the sake of argument we’re using a static cache key since the query has no dynamic parameters. Since we have no way to invalidate the cache, it seems like the only way to get the task to re-run involves something along the lines of a deployment.
The above could also be solved by having some way to trigger a flow run through the UI with caching disabled, but what’s the argument for having it not also be solvable by just dropping the cache?
perhaps it’s worth applying that on a process level in your team when doing reviews of pull requests? you could enforce that no cache longer than X is possible e.g. by building tests to validate that behavior and ensure if some PR with a flow doesn’t pass that test, it won’t get merged
we want to implement a feature to make it possible to manually invalidate a given cache key; it’s a bit complicated and, to be transparent, not high on the roadmap atm, but it’s on our radar for sure
To add to this, I’ve had cases in large flow runs with multiple map tasks across 500+ keys where a task will fail for various reasons, and since caching is on you’d expect that you could just re-run the flow and it would skip over completed tasks and re-compute failed/incomplete tasks. However, some tasks will write their cache information to the Prefect database, but not the actual cache file to the cache storage location, so that on the task re-run Prefect is expecting the file to exist and fails because it cannot find the cache file (no cache files were manually deleted). Even with retries, the tasks simply fail straight away each retry due to the missing cache file, and the flow can never complete, so then refresh_cache=True must be used on the entire map task, essentially wasting the working cache files of the other 99% of mapped tasks which completed and cached successfully. This means that effectively, the caching is useless in trying to complete a large flow run prone to failure.
It would make a lot of sense to simply (optionally) recompute the task if the database expects a cache file to exist but it doesn’t exist in cache storage.
I’m running the prefect agent in a docker container on infrastructure that is created on demand, so I just want to persist results for the purpose of saving memory with cache_result_in_memory, retrying and some added protection from a short lived cache on local storage when flow runs are called frequently.
Remote storage is more trouble than its worth in our case so this just trying to minimize negative engineering and take advantage of fast local cache on a best effort basis
Just the peace of mind that nothing will break if the cache gets deleted would be very nice, otherwise I’m leaning towards just provisioning more resources and disabling the cache.