Best way to recover long-running task runs

In our use case, we sometimes face the issue where one of many expensive tasks in a flow failed, while others are still running. It would be awesome if we can retry that specific task run without disturbing the running task runs, appreciate any ideas/things people tried to make this happen.

We considered retrying, but let’s say we have a retry of 2 and it still failed due to a data issue. If there is a recovery option, we would be able to fix the data issue while letting the other task runs to continue, and re-run the failed one and downstream once we are ready.

I would suggest configuring result persistence. In the case you mentioned where a data issue exists that must be resolved before retry, you could fix the data issue after the failure occurs, and then manually kick off a retry of that flow, where all tasks with succeeded states would simply pull their result from the completed state