Is there a good way to design flows for restart after failure

I am currently working on flow of flows, with millions of tasks and thousands of subflows.
It is a step-by-step iterative scraping workflow, which scope grows at each step (starting from 100 resources to scrape, growing to millions at the last step).
I use the prefect precept of atomic tasks, so each scraping task correspond to a single request (one resource to get).
I will skip the details, except that I implemented the subflows as batches of 1000 requests to be executed, and those are executed as a mapping of subflow runs. The task immediately succeeding such a mapped scraping subflow run has a trigger set to all_finished so that a single batch failure should not stop the whole parent flow.
I am persisting the tasks results at the parent flow level, in order to be able to restart the flow for failed tasks.

This works well, however the flow fails when it has no more storage for the new task results (40 Go are filled).

My question are the following:

  1. Is there a way to mark a task as automatically restartable when its result is needed by a restarted downstream task? (this would enable the saving of most of the storage used by generated task result files, as most of my tasks are inexpensive requests preparation tasks)
  2. Is there a way to link a task run with its upstream and downstream task runs? (my idea would be to delete the persisted results once all downstream tasks have succeeded)
  3. Is there simply a better approach to my problem?
1 Like

I would start with a question: would you rather design this solution in Prefect 2.0?

I’m asking since 2.0 is the LTS version and 1.0 is in maintenance mode and we encourage users to migrate. If you are currently in a design phase of such a big project, it would be more beneficial to design it in the LTS product i.e. Prefect 2.0 - docs.prefect.io

Also, this would be easier to accomplish in 2.0

The migration resources are linked in:

1 Like

I intended to perform the migration, but is was on a lower priority than this problem.
But if there is a way to fix the problem “easily” in prefect 2.0, I will swap priorities.

So my questions remain, but for prefect 2.0.

1 Like

I would recommend designing this workflow using functional programming pattern in pure Python. Then, add flow and task decorators where needed to turn it into a Prefect flow, and given how easy it is to build composable subflows in 2.0, you can easily add retries e.g. even to subflows

Designing subflows indeed seem way easier with prefect 2.0.

However, it seems the ability to manually restart a flow after failure (or simply restart a flow) has been lost, alongside the persistence of task results, isn’t it?
Is it temporary, or will it be brought back in a potentially different form?

As I am dealing with a heavy and long flows, with a bunch of unavoidable and sometimes unforseeable errors, I found the ability to manually restart a flow from its point of failure (or even any point) to be very practical.

1 Like

Restart and retry are very different features

I do not believe that we will introduce restarts from arbitrary point since you can just rerun your script even from a local machine for the relevant subflows when needed, and all the execution metadata will be tracked in the backend

Are all tasks results systematically persisted in prefect 2.0?
I already had a problem in prefect 1.0 when persisting all tasks results because it filled the storage very quickly causing the flow to fail.

Is there a way to disable the persistence of tasks results in 2.0 or remove persisted results systematically when they are no longer needed?

Not yet, but Michael starts working on it this week

Regarding restart, this conversation might be relevant for you too:

In the meantime, how could I best delete task results after a flow is completed? Or perhaps the release of the feature is so imminent that waiting for it would make more sense?

I was thinking about running a clean-up task at the end of the flow, looking for all task run IDs using flow run context and deleting their results file from storage.

We have started working on an improved design for it this week - you should see more configurable setup in the next releases