Hi there, Looking to some more experienced prefect users to help me with an architecture problem.
TL;DR: for very large workloads, is it good practice (or at least not bad) to have one flow-run group jobs into batches based on size, and trigger these batches gradually after a certain number complete so as not to overcrowd the prefect server & dashboard. Is it best to use run_deployments or sub flows to achieve this (for example a run_deployment can be picked up by any worker in the pool, but does a sub flow always process on the same machine?)
General Setup:
I’m currently deploying self hosted prefect throughout our company and it’s great.
We’re using Dockerised flows to create a few different deployments, but many of them follow the same pattern which is this:
We have a very large backlog of processing that is sometimes hundreds of millions of jobs.
We may never get through them all, but we can prioritise the list.
In general this means running many flow runs in order of priority (which is simple enough).
However occasionally a business case can bump the priority of a subset of those jobs, and so being able to go in and re-schedule certain jobs with a higher priority is great. When the background work comes around to those jobs it will see they’ve already been processed and skip them (though it would be better to not duplicate if there is a way to programmatically change priority…?)
So a naïve solution would be to set a work pool with 2 priority queues. Unless told otherwise, when the high-priority queue is empty the workers should be picking up jobs on the lower priority queue.
The complication to this basic set up is this:
- Is it a bad idea to trigger say 1million flows at once - is the staggered solution in the TL;DR a good idea?
- In some cases we know before running the jobs will be short-lived, therefore it makes sense to batch them in groups of say 10 and run them as tasks (docker + prefect can have a not insignificant setup time). With the solution in the TL;DR it would be easy to group these jobs dynamically based on size as you’re triggering the flow, does this seem a good idea
- As mentioned in the TL;DR, would you use sub flows or run_deployements for this? Can they both split work across workers in the pool. I found it hard to find a working example with either where I could successfully listen to jobs as they completed in order to schedule more.
I’m finding myself a bit overwhelmed with these more complex setups and unsure how much to invest in trying to debug this sub-flow solution without any clue if what I’m doing is needlessly overcomplicated
Massively appreciate any advice from some more experienced prefect users!