How scalable is Prefect Server for scheduling concurrent runs of tens of thousands of flows?

View in #prefect-server on Slack

Sam_Brownlow @Sam_Brownlow: tl;dr How scalable is Prefect Server for scheduling concurrent runs of 10s of thousands of flows?

I’m evaluating Prefect Server as a solution for managing a data pipeline.

How I envisioned this working was using the Prefect framework as a computationally minimal “glue.” Prefect would be the central pipe through which the state of the data flows (but not the data itself). It would hand off all computation to microservices via (a)sync requests and its only concern would be controlling how the state is pushed from service to service.

Prefect seems to be a great tool for this job; I really like the API and how quickly a developer can get up and running.

However, there are a couple of things that I don’t yet have total clarity on.

This documentation describes Server as a “highly scalable scheduler” but also says that it may start degrading at “~10-20 tasks running concurrently against a typical Server.”

Does the above degradation occur because the typical Server is deployed to a single node via docker-compose with a single agent, instead of being deployed via something like Helm, which horizontally scales scheduled flows across a cluster of agents?

How quickly does the size of the PostgreSQL database generally grow, relative to the number of flows run? Is there any reason that regularly deleting old runs would be any more complicated than as suggested here?

Are there any case studies with Prefect Server being used to run 10s of thousands of concurrent flows?

I see that there used to be a Nomad Agent, are there any helpful resources for running Prefect on a nomad cluster?

Thanks for any advice you are able to share. I have been diving into Prefect for only the past couple of days so greatly appreciate any referential pointers here.

Server Overview | Prefect Docs

Stack Overflow: Cleaning ~/.prefect/pg_data/ when using Prefect

GitHub: Background on Nomad Agent removal? · Discussion #3575 · PrefectHQ/prefect

Anna_Geller @Anna_Geller: Thanks for a detailed and thorough writeup of your use case! Let me try to unpack the answer into multiple steps to give you an equally thorough response.

A bit old but still valid response from Chris on the question of scale:

“The open-source Prefect Server was not designed for that sort of scale; as described in this new doc, this is one of the reasons people migrate to Prefect Cloud, which is designed for scale and performance.”

Stack Overflow: How does Prefect scale with thousands of workflows concurrently?

Server Overview | Prefect Docs

> Prefect would be the central pipe through which the state of the data flows (but not the data itself). It would hand off all computation to microservices via (a)sync requests and its only concern would be controlling how the state is pushed from service to service.
Prefect is a dataflow automation tool, so it makes sense to leverage the benefits of Prefect e.g. with respect to passing data between tasks. If you instead write your actual pipelines in some separate (micro)service, you don’t take advantage of Prefect with respect to visibility, and Prefect then merely serves as a job scheduler rather than a workflow orchestrator and data automation tool. This is a valid use case for Prefect, but not really taking advantage of its capabilities.

Still, we have many users who use Prefect e.g. to orchestrate AWS Batch jobs - a Prefect task triggers a batch job and polls for execution state of that job. When it finishes successfully, you see that a task and flow run were successful. At the same time, you don’t really have any information about what is happening within that job - again, a totally valid use case if you prefer this option.
> Does the above degradation occur because the typical Server is deployed to a single node via docker-compose with a single agent, instead of being deployed via something like Helm, which horizontally scales scheduled flows across a cluster of agents?
Scaling Server is hard, to be honest. For Prefect Cloud, we have an entire infrastructure team managing the underlying compute and ensuring that all services scale and are running reliably at all times.

But if you want to do it yourself, the database is the only stateful component for which you need to ensure that it scales properly. How quickly your storage grows depends on so many factors (number of flows, flow runs, logs, …) but if you pick some Postgres-compatible Cloud database like AWS Aurora or GCP Cloud Spanner, it should allow you to grow as you need to.

I don’t know enough here to judge if simply having 3 Apollo containers instead of one can be immediately used to scale the service. I would suggest starting by scaling vertically, i.e. assigning more vCPU and RAM to your Server components when needed before trying to scale horizontally and introducing load balancing (which is not trivial IMO).
> horizontally scales scheduled flows across a cluster of agents?
Scaling agents that way is not necessary if you use e.g. KubernetesAgent - each agent (even a local one!) itself is a lightweight process that polls for new scheduled flow runs and deploys those - for KubernetesAgent, it deploys flow runs as separate Kubernetes jobs
> Are there any case studies with Prefect Server being used to run 10s of thousands of concurrent flows?
I’m not aware of any such case study, most users that need that scale are using Prefect Cloud
> I see that there used to be a Nomad Agent, are there any helpful resources for running Prefect on a nomad cluster?
There was one user in the community, but the conversation about it is gone from Slack and they didn’t contribute it to the main repo

GitHub: Background on Nomad Agent removal? · Discussion #3575 · PrefectHQ/prefect

ok, I guess all the questions are answered.

Next time, it would be great if you could ask each question in a separate thread as it makes it much easier to respond to specific questions.
and since you mentioned async - I assumed you meant asynchronous requests i.e. you send a request to process a job on some external service and rather than waiting, it “asynchronously” responds immediately with a job status code and you then poll for the status of that job rather than waiting synchronously

but if you instead meant Python async, check out Prefect 2.0 https://orion-docs.prefect.io/getting-started/overview/

Sam_Brownlow @Sam_Brownlow: Thank you for these detailed responses; they are very helpful.

I will give some thought to what you’ve said about the benefits of using Prefect as a dataflow framework instead of just a scheduler framework, while also considering that the “database is the only stateful component for which you need to ensure that it scales properly.” fwiw, my infra coworker has said that Hasura scales well horizontally.