View in #prefect-community on Slack
I have a question regarding running prefect on AWS ECS. I am currently using Fargate to launch my flows. I have a pretty big docker image (~2-3GB uncompressed) which adds some dependencies (not only python) on top of the prefect docker image. The problem is, that for every flow Fargate pulls the image from AWS ECR (in the same VPC) which results in multiple minutes to start. Most of the runs are small, so I need to wait couple of minute to start and then they finish within a few seconds. Let’s assume I start 100 flows a day, this would result in 200-300gb of pulling the same image.
My first idea was to split the image into multiple images and use subflows. Then every subflow could specify which image and dependencies it needs. Or I could try to reduce the image somehow. But in both cases even at 0.5GB per image it would result in pulling 50GB a day.
I found this AWS issue regarding caching the image: https://github.com/aws/containers-roadmap/issues/696
Unfortunately caching is currently only supported for EC2 but not for Fargate. So my second idea was to use EC2 instead? But I am not sure how well it scales. This would result in startup and shutdown of EC2 instances depending on how many flows are running. So it might just shifts the startup problem as flows might need to wait for another EC2 instance to start.
I used this tutorial to set-up everything for Fargate: https://towardsdatascience.com/how-to-cut-your-aws-ecs-costs-with-fargate-spot-and-prefect-1a1ba5d2e2df (thanks to @Anna_Geller, this works great!)
But I could not figure out how to do it for EC2 properly. If I understand correctly, EC2 has one IP per instance while Fargate has one IP per every flow, so the set-up would be a little different.
My main problem is the long startup time of multiple minutes and I am not sure what’s the best way to deal with it. Maybe someone experienced the same problem and found a better solution?
@Noah_Holm: With such a large image I’d start off with trying to optimising the image. Have you spent any time on that? E.g., looking at multi stage builds
@Nico_Neumann: Thanks for your reply!
It is not fully optimised yet, so this would definitely help a little. But my dependencies may grow in future so might be a good idea to split them into multiple images
@Noah_Holm: I think it’s very valuable to have some tricks to optimise the size so def worth looking into. Depending on what you’re including in the image it could be easy or hard ofc But splitting them up might be relevant as well, maybe you end up making it more understandable with smaller sized flows?
Some example from our environment: we have an image that is 405 MB or ~125 MB compressed on ECR. Looking at a couple of example runs it takes about 30 seconds from submission to running on ECS Fargate
@Anna_Geller: There are a couple of different problems here and things to unpack so I’ll try to approach it in a structured way.
1. Docker image size
There are multiple ways to reduce the Docker image size. You can check e.g. this tutorial to learn more.
Unfortunately using serverless the image must be pulled every time since each time your ECS task may get deployed to a completely different EC2 instance.
1.1 Why is this image so big? Can you share the Dockerfile?
1.2 Why do you have non-Python dependencies directly in your flow Docker image? Do you trigger those non-Python processes via subprocess e.g. using
ShellTask? If so, you could consider separating those into a separate ECS task to separate out those dependencies and potentially mitigate the issue, especially if you run flows with this non-python dependency infrequently. You could e.g. use this unofficial ecsruntask and here is a usage example for this task in a flow: ecsruntask_usage.py
2. Latency/Startup time differences between serverless and non-serverless ECS data plane
The fact that you need to wait a couple of minutes until your ECS container starts is completely normal. You need to consider all the work that AWS is doing here:
- checking your workloads requirements
- identify an EC2 instance that satisfies those requirements (enough CPU/memory, the right region and AZ)
- pulling the image for your container
- starting the container
- sending logs to CloudWatch
- and then Prefect needs to also do lot of work to communicate with this container to manage the task and flow run states, pull the logs, etc.
If this latency is not acceptable by your workflows, you should consider either:
- switching to ECS Fargate with self-managed EC2 data plane with instances being always on → you’re spot on that it doesn’t scale that well since you would need to manage that compute
- switching to a Kubernetes agent on AWS EKS with managed node groups which gives almost all the same benefits of a fully managed service as Fargate does but without the “serverless latency” (but at a slightly higher price) → regarding your worry of scale, EKS makes it incredibly easy to add additional managed nodes to your cluster as you need, or you could even combine it with EKS on Fargate - I discussed this in detail in this blog post.
3. Consider AWS EKS with managed node groups instead of ECS Fargate for latency-sensitive workloads
If you want to spin up an EKS cluster instead, you can do that using eksctl. To spin up a cluster with a single node you need a single command which under the hood triggers a CloudFormation stack handling all the infrastructure details:
eksctl create cluster --name=prefect-eks --nodes=1
This blog post discusses the topic in much more detail, incl. a full walkthrough and even a CI/CD pipeline for that.
An additional benefit of using AWS EKS with managed node groups is that your instances are always on, therefore you don’t have to pull the Docker image if it already exists on the machine! You can do that by setting the
imagePullPolicy on your Kubernetes job template (see example in
src/prefect/agent/kubernetes/job_template.yaml) that you can pass to a
KubernetesRun, or you can set up directly when starting the agent:
prefect agent kubernetes install --key YOUR_API_KEY --label eks
This will return a Kubernetes manifest for the
KubernetesAgent that contains
You can see that by default it’s set to “Always”, but you can change it to
You could also set it directly on your
flow.run_config = KubernetesRun(image_pull_policy="imagePullPolicy")
More on that policy here.
So overall, I totally understand your concerns and your frustration with ECS Fargate latency. Out of curiosity I once benchmarked EKS on Fargate vs. EKS on managed node groups and Fargate was 12 times slower than managed node groups. So serverless is great and all but you need to be patient with it Perhaps you can do a small PoC with the setup from the dbt part 2 article and compare it with the ECS Fargate setup and decide which one works better for you.