Explanation of the issue
Any time you see that your flow run has transitioned from a Scheduled
into a Pending
state, but it doesn’t move to the Running
state, this indicates an issue in the execution layer, e.g.:
- your agent can’t deploy your flow run to a given infrastructure
- something is wrong in your Kubernetes or ECS cluster, or your VM
Pending state means that the agent was able to pick up the scheduled flow run and it submitted it for execution, but something has gone wrong in the flow run deployment.
How to resolve that?
- Verify that the agent process is running e.g. Kubernetes deployment, ECS service, dockerd daemon
- Check the agent logs to see if anything suspicious stands out there
- Verify that your execution layer is able to pull your flow run’s image e.g. if the image needs to be pulled from a container registry, make sure your container can reach the Internet and has appropriate permissions to pull the image
- Verify that your execution layer has enough permissions to spin up the required resources e.g. IAM roles, valid Prefect API key
- Verify that your execution layer has enough capacity on the cluster to deploy your flow run - we’ve seen similar issues when the agent is starved for resources - try allocating more CPU and memory to the agent process and see whether that helps
- Agent is polling too frequently, consuming lots of resources and not having enough resources to deploy runs to infra: try decreasing the poll frequency to, e.g., 30 seconds:
prefect config set PREFECT_AGENT_QUERY_INTERVAL='30.0'
- Check if there is more than one agent polling for runs from the same work queue - we’ve seen some issues when the user had multiple agents polling from the same work queue and this often led to some Pending runs that can’t get deployed efficiently