Why is my flow stuck in a Submitted State?

Explanation of the issue

Any time you see that your flow run has transitioned from a Scheduled into a Submitted state, but it doesn’t move to the Running state (i.e. it’s stuck in a Submitted state), this indicates an issue in the execution layer, e.g.:

  • your agent can’t deploy your flow run to a given execution layer
  • something is wrong in your Kubernetes or ECS cluster

Submitted state means that the agent was able to pick up the scheduled flow run and it submitted it for execution, but something has gone wrong in the flow run deployment.

After some time, this stuck flow run will be resurrected by the Lazarus process.

How to resolve that?

Here are some steps you may take to investigate it further:

  • check the agent logs to see if anything suspicious stands out there
    (running an agent with the --show-flow-logs option can show more useful output)
  • verify that the agent process is running e.g. Kubernetes deployment, ECS service, dockerd daemon
  • verify that your execution layer is able to pull your flow run’s image e.g. if the image needs to be pulled from a container registry, make sure your container can reach the Internet and has appropriate permissions to pull the image
  • verify that your execution layer has enough permissions to spin up the required resources. For instance:
    • Kubernetes agent needs proper RBAC permissions to deploy your flow run as a Kubernetes job
    • ECS agent needs custom task_execution_role with permissions to describe VPCs, register a task definition and run the ECS task
  • verify that your execution layer has enough capacity on the cluster to deploy your flow run.

If none of that helped, reply to this message or open an issue on GitHub!

A similar issue occurs when using sys.exit() in the flow code