My flow run in Prefect 2 is stuck in a Pending state - what can I do?

Explanation of the issue

Any time you see that your flow run has transitioned from a Scheduled into a Pending state, but it doesn’t move to the Running state, this indicates an issue in the execution layer, e.g.:

  • your agent can’t deploy your flow run to a given infrastructure
  • something is wrong in your Kubernetes or ECS cluster, or your VM

Pending state means that the agent was able to pick up the scheduled flow run and it submitted it for execution, but something has gone wrong in the flow run deployment.

How to resolve that?

  1. Verify that the agent process is running e.g. Kubernetes deployment, ECS service, dockerd daemon
  2. Check the agent logs to see if anything suspicious stands out there
  3. Verify that your execution layer is able to pull your flow run’s image e.g. if the image needs to be pulled from a container registry, make sure your container can reach the Internet and has appropriate permissions to pull the image
  4. Verify that your execution layer has enough permissions to spin up the required resources e.g. IAM roles, valid Prefect API key
  5. Verify that your execution layer has enough capacity on the cluster to deploy your flow run - we’ve seen similar issues when the agent is starved for resources - try allocating more CPU and memory to the agent process and see whether that helps
  6. Agent is polling too frequently, consuming lots of resources and not having enough resources to deploy runs to infra: try decreasing the poll frequency to, e.g., 30 seconds: prefect config set PREFECT_AGENT_QUERY_INTERVAL='30.0'
  7. Check if there is more than one agent polling for runs from the same work queue - we’ve seen some issues when the user had multiple agents polling from the same work queue and this often led to some Pending runs that can’t get deployed efficiently
3 Likes

Hi, I built a toy example and ran into the pending issue (It was marked as “late” and then “pending”). Have checked all that are listed above, but could not resolve the issue. Can you plz shed more light on what could go wrong there?

Here is my example:

from prefect import flow
from prefect.deployments import Deployment, run_deployment
@flow(name="my_flow", log_prints=True)
def my_flow()
    print("123")

def main():
   aaa = Deployment.build_from_flow(flow=my_flow, 
                                     name="test", 
                                     work_queue_name="prod-test", 
                                     skip_upload=False, 
                                     apply=True)

  run_deployment(name="my_flow/test", timeout=0)

if __name__ == "__main__":
    main()

And I make sure orion and agent are on:

prefect orion start
prefect agent start --match 'prod-'

What I got from the UI:
Work-Queue

Deployment

Flow run pending
Screenshot 2022-12-12 at 18.25.13

PS: Both Orion and agent didn’t report any error.

perhaps try deploying with CLI as shown in this example, also using the match pattern?

Hi Anna, thanks for your reply. My actual use case involves a main flow accepting many dataclasses as arguments. The problem with prefect deployment cli command is that it does not accept dataclasses. Even if I refactored my main flow, converting all dataclasses into actual parameters, there would be too many parameters to type in the cli.

why would you type those parameters into CLI? the parameter schema should get automatically inferred from the flow function

Sorry I didn’t make it clear. I meant, when building deployment with CLI, passing lots of parameters via “–params” option might not be viable for me. So am looking to programmatically generate params under the “parameter” section in deployment.yml. I will let you know should the pending issue occur again.

1 Like

passing params through CLI is only required if you want to override them for a custom deployment which uses different default values than those set on a flow function

perhaps changing those directly on the flow function would work better?

the alternative is to first build a deployment, this generates a YAML file, and then modify those default values in the YAML file.

I haven’t found an issue in Github regarding problem number 7 - " Check if there is more than one agent polling for runs from the same work queue…".

I am having this problem currently, I have 5 agents all working for the same queue, each with a concurrency limit of 10. I prefer to have many agents instead of 1 with a higher concurrency in case any of them crash or the node they are running in crashes.

Is there a plan to fix this?

Thanks in advance :slight_smile:

Hello Anna,

I am having a similar issue. I am running everything from a docker compose file, as described by your colleague here: GitHub - rpeden/prefect-docker-compose: A repository that makes it easy to get up and running with Prefect 2 using Docker Compose.

I believe this issue has to do with the agent talking to Prefect server, or… not talking.
Seems like my agent is able to execute the flow, as the terminal contains logs like:

prefect_docker-agent-1  | 17:19:00.643 | INFO    | prefect.infrastructure.process - Process 'space-tuna' exited cleanly.
prefect_docker-agent-1  | 17:19:00.657 | INFO    | prefect.infrastructure.process - Process 'astute-degu' exited cleanly.
prefect_docker-agent-1  | 17:19:00.680 | INFO    | prefect.agent - Completed submission of flow run '8fca7466-eda8-4141-8149-002367585715'
prefect_docker-agent-1  | 17:19:00.694 | INFO    | prefect.agent - Completed submission of flow run '5f73e53d-e64b-4f90-aa76-2ff72bc08cfb'
...

Meanwhile, my result from the Prefect UI is exactly as you see here, pending. When I open up the flow up, its stuck “Waiting for logs…”

I tried launching an agent from the CLI and it works. My ideal solution would be to not use the CLI though. What do you think?

Edit: one other thing, I tried using the --expose tag as a workaround, as mentioned here. Unfortunately, this results in an error when I run docker compose --profile server up (as denoted in the prefect-docker-compose GitHub repo I linked).

prefect_docker-server-1    | ╭─ Error ──────────────────────────────────────────────────────────────────────╮
prefect_docker-server-1    | │ No such option: --expose                                                     │
prefect_docker-server-1    | ╰──────────────────────────────────────────────────────────────────────────────╯
prefect_docker-server-1 exited with code 2

Note that my base image is prefecthq/prefect:2.10.16-python3.11, so I’m not sure why the flag isn’t supported.

To add, I also tried utilizing the --show-flow-logs flag on the agent, from here:

  ## Prefect Agent
  agent:
    image: prefecthq/prefect:2.10.16-python3.11
    restart: always
    entrypoint: ["prefect", "agent", "start", "-q", "default", "DEBUG", "--show-flow-logs"]
    environment:
      - PREFECT_API_URL=http://server:4200/api
    profiles: ["agent"]

Just as --expose did, --show-flow-logs also failed.