Why does this error occur when running flows with a Kubernetes agent? "Pod prefect-job-xxx failed. No container statuses found for pod"

View in #prefect-community on Slack

Andrew_Lawlor @Andrew_Lawlor: i ran a flow of flows process that was supposed to kick off 12,000 flows on GKE. it started 10,000 flows, and then failed with the message

Pod prefect-job-5e3af599-tl2xs failed. No container statuses found for pod

where can i look for a more detailed message? any idea what actually caused it to fail?
also, of those jobs, all but one passed. the one that failed had the same message (no container statuses found).

Kevin_Kho @Kevin_Kho: Is the pod still up? You could look for the pod logs

Andrew_Lawlor @Andrew_Lawlor: its not. i think it went down right away when it failed. any way to see logs on old pods?

Kevin_Kho @Kevin_Kho: I think the pod still needs to exist. I assume it doesn;t?

Andrew_Lawlor @Andrew_Lawlor: no it doesnt

Kevin_Kho @Kevin_Kho: I know this is hard to pair, but that does that have a corresponding run in Prefect Cloud?
So this happens when there is a Failed pod. Prefect is the one that makes the log here

For example here , there is an underlying issue with the pod

GitHub: prefect/agent.py at master · PrefectHQ/prefect

[May 2nd, 2022 2:18 PM] zach160: hi - we started seeing a new log from prefect manifesting in some failed k8s jobs. Any idea what could be going on here?

Andrew_Lawlor @Andrew_Lawlor: hmm ok. im not seeing other errors like he is, but it does seem pretty similar

@Anna_Geller: It looks like your flow run was Submitted to your infrastructure and Prefect:
• submitted a Kubernetes job for the flow run,
• the flow run started (moved to a Running state)
• but then something failed - could be that the container image couldn’t be pulled from the container registry, Prefect flow couldn’t be pulled from storage or there was some issue in allocating resources for the run)
as a result, the flow run was marked as Failed - this is my understanding.

You may tackle this issue using a flow-level state handler - if you see this specific type of error, create a new flow run of this flow to sort of “restart/retrigger” the entire process

Are you on Prefect Cloud? Can you send an example flow run ID to check the logs and confirm?

Andrew_Lawlor @Andrew_Lawlor: i am on prefect cloud. 94fb6788-3052-4c41-9a38-579d584c6fd7 is a flow id
and it did start. it ran some tasks successfully, then ran a mapped task 10000 times successfully before it failed. i. would rather not restart the entire process, but i would like to be able to restart from the point of failure

@Anna_Geller: thanks for providing more info. I guess I was confused when you said in the original message that you triggered 10,000 flow runs, but it looks like it’s rather a single flow run with 10,000 mapped task runs, correct? let me check the logs

Andrew_Lawlor @Andrew_Lawlor: the mapped task is a create_flow_run task

@Anna_Geller: I see, so the ID you sent me is the flow run ID of the parent flow run that triggered 10,000 child flow runs via mapped create_flow_run task?

Andrew_Lawlor @Andrew_Lawlor: yes

@Anna_Geller: Thanks, the logs are helpful:

"Failed to set task state with error: ClientError([{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID dbd483e5-a5f9-4155-b0a5-a6e96e6e8c2b: provided a running state but associated flow run 94fb6788-3052-4c41-9a38-579d584c6fd7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])\nTraceback (most recent call last):\n  File \""/usr/local/lib/python3.9/site-packages/prefect/engine/cloud/task_runner.py\"", line 91, in call_runner_target_handlers\n    state = self.client.set_task_run_state(\n  File \""/usr/local/lib/python3.9/site-packages/prefect/client/client.py\"", line 1598, in set_task_run_state\n    result = self.graphql(\n  File \""/usr/local/lib/python3.9/site-packages/prefect/client/client.py\"", line 473, in graphql\n    raise ClientError(result[\""errors\""])\nprefect.exceptions.ClientError: [{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID dbd483e5-a5f9-4155-b0a5-a6e96e6e8c2b: provided a running state but associated flow run 94fb6788-3052-4c41-9a38-579d584c6fd7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]""}"

It looks like there are too many flow-runs queued up and the execution layer cannot process them all at once. I wonder if you could try setting some concurrency limit to mitigate this? E.g. perhaps setting a concurrency limit of 500 for this child flow run to ensure that your execution layer can better handle this rather than getting all those child flow runs being created all at once?

Andrew_Lawlor @Andrew_Lawlor: i thought i was seeing those after the pod failed. so the tasks had been queued up but the flow had already failed
but ya that does make sense. is there a limit on what the execution layer can process at once? there were never more than 30 runs going at once (i actually was going to ask about that too, it took 4 hours to create all those flow runs, and i was wondering why it took so long)

@Anna_Geller: > is there a limit on what the execution layer can process at once?
there isn’t, it depends on what infrastructure (your K8s cluster) can handle

> i was wondering why it took so long
You had the right intuition, it takes some time to:
• create the underlying K8s job for each flow run
• and communicate state updates with the Prefect backend.
Queuing them using concurrency limits could help prevent them to get submitted at once (it would kind of batch submit them as a result) which could mitigate that some flow runs get stuck