Work queue concurrency limit and deleted flow runs

kkamenev · January 24, 2023, 1:26pm

Hi,

First some background. Our Prefect setup consists of Prefect Cloud and one agent that runs on Kubernetes. The agent pulls runs from a queue called agent_1. The queue has a concurrency limit of 4. Without the concurrency limit, the agent pulls all the work there is (we have quite many runs) which leads to out-of-memory issues. Sometimes the agent process is restarted (eg. when Kubernetes updates the Pod). If a flow run was running on the agent when it was restarted, it ends up in a corrupted state. Orion thinks it is running on the agent, and the restarted agent knows nothing about that run, so it never updates the state → the flow run remains Running forever. The same happens also with the Pending state. As soon as we have 4 of such “ghost” runs, the concurrency limit becomes exceeded and the agent does not pick up new runs any longer. To deal with that, we implemented a scheduled job that checks for flow runs that remained in Running or Pending state for too long and deletes them.

The real problem is here:
Due to a bug on our side, sometimes we delete runs that are being executed by the agent, which leads to this sequence

Agent pulls a run, state=Running
Scheduled job deletes the run
Agent finishes the execution and tries to set the state=Completed. But because the run has been deleted, it fails with ObjectNotFound error.

After it happens 4 times, the agent stops receiving new runs (supposedly due to the queue concurrency limit). Restarting the agent does not help. The only thing that helps is increasing the concurrency limit: 4 → 8. And the next time it happens, we have to increase it again. It looks like the concurrency limit state is cached somewhere in Prefect Cloud.
This is just my guess. Could you please give your opinion on whether this is a plausible explanation and what could be done to fix it.
Thank you!

Christopher_Boyd · February 22, 2023, 2:56pm

Hi kkamenev,

This seems pretty accurate and is a known issue that is being worked on.
You can see issues here:

github.com/PrefectHQ/prefect

Handle flow run restarts caused by infrastructure events

opened 12:01PM - 10 Oct 22 UTC

tekumara

enhancement needs:design status:roadmap

### First check - [X] I added a descriptive title to this issue. - [X] I used t…he GitHub search to find a similar request and didn't find it. - [X] I searched the Prefect documentation for this feature. ### Prefect Version 2.x ### Describe the current behavior The underlying infrastructure may decide to restart the flow, eg: as part of normal operations Kubernetes may wish to reschedule pods running the flow job (eg: because of node scale down or failure). When this happens the flow is stopped and then started again by Kubernetes. However when it starts for the second time it immediately fails with: ``` prefect.engine - Engine execution of flow run 'd53003f4-9f6a-4322-834a-406185693232' aborted by orchestrator: This run cannot transition to the RUNNING state from the RUNNING state. ``` The flow execution aborts here, but the flow is left in the RUNNING state, even though nothing is running. ### Describe the proposed behavior Prefect flows are resilient to infrastructure restarts, eg: the flow begins again from the last running task and continues instead of aborting. Flows are not left hanging in the RUNNING state. ### Example Use To reliably run flows in a cloud native environment (eg: Kubernetes). ### Additional context This also occurs in Prefect 1.x, eg: ``` Beginning Flow run for 'child' Task 'hello_task': Starting task run... Task 'hello_task': Finished task run for task with final state: 'Running' Flow run RUNNING: terminal tasks are incomplete. ``` Here the task and flow try to run but are already in the Running state and so the flow aborts.

and here:

github.com/PrefectHQ/prefect

Task Runs Concurrency slots not released when flow runs in Kubernetes are cancelled

opened 10:15PM - 16 Feb 23 UTC

masonmenges

bug status:triage

### First check - [X] I added a descriptive title to this issue. - [X] I used t…he GitHub search to find a similar issue and didn't find it. - [X] I searched the Prefect documentation for this issue. - [X] I checked that this issue is related to Prefect and not one of its dependencies. ### Bug summary When running a prefect flow as a kubernetes job if the flow run is cancelled while tasks are in a running state the concurrency slots used by the tasks are not released though the tasks are in a cancelled state. This is reproducible via the following steps with the code below with a flow run triggered as a kuberenetes job 1. Create a concurrency limit in Prefect Cloud 2. Add a task label to use that concurrency limit 3. Trigger the flow and cancel the flow once the tasks that use that concurrency limit are in a “running” state and populate the concurrency limit queue 4. The tasks’ state will change from “running” to “canceled”, but will remain in the concurrency limit queue KubernetesJob Config: ![k8sjobconfig](https://user-images.githubusercontent.com/76698667/219498682-a8356a48-5384-4863-be3d-f7d5a24f28cf.png) potentially related but separate issue: https://github.com/PrefectHQ/prefect/issues/7732 ### Reproduction ```python3 from prefect import flow, task, get_run_logger import time @task(tags=["some_concurrency_tag"]) def log_something(x): logger = get_run_logger() logger.info(f"this is log number {x}") time.sleep(60) @flow def smoke_test_flow(): for x in range(0, 100): log_something.submit(x) if __name__ == "__main__": smoke_test_flow() ``` ### Error _No response_ ### Versions ```Text runs from the base docker image prefecthq/prefect:2.8.0-python3.10 ``` ### Additional context Cluster config, minus any sensitive information ``` { "location": "southcentralus", "name": "prefect-k8s-dev", "tags": { "Application": "", "BudgetAlert": "", "BusinessGroup": "Data Analytics", "CostCode": "", "Priority": "", "TechnicalContact": "", "environment": "dev", "prefect": "true" }, "type": "Microsoft.ContainerService/ManagedClusters", "properties": { "provisioningState": "Succeeded", "powerState": { "code": "Running" }, "kubernetesVersion": "1.24.9", "dnsPrefix": "prefect-k8s-dev", "agentPoolProfiles": [ { "name": "default", "count": 2, "vmSize": "Standard_DS2_v2", "osDiskSizeGB": 50, "osDiskType": "Ephemeral", "kubeletDiskType": "OS", "maxPods": 110, "type": "VirtualMachineScaleSets", "enableAutoScaling": false, "provisioningState": "Succeeded", "powerState": { "code": "Running" }, "orchestratorVersion": "1.24.9", "enableNodePublicIP": false, "mode": "System", "enableEncryptionAtHost": false, "enableUltraSSD": false, "osType": "Linux", "osSKU": "Ubuntu", "nodeImageVersion": "AKSUbuntu-1804gen2containerd-2023.01.20", "upgradeSettings": {}, "enableFIPS": false } ] } } ```

The core of the issue is as you surmised - if the job is restarted by infrastructure, there are issues in communicating and updating the concurrency and state - namely - was it a new run with a duplicate ID? Was it a retried infrastructure event?

Topic		Replies	Views
Pending flow-runs block execution in queue Help prefect-2-0 , kubernetes	1	266	July 18, 2023
My flow run in Prefect 2 is stuck in a Pending state - what can I do? Help prefect-2-0 , agent , stuck , pending , marvin	8	3798	June 23, 2023
Is there a way to set concurrency limits on the agent when using Prefect Server? Archive prefect-1-0 , server , concurrency-limits , user-contribution	0	907	February 7, 2022
Prefect 2.7.5/2.7.6, flow stuck at Pending forever Help prefect-2-0 , agent , failure	3	945	January 5, 2023
Prefect 2.7.10 has been released with a significant upgrade to the cancellation feature, production multi-architecture Docker images and more Announcements prefect-2-0 , release-notes	0	679	January 26, 2023

Work queue concurrency limit and deleted flow runs

Related Topics