We have a Work Queue which has concurrency limit of 15 flows at a time. If there are already tasks in queue, the flows are marked as Late. Once the Late flow finally gets a chance in the queue, the flow starts and shows CRASHED in 0s. Job status in k8 is marked as Complete.
In the kubernetes Job/Pod logs, I see:
Engine execution of flow run 'db723917-9091-43b4-8ac3-904f6792fd29' aborted by orchestrator: Unable to take work pool or work queue concurrency slot for flow run
What I did to try to fix it:
- Increased concurrency from 15 to 20
- Increase resources limits for memory and cpu in k8 job.
- Cleared long running tasks
These does temporarily fix the issue, but it keeps happening again and again on a daily basis. Super URGENT, please help!