Hello, my current setup is:
- Prefect v2.10.11
- Infrastructure: kubernetes v1.24.6
- I run 2 agent watching the same work queues due to resiliency
I run a simple flow every 10s linked to a work-queue with a limit of 10 and it works fine but eventually the execution of flows stops due to 10 flow-runs stuck on pending state blocking others to run.
The UI, command-cli or agent/server don’t reveal any error and exec into the agent pod what I get is:
$ prefect flow-run inspect 678db3d9-747a-4d50-9c22-3913d80dc751
05:59:49.195 | DEBUG | prefect.profiles - Using profile ‘default’
05:59:49.319 | DEBUG | prefect.client - Connecting to API at https://****/api/
FlowRun(
id=‘678db3d9-747a-4d50-9c22-3913d80dc751’,
created=DateTime(2023, 6, 28, 18, 48, 54, 179379, tzinfo=Timezone(‘+00:00’)),
updated=DateTime(2023, 6, 28, 19, 32, 36, 635283, tzinfo=Timezone(‘+00:00’)),
name=‘complex-aardwolf’,
flow_id=‘085c9387-e60f-4478-8649-95a503905f7e’,
state_id=‘69c3e7f8-891b-4d73-8672-4ea4fd89e38c’,
deployment_id=‘390398eb-4a14-4d1b-a372-120eb40a2e9a’,
work_queue_id=‘8bc6421c-4143-4303-b32a-f951e7954cce’,
work_queue_name=‘default’,
idempotency_key=‘scheduled 390398eb-4a14-4d1b-a372-120eb40a2e9a 2023-06-28T21:13:41.259000+02:00’,
tags=[‘auto-scheduled’],
state_type=StateType.PENDING,
state_name=‘Pending’,
expected_start_time=DateTime(2023, 6, 28, 19, 13, 41, 259000, tzinfo=Timezone(‘+00:00’)),
estimated_start_time_delta=datetime.timedelta(seconds=38768, microseconds=111511),
auto_scheduled=True,
infrastructure_document_id=‘8e418b68-d957-46ef-acc5-8656f769b37d’,
work_pool_id=‘a005d8d5-0dd0-4f0d-862f-9c57ef74e443’,
work_pool_name=‘default-agent-pool’,
state=State(
id=‘69c3e7f8-891b-4d73-8672-4ea4fd89e38c’,
type=StateType.PENDING,
name=‘Pending’,
timestamp=DateTime(2023, 6, 28, 19, 32, 36, 633732, tzinfo=Timezone(‘+00:00’)),
state_details=StateDetails(
flow_run_id=‘678db3d9-747a-4d50-9c22-3913d80dc751’,
scheduled_time=DateTime(2023, 6, 28, 19, 13, 41, 259000, tzinfo=Timezone(‘+00:00’))
)
)
)
If I cancel manually the pending flow-run everything start to work again till the next occurrence of the problem.
I understood that there is a watcher routine in the server that move the pending state to crash after a timeout, but it seems not work.
Any hint of how to solve this?
Thanks in advance.