Prefect flows failing due to PrefectHTTPStatusError when number of flows is increased

Hi,
I am using Prefect 2.14.15 and have my own prefect Orion server with multiple work-pools deployed on different servers.
Flows running on my machine is failing due to PrefectHTTPStatusError for some of the internal prefect api like /flow_run, /flows, /set_state.

Error Traceback:
Encountered exception during execution:
Traceback (most recent call last):
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/engine.py”, line 849, in orchestrate_flow_run
result = await flow_call.aresult()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 293, in aresult
return await asyncio.wrap_future(self.future)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 318, in _run_sync
result = self.fn(*self.args, **self.kwargs)
File “/var/www/apt/apt/prefect_flows/workflow.py”, line 20, in execute_workflow
return state.result()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/client/schemas/objects.py”, line 212, in result
return get_state_result(self, raise_on_failure=raise_on_failure, fetch=fetch)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/states.py”, line 71, in get_state_result
return _get_state_result(state, raise_on_failure=raise_on_failure)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py”, line 255, in coroutine_wrapper
return call()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 398, in call
return self.result()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 284, in result
return self.future.result(timeout=timeout)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 168, in result
return self.__get_result()
File “/usr/lib/python3.10/concurrent/futures/_base.py”, line 403, in __get_result
raise self._exception
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 355, in _run_async
result = await coro
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/states.py”, line 91, in _get_state_result
raise await get_state_exception(state)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/engine.py”, line 849, in orchestrate_flow_run
result = await flow_call.aresult()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 293, in aresult
return await asyncio.wrap_future(self.future)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 318, in _run_sync
result = self.fn(*self.args, **self.kwargs)
File “/var/www/apt/apt/prefect_flows/workflow.py”, line 28, in run_node_flow
res = run_workflow(workflow, tenant, payload_data)
File “/var/www/apt/apt/prefect_flows/workflow.py”, line 48, in run_workflow
process_conditional_task(task, tenant, result_of_tasks)
File “/var/www/apt/apt/prefect_flows/workflow.py”, line 95, in process_conditional_task
filtered_data = check_condition(task.get(“condition”), data_to_filter)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/tasks.py”, line 569, in call
return enter_task_run_engine(
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/engine.py”, line 1382, in enter_task_run_engine
return from_sync.wait_for_call_in_loop_thread(begin_run)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/api.py”, line 243, in wait_for_call_in_loop_thread
return call.result()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 284, in result
return self.future.result(timeout=timeout)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 168, in result
return self.__get_result()
File “/usr/lib/python3.10/concurrent/futures/_base.py”, line 403, in __get_result
raise self._exception
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py”, line 355, in _run_async
result = await coro
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/engine.py”, line 1550, in get_task_call_return_value
return await future._result()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/futures.py”, line 237, in _result
return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/states.py”, line 91, in _get_state_result
raise await get_state_exception(state)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/task_runners.py”, line 231, in submit
result = await call()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/engine.py”, line 1780, in begin_task_run
state = await orchestrate_task_run(
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/engine.py”, line 2023, in orchestrate_task_run
task_run = await client.read_task_run(task_run.id)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/client/orchestration.py”, line 2031, in read_task_run
response = await self._client.get(f"/task_runs/{task_run_id}")
File “/home/azureuser/.local/lib/python3.10/site-packages/httpx/_client.py”, line 1757, in get
return await self.request(
File “/home/azureuser/.local/lib/python3.10/site-packages/httpx/_client.py”, line 1530, in request
return await self.send(request, auth=auth, follow_redirects=follow_redirects)
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/client/base.py”, line 312, in send
response.raise_for_status()
File “/home/azureuser/.local/lib/python3.10/site-packages/prefect/client/base.py”, line 164, in raise_for_status
raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.cause
prefect.exceptions.PrefectHTTPStatusError: Server error ‘500 Internal Server Error’ for url ‘https://prefect-server/api/task_runs/<task_id>’
Response: {‘exception_message’: ‘Internal Server Error’}
For more information check: 500 Internal Server Error - HTTP | MDN

Similarly for other APIs also,
prefect.exceptions.PrefectHTTPStatusError: Server error ‘500 Internal Server Error’ for url ‘/api/task_runs/’
Response: {‘exception_message’: ‘Internal Server Error’}

prefect.exceptions.PrefectHTTPStatusError: Server error ‘500 Internal Server Error’ for url ‘/api/task_runs//set_state’
Response: {‘exception_message’: ‘Internal Server Error’}

prefect.exceptions.PrefectHTTPStatusError: Server error ‘500 Internal Server Error’ for url ‘/api/flows/’
Response: {‘exception_message’: ‘Internal Server Error’}

I am not able to figure out the root cause of this issue.

My infra setup:

  • prefect-orion server: 16 GB RAM, 4 Core
    work-pools:
  • default-pool: same as prefect-orion server | 1 agent
  • pool1: server2 16 GB RAM, 4 core | 5 agents
  • pool2: same server as pool1 | 7 agents

Pool1 has daily load of nearly 30k flows out of which ~5% failed due to these error.
Pool2 has daily load of nearly 10k flows out of which ~8% is failed due to these error.
Currently there are around 350k total flow runs until know.

All the machines are on same VPC. I am using postgresQL DB connected to my prefect orion server with some table recods (like log, flow_run_state) nearly reaching 700k entries.

Earlier i thought it was due to huge number of data in DB with is causing timeout but i had removed nearly 30% on my old data from DB but the error still exists.

Does anyone have ony idea what might be the cause of it?

Same issue. I have smaller scale (3K flow runs/day) but get the same error. Any resolutions? I’ve also tried to clean out the DB. I think may some of it has something to do with traffic on the network card, but i’m floating around 32mpbs of in/out traffic so doesn’t seem to be the main problem.