Flow could not be retrieved from deployment - intermittent error

Hello everyone,

I’m facing a bizarre error in my deployments.
It looks like the one reported here, but not quite…

What I’m doing is the following:

  • we deploy 3 services on k8s: prefect-orion, an ETL service (let’s call it svc-A), and a fraud detection/data aggregation service (call it svc-B).
  • both svc-A and -B are scheduled deployments that leverage on prefect-orion
  • when the service is (re)deployed on our k8s cluster with ArgoCD, it cancels any running workflows for that given deployment, submit the new version of the scheduled deployment (using deployments.Deployment.build_from_flow(…)), and run one flow straightaway (using deployments.run_deployment(…))
  • until svc-B appeared, everything was fine. Now, we’re making tests to roll this service into production and that’s where the problem arises

Problem is:

  • sometimes, everything goes fine and as expected: the two services have their own schedule and share the same work queue;
  • other times svc-B just fails with an error:
Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/prefect/engine.py", line 262, in retrieve_flow_then_begin_flow_run
    flow = await load_flow_from_flow_run(flow_run, client=client)
  File "/usr/local/lib/python3.10/dist-packages/prefect/client/utilities.py", line 47, in with_injected_client
    return await fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/prefect/deployments.py", line 166, in load_flow_from_flow_run
    await storage_block.get_directory(from_path=deployment.path, local_path=".")
  File "/usr/local/lib/python3.10/dist-packages/prefect/filesystems.py", line 143, in get_directory
    copytree(from_path, local_path, dirs_exist_ok=True)
  File "/usr/lib/python3.10/shutil.py", line 556, in copytree
    with os.scandir(src) as itr:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/mm_datascripts'

Then, very likely, in the next scheduled execution of a flow for svc-B, it will work as expected. In the meantime, nothing has changed neither on the application nor on the infrastructure side.

The intermittent character of this issue is puzzling me. Has someone already faced such behavior?

Cheers,

Hugo

Hi hugocatlas -

Simplifying this problem a little bit, I think the construct and design of your system are not as relevant as the core problem -

Flow could not be retrieved from deployment.
FileNotFoundError: [Errno 2] No such file or directory: ‘/usr/local/lib/python3.10/dist-packages/mm_datascripts’

How and when is your deployment created, and applied?
Further, when this does occur:
What image is being used?
What does your sys.path and pythonpath look like?
How does mm_datascripts get to this path?
What is the path and entrypoint for your deployment registration?

1 Like

Hi Christopher,

Thank you for replying.
As a matter of fact, the problem seems to have been solved yesterday after I found this bug report: BUG: If multiple agents have same type + labels, only one shows up on "Agents" page · Issue #384 · PrefectHQ/ui · GitHub, more particularly this comment: BUG: If multiple agents have same type + labels, only one shows up on "Agents" page · Issue #384 · PrefectHQ/ui · GitHub

The problem seemed to be due to the fact that both services start agents using the same work queues (named after our clients), hence, when flow runs are created from deployments, it seems that orion sometimes sends the run to one agent, sometimes to the other one.
I solved it by removing this behavior: now, each service starts agents with unique work queue names and up to now, everything seems to be working fine.

Else, having the possibility to name the agents as the comment indicates (but no reference to this feature exists in the documentation any longer) would also solve definitively the problem, I suppose.

Cheers,

Hugo

1 Like