Hello everyone,
I’m facing a bizarre error in my deployments.
It looks like the one reported here, but not quite…
What I’m doing is the following:
- we deploy 3 services on k8s: prefect-orion, an ETL service (let’s call it svc-A), and a fraud detection/data aggregation service (call it svc-B).
- both svc-A and -B are scheduled deployments that leverage on prefect-orion
- when the service is (re)deployed on our k8s cluster with ArgoCD, it cancels any running workflows for that given deployment, submit the new version of the scheduled deployment (using deployments.Deployment.build_from_flow(…)), and run one flow straightaway (using deployments.run_deployment(…))
- until svc-B appeared, everything was fine. Now, we’re making tests to roll this service into production and that’s where the problem arises
Problem is:
- sometimes, everything goes fine and as expected: the two services have their own schedule and share the same work queue;
- other times svc-B just fails with an error:
Flow could not be retrieved from deployment.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/prefect/engine.py", line 262, in retrieve_flow_then_begin_flow_run
flow = await load_flow_from_flow_run(flow_run, client=client)
File "/usr/local/lib/python3.10/dist-packages/prefect/client/utilities.py", line 47, in with_injected_client
return await fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/prefect/deployments.py", line 166, in load_flow_from_flow_run
await storage_block.get_directory(from_path=deployment.path, local_path=".")
File "/usr/local/lib/python3.10/dist-packages/prefect/filesystems.py", line 143, in get_directory
copytree(from_path, local_path, dirs_exist_ok=True)
File "/usr/lib/python3.10/shutil.py", line 556, in copytree
with os.scandir(src) as itr:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/mm_datascripts'
Then, very likely, in the next scheduled execution of a flow for svc-B, it will work as expected. In the meantime, nothing has changed neither on the application nor on the infrastructure side.
The intermittent character of this issue is puzzling me. Has someone already faced such behavior?
Cheers,
Hugo