Error "Infrastructure returned without reporting flow run" when using custom Infrastructure Block

yaron · February 26, 2023, 3:41pm

Hi,

As I continue to break walls to get to the final goal of a working custom Infrastructure Block of a Render job, I now got to a state where I actually was able to run a flow inside a Render, the flow finishes in success state, but then weirdly enough says that: “Infrastructure returned without reporting flow run”.

Let me explain what I currently have.
This is the Infrastructure block:

The render_start_job() function calls Render REST API and tells it to start a new instance with the “start” command containing the current runId:

So, looking at the logs of the agent, everything actually looks and runs nicely.
The agent is getting the flow, and inside the Infrastructure block I am logging the current runId, starting a new Render instance and polling (blocking) until it’s finished:

So the Render instance is running and we are inside a loop, waiting, and polling for it to finish. Then 5 minutes later, when the instance finished doing it’s stuff:

From the image above you can see my log which is “job finished successfully”, and right after I see that error with “Infrastructue returned without…”. But if you look at my code, just after my “success” log, I return a successful InfrastructureResult:

So the question why is the agent yelling about me not returning an InfrastructureResult?

(As always, I’ve tried looking at the GCP and AWS repos regarding their custom infra block and tried to find some line about reporting status to the agent that I didn’t make, but I could not find anything.)

Also, this is the flow URL:
https://app.prefect.cloud/account/8eed9803-456a-4126-a7f7-074aa44aa1b2/workspace/8ff919f1-a2c3-4660-9ab5-66b57758d46d/flow-runs/flow-run/ed5fb5b3-b7cf-446b-88b2-26e47bb5b6ed

yaron · February 26, 2023, 3:44pm

@ryan_peden @desertaxle @EmilRex
Would love to get your input on this, as you’ve helped me break the previous wall.
Thanks!

ryan_peden · February 26, 2023, 8:06pm

Hi Yaron! Your Render infrastructure block is looking good.

The error message you see is happening because the agent wants you to call the task_status.started callback to let it know the flow has started running. See here for an example of how the Azure Container Instances block does it, or here in the ECS block.

The agent wants this because this callback gives you a chance to provide a unique ID that can be used to cancel the flow if the user requests cancellation. When that happens, the agent will call your block’s kill method and pass the unique ID in as the first parameter. See here for a look at how the ECS block handles it.

In your infrastructure block, you could call the callback with something like task_status.started(render_job_id).

I see you noted that render_start_job only returns after the Render instance finishes running. So perhaps you could pass task_status into render_start_job as a parameter, and then call it with the Render job ID after you create the job, but before you start polling to check for job completion.

I hope this helps, but if you have any other questions, please feel free to post a follow-up message!

yaron · March 6, 2023, 8:58am

@ryan_peden Yes! it worked. So I believe everything is working now. Some small thing that still bothers me is this warning when the instance starts working:

/usr/local/lib/python3.7/runpy.py:125: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour

Any ideas about what could be the root cause?

ryan_peden · March 10, 2023, 12:32am

Nothing specific yet, but I’ve seen the same warning sometimes appear when running flows in other containerized environments like Azure Container Instances and it’s on our list of issues to address.

For what it’s worth, this should not actually cause unpredictable behavior in prefect.engine, but the warning does look ominous so I would like to make it disappear.

yaron · March 25, 2023, 3:12pm

@ryan_peden Hi Ryan, Just as I thought I had a working solution e2e, I got stuck in another weird wall.

So this flow runs great:

And this is the output log:

But when I try to import some of my custom code like so:

Then I get this error about permission when accessing the repo/code:

This is strange as in the first example the flow clearly has access to the code (we see logs and the flow runs to success).

Also, this is nothing specifically related to the piece of code regarding Snowflake. Any custom code I will import would cause the error you see here.

Any direction as to what might be the problem here?

Sorry for that (-: Already solved this issue. It was just some missing PYTHONPATH.

yaron · March 27, 2023, 7:53am

Ok, so another update. This permission issue comes back. It seems flakey and I can’t seem to put my finger on it.

ryan_peden · March 27, 2023, 11:15am

That’s odd. Are you able to try it with Python 3.8 or above?

On 3.7 we use a custom version of copytree and it looks like that’s where the problem is starting. It’s hard to tell if our code is causing the problem, or if it’s something external. If the problem disappears on 3.8, that would help narrow things down.

yaron · March 27, 2023, 5:01pm

Ok got it I’ll check it on 3.8 and report.

yaron · March 27, 2023, 5:34pm

@ryan_peden So same error with python 3.8

/opt/render/project/python/Python-3.8.16/lib/python3.8/runpy.py:127: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
Mar 27 08:25:16 PM    warn(RuntimeWarning(msg))
Mar 27 08:25:16 PM  17:25:16.709 | INFO    | Flow run 'enlightened-marten' - Downloading flow code from storage at None
Mar 27 08:25:17 PM  17:25:17.868 | ERROR   | Flow run 'enlightened-marten' - Flow could not be retrieved from deployment.
Mar 27 08:25:17 PM  Traceback (most recent call last):
Mar 27 08:25:17 PM    File "/opt/render/project/src/.venv/lib/python3.8/site-packages/prefect/engine.py", line 274, in retrieve_flow_then_begin_flow_run
Mar 27 08:25:17 PM      flow = await load_flow_from_flow_run(flow_run, client=client)
Mar 27 08:25:17 PM    File "/opt/render/project/src/.venv/lib/python3.8/site-packages/prefect/client/utilities.py", line 47, in with_injected_client
Mar 27 08:25:17 PM      return await fn(*args, **kwargs)
Mar 27 08:25:17 PM    File "/opt/render/project/src/.venv/lib/python3.8/site-packages/prefect/deployments.py", line 194, in load_flow_from_flow_run
Mar 27 08:25:17 PM      await storage_block.get_directory(from_path=deployment.path, local_path=".")
Mar 27 08:25:17 PM    File "/opt/render/project/src/.venv/lib/python3.8/site-packages/prefect/filesystems.py", line 966, in get_directory
Mar 27 08:25:17 PM      copytree(
Mar 27 08:25:17 PM    File "/opt/render/project/python/Python-3.8.16/lib/python3.8/shutil.py", line 557, in copytree
Mar 27 08:25:17 PM      return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
Mar 27 08:25:17 PM    File "/opt/render/project/python/Python-3.8.16/lib/python3.8/shutil.py", line 513, in _copytree
Mar 27 08:25:17 PM      raise Error(errors)
Mar 27 08:25:17 PM  shutil.Error: [('/tmp/tmpg09rf7_lprefect/.git/objects/pack/pack-0bfd76b4389ea94e4d681cb18045eac511b6db34.pack', './.git/objects/pack/pack-0bfd76b4389ea94e4d681cb18045eac511b6db34.pack', "[Errno 13] Permission denied: './.git/objects/pack/pack-0bfd76b4389ea94e4d681cb18045eac511b6db34.pack'"), ('/tmp/tmpg09rf7_lprefect/.git/objects/pack/pack-0bfd76b4389ea94e4d681cb18045eac511b6db34.idx', './.git/objects/pack/pack-0bfd76b4389ea94e4d681cb18045eac511b6db34.idx', "[Errno 13] Permission denied: './.git/objects/pack/pack-0bfd76b4389ea94e4d681cb18045eac511b6db34.idx'")]

But what is interesting, I figured out the exact line that causes the error.

So this code CRASHES:

But if I remove that last line, it works:

And I’ve checked, it’s nothing specific to Snowflake. Even a print() will cause this.

Any ideas?

Topic		Replies	Views
How to get the current flow_run_id when implementing a custom Infrastructure Block Help	2	994	February 8, 2023
Find agent that was unable to submit to infrastructure Help prefect-2-0 , agent , infrastructure	0	62	March 7, 2024
Flow_run_filter with FlowRunFilterStartTime doesn't return all runs Help	0	439	March 6, 2023
Late flow run state doens't change Help prefect-2-0	0	375	April 25, 2023
Flow runs from deployments seem to be late Archive prefect-2-0	1	593	June 27, 2022

Error "Infrastructure returned without reporting flow run" when using custom Infrastructure Block

Related Topics