Flow is failing with an error message "No heartbeat detected from the remote task"

anna_geller · January 26, 2022, 10:13pm

Problem description

Prefect flows have heartbeats that signal to Prefect Cloud that your flow is alive. If Prefect didn’t have heartbeats, flows that lose communication and die would be shown permanently with a Running state in the UI.

Most of the time, we have seen “no heartbeat detected” to be a sign of memory issues, i.e. your flow runs out of memory. Sometimes, it also happens with long-running jobs such as Kubernetes or Databricks jobs.

What is a “heartbeat”?

As part of executing a flow run, flow runners spin up either a subprocess or background thread to periodically record “heartbeats” to the Prefect Cloud API. The goal of doing so is to detect when something has interrupted the execution of a run but has not been able to report to the API (e.g. the Kubernetes container ran out of memory).

Subprocess heartbeats are more likely to be killed because they take up more CPU and memory. When encountering possible resource issues, switching to threaded heartbeats is a good debugging step.

Possible solutions

1) Setting heartbeat mode to threads

In version 0.15.4 and above, additional logging has been added that propagates the real error in the event this error happens.

You can configure heartbeats to use threads instead of processes.

from prefect.run_configs import UniversalRun
flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})

This has proven to be more stable for a lot of users.

2) Moving the failing task to a subflow and turning off flow heartbeat

In the event that the flow continues to fail after switching to using threads, some users have had success in making the failing task a subflow, and then turning off heartbeats for the subflow.

mutation {
  disable_flow_heartbeat(input: { flow_id: "your-subflows-flow-id-here" }) {
    success
  }
}

Flow heartbeat issues with externally executed jobs (e.g. a Kubernetes job)

Tracking flow heartbeats in a hybrid execution model is challenging for several reasons:

Kubernetes job for the flow run is a fully independent containerized application that communicates with Prefect Cloud API purely via outbound API calls, i.e. your Kubernetes job must send a request to our API to perform any state change, etc.
Heartbeats allow us to track if the flow/task runs are still in progress but again, we don’t have inbound access to your pod to check the state and if something goes wrong within the pod (e.g., someone deletes the pod or the pod crashes or faces some network issues), heartbeats are the only way for Prefect to detect such infrastructure issue without having access to your infrastructure.

It can be especially challenging if within this flow run pod you are spinning up some external resources such as:

subprocesses triggered with a ShellTask,
externally long-running jobs such as DatabricksSubmitRun

The above-mentioned use cases double the challenge since not only your flow run is executed in a completely separate isolated environment (the container in your Kubernetes job pod), but also within that container you are spinning up a subprocess, e.g. for the ShellTask, and if something happens in this subprocess, there is even one more extra layer to find out about the issue.

We could e.g. mark task runs or flow runs immediately as failed if we e.g. don’t get flow heartbeats for a couple of seconds and cancel the flow run, but this would risk that we cancel your computation even though it works properly but only had some transient issues with the heartbeat itself or with sending a state update to the API.

It could be that this description is technically not 100% accurate but this should make the challenge a bit clearer and show why the flow heartbeat issues may occur and under what circumstances they are particularly challenging.

How can you mitigate that issue?

You can run your subprocess with a local agent rather than within a Kubernetes job.
You can offload long-running jobs that cause heartbeat issues into a separate flow (subflow)
- this subflow can run on some local machine while your dependency-heavy containerized child flows run on Kubernetes
- you may optionally turn off flow heartbeats for such flow using disable_flow_heartbeat mutation as explained in Section 2 in this topic,
- you can orchestrate those child flows from a parent flow (i.e. build a flow-of-flows).

Related topics

Amir · November 7, 2022, 4:23pm

Hi Anna,

Is it possible that flow heartbeat issues with externally executed jobs (as described above) can cause CPU to drop to 0 while maintaining “Running” status for the flow in the cloud UI?

Using Prefect 2.x in a hybrid setup, I was running into an issue using ShellTask where the pod CPU usage would drop to 0 after a prolonged period of no (outbound) API communication to the Cloud (ie. a lack of DEBUG | prefect.client - Connecting to API at ...). This problem was averted by generating logs regularly, leading to regular outbound communication, which led to the flow to maintain processing and therefore finish properly.

anna_geller · November 7, 2022, 4:53pm

could you elaborate? are you saying enabling debug logs on the agent fixed the heartbeat issue for you?

Amir · November 7, 2022, 5:19pm

Sure! Sorry, this is a bit convoluted so it’s tough to explain. Debug logs helped uncover the issue, as they showed when the agent was communicating with the Cloud - they didn’t resolve it. I’ll break it down a bit more:

Initially, I was experiencing an issue where running my model through ShellTask would cause CPU usage to drop to zero after ~1 hour. Although CPU usage was zero (and therefore no processes were being made), RAM usage would remain high (~30gb) and the flow would still be marked as Running. This would continue indefinitely (at least ~2 days, over the weekend). A more thorough description of the issue can be found in Slack.
I turned debug logs on in order to try uncovering what the issue was. That didn’t produce any logs once ShellTask began running the model. The model also did not produce any logs. Essentially, from the terminal, the last thing you would see was ShellTask invoking the model, and then a line saying the model was running. This would last for about an hour before CPU would drop (there was zero outbound communication to the Cloud while this was happening).
From within the model, I then changed the model’s __main__.py configure_logging() method to use logging.StreamHandler(stream=sys.stdout), which led the logs from the model to be outputted to the terminal. From here on, since the model was generating logs to the terminal, these logs would be sent to the cloud via outbound API, which was confirmed by the regular occurring lines saying, DEBUG | prefect.client - Connecting to API at .... This caused CPU utilization to continue, and the model did not ‘stall’.
The latter parts of the model did not generate any logs (I had yet to add them). So when the model reached the point where it no longer generated any logs, outbound communication ceased. No new logs were being generated, and after a period of time CPU utilization dropped to zero again.
I added logs, communication to the cloud began again, and the model successfully ran while CPU utilization remained consistent.

Therefore, I’m now thinking this could be a heartbeat issue, but it doesn’t explain the CPU drop - at least, I’m unsure if it does. I’m currently trying to replicate this issue using a python script that simulates CPU load over a long time, and calling it via ShellTask to see if CPU drops. Let me know if the above makes sense - I’m happy to hop on a call with whoever to try explaining the issue a little better!

anna_geller · November 9, 2022, 12:25am

Great write-up! This may show that I’m not the right person to ask about low-level hardware issues, but is low CPU usage (or a drop in usage) bad? I would be more concerned if the usage spiked all of a sudden for no obvious reason

We don’t do calls on a community, but you could reach paid support for troubleshooting calls via cs@prefect.io

Topic		Replies	Views
Why does this error occur when running flows with a Kubernetes agent? "Pod prefect-job-xxx failed. No container statuses found for pod" Archive prefect-1-0 , kubernetes , troubleshooting , kubernetes-agent	0	1240	May 4, 2022
How to ensure that my flow run is considered failed when any of mapped tasks failed? Archive prefect-1-0 , reference-tasks , state-dependencies , mapping , unmapped	0	1398	April 20, 2022
Tasks executed in differnt Flows are crashing when Flows run in parallel Help	0	126	May 17, 2024
Why there is no Automation to send alerts when a flow run is stuck in a Submitted state for a given time? Archive prefect-1-0 , automations , stuck , submitted-state , lazarus	1	599	April 6, 2022
Flows not resuming from point of failure Help prefect-2-0 , troubleshooting	2	214	September 9, 2024