Problem description
Prefect flows have heartbeats that signal to Prefect Cloud that your flow is alive. If Prefect didn’t have heartbeats, flows that lose communication and die would be shown permanently with a Running state in the UI.
Most of the time, we have seen “no heartbeat detected” to be a sign of memory issues, i.e. your flow runs out of memory. Sometimes, it also happens with long-running jobs such as Kubernetes or Databricks jobs.
What is a “heartbeat”?
As part of executing a flow run, flow runners spin up either a subprocess or background thread to periodically record “heartbeats” to the Prefect Cloud API. The goal of doing so is to detect when something has interrupted the execution of a run but has not been able to report to the API (e.g. the Kubernetes container ran out of memory).
Subprocess heartbeats are more likely to be killed because they take up more CPU and memory. When encountering possible resource issues, switching to threaded heartbeats is a good debugging step.
Possible solutions
1) Setting heartbeat mode to threads
In version 0.15.4 and above, additional logging has been added that propagates the real error in the event this error happens.
You can configure heartbeats to use threads instead of processes.
from prefect.run_configs import UniversalRun
flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})
This has proven to be more stable for a lot of users.
2) Moving the failing task to a subflow and turning off flow heartbeat
In the event that the flow continues to fail after switching to using threads, some users have had success in making the failing task a subflow, and then turning off heartbeats for the subflow.
mutation {
disable_flow_heartbeat(input: { flow_id: "your-subflows-flow-id-here" }) {
success
}
}
Flow heartbeat issues with externally executed jobs (e.g. a Kubernetes job)
Tracking flow heartbeats in a hybrid execution model is challenging for several reasons:
- Kubernetes job for the flow run is a fully independent containerized application that communicates with Prefect Cloud API purely via outbound API calls, i.e. your Kubernetes job must send a request to our API to perform any state change, etc.
- Heartbeats allow us to track if the flow/task runs are still in progress but again, we don’t have inbound access to your pod to check the state and if something goes wrong within the pod (e.g., someone deletes the pod or the pod crashes or faces some network issues), heartbeats are the only way for Prefect to detect such infrastructure issue without having access to your infrastructure.
It can be especially challenging if within this flow run pod you are spinning up some external resources such as:
- subprocesses triggered with a
ShellTask
, - externally long-running jobs such as
DatabricksSubmitRun
The above-mentioned use cases double the challenge since not only your flow run is executed in a completely separate isolated environment (the container in your Kubernetes job pod), but also within that container you are spinning up a subprocess, e.g. for the ShellTask
, and if something happens in this subprocess, there is even one more extra layer to find out about the issue.
We could e.g. mark task runs or flow runs immediately as failed if we e.g. don’t get flow heartbeats for a couple of seconds and cancel the flow run, but this would risk that we cancel your computation even though it works properly but only had some transient issues with the heartbeat itself or with sending a state update to the API.
It could be that this description is technically not 100% accurate but this should make the challenge a bit clearer and show why the flow heartbeat issues may occur and under what circumstances they are particularly challenging.
How can you mitigate that issue?
- You can run your subprocess with a local agent rather than within a Kubernetes job.
- You can offload long-running jobs that cause heartbeat issues into a separate flow (subflow)
- this subflow can run on some local machine while your dependency-heavy containerized child flows run on Kubernetes
- you may optionally turn off flow heartbeats for such flow using
disable_flow_heartbeat
mutation as explained in Section 2 in this topic, - you can orchestrate those child flows from a parent flow (i.e. build a flow-of-flows).