Why there is no Automation to send alerts when a flow run is stuck in a Submitted state for a given time?

TL;DR - because this is what Lazarus is for.

View in #prefect-community on Slack

Matthew_Seligson @Matthew_Seligson: We need alerting when a flow is submitted for execution but is never executed. Can we create an automation for this? Basically for every flow, send an email if a flow is submitted but not started after 60 seconds. Is this possible?

Kevin_Kho @Kevin_Kho: I actually think there is none because there is only "Does not start" and this refers to something not leaving a Scheduled state. I think we’d need an issue for this
@Marvin open “Feature Request: Automation for Flow stuck in Submitted”

Marvin @Marvin: https://github.com/PrefectHQ/prefect/issues/5652

GitHub: Feature Request: Automation for Flow stuck in Submitted · Issue #5652 · PrefectHQ/prefect

Matthew_Seligson @Matthew_Seligson: What other solutions are there to this problem? We have state handlers that alert when a flow fails but no coverage for cases where a flow doesn’t even start.

Kevin_Kho @Kevin_Kho: I am not seeing one at the moment myself, let me ask the team for ideas

Anna_Geller @Anna_Geller: @Matthew_Seligson this is what Lazarus is for. If the flow run stays in the Submitted state for longer than 10+ minutes, Lazarus rescheduled it up to 3 times. There is nothing you have to do on your end to handle that and I don’t think we need Automation for this.

In general, this indicates an issue with your execution layer, e.g.:

• your agent can’t deploy your flow run to a given execution layer
• something is wrong in your Kubernetes or ECS cluster
Submitted state means that the agent was able to pick up the scheduled flow run and it submitted it for execution, but something has gone wrong in the flow run deployment.

Here are some steps you may take to investigate it further:
• check the agent logs to see if anything suspicious stands out there (running an agent with the --show-flow-logs option can show more useful output)
• verify that your execution layer is able to pull your flow run’s image e.g. if the image needs to be pulled from a container registry, make sure your container can reach the Internet and has appropriate permissions to pull the image
• verify that your execution layer has enough permissions to spin up all required resources. For instance:
Kubernetes agent needs proper RBAC permissions to deploy your flow run as a Kubernetes job
ECS agent needs custom task_execution_role with permissions to describe VPCs, register a task definition and run the ECS task
• verify that your execution layer has enough capacity on the cluster to deploy your flow run.

Matthew_Seligson @Matthew_Seligson: Thanks @Anna_Geller. We don’t want to reschedule it; we just want to know that it failed to execute a flow. As you mentioned, the agent is picking up the flow but something is going wrong in the flow run deployment. Looking at logs is certainly helpful for a post-mortem, but they are not the same as an alert. we’d like to be notified when the flow run is submitted but not running.

Anna_Geller @Anna_Geller: in that case, perhaps you can create a custom flow-level state handler?

in general, it would be better to tackle the root cause of the issue. Do you know why your infrastructure cannot deploy the flow run?
also, just to be 100% sure - you are on Prefect Cloud, right?

Matthew_Seligson @Matthew_Seligson: Thanks Anna. We have a flow state handler but it is not triggered if the flow isn’t started. The issue is between the submitted and running state. Yes we are on Prefect Cloud.

Anna_Geller @Anna_Geller: Gotcha. I can totally understand why getting such an alert is useful, but to set expectations - I could reopen this issue, but I wouldn’t expect it to be tackled anytime soon since we already have a dedicated service that handles such issues in 1.0, which is Lazarus, and 2.0 is the priority

so the best bet would be to tackle the root cause of the issue and prevent those scenarios in the first place e.g. by allocating more resources and monitoring such use cases on the infrastructure level

Hi! To find out what I can do, say @marvin display help.