How to handle ECS errors during flow run deployment? Is it possible to implement a flow-level retry for errors such as this one: "Failed to start [ECS] task for flow run SOME_UUID"?

anna_geller · March 28, 2022, 9:05pm

@Myles_Steinhauser: Does Prefect 1.1 support retries on Flows ? Specifically, I’m trying to workaround some delayed scaling issues with ECS using EC2 instances (not ECS with Fargate tasks)

Often, this failure is reported back to Prefect like the following error until Capacity Provider scaling has caught up again:

FAIL signal raised: FAIL('a4f09101-0577-41ce-b8b0-31b84f26d855 finished in state <Failed: "Failed to start task for flow run a4f09101-0577-41ce-b8b0-31b84f26d855. Failures: [{\'arn\': \'arn:aws:ecs:us-east-1:<redacted>:container-instance/a8bc98b7c6864874bc6d1138f758e8ea\', \'reason\': \'RESOURCE:CPU\'}]">')

I’m using the following calls to launch the sub-flows (as part of a larger script):

flow_a = create_flow_run(flow_name="A", project_name="myles")
wait_for_flow_a = wait_for_flow_run(flow_a, raise_final_state=True, stream_logs=True)

@Anna_Geller: there are no flow-level retries - to retry a flow run, you would need to create a new flow run - this is something you could do in a state handler or using Automations

Are you on Prefect Cloud?

@Myles_Steinhauser: Yep, running on Prefect Cloud.
We’re looking forward to Prefect 2.0 and the formal support for subflows. But, the current experience at beta.prefect.io doesn’t meet our needs yet.

@Anna_Geller: Sure, understandable! To sort of retry a flow run you could set an SLA using Automations to ensure that e.g. if your flow run failed to move to a Running state after e.g. 10 minutes (e.g. due to some ECS provisioning issue), you could start a new flow run

but not sure if this is the right approach here, let me check sth

@Myles_Steinhauser: The behavior I’m seeing is that the Parent Flow starts (and we can usually manage making capacity for this ahead of time via Scheduled Scaling Policy in ASGs) but a Child Flow fails to start due to insufficient resources available.

Retrying to launch the Child Flow is what I’m hoping to do, while the Parent Flow keeps running.

@Anna_Geller: I checked the logs of your flow run and the Automations won’t help you here because the flow run fails immediately because Prefect cannot deploy the flow run to ECS due to not enough CPU resources on your EC2 instance:

github.com

PrefectHQ/prefect/blob/1.x/src/prefect/agent/ecs/agent.py#L349


      
              if new_taskdef_arn:
                  self.logger.debug("Deregistering task definition %s", taskdef_arn)
                  self.ecs_client.deregister_task_definition(taskDefinition=taskdef_arn)
          
          
    if resp.get("tasks"):
                  task_arn = resp["tasks"][0]["taskArn"]
                  self.logger.debug("Started task %r for flow run %r", task_arn, flow_run.id)
                  return f"Task {task_arn}"
          
          
    raise ValueError(
                  "Failed to start task for flow run {0}. Failures: {1}".format(
                      flow_run.id, resp.get("failures")
                  )
              )
          
          
def generate_task_definition(
              self, flow_run: GraphQLResult, run_config: ECSRun
          ) -> Dict[str, Any]:
              """Generate an ECS task definition from a flow run
          
          
    Args:

I hate to say that because it’s a bit lazy answer but the best option going forward here is to ensure that either:
• you ensure you have enough capacity on your self-managed EC2 data plane e.g. by implementing a bit more aggressive scaling policy
• move to Fargate and not have to worry about infrastructure management, but accept the latency of serverless

Managing this on a Prefect side is doable, but a bit hacky and not super elegant. You could e.g. in theory use:

from prefect import Flow
from prefect.tasks.prefect import StartFlowRun
from datetime import timedelta

start_flow_run = StartFlowRun(project_name="PROJECT_NAME", wait=True, max_retries=10, retry_delay=timedelta(minutes=5))

with Flow("FLOW_NAME") as flow:
    staging = start_flow_run(flow_name="child_flow_name")

and the retry_delay timedelta would respect the time set on your scaling policy (e.g. if the scale out takes 3-4 minutes then retry of 5 min can make sense)

Is this helpful?

@Myles_Steinhauser: Yup, this makes sense and gives me something to play with! I’m trying to avoid moving to Fargate for launching due to existing security vendor deployments which we would also need to update. We might at some point, just can’t do it yet.
Ahh, okay, this makes now comparing the API of StartFlowRun (Task-based) to create_flow_run

Topic		Replies	Views
Checkpoint/restart capability Archive	1	970	August 31, 2022
What are common pitfalls when setting up Prefect flow execution in AWS ECS? Help flow-code-storage , marvin	2	1179	January 17, 2024
How does retrying failed flow runs intersect with new flow run code? Help	1	371	March 14, 2023
When using ECSAgent, I'm getting an error: "An error occurred (ThrottlingException) when calling the DeregisterTaskDefinition operation (reached max retries: 2): Rate exceeded" - how to fix it? Archive prefect-1-0 , aws , fargate , agent , ecs , throttling	0	663	March 3, 2022
Cloud Run Jobs - Possible to configure 'Number of retries per failed task'? Help prefect-2-0 , gcp	0	231	February 9, 2024

How to handle ECS errors during flow run deployment? Is it possible to implement a flow-level retry for errors such as this one: "Failed to start [ECS] task for flow run SOME_UUID"?

Related topics