When using a KubernetesAgent on Azure AKS with Prefect Cloud, I got an error: "Connection aborted.', OSError(0, 'Error')"

anna_geller · February 25, 2022, 3:35pm

Problem description

User’s setup: Prefect Cloud + Azure AKS KubernetesAgent in a custom isolated VNet only allowing outbound connection via port 443 to connect with api.prefect.io.

The agent gets deployed successfully and the user can confirm in the Cloud UI that the agent is polling API for new flow runs every 10 seconds. But anytime they run a flow, they get the following error: Connection aborted.', OSError(0, 'Error') and the flow run pod doesn’t seem to be even initiated. I suggested so far:

trying the Azure marketplace template for KubernetesAgent - they refused since they wish to have Infrastructure as Code.
enabling debug log levels on the agent - the user claimed this didn’t help to get any new information
sharing their agent deployment YAML file - looks very standard apart from the fact that there are no labels on the agent, but since the flow run got picked up by the agent, we should assume there is no label mismatch here
asked if they can confirm that the agent can reach the Internet to download the image (to verify that the issue is not caused by the Kubernetes job not being able to download base prefect image) - the user responded “I don’t see any attempts at creating a pod. Usually, I’d see imagePullBackoff during pod creation, but this seems to fail before we get there.”

The user also added: “I have also tried a separate template (sidecar pattern), and deploying to a preexisting dask cluster (in case I somehow wasn’t allowed to create sidecars). Because of firewalls, I’m having trouble accessing k8s logs. I’m back to networking as the most likely culprit: As mentioned our firewall is very restrictive, and has enabled only traffic out (and the response) to api.prefect.io port 443. Could there be anything else missing? As mentioned, the agent pops up in the UI as if everything was fine.”

Solution

The firewall rule is on port 443, outbound from the AKS cluster IP (or IP range) to the <something>. amk8s.io . or something domain that is listed as the “API Server Address” of the cluster. The address can be found on the AKS overview page in the portal. Depending on the firewall rule type, you might have to find the IP behind the API server domain and use that in the rule instead of the URL, but that’s a detail.

No extra rules to Prefect Cloud were needed, the trusty api.prefect.io:443 outbound rule was enough once the AKS/K8S/Firewall rule above had been sorted.

Detailed slack discussion

View in #prefect-community on Slack

@Henning_Holgersen: Hi,
I’m using prefect cloud and an AKS (Azure Kubernetes) cluster running a vanilla k8s agent (from the prefect agent kubernetes install command). The agent connects to prefect Cloud, but when I run a flow I get an error saying 'Connection aborted.', OSError(0, 'Error') . I have added the RBAC stuff to the deployment, and opened outbound traffic to api.prefect.io on port 443, but I have not been able to get anything to run on the AKS agent. Any ideas?

@Anna_Geller: What Prefect version do you use for your agent? It could be some urllib3 version issue.

We had some users reporting similar errors because AKS load balancer was dropping connections and the solution recommended by Azure was to increase timeout settings on the AKS load balancer. I can search for the Github issue that discussed this in more detail.
ok, it’s not deleted yet here is the Slack thread

[December 6th, 2021 6:38 AM] anna: <@U01NXDFCMS4> Your issue may occur due to Azure Load Balancer dropping connections in AKS Kubernetes cluster, as described https://github.com/PrefectHQ/prefect/pull/3344#issuecomment-696643851|here. The issue was already reported by the community in <Slack thread> - sharing in case you want to have a look.

Main take-aways:
• Azure has given the advice to increase the idle timeout on the load balancer but according to our community member, this solutions did not fix the problem, and the connection reset was still appearing after roughly 4 minutes
• <https://github.com/PrefectHQ/prefect/pull/5066|This PR #5066> was addressing the problem and got <https://github.com/PrefectHQ/prefect/releases/tag/0.15.8|released in 0.15.8> - what Prefect version do you use?

@Henning_Holgersen: My prefect version is 0.15.13 on my local machine, and 0.15.12 on the agent. Skimming through the thread you linked, this issue comes up immediately - within 30 seconds. Networking is my prime suspect, since that’s the biggest difference between my private AKS+prefect setup (which is working) and the professional one I’m dealing with now (kubenet vs Azure CNI).

Do you have any idea what part of the system this comes from? Did the agent simply stop responding? Is this a response from the agent about some docker/images thing? Are there communications on other ports than 443?
I can add that I’m able to run a normal flask-app on the cluster without issues.

@Anna_Geller: it’s 100% a network issue, something is not right with the SSL/TLS handshake. Can it be some missing firewall rule? I know you opened 443, but did you open 8080 as well? I see agent uses this for health checks

@Henning_Holgersen: Thanks, that’s good to know. I was told port 443 was all I needed and that the health check was voluntary. But yes, the deployment is set up with this. Port 8080 is like the YOLO port of the internet, can’t wait to argue with the security team about that one…

The agent uses python 3.6, while I’m on 3.9 locally. My private (working) aks agent is on prefect 0.15.7 and python 3.6.

@Anna_Geller: thanks for providing more details. To be 100% honest: I’m no networking pro and I don’t know whether the port 8080 must be opened on your agent, but it seems so to me when I see this health check definition in the default Kubernetes agent spec:
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /api/health
            port: 8080
          initialDelaySeconds: 40
          periodSeconds: 40
        name: agent
I will ask the team if someone has any concrete recommendations

@Henning_Holgersen: I removed the health check and upgraded the image, now I’m getting a more informative error message that points to the internal k8s networking. I think prefect is no longer a suspect here, at least not directly. Thank you for the pointers.

@Anna_Geller: Nice work! I wonder whether removing the health checks has any negative impact on how your agent works. I think it should not - it probably won’t just give you the indication in the Cloud UI whether the agent is healthy.

Keep us posted if there is anything we can help with

@Henning_Holgersen: @Anna_Geller After a lot more debugging I’m (almost) back where I started. The informative error message I mentioned was a sidetrack.

I have verified that the AKS cluster can reach and establish a https connection with api.prefect.io (a simple flask app using the requests library). I also registered a flow that used a public docker image instead of a private ACR, with the same result. So… I’m a little out of ideas, if you have any, let me know.

@Anna_Geller: Can you confirm that your KubernetesAgent can reach the Internet and can communicate with Prefect Cloud? Do you see the agent being healthy in the UI?

Also, can you share the flow definition and Dockerfile of the image you used to register your first flow with this agent? It would be best if you start with a trivial hello-world example such as this one:
from prefect import task, Flow


@task(log_stdout=True)
def hello_world():
    print("hello world")


with Flow("hello") as flow:
    hw = hello_world()
Lastly, if the agent you set up yourself doesn’T work, perhaps you can try this one from the Azure marketplace?

@Henning_Holgersen: The agent appears healthy (green) in the UI, and there is consistent contact every 10 seconds.

The flow I’m testing (one of them, anyways) is a copy of this one (but with a different registry): https://github.com/radbrt/prefect_pipeline/blob/master/dockerstorage/dockerflow.py

This flow worked on my personal setup with AKS and prefect cloud.

I am hesitant to go for the marketplace offering because we use infrastructure as code with a somewhat involved templating language on top of ARM, and we are inside a “landing zone”, which means among other things that all resoures must be deployed in a predefined subnet. That’s not to say it wouldn’t be possible, but it would be very cumbersome.

GitHub: prefect_pipeline/dockerflow.py at master · radbrt/prefect_pipeline

@Anna_Geller: 1. Can you share the YAML file of the Kubernetes agent?
2. Can you confirm that your KubernetesAgent can reach the Internet in order to download the image? If the agent polls Prefect Cloud, then it should have this outbound internet access but would be good to know if the images can be pulled as well
3. Can you enable DEBUG logs on your agent and in your run config to get more information?
4. Are you sure that labels are matching between the agent and a flow run? Have you tried using KubernetesRun run config?
I will ask team as well to get more ideas of what you can try
Also, what you could do is to use the KubernetesAgent from Azure marketplace to just see if this setup works in your custom subnets. Perhaps doing this with the template can give you more hints about what is wrong in the current IaC setup?

@Henning_Holgersen: Thank you, especially the debug logs are a new idea to me. I will give it a try on monday, gather the info you asked for, and get back to you. In the meantime, have a nice weekend.

@Anna_Geller: Great to hear, have a nice weekend, too!

@Henning_Holgersen: Hi again, I am still having no luck at this.

See below for the deploy file. It is very vanilla. I have tried a myriad of variations on this, but the result is always the same.

I don’t really have a way to verify that the agent itself can pull images, but I’m not seeing any attempts at creating a pod. Usually I’d see imagePullBackoff and that stuff during pod creation, but this seems to fail before we get there.

Enabling DEBUG (PREFECT__LOGGING__LEVEL = DEBUG env variable in deploy file) didn’t yield any extra info.

The labels are matching, I believe the flow wouldn’t have started if not.
I have also tried a separate template (sidecar pattern), and deploying to a preexisting dask cluster (in case I somehow wasn’t allowed to create sidecars).

Because firewalls, I’m having trouble accessing k8s logs.

I’m back to networking as the most likely culprit: As mentioned our firewall is very restrictive, and has enabled only traffic out (and the response) to api.prefect.io port 443. Could there be anything else missing? As mentioned, the agent pops up in the UI as if everything was fine.
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prefect-agent
  name: prefect-agent
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prefect-agent
  template:
    metadata:
      labels:
        app: prefect-agent
    spec:
      containers:
      - args:
        - prefect agent kubernetes start
        command:
        - /bin/bash
        - -c
        env:
        - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
          value: ''
        - name: PREFECT__CLOUD__API
          value: <https://api.prefect.io>
        - name: NAMESPACE
          value: default
        - name: IMAGE_PULL_SECRETS
          value: ''
        - name: PREFECT__CLOUD__AGENT__LABELS
          value: '[]'
        - name: JOB_MEM_REQUEST
          value: ''
        - name: JOB_MEM_LIMIT
          value: ''
        - name: JOB_CPU_REQUEST
          value: ''
        - name: JOB_CPU_LIMIT
          value: ''
        - name: IMAGE_PULL_POLICY
          value: ''
        - name: SERVICE_ACCOUNT_NAME
          value: ''
        - name: PREFECT__BACKEND
          value: cloud
        - name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
          value: http://:8080
        - name: PREFECT__CLOUD__API_KEY
          value: '<SuperSecretSystemUserAPIKey>'
        - name: PREFECT__CLOUD__TENANT_ID
          value: ''
        image: prefecthq/prefect:0.15.7-python3.6
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /api/health
            port: 8080
          initialDelaySeconds: 40
          periodSeconds: 40
        name: agent
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  name: prefect-agent-rbac
  namespace: default
rules:
- apiGroups:
  - batch
  - extensions
  resources:
  - jobs
  verbs:
  - '*'
- apiGroups:
  - ''
  resources:
  - events
  - pods
  verbs:
  - '*'
---
apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
kind: RoleBinding
metadata:
  name: prefect-agent-rbac
  namespace: default
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
  name: default
@Anna_Geller: Thanks for sharing all the details, this is helpful. There are two things that are striking to me:

You said that flow and agents have matching labels but your agent’s YAML file doesn’t have any labels assigned - could you try by specifying labels explicitly on the agent and on the flow’s KubernetesRun? I know it may sound annoying but the labels mismatch is a really frequent culprit

You said that changing logs to DEBUG didn’t yield any extra info - this is suspicious, since you should see much more log files
I will share your use case with the team to gather more ideas, but you can definitely cross check those two points above. I also believe the network is most likely the issue here but still good to cross-check the basics first before doing a deep dive into the networking rabbit hole

@Henning_Holgersen: Ah, sorry about the missing label. It was supposed to be there. I’m on two different computers so I have to recreate the deployment and I forgot. But yes, the label is usually some variation of “aks”. I keep changing the label when debugging and launching multiple agents, and I manually change the label on the flow to match. The agents always have a label and the flows always start, whenever I have mucked this up the flow won’t start. Is there a chance the flow may start but error out if the label is wrong? I have not given much attention to that possibility.

The Debug logs thing struck me as well, do you have any more info about how to set it up? I could only find a reference to the env var, so I might have misunderstood something.

@Anna_Geller: Gotcha, thanks for explaining the issue with the labels.

To enable debug logs for the agent, you could add this environment variable to the agent deployment file:
        env:
        - name: PREFECT__LOGGING__LEVEL
          value: 'DEBUG'
I shared the issue with our team, I’ll let you know if I get any pointers on what may be wrong in your setup.

@Henning_Holgersen: Just to add the answer here, which (as was predicted) has nothing to do with Prefect: Our AKS runs behind a firewall, but the AKS endpoint that Azure creates is public. The FW rules needs an outbound rule for the “API server address” (probably ending in <http://azmk8s.io|azmk8s.io>) that is listed in the portal. Prefect needs to reach this endpoint in order to create new jobs (=flow runs). The needed FW rule is mentioned in the Azure documentation but hard to find. And the error message was completely undecipherable.

@Anna_Geller: Thank you so much for the update @Henning_Holgersen! But to be honest, I still don’t get it Can you explain what exactly did you do to fix the issue? For example:
• which ports did you have to open on Azure to allow outbound traffic to Prefect Cloud API?
• what extra networking configuration/routes did you have to set up in Azure?
• can you share the link to this elusive documentation that was helpful to fix your issue?
Is your Kubernetes agent working now?

@Henning_Holgersen: I don’t actually have the documentation, I simply met the right person with a lot of azure experience who recognized the situation and said he had spent two weeks debugging it when he first encountered it. He also mentioned that it was mentioned in the docs somewhere, but didn’t provide the link and… I can’t find it.

In short, the firewall rule is on port 443, outbound from the AKS cluster IP (or IP range) to the <something>.<http://amk8s.io|amk8s.io>. or something domain that is listed as the “API Server Address” of the cluster. The address can be found on the AKS overview page in the portal. Depending on the firewall rule type, you might have to find the IP behind the API server domain and use that in the rule instead of the url, but that’s a detail.

No extra rules to Prefect Cloud was needed, the trusty old api.prefect.io:443 outbound rule was enough once the AKS/K8S/Firewall rule above had been sorted.

@Anna_Geller: Thank you so much. I’ll export that to Discourse for posterity.

Topic		Replies	Views
How to fix the error: "ProviderCreateFailed pod/prefect-job ACI does not support providing args without specifying the command. Please supply both command and args to the pod spec" Archive prefect-1-0 , kubernetes , azure , aks	0	826	March 31, 2022
Prefect Agent CrashLoopBackOff Help prefect-2-0 , kubernetes , aws , agent	0	11	July 18, 2024
ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 4200) Help prefect-2-0 , kubernetes	2	5782	February 24, 2023
How to use KubernetesAgent on Azure AKS with virtual nodes? Archive prefect-1-0 , kubernetes , azure , kubernetes-agent , aks , kubernetes-job-template	0	857	June 3, 2022
How to deploy a flow to Kubernetes in Azure AKS with Prefect Archive prefect-1-0 , kubernetes , server , azure , kubernetes-run , kubernetes-agent , getting-started , aks	0	1314	February 21, 2022

When using a KubernetesAgent on Azure AKS with Prefect Cloud, I got an error: "Connection aborted.', OSError(0, 'Error')"

Problem description

Solution

Detailed slack discussion

Related Topics