View in #prefect-community on Slack
@Henning_Holgersen : Hi,
I’m using prefect cloud and an AKS (Azure Kubernetes) cluster running a vanilla k8s agent (from the prefect agent kubernetes install command). The agent connects to prefect Cloud, but when I run a flow I get an error saying 'Connection aborted.', OSError(0, 'Error')
. I have added the RBAC stuff to the deployment, and opened outbound traffic to api.prefect.io on port 443, but I have not been able to get anything to run on the AKS agent. Any ideas?
@Anna_Geller : What Prefect version do you use for your agent? It could be some urllib3
version issue.
We had some users reporting similar errors because AKS load balancer was dropping connections and the solution recommended by Azure was to increase timeout settings on the AKS load balancer. I can search for the Github issue that discussed this in more detail.
ok, it’s not deleted yet here is the Slack thread
[December 6th, 2021 6:38 AM] anna: <@U01NXDFCMS4> Your issue may occur due to Azure Load Balancer dropping connections in AKS Kubernetes cluster, as described https://github.com/PrefectHQ/prefect/pull/3344#issuecomment-696643851|here . The issue was already reported by the community in <Slack thread> - sharing in case you want to have a look.
Main take-aways:
• Azure has given the advice to increase the idle timeout on the load balancer but according to our community member, this solutions did not fix the problem, and the connection reset was still appearing after roughly 4 minutes
• <https://github.com/PrefectHQ/prefect/pull/5066|This PR #5066> was addressing the problem and got <https://github.com/PrefectHQ/prefect/releases/tag/0.15.8|released in 0.15.8> - what Prefect version do you use?
@Henning_Holgersen : My prefect version is 0.15.13 on my local machine, and 0.15.12 on the agent. Skimming through the thread you linked, this issue comes up immediately - within 30 seconds. Networking is my prime suspect, since that’s the biggest difference between my private AKS+prefect setup (which is working) and the professional one I’m dealing with now (kubenet vs Azure CNI).
Do you have any idea what part of the system this comes from? Did the agent simply stop responding? Is this a response from the agent about some docker/images thing? Are there communications on other ports than 443?
I can add that I’m able to run a normal flask-app on the cluster without issues.
@Anna_Geller : it’s 100% a network issue, something is not right with the SSL/TLS handshake. Can it be some missing firewall rule? I know you opened 443, but did you open 8080 as well? I see agent uses this for health checks
@Henning_Holgersen : Thanks, that’s good to know. I was told port 443 was all I needed and that the health check was voluntary. But yes, the deployment is set up with this. Port 8080 is like the YOLO port of the internet, can’t wait to argue with the security team about that one…
The agent uses python 3.6, while I’m on 3.9 locally. My private (working) aks agent is on prefect 0.15.7 and python 3.6.
@Anna_Geller : thanks for providing more details. To be 100% honest: I’m no networking pro and I don’t know whether the port 8080 must be opened on your agent, but it seems so to me when I see this health check definition in the default Kubernetes agent spec:
livenessProbe:
failureThreshold: 2
httpGet:
path: /api/health
port: 8080
initialDelaySeconds: 40
periodSeconds: 40
name: agent
I will ask the team if someone has any concrete recommendations
@Henning_Holgersen : I removed the health check and upgraded the image, now I’m getting a more informative error message that points to the internal k8s networking. I think prefect is no longer a suspect here, at least not directly. Thank you for the pointers.
@Anna_Geller : Nice work! I wonder whether removing the health checks has any negative impact on how your agent works. I think it should not - it probably won’t just give you the indication in the Cloud UI whether the agent is healthy.
Keep us posted if there is anything we can help with
@Henning_Holgersen : @Anna_Geller After a lot more debugging I’m (almost) back where I started. The informative error message I mentioned was a sidetrack.
I have verified that the AKS cluster can reach and establish a https connection with api.prefect.io (a simple flask app using the requests library). I also registered a flow that used a public docker image instead of a private ACR, with the same result. So… I’m a little out of ideas, if you have any, let me know.
@Anna_Geller : Can you confirm that your KubernetesAgent can reach the Internet and can communicate with Prefect Cloud? Do you see the agent being healthy in the UI?
Also, can you share the flow definition and Dockerfile of the image you used to register your first flow with this agent? It would be best if you start with a trivial hello-world example such as this one:
from prefect import task, Flow
@task(log_stdout=True)
def hello_world():
print("hello world")
with Flow("hello") as flow:
hw = hello_world()
Lastly, if the agent you set up yourself doesn’T work, perhaps you can try this one from the Azure marketplace ?
@Henning_Holgersen : The agent appears healthy (green) in the UI, and there is consistent contact every 10 seconds.
The flow I’m testing (one of them, anyways) is a copy of this one (but with a different registry): https://github.com/radbrt/prefect_pipeline/blob/master/dockerstorage/dockerflow.py
This flow worked on my personal setup with AKS and prefect cloud.
I am hesitant to go for the marketplace offering because we use infrastructure as code with a somewhat involved templating language on top of ARM, and we are inside a “landing zone”, which means among other things that all resoures must be deployed in a predefined subnet. That’s not to say it wouldn’t be possible, but it would be very cumbersome.
GitHub: prefect_pipeline/dockerflow.py at master · radbrt/prefect_pipeline
@Anna_Geller : 1. Can you share the YAML file of the Kubernetes agent?
2. Can you confirm that your KubernetesAgent can reach the Internet in order to download the image? If the agent polls Prefect Cloud, then it should have this outbound internet access but would be good to know if the images can be pulled as well
3. Can you enable DEBUG logs on your agent and in your run config to get more information?
4. Are you sure that labels are matching between the agent and a flow run? Have you tried using KubernetesRun run config?
I will ask team as well to get more ideas of what you can try
Also, what you could do is to use the KubernetesAgent from Azure marketplace to just see if this setup works in your custom subnets. Perhaps doing this with the template can give you more hints about what is wrong in the current IaC setup?
@Henning_Holgersen : Thank you, especially the debug logs are a new idea to me. I will give it a try on monday, gather the info you asked for, and get back to you. In the meantime, have a nice weekend.
@Anna_Geller : Great to hear, have a nice weekend, too!
@Henning_Holgersen : Hi again, I am still having no luck at this.
See below for the deploy file. It is very vanilla. I have tried a myriad of variations on this, but the result is always the same.
I don’t really have a way to verify that the agent itself can pull images, but I’m not seeing any attempts at creating a pod. Usually I’d see imagePullBackoff and that stuff during pod creation, but this seems to fail before we get there.
Enabling DEBUG (PREFECT__LOGGING__LEVEL = DEBUG env variable in deploy file) didn’t yield any extra info.
The labels are matching, I believe the flow wouldn’t have started if not.
I have also tried a separate template (sidecar pattern), and deploying to a preexisting dask cluster (in case I somehow wasn’t allowed to create sidecars).
Because firewalls, I’m having trouble accessing k8s logs.
I’m back to networking as the most likely culprit: As mentioned our firewall is very restrictive, and has enabled only traffic out (and the response) to api.prefect.io port 443. Could there be anything else missing? As mentioned, the agent pops up in the UI as if everything was fine.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: prefect-agent
name: prefect-agent
spec:
replicas: 1
selector:
matchLabels:
app: prefect-agent
template:
metadata:
labels:
app: prefect-agent
spec:
containers:
- args:
- prefect agent kubernetes start
command:
- /bin/bash
- -c
env:
- name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
value: ''
- name: PREFECT__CLOUD__API
value: <https://api.prefect.io>
- name: NAMESPACE
value: default
- name: IMAGE_PULL_SECRETS
value: ''
- name: PREFECT__CLOUD__AGENT__LABELS
value: '[]'
- name: JOB_MEM_REQUEST
value: ''
- name: JOB_MEM_LIMIT
value: ''
- name: JOB_CPU_REQUEST
value: ''
- name: JOB_CPU_LIMIT
value: ''
- name: IMAGE_PULL_POLICY
value: ''
- name: SERVICE_ACCOUNT_NAME
value: ''
- name: PREFECT__BACKEND
value: cloud
- name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
value: http://:8080
- name: PREFECT__CLOUD__API_KEY
value: '<SuperSecretSystemUserAPIKey>'
- name: PREFECT__CLOUD__TENANT_ID
value: ''
image: prefecthq/prefect:0.15.7-python3.6
imagePullPolicy: Always
livenessProbe:
failureThreshold: 2
httpGet:
path: /api/health
port: 8080
initialDelaySeconds: 40
periodSeconds: 40
name: agent
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
name: prefect-agent-rbac
namespace: default
rules:
- apiGroups:
- batch
- extensions
resources:
- jobs
verbs:
- '*'
- apiGroups:
- ''
resources:
- events
- pods
verbs:
- '*'
---
apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
kind: RoleBinding
metadata:
name: prefect-agent-rbac
namespace: default
roleRef:
apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
kind: Role
name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
name: default
@Anna_Geller : Thanks for sharing all the details, this is helpful. There are two things that are striking to me:
You said that flow and agents have matching labels but your agent’s YAML file doesn’t have any labels assigned - could you try by specifying labels explicitly on the agent and on the flow’s KubernetesRun? I know it may sound annoying but the labels mismatch is a really frequent culprit
You said that changing logs to DEBUG didn’t yield any extra info - this is suspicious, since you should see much more log files
I will share your use case with the team to gather more ideas, but you can definitely cross check those two points above. I also believe the network is most likely the issue here but still good to cross-check the basics first before doing a deep dive into the networking rabbit hole
@Henning_Holgersen : Ah, sorry about the missing label. It was supposed to be there. I’m on two different computers so I have to recreate the deployment and I forgot. But yes, the label is usually some variation of “aks”. I keep changing the label when debugging and launching multiple agents, and I manually change the label on the flow to match. The agents always have a label and the flows always start, whenever I have mucked this up the flow won’t start. Is there a chance the flow may start but error out if the label is wrong? I have not given much attention to that possibility.
The Debug logs thing struck me as well, do you have any more info about how to set it up? I could only find a reference to the env var, so I might have misunderstood something.
@Anna_Geller : Gotcha, thanks for explaining the issue with the labels.
To enable debug logs for the agent, you could add this environment variable to the agent deployment file:
env:
- name: PREFECT__LOGGING__LEVEL
value: 'DEBUG'
I shared the issue with our team, I’ll let you know if I get any pointers on what may be wrong in your setup.
@Henning_Holgersen : Just to add the answer here, which (as was predicted) has nothing to do with Prefect: Our AKS runs behind a firewall, but the AKS endpoint that Azure creates is public. The FW rules needs an outbound rule for the “API server address” (probably ending in <http://azmk8s.io|azmk8s.io>
) that is listed in the portal. Prefect needs to reach this endpoint in order to create new jobs (=flow runs). The needed FW rule is mentioned in the Azure documentation but hard to find. And the error message was completely undecipherable.
@Anna_Geller : Thank you so much for the update @Henning_Holgersen ! But to be honest, I still don’t get it Can you explain what exactly did you do to fix the issue? For example:
• which ports did you have to open on Azure to allow outbound traffic to Prefect Cloud API?
• what extra networking configuration/routes did you have to set up in Azure?
• can you share the link to this elusive documentation that was helpful to fix your issue?
Is your Kubernetes agent working now?
@Henning_Holgersen : I don’t actually have the documentation, I simply met the right person with a lot of azure experience who recognized the situation and said he had spent two weeks debugging it when he first encountered it. He also mentioned that it was mentioned in the docs somewhere, but didn’t provide the link and… I can’t find it.
In short, the firewall rule is on port 443, outbound from the AKS cluster IP (or IP range) to the <something>.<http://amk8s.io|amk8s.io>
. or something domain that is listed as the “API Server Address” of the cluster. The address can be found on the AKS overview page in the portal. Depending on the firewall rule type, you might have to find the IP behind the API server domain and use that in the rule instead of the url, but that’s a detail.
No extra rules to Prefect Cloud was needed, the trusty old api.prefect.io:443 outbound rule was enough once the AKS/K8S/Firewall rule above had been sorted.
@Anna_Geller : Thank you so much. I’ll export that to Discourse for posterity.