When using a KubernetesAgent on Azure AKS with Prefect Cloud, I got an error: "Connection aborted.', OSError(0, 'Error')"

Problem description

User’s setup: Prefect Cloud + Azure AKS KubernetesAgent in a custom isolated VNet only allowing outbound connection via port 443 to connect with api.prefect.io.

The agent gets deployed successfully and the user can confirm in the Cloud UI that the agent is polling API for new flow runs every 10 seconds. But anytime they run a flow, they get the following error: Connection aborted.', OSError(0, 'Error') and the flow run pod doesn’t seem to be even initiated. I suggested so far:

  • trying the Azure marketplace template for KubernetesAgent - they refused since they wish to have Infrastructure as Code.
  • enabling debug log levels on the agent - the user claimed this didn’t help to get any new information
  • sharing their agent deployment YAML file - looks very standard apart from the fact that there are no labels on the agent, but since the flow run got picked up by the agent, we should assume there is no label mismatch here
  • asked if they can confirm that the agent can reach the Internet to download the image (to verify that the issue is not caused by the Kubernetes job not being able to download base prefect image) - the user responded “I don’t see any attempts at creating a pod. Usually, I’d see imagePullBackoff during pod creation, but this seems to fail before we get there.”

The user also added: “I have also tried a separate template (sidecar pattern), and deploying to a preexisting dask cluster (in case I somehow wasn’t allowed to create sidecars). Because of firewalls, I’m having trouble accessing k8s logs. I’m back to networking as the most likely culprit: As mentioned our firewall is very restrictive, and has enabled only traffic out (and the response) to api.prefect.io port 443. Could there be anything else missing? As mentioned, the agent pops up in the UI as if everything was fine.”

Solution

The firewall rule is on port 443, outbound from the AKS cluster IP (or IP range) to the <something>. amk8s.io . or something domain that is listed as the “API Server Address” of the cluster. The address can be found on the AKS overview page in the portal. Depending on the firewall rule type, you might have to find the IP behind the API server domain and use that in the rule instead of the URL, but that’s a detail.

No extra rules to Prefect Cloud were needed, the trusty api.prefect.io:443 outbound rule was enough once the AKS/K8S/Firewall rule above had been sorted.

Detailed slack discussion