Network issues when deploying a new Kubernetes agent in a different namespace than all Prefect Server components

anna_geller · March 31, 2022, 9:29am

@Gabriel_Milan: Hi everyone! I’ve got two agents on my k8s cluster, both of them are on the prefect namespace and work fine. I needed to add one more agent to the cluster, but this one should be on another namespace, say prefect-agent-xxxx . When I do this, I can successfully submit runs to it, they do get deployed, but it doesn’t seem to actually run and no logs are shown. I’ve tried configuring the apollo URL to http://prefect-apollo.prefect.svc.cluster.local:4200 and also setting an ExternalName to it in the prefect-agent-xxxx namespace and using it, but none of them works. Any ideas on how I could debug this?

@Kevin_Kho: Maybe put debug level logs on the agent?

@Gabriel_Milan: agent logs seems to be fine

it’s just jobs that keeps “running” forever and then I eventually get this on the UI

@Kevin_Kho: Oh I think debug level logs might give more insight but that does look like and it’s just unable to get compute. Not sure how to debug off the top of my head let’s see if the community chimes in

@Matthias: Maybe try spinning up an agent with --show-flow-logs. It could give you more insights into the issue

@Gabriel_Milan: this option doesn’t work here, unfortunately, I’m launching the agent using prefect agent kubernetes start --job-template {{ .Values.agent.jobTemplateFilePath }}

@Matthias: And why can’t you add the additional argument? The job template only references the manifest of the job that the agency is supposed to submit.

@Gabriel_Milan: because this is not actually an option

@Matthias: What you could do is change the deployment spec manually and deploy that one (just for debugging)

@Gabriel_Milan: That’s what I’m trying to do, I’m changing the deployment command in order to add this option you’ve mentioned, but it doesn’t work

@Matthias: Oh yeah, my mistake! It is not part of the agent code

@Anna_Geller: I haven’t understood what’s exactly the issue here - your flow run pods die instantly because they can’t talk to your Server API?

It seems that you have successfully deployed your third agent to a separate namespace.

Some questions:
Q1: What label did you assign to that agent?
Your flow runs are correctly deployed and we can see that in the agent logs, so label shouldn’t be an issue, but still worth sharing that for debugging.
Q2: Can you inspect the flow run pods and check the logs there? You could check the pods in this namespace and inspect the Kubernetes jobs and pods deployed there. Are flow run and task run states getting updated in your Server backend? You could potentially check that in your Server logs somewhere.
Q3: You wrote “it doesn’t seem to actually run and no logs are shown” - what doesn’t run? Do you mean you don’t see the flow run logs and updates being reflected in your Server UI?
Q4: Where did you configure your Server Apollo endpoint - did you set it in the agent manifest as env variable as shown here?

env:
  - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
    value: ''
  - name: PREFECT__CLOUD__API
    value: "http://[prefect-apollo.prefect.svc.cluster.local](http://prefect-apollo.prefect.svc.cluster.local:4200/):4200/graphql" # paste your GraphQL Server
  - name: PREFECT__BACKEND
    value: server

Q5: How did you configure your flow runs that got deployed to this agent (KubernetesRun)?
Q6: Didn’t you explicitly set the namespace when deploying the YAML file of the KubernetesAgent?
Some immediate ideas to check/inspect or try:
I would recommend creating a manifest file using:

prefect agent kubernetes install --rbac > third_agent.yaml

Then adjusting the env variables as above and deploying it to a desired namespace this way:

kubectl apply -f third_agent.yaml -n yournamespace

Then, all the flow run Kubernetes jobs should also be deployed to this namespace.

Then, only networking and permission issues remains so that your flow run pods can talk to your Server in a separate namespace and your Service with ExternalName seems like the right solution.

kind: Service
apiVersion: v1
metadata:
  name: server-third-agent
  namespace: yournamespace
spec:
  type: ExternalName
  externalName: [prefect-apollo.prefect.svc.cluster.local](http://prefect-apollo.prefect.svc.cluster.local:4200/)
  ports:
  - port: 80
  - port: 443
  - port: 4200

I’m particularly guessing here when it comes to ports - I have no idea which ports exactly would need to be open, but would like to open this up for discussion - it may be an issue with ports.

Can you then also check the logs of your Server components and this above service to check if you see any errors or missing permissions there?

Finally, I would check RBAC on your third agent. It may also be an issue of missing RoleBinding to bind your third agent’s permissions to your Server’s namespace (or both namespaces):

apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: RoleBinding
metadata:
  name: prefect-agent-rbac
  namespace: default # add your prefect-agent-xxxx here
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
  name: default

@Gabriel_Milan: Before I proceed to the questions: no, the flow run pods don’t die. They’re there “running” forever.

Q1: datario label
Q2: there are no logs whatsoever in the flow run pods. I can only see a change of state in the UI when the run is actually submitted, but nothing else. where could I get Server logs?
Q3: yes, and there’re also no logs on the pods themselves
Q4: the only env that you’ve shown that is not set on my agent is PREFECT__CLOUD__AGENT__AUTH_TOKEN, could that be a problem? all of my other agents work without it
Q5: I’ve set them using flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value) . This constants.DOCKER_IMAGE.value is a valid docker image, the same I’m using for other agents
Q6: I’ve deployed it by doing helm upgrade --install prefect-agent -n <namespace> <mychart> -f values.yaml . The chart I’m using is this one and my values.yaml file looks like this:

agent:
  apollo_url: <http://prefect-apollo.prefect.svc.cluster.local:4200/>
  env: []
  image:
    name: prefecthq/prefect
    tag: 0.15.9
  job:
    resources:
      limits:
        cpu: ''
        memory: ''
      requests:
        cpu: ''
        memory: ''
  jobTemplateFilePath: myjobtemplateurl.yaml
  name: prefect-agent
  prefectLabels:
  - datario
  replicas: 1
  resources:
    limits:
      cpu: 100m
      memory: 128Mi
  serviceAccountName: prefect-agent

the job template looks like this

apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      containers:
        - name: flow
          envFrom:
            - secretRef:
                name: gcp-credentials
            - secretRef:
                name: vault-credentials
          env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /mnt/creds.json
          volumeMounts:
            - name: gcp-sa
              mountPath: /mnt/
              readOnly: true
      volumes:
        - name: gcp-sa
          secret:
            secretName: gcp-sa

and all of those secrets are properly configured.

Finally, I just wanted to add that I’ll check those steps you’ve mentioned and get back asap

Alright, I found it out. Turns out the “issue” was our Docker image for runs: as we’re using linkerd for deploying agents on multiple k8s clusters, our image uses linkerd-await for blocking on linkerd readiness. This third agent was deployed on a non-linkerd-injected namespace, thus “awaiting” forever on readiness. That’s why our run pods would never die, show logs or update its state. After I’ve injected the namespace with linkerd, everything works. Thank you so much for the effort on understanding our scenario and all the help!

@Anna_Geller: I’m glad you found it out since I’ve never heard of linkerd sounds like you spent a lot of time configuring all this up and you know a lot about managing Prefect Server with Kubernetes and Helm. Did you think about writing this up into some GitHub repo, Readme or a blog post? Absolutely no pressure, but if you you’d like to share your set up, I’m sure many users could benefit from your knowledge sharing whichever form you’d choose. Even opening a topic on discourse.prefect.io with a couple of bullet points and code snippets might be insightful.

@Gabriel_Milan: This is something we’re planning to do in a near future for the city hall. I’ll be glad to translate it and share with you then!

Topic		Replies	Views
Deploying Prefect Agents on Kubernetes Archive prefect-2-0 , kubernetes , agent , kubernetes-agent , agent-setup , cloud-2-0	0	6178	August 2, 2022
Api dont start in Kubernetes Archive prefect-2-0 , kubernetes , ui	1	525	September 26, 2022
Deployment fails trying to get namespace Help prefect-2-0 , deployment , kubernetes , infrastructure , troubleshooting	11	3520	February 14, 2023
How to deploy a Prefect 2.0 agent to a local Kubernetes cluster and connect it to Cloud 2.0 backend? Archive prefect-2-0 , kubernetes-agent , execution-layer-deployment-tutorial , execution-layer	3	2581	April 19, 2024
Pending flow-runs block execution in queue Help prefect-2-0 , kubernetes	1	394	July 18, 2023

Network issues when deploying a new Kubernetes agent in a different namespace than all Prefect Server components

Related topics