View in #prefect-community on Slack
@Gabriel_Milan: Hi everyone! I’ve got two agents on my k8s cluster, both of them are on the prefect
namespace and work fine. I needed to add one more agent to the cluster, but this one should be on another namespace, say prefect-agent-xxxx
. When I do this, I can successfully submit runs to it, they do get deployed, but it doesn’t seem to actually run and no logs are shown. I’ve tried configuring the apollo URL to http://prefect-apollo.prefect.svc.cluster.local:4200
and also setting an ExternalName
to it in the prefect-agent-xxxx
namespace and using it, but none of them works. Any ideas on how I could debug this?
@Kevin_Kho: Maybe put debug level logs on the agent?
@Gabriel_Milan: agent logs seems to be fine
it’s just jobs that keeps “running” forever and then I eventually get this on the UI
@Kevin_Kho: Oh I think debug level logs might give more insight but that does look like and it’s just unable to get compute. Not sure how to debug off the top of my head let’s see if the community chimes in
@Matthias: Maybe try spinning up an agent with --show-flow-logs
. It could give you more insights into the issue
@Gabriel_Milan: this option doesn’t work here, unfortunately, I’m launching the agent using prefect agent kubernetes start --job-template {{ .Values.agent.jobTemplateFilePath }}
@Matthias: And why can’t you add the additional argument? The job template only references the manifest of the job that the agency is supposed to submit.
@Gabriel_Milan: because this is not actually an option
@Matthias: What you could do is change the deployment spec manually and deploy that one (just for debugging)
@Gabriel_Milan: That’s what I’m trying to do, I’m changing the deployment command in order to add this option you’ve mentioned, but it doesn’t work
@Matthias: Oh yeah, my mistake! It is not part of the agent code
@Anna_Geller: I haven’t understood what’s exactly the issue here - your flow run pods die instantly because they can’t talk to your Server API?
It seems that you have successfully deployed your third agent to a separate namespace.
Some questions:
Q1: What label
did you assign to that agent?
Your flow runs are correctly deployed and we can see that in the agent logs, so label shouldn’t be an issue, but still worth sharing that for debugging.
Q2: Can you inspect the flow run pods and check the logs there? You could check the pods in this namespace and inspect the Kubernetes jobs and pods deployed there. Are flow run and task run states getting updated in your Server backend? You could potentially check that in your Server logs somewhere.
Q3: You wrote “it doesn’t seem to actually run and no logs are shown” - what doesn’t run? Do you mean you don’t see the flow run logs and updates being reflected in your Server UI?
Q4: Where did you configure your Server Apollo endpoint - did you set it in the agent manifest as env variable as shown here?
env:
- name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
value: ''
- name: PREFECT__CLOUD__API
value: "http://[prefect-apollo.prefect.svc.cluster.local](http://prefect-apollo.prefect.svc.cluster.local:4200/):4200/graphql" # paste your GraphQL Server
- name: PREFECT__BACKEND
value: server
Q5: How did you configure your flow runs that got deployed to this agent (KubernetesRun
)?
Q6: Didn’t you explicitly set the namespace when deploying the YAML file of the KubernetesAgent
?
Some immediate ideas to check/inspect or try:
I would recommend creating a manifest file using:
prefect agent kubernetes install --rbac > third_agent.yaml
Then adjusting the env variables as above and deploying it to a desired namespace this way:
kubectl apply -f third_agent.yaml -n yournamespace
Then, all the flow run Kubernetes jobs should also be deployed to this namespace.
Then, only networking and permission issues remains so that your flow run pods can talk to your Server in a separate namespace and your Service with ExternalName
seems like the right solution.
kind: Service
apiVersion: v1
metadata:
name: server-third-agent
namespace: yournamespace
spec:
type: ExternalName
externalName: [prefect-apollo.prefect.svc.cluster.local](http://prefect-apollo.prefect.svc.cluster.local:4200/)
ports:
- port: 80
- port: 443
- port: 4200
I’m particularly guessing here when it comes to ports - I have no idea which ports exactly would need to be open, but would like to open this up for discussion - it may be an issue with ports.
Can you then also check the logs of your Server components and this above service to check if you see any errors or missing permissions there?
Finally, I would check RBAC on your third agent. It may also be an issue of missing RoleBinding
to bind your third agent’s permissions to your Server’s namespace (or both namespaces):
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: RoleBinding
metadata:
name: prefect-agent-rbac
namespace: default # add your prefect-agent-xxxx here
roleRef:
apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
kind: Role
name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
name: default
@Gabriel_Milan: Before I proceed to the questions: no, the flow run pods don’t die. They’re there “running” forever.
Q1: datario
label
Q2: there are no logs whatsoever in the flow run pods. I can only see a change of state in the UI when the run is actually submitted, but nothing else. where could I get Server logs?
Q3: yes, and there’re also no logs on the pods themselves
Q4: the only env that you’ve shown that is not set on my agent is PREFECT__CLOUD__AGENT__AUTH_TOKEN
, could that be a problem? all of my other agents work without it
Q5: I’ve set them using flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value)
. This constants.DOCKER_IMAGE.value
is a valid docker image, the same I’m using for other agents
Q6: I’ve deployed it by doing helm upgrade --install prefect-agent -n <namespace> <mychart> -f values.yaml
. The chart I’m using is this one and my values.yaml
file looks like this:
agent:
apollo_url: <http://prefect-apollo.prefect.svc.cluster.local:4200/>
env: []
image:
name: prefecthq/prefect
tag: 0.15.9
job:
resources:
limits:
cpu: ''
memory: ''
requests:
cpu: ''
memory: ''
jobTemplateFilePath: myjobtemplateurl.yaml
name: prefect-agent
prefectLabels:
- datario
replicas: 1
resources:
limits:
cpu: 100m
memory: 128Mi
serviceAccountName: prefect-agent
the job template looks like this
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: flow
envFrom:
- secretRef:
name: gcp-credentials
- secretRef:
name: vault-credentials
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /mnt/creds.json
volumeMounts:
- name: gcp-sa
mountPath: /mnt/
readOnly: true
volumes:
- name: gcp-sa
secret:
secretName: gcp-sa
and all of those secrets are properly configured.
Finally, I just wanted to add that I’ll check those steps you’ve mentioned and get back asap
Alright, I found it out. Turns out the “issue” was our Docker image for runs: as we’re using linkerd for deploying agents on multiple k8s clusters, our image uses linkerd-await
for blocking on linkerd readiness. This third agent was deployed on a non-linkerd-injected namespace, thus “awaiting” forever on readiness. That’s why our run pods would never die, show logs or update its state. After I’ve injected the namespace with linkerd, everything works. Thank you so much for the effort on understanding our scenario and all the help!
@Anna_Geller: I’m glad you found it out since I’ve never heard of linkerd sounds like you spent a lot of time configuring all this up and you know a lot about managing Prefect Server with Kubernetes and Helm. Did you think about writing this up into some GitHub repo, Readme or a blog post? Absolutely no pressure, but if you you’d like to share your set up, I’m sure many users could benefit from your knowledge sharing whichever form you’d choose. Even opening a topic on discourse.prefect.io with a couple of bullet points and code snippets might be insightful.
@Gabriel_Milan: This is something we’re planning to do in a near future for the city hall. I’ll be glad to translate it and share with you then!