View in #prefect-community on Slack
@Omar_Sultan: Hi Everyone, we have a prefect server running on Kubernetes, setup was done using the HELM Chart. Everything is running smoothly but occasionally we would get this error
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 341, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)
This happens especially when we use the Task StartFlowRun it does not happen very often , but I was wondering if there was a way to force retry or if anyone knows why this would be happening? Thanks
@Anna_Geller: This error means that some action on the backend took too long to respond to the API request. E.g. it could be that the task run state updates are taking too long.
Here is how you can increase the read timeout on Server:
• Is it possible to increase the GraphQL API request timeout?
• How to check a GraphQL query timeout settings on Server? How to increase that timeout value?
To set retry on the StartFlowRun task, you can do:
from datetime import timedelta
from prefect import Flow
from prefect.tasks.prefect import StartFlowRun
start_flow_run = StartFlowRun(project_name="PROJECT_NAME", wait=True, max_retries=3, retry_delay=timedelta(minutes=5))
with Flow("FLOW_NAME") as flow:
staging = start_flow_run(flow_name="child_flow_name")
@Omar_Sultan: Thank you so much for that
Hey Anna, quick followup , so I applied the change in the HELM Values and apply the configuration. And I can now see the env variable being assigned to the apollo pods , however when I print
prefect.context.config.cloud.request_timeout
from any task that runs on the server it still shows 15, is there anything else I need to apply? do I need to restart the agent pod for example?
@Anna_Geller: You may restart the Apollo pod so that it can pick up the changes. But not sure what the max allowed value is here, what did you set?
@Omar_Sultan: I set it at 60 , I believe the document said that the max allowed value was 60
@Anna_Geller: I see. So if this doesn’t help I’m afraid we have to find out the root cause for those timeouts can you share your flow with StartFlowRun that times out? Could it be that the child flow run gets stuck and the parent flow run polling for its status at some point times out?
@Omar_Sultan: Hey Anna, so i’ve been doing some investigations and it seems that this error only happens right around the same time the Hasura pod tries to submit telemetry. Because I am working in a closed environment without internet access this operation is timing out and it seems its causing the flows to timeout as well. Not sure if that makes anysense.
I can see from Hasura documentaiton that telemetry can be disabled, however, not sure how i can do that in the helm values file. Any ideas?
@Anna_Geller: Nice work finding that out! There is an easy way to disable it, check out this page: Telemetry | Prefect Docs
On a Hełm chart you should be able to set an environment variable
name: PREFECT__SERVER__TELEMETRY__ENABLED
value: false
@Omar_Sultan: So I followed the link above and it disabled telemetry for prefect and reflected correctly. But I believe the issue is coming still from the Hasura image itself, it needs to have an env variable in the pod called HASURA_GRAPHQL_ENABLE_TELEMETRY set as false … I’m looking at the vlaues.yaml of the helm chart but can’t find anywhere to pass env variables to the service
sorry to keep bugging you with this
@Anna_Geller: I have a meeting with a Prefect employee who knows more about this telemetry thing in just 10 min will ask and update you afterwards
@Omar_Sultan: WoW nice thank yo u so much
@Anna_Geller: @Omar_Sultan sorry for getting back to you a little later than planned: I check and this is part of the code where the telemetry info is set
You can see here that the environment variable PREFECT_SERVER__TELEMETRY__ENABLED
is the right one. I seemed to have mistyped the underscores (only one underscore between PREFECT and SERVER), which may be the reason why this didn’t work for you.
Also, once you set that, you may need to restart the Apollo pod to make sure the changes get applied.
To set it on the Helm chart, see server/values.yaml at master · PrefectHQ/server · GitHub, e.g.:
helm upgrade \
$NAME \
prefecthq/prefect-server \
--set apollo.options.telemetryEnabled=true