OSError: [Errno 24] "Too many open files" when running a docker agent for some time

Hello there,

I have deployed a first flow in production with Prefect cloud and a docker agent

As per the command below, I use a volume when running the agent.

prefect agent docker start --show-flow-logs --env PATH_TO_FILES=/path/to/file --volume /home/Skam/path/to/file:/path/to/file

Everything works fine, however every 2-3 days the runs start to fail with the error docker.errors.DockerException: Credentials store error: StoreError('Unexpected OS error "Too many open files", errno=24') (full stacktrace at the end)

This is no big problem, restarting the agent and rescheduling the failed runs work, but I’d rather not have the error in the first place :grin:

A quick google search does not reveal much (a SO thread which advise to use a now deprecated --ulimit argument), so I am coming here for advice !

Full stacktrace:

[2022-05-16 09:10:00,005] INFO - agent | Deploying flow run 9f0890cb-e068-4ad2-b2a6-bea91ac70e0f to execution environment...
[2022-05-16 09:10:00,357] INFO - agent | Pulling image <Image registry>
[2022-05-16 09:10:00,359] ERROR - agent | Exception encountered while deploying flow run 9f0890cb-e068-4ad2-b2a6-bea91ac70e0f
Traceback (most recent call last):
  File "/home/lixoloic/.local/lib/python3.9/site-packages/docker/credentials/store.py", line 76, in _execute
    output = subprocess.check_output(
  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.9/subprocess.py", line 1722, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/Skam/.local/lib/python3.9/site-packages/docker/auth.py", line 262, in _resolve_authconfig_credstore
    data = store.get(registry)
  File "/home/Skam/.local/lib/python3.9/site-packages/docker/credentials/store.py", line 33, in get
    data = self._execute('get', server)
  File "/home/Skam/.local/lib/python3.9/site-packages/docker/credentials/store.py", line 89, in _execute
    raise errors.StoreError(
docker.credentials.errors.StoreError: Unexpected OS error "Too many open files", errno=24

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/Skam/.local/lib/python3.9/site-packages/prefect/agent/agent.py", line 388, in _deploy_flow_run
    deployment_info = self.deploy_flow(flow_run)
  File "/home/Skam/.local/lib/python3.9/site-packages/prefect/agent/docker/agent.py", line 387, in deploy_flow
    pull_output = self.docker_client.pull(image, stream=True, decode=True)
  File "/home/Skam/.local/lib/python3.9/site-packages/docker/api/image.py", line 409, in pull
    header = auth.get_config_header(self, registry)
  File "/home/Skam/.local/lib/python3.9/site-packages/docker/auth.py", line 45, in get_config_header
    authcfg = resolve_authconfig(
  File "/home/Skam/.local/lib/python3.9/site-packages/docker/auth.py", line 322, in resolve_authconfig
    return authconfig.resolve_authconfig(registry)
  File "/home/Skam/.local/lib/python3.9/site-packages/docker/auth.py", line 233, in resolve_authconfig
    cfg = self._resolve_authconfig_credstore(registry, store_name)
  File "/home/Skam/.local/lib/python3.9/site-packages/docker/auth.py", line 278, in _resolve_authconfig_credstore
    raise errors.DockerException(
docker.errors.DockerException: Credentials store error: StoreError('Unexpected OS error "Too many open files", errno=24')
[2022-05-16 09:10:00,617] ERROR - agent | Updating flow run 9f0890cb-e068-4ad2-b2a6-bea91ac70e0f state to Failed...
1 Like

Hi @Skam, great to have you here on Discourse! And thanks for this great write-up of your issue! :clap:

If you are just getting started with Prefect, perhaps it’s easier to start directly with Prefect 2.0?

Regarding the error itself, before we jump into things we may try to solve it, I wonder why so many files get opened and not closed. Could you explain your setup a bit more?

  1. What registry are you pulling the images from?
  2. How did you configure credentials to that registry?
  3. Does this error happen on every flow run? Based on what you said it looks like you use some registry credentials that expire after 2-3 days - could this be the issue?

A quick and dirty solution might be to add

DefaultLimitNOFILE=100000

to
/etc/systemd/system.conf

But it would be better to understand why this error is occurring in the first place.

Hi @anna_geller, thank you for the quick reply !

To answer your questions:
1 & 2. I am pulling images from a private GCP artifact registry. My docker agent is running on a Google compute Engine instance, so I did not have to authenticate to acess the registry.
My docker config simply contains

{
  "credHelpers": {
    "europe-docker.pkg.dev": "gcloud"
  }
}
  1. This error seems to happen at regular intervals (every 2-3 days), but once it happens, it will happen for all subsequent flow runs. As per points 1&2 above, I don’t think there is an expiration of the credentials (GCP seems to handle that at every pull), and if this was the case I don’t think restarting the agent would solve the problem ?

I will try your proposed solution and come back to you if it works !

PS: Regarding using Prefect 2.0, I was interested to use it, but the documentation recommends using 1.0 for production missions, so I did not look further :slight_smile:

it would if the credentials would be set during the agent creation or shortly before

the entire question about what is “production”-ready is a bit subjective :smile: but I understand

Let me ask the team if someone has any other ideas

For troubleshooting, could you also send me your DockerRun run config and storage configuration? (redact any sensitive info)

Oh yes I get it, I understood your question as configuring credentials before running the docker agent but your point makes sense.

Here is the extract from my python file that defines the storage (Gitlab) and run_config

schedule = IntervalSchedule(interval=datetime.timedelta(minutes=10))

storage = GitLab(
    repo="Skamito/prefect-jobs",
    path="prefect_jobs/flows/myflow.py",
    access_token_secret="GITLAB_ACCESS_TOKEN",
    ref="main",
)

with Flow(
    "Scheduled flow",
    schedule=schedule,
    run_config=DockerRun(
        image="europe-docker.pkg.dev/:vX.XX.X", labels=[]
    ),
    storage=storage,
) as flow:
    some_task()
    ...

The flow is registered in Prefect cloud via

prefect register --project test -p prefect_jobs/flows/myflow.py --label None

Thank you !

Can you run lsof after your Docker agent has been running for a while? It will show open files, so we’ll have a better sense of files that we’re failing to close correctly.

To run this command you may need to first install the lsof package:

apt-get update
apt-get install lsof
# finally run the actual command:
lsof

Here is some extra info I got from a colleague:

  • Credential helpers are a simple subprocess-based protocol for handling authentication tokens.
  • when you run commands like docker pull, it will execute the credential helper program with some specific parameters to get a bearer token to use for authentication - this may lead to the problem of leaking file handles even if we don’t hit the max ulimit

Could you also confirm how many containers is your Docker agent handling? It could be an issue related to scale.

To count the number of open processes related to Prefect, you can try:

lsof -n | grep prefect | wc -l 

Then, you can check the dockerpy code related to credentials - perhaps the issue is related to that?

And the ulimit command can also help to troubleshoot:

The agent only handles one single container at the moment.

From a prefect agent that has run for approximatively 24 hours, I get ~1100 files open from this command: looking into the details I get a repetition of these type of files

prefect     730    Skam    47u    unix    0x000000003dbdff92      0t0     142533 type=STREAM
prefect     730    Skam    48w    FIFO    0,11                    0t0     20342  pipe

With the pipe one making up ~90% of the results, and type=STREAM a bit more than 5%

I’ll try to deepdive into the dockerpu code when I get a bit of time :slight_smile: But I’ll first keep an eye on whether the DefaultLimitNOFILE=100000 method worked

1 Like

nice, keep us posted, interesting to see what is really the root cause here!

1 Like

Hello, just a quick update regarding my pipeline in production:

even when setting the limit of open files as indicated above, I still run into the same error. There does not seem to be a change in the number of runs completed before the error appears

thanks for the update, hard to say why - keep us posted if you find more

you can also try recreating the agent entirely

Coup de théâtre about the potential source of this error.

I had a hunch that since the errors came from docker authentication, adding the --no-pull argument could change things.

I was somewhat right, as the agent seemed to hold a bit longer before having these errors again (around 3 days instead of 2).
However, I am still getting a too many open files error message.

The interesting part is that it now originates from somewhere else in the code, as shown below

Traceback (most recent call last):
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 402, in ssl_wrap_socket
    context.load_verify_locations(ca_certs, ca_cert_dir, ca_cert_data)
OSError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn
    conn.connect()
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/connection.py", line 414, in connect
    self.sock = ssl_wrap_socket(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 404, in ssl_wrap_socket
    raise SSLError(e)
urllib3.exceptions.SSLError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lixoloic/.local/lib/python3.9/site-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  [Previous line repeated 3 more times]
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.prefect.io', port=443): Max retries exceeded with url: / (Caused by SSLError(OSError(24, 'Too many open files')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lixoloic/.local/lib/python3.9/site-packages/prefect/agent/agent.py", line 320, in _submit_deploy_flow_run_jobs
    flow_run_ids = self._get_ready_flow_runs()
  File "/home/lixoloic/.local/lib/python3.9/site-packages/prefect/agent/agent.py", line 571, in _get_ready_flow_runs
    result = self.client.graphql(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/prefect/client/client.py", line 452, in graphql
    result = self.post(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/prefect/client/client.py", line 407, in post
    response = self._request(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/prefect/client/client.py", line 641, in _request
    response = self._send_request(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/prefect/client/client.py", line 506, in _send_request
    response = session.post(
  File "/home/lixoloic/.local/lib/python3.9/site-packages/requests/sessions.py", line 577, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/home/lixoloic/.local/lib/python3.9/site-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/lixoloic/.local/lib/python3.9/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/home/lixoloic/.local/lib/python3.9/site-packages/requests/adapters.py", line 517, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='api.prefect.io', port=443): Max retries exceeded with url: / (Caused by SSLError(OSError(24, 'Too many open files')))