@Chu_Lục_Ninh: Hi, I faced the problem with
kubernetes.RunNamespacedJob, when run a long-running job (in my case half of the day). I got huge memory leak. My suspect is this job use
ThreadPoolExecutor to poll job log every 5s using
ReadNamespacedPodLogs, since ReadNamespacedPodLogs executed in other thread context, its refCount didn’t reduce to 0 to trigger GC when we assign new value to
read_pod_logs, and that leads to orphan object and leak the memory
@Anna_Geller: Thanks for a great description. I understand the issue but don’t know how to fix that tbh. Do you have some idea on how you would approach it?
How did you identify that this is the line causing the memory leak? was it clear from the execution logs?
one workaround I can think of would be looking at whether it may perhaps make sense to split the job into smaller components? what is your use case here - why does a single job take half of the day to complete? typically a task is a small atomic component in a Prefect flow. While such large tasks are supported, they are hard to debug and troubleshoot - something you experienced just now
@Chu_Lục_Ninh: Hi Anna, actually it is the service we want to run everyday in some time range. You can relate it to the fx trading business when we need the service run only in trading hours.
We monitored the job in grafana/prometheus and experienced that the more logs it produce, the more mem it consumer. Also in execution logs I saw the logs were queued to send to prefect server. Another experience we made was increase job polling interval makes mem leak speed slower.
Finally, it is just Python with multithread, and it may mess up easily with wrong setup. One thing I noticed is previously, this job also messed up losing context when be passed to threadpool executor. Glad that the community fixed it in v1.1.0
@Anna_Geller: > Glad that the community fixed it in v1.1.0
does it mean that upgrading your Server and flows will fix the issue for you?
> it is the service we want to run everyday in some time range.
when you submit your trades, do you think it must be a long-running service? I could easily imagine having it implemented as parametrized flow and to coordinate multiple trades, you could build it as a flow with multiple tasks or even a flow of flows, each flow for a specific market/sector/trade type submission. This way, you could actually have way more visibility than running it as a long-running
KubernetesPodJob. Not something you could change immediately, but worth thinking about
@Chu_Lục_Ninh: > does it mean that upgrading your Server and flows will fix the issue for you?
v1.1.0 only fixed losing thread context. The memory leak problem still persist in master branch.
> I could easily imagine having it implemented as parametrized flow and to coordinate multiple trades
This is not our purpose. We run the service that receives market feed from the exchange from 9AM to 3PM everyday. After that range of time, the exchange close the connection so we have to gracefully shutdown our service and open it back tomorrow.
@Anna_Geller: I understand the restrictions of the exchange opening and closing at specific times, but you didn’t persuade me that it requires a persistent service - do you have some sort of a while loop there that grabs time-series data starting from 9 AM, loads it to some time-series database or dumps to a file and keeps continuing until the exchange closes?
I’m asking because I had to deal with a similar use case related to energy trading
@Chu_Lục_Ninh: The exchange continuously sends market feed data to us from 9AM to 3PM. We then broadcast the data to our clients. The service must be persistence because the exchange push message to us via TCP, and we broadcast the message right away when we receive it to maintain low-latency data stream.
@Anna_Geller: for a persistent service, Prefect 1.0 with Kubernetes tasks is not really meant for this.
Prefect 2.0 would be more helpful since you could have a persistent service running flow runs, even every 1-2 seconds if needed, in some sort of while loop, and every run would be trackable from the UI, giving you visibility and all Prefect features such as retries, within that persistent service. Or you could schedule your deployments every 10 seconds to achieve near real-time processing of those messages
@Chu_Lục_Ninh: Thanks, I definitly will try Orion. That version is still in our testing phase while we figuring out how to migrate v1 flow to v2. Most of the reason is the lack of task library, we are re-writing some library target v2 and may contribute back to the community.