Prefect "real time" flow execution options

Hi there! My question is less technical / coding but falls more into architecture / design category.

“real time” in this topic means “good enough for application user interaction”, not a “real” real-time

I’ve studied documentation, read through number of discourse and github discussions as well as number of blog posts and articles related to prefect.
So at this point I know that prefect is not suitable for real-time processing and sub-second precision use-cases.

Conceptually, I’d like to run number of workflows most of the time on the schedule (1), but also have a possibility to trigger some of those workflows on a user interaction (2), in a latter case application user is waiting for data, so in this case latency is critical. The most obvious solution is not to use prefect flows for such scenarios - I suspect prefect is not built for such scenarios, which is fine - it does good what it is built for. However there are number of reasons this could be actualy a reasonable thing to do:

  • avoid code duplication since workflow for cases (1) and (2) are almost the same or identical
  • benefit from prefect’s retry and caching logic

This is how I currently imagine the use-case:

  • in an app user adds number of data-sources
  • on a back-end for each data-source type we define flows
  • as mentioned before, workflows are mostly executed on a schedule
  • flows generate data which will be served via api and displayed to the user

For each data source flows are quite simple

EL -> T0 -> (data for ui ready) -> T1 -> T2 -> ... Tn
  • EL - extract / load flow
    • call one or few external APIs
    • stores raw API call results into EventStoreDB (or similar)
    • triggers T0
  • T0 - transform flow which produces data user wants to see in the ui immediately
    • retrieves raw API results from ESDB
    • performs quick data transformation to prepare data for display in UI
    • stores result into DB/Cache for api server
    • at this point API server is ready to serve data to the user
    • triggers T1…Tn
  • T1…Tn - other transform flows which are potentially heavier and I’m fine they are running in the background and take as long as they need
    • do whatever else I need to do with the data

Reusing code in my case could be quite important, let say I support 20-50 datasources types:

  • I need to maintain 20-50 EL and T0 prefect flow definitions
  • and almost identical copies of these flow definitions for “interactive” use outside of prefect ecosystem

Obviously in case (2 - user interactive) it is undesirable to have extra latency caused by infrastruture taking some time preparing execution environments, etc.

I come-up with a quite straight-forward workaround for case (2): I run EL and T0 directly inside the API server process in background task (using ephemeral server) and then trigger T1…Tn to run in infrastructure managed by proper prefect set-up.

<user/app> -> API endpoint -> trigger "interactive" flow
-> execute EL and T0 immediately 
-> store results -> trigger T1..Tn

I’m testing it on local dev environment and timing is actually not that bad - it takes 3-4 seconds to get data ready to display for user:

  • 0.5s - spin-up prefect ephemeral server (compared to 1-2s when using proper prefect server setup) - Update: it seems it takes this time only on first flow run, and after that it does not shutdown ephemeral server immediately, so this is more-or-less one-time-fee until instance restart.
  • 1 - 1.5s - external API calls (when hosted in “real” environment, I expect it to be faster)
  • 1.5s - interaction with app database (this delay will be no issue in “real” environment with propertly configured and tuned db)
  • 100-200ms - interaction with Event Store DB
  • few hunderd ms - everything else - this is a “fee” I’m ready to pay for code de-duplication retries / caching / etc.

I can probably save 0.5s but just running local prefect server instances alongside with API server instances in the same container, external API calls and app database interaction will be much faster in proper environment, so entire use-case will execute and send response to a user under a second, which is quite good.

Obviously in this scenarior

  • I’m missing proper observability: central prefect server does not now anything about my local / ephemeral prefect servers
  • this is something which could actually be nice-to-have but not critical at all - it should be fairly easy to grab flow logs into ELK or similar
  • e.g. it would probably be nice to have an option to replicate job run information from “local” prefect server instances to “central” prefect
  • I don’t care much about failures. If “interactive” part of the flow (EL, T0) fails we quickly give user feedback in an app and “interactivity” is preserved.

I’ve read discussions about alternative options, e.g. implement “interactive” flows as an ininite while True loop (in this case execution environment is effectively “reused”) - but this solution is actually a bit cumbersome to implement, it obscures otherwise simple and comprehensible flow definition code, and I think will actually be harder to scale. E.g. scaling solution with ephemeral prefect intance is as simple as introducing message queue between API server(s) and <“interactive” flow execution server + “local” prefect instance>.

So I’m just looking for an opinion from the community:

  • does such set-up looks good enough? maybe I’m missing some other simpler and obvious approach?
  • what are pitfalls of running ephemeral or “local” prefect server instances for “interactive” flows in real production systems?