Choosing the right flow storage, repository structure, agents and execution layer for new teams

There are several options for flow storage and run types. Advanced users seem to prefer Docker/Kubernetes, but for new team members learning Docker can be a hurdle on top of everything else.

The way I see it this is a functionality/learning-curve trade off. Docker is more flexible, but it is perhaps excessive for small/medium teams who are not comfortable with Docker.

How does your team manage this? Have you opted for Docker/k8s knowing it is an additional hurdle? Are you keeping it simple with local agents? If so, how do you deploy new flows?

I have doodled around deployment patterns for local agents, and come up with a working system based on two git repos:

Repo 1 contains the “business” flows, organized in different projects, in a way that can be registered automatically.

Repo 2 is a “janitor” repo, containing a flow that pulls repo 1 and (re-)registers the flows in that repo. This way developers can commit flows to Repo 1, and they will be deployed automatically on some cadence. Poetically, it is using Prefect to deploy Prefect flows. Changes to repo 2 are not automatically deployed, so these must be handled manually by SSHing into the agent.

Obviously this is not scaleable, but it might work with 1-3 agents and uncomplicated flows.

Thanks for opening up this topic for discussion.

I agree that in Prefect 1.0:

  • a Local agent is the easiest to get started,
  • a Docker/Kubernetes agent is better when you have to manage several (possibly conflicting) package dependencies across multiple projects in your team.

In Prefect 2.0, this will get even easier since:

  • there is support for virtual environments mitigating the problem that you’ve mentioned that Docker may be too difficult for someone getting started in data engineering,
  • agents are more flexible and can pick up and deploy work for multiple different flow runners (SubprocessFlowRunner - similar to a local process deployed by a local agent, DockerFlowRunner, etc).

Regarding managing various storage and run configurations, I understand that this may not be as straightforward at first and seeing some examples is useful. This repository contains examples and provides one option on how you can structure your code repository for Prefect flows:

Why do you think you need a separate “janitor” repo? Is this something you want to use for additional QA before registering flows to your production environment?

The overarching thought of the “janitor” is to automatically register new flows. In a sense it has nothing to do with repo 1, except for pulling it and running prefect register ... regularly. The reason I wanted to have this as a separate repo, is because this is a separate type of functionality managed by a separate team (or sub-team). I have thought about using Github Actions for registering new flows, but it seems a little over the top.

I think using a CI/CD pipeline to register flows automatically is a pretty common approach. If you are interested in seeing some examples, this article shows how you could do that using CircleCI: