How can I configure my flow to run with Dask?

anna_geller · January 21, 2022, 10:36pm

Prefect 2.0

Prefect integrates withDask via the task runner interface. The DaskTaskRunner runs Prefect tasks using Dask’s Distributed Scheduler. It can be used locally on a single machine, but it’s most useful when scaling out distributed across multiple nodes. This documentation page provides more details.

from prefect_dask import DaskTaskRunner

@flow(task_runner=DaskTaskRunner())

Prefect 1.0

There are two ways how you can leverage Dask with Prefect 1.0.

1) Dask executor

The first and easiest way to use Dask is to offload the task run execution to a Dask cluster via executor. You can do that by assigning one of Dask executors:

LocalDaskExecutor
DaskExecutor

For a detailed comparison between those two, check out this topic:

Also, if you want to learn more about the differences between a temporary vs. static Dask cluster, check out:

You can then attach the executor directly to your Flow object:

from prefect.executors import DaskExecutor

with Flow("parallel_task_runs", executor=DaskExecutor()) as flow:

Note that the executor information is not stored in the backend during flow registration. Instead, it is retrieved from Storage at runtime since it may contain sensitive information such as the Dask scheduler address.

Then, once you use mapping, Prefect will automatically parallelize the task run execution across processes, threads, or even across several Dask workers. More on mapping:

2) Resource manager with Dask cluster client

The second way to use Dask with Prefect is to combine it with the resource manager abstraction. For more details on that, check out this blog post:

A short code snippet that illustrate this:

# Define a ResourceManager object
@resource_manager
class DaskCluster:
   def __init__(self, cluster_type="local", n_workers=None, software=None, account=None, name=None):
       self.cluster_type = cluster_type
       self.n_workers = n_workers
       self.software = software
       self.account = account
       self.name = name

   def setup(self):
       if self.cluster_type == "local":
           return Client(processes=False)
       elif self.cluster_type == "coiled":
           cluster = coiled.Cluster(
               name = self.name,
               software = self.software,
               n_workers = self.n_workers,
               account = self.account,
           )
           return Client(cluster)
  
   def cleanup(self, client):
       client.close()
       if self.cluster_type == "coiled":
           client.cluster.close()

# Build Prefect Flow
with Flow(name="Github ETL Test") as flow:
   # define parameters
   n_workers = Parameter("n_workers", default=4)
   software = Parameter("software", default='coiled-examples/prefect')
   account = Parameter("account", default=None)
   name = Parameter("name", default='cluster-name')
   start_date = Parameter("start_date", default="01-01-2015")
   end_date = Parameter("end_date", default="31-12-2015")

   # build flow
   filenames = create_list(start_date=start_date, end_date=end_date)
   cluster_type = determine_cluster_type(filenames)

   
   # use ResourceManager object
   with DaskCluster(
           cluster_type=cluster_type,
           n_workers=n_workers,
           software=software,
           account=account,
           name=name
           ) as client:
       push_events = get_github_data(filenames)
       df = to_dataframe(push_events)
       to_parquet(df)

# Run flow with parameters
flow.run(
   parameters=dict(
       end_date="02-01-2015",
       n_workers=15,
       name="prefect-on-coiled")
)

anna_geller · February 25, 2022, 2:51pm

Additionally, if you want to use Dask dataframes in your flow, check out this issue:

anna_geller · July 5, 2022, 12:40pm

Note to anyone using DaskTaskRunner or RayTaskRunner:

from prefect version 2.0b8 onwards, those task runners were moved to the respective Prefect Collections for better code dependency management (the core library no longer requires dask or ray as dependencies - now, those can be installed sepataely when needed).

The correct imports are now:

from prefect_dask import DaskTaskRunner
from prefect_ray import RayTaskRunner

Topic		Replies	Views
How can I parallelize execution across 8 CPU cores? Archive migration-guide , prefect-1-0 , prefect-2-0 , dask , executor , task-runner , parallel-processing , dask-task-runner , local-dask-executor , infrastructure , multithreading , multiprocessing	0	2294	January 31, 2022
What is the difference between a DaskExecutor and a LocalDaskExecutor? Archive prefect-1-0 , local-development , dask , parallel-processing , local-dask-executor , dask-executor , kube-cluster	0	3106	February 25, 2022
Are there any guidelines for using a temporary vs. static Dask cluster? How to set one on Kubernetes? Archive prefect-1-0 , dask , parallel-processing , dask-executor , kube-cluster	1	1216	March 10, 2022
DaskTaskRunner creating a separate client Help prefect-2-0 , kubernetes , dask	0	18	July 16, 2024
DaskTaskRunner using KubeCluster Help prefect-2-0 , kubernetes , dask , dask-task-runner	0	430	May 16, 2023

How can I configure my flow to run with Dask?

Prefect 2.0

Prefect 1.0

1) Dask executor

2) Resource manager with Dask cluster client

Related topics