it depends a lot on where your Spark cluster is running.
Databricks
Prefect is an official Databricks partner, so if you want to leverage Spark on Databricks, Prefect can help you orchestrate those workflows:
Docs
https://prefecthq.github.io/prefect-databricks/
Code
Examples
from prefect import flow
from prefect_databricks import DatabricksCredentials
from prefect_databricks.jobs import jobs_list
@flow
def example_execute_endpoint_flow():
databricks_credentials = DatabricksCredentials.load("my-block")
jobs = jobs_list(
databricks_credentials,
limit=5
)
return jobs
example_execute_endpoint_flow()
Fugue
Also, we integrate with Fugue:
Docs
https://fugue-project.github.io/prefect-fugue/
Code
Examples
from prefect import flow
from prefect_fugue import fugue_engine, fsql
@flow
def hello_flow():
fsql("""
CREATE [[0]] SCHEMA a:int
PRINT
""")
hello_flow()
@flow
def world_flow(n, engine):
with fugue_engine(engine):
fsql("""
CREATE [[0],[1]] SCHEMA a:int
SELECT * WHERE a>0
PRINT
""", n=n)
world_flow(1, "duckdb") # running using duckdb (assuming duckdb is install…
AWS EMR
AWS has an extremely easy way to run EMR jobs with awswrangler:
https://aws-data-wrangler.readthedocs.io/en/stable/tutorials/015%20-%20EMR.html
Self-hosted Spark on Kubernetes
Running a Spark job on a Kubernetes cluster is more difficult as you have to submit jobs to a cluster yourself (even with Kubeflow). But it’s possible - you need to run spark_submit
command to submit jobs to the cluster and poll for status