Managing long-running Bayesian optimization campaigns (days/weeks)

I’d like to use Prefect to manage long-running Bayesian optimization campaigns, in my case, tied to automated chemistry and materials science laboratories with both experiments and simulations. The campaigns can span weeks or in some cases, months, but the optimization client will spend a significant portion of this time idle, simply waiting to receive the experimental results. (xref: run history for long-running jobs).

For context, a basic optimization script is as follows, which instantiates an optimization client and iteratively seeks to minimize the returned value from a function:

import numpy as np
from ax.service.ax_client import AxClient, ObjectiveProperties


def branin(x1, x2):
    y = float(
        (x2 - 5.1 / (4 * np.pi**2) * x1**2 + 5.0 / np.pi * x1 - 6.0) ** 2
        + 10 * (1 - 1.0 / (8 * np.pi)) * np.cos(x1)
        + 10
    )

    return y


ax_client = AxClient()
ax_client.create_experiment(
    parameters=[
        {"name": "x1", "type": "range", "bounds": [-5.0, 10.0]},
        {"name": "x2", "type": "range", "bounds": [0.0, 10.0]},
    ],
    objectives={
        "branin": ObjectiveProperties(minimize=True),
    },
)


for _ in range(10):
    parameters, trial_index = ax_client.get_next_trial()
    results = branin(parameters["x1"], parameters["x2"])
    ax_client.complete_trial(trial_index=trial_index, raw_data=results)


best_parameters, metrics = ax_client.get_best_parameters()

Typically, I use MQTT via HiveMQ to communicate between the devices and the Python orchestrator, where the orchestrator has normally been a Jupyter notebook or a Python script running locally. This setup has worked well for me to communicate with microcontrollers such as a Raspberry Pi Pico W. I also prefer to use MongoDB via MongoDB Atlas Cloud to record experimental data; my field has some history of integrations with MongoDB and I appreciate the flexibility in the representations. For Bayesian optimization, I prefer to use Ax Platform (the example above), which exposes many state-of-the-art algorithms in relatively simple interfaces and is highly customizable.

I would like to integrate Prefect into this stack as the workflow orchestrator to leverage features such as human-in-the-loop, pause/resume, cancel/restart, avoiding accidental spending, conditions/dependencies, visualization, and asynchrony (see Workflow orchestration - what goes where? · sparks-baird/self-driving-lab-demo · Discussion #233 · GitHub for additional context).

I’ve considered various options. Running a Prefect managed workpool constantly isn’t feasible (whether free-tier or Pro) due to usage limits (10 hrs/250 hrs). From what I understand, even running the flows on EC2 workers won’t solve it completely, since flow runs get killed automatically after a certain period of time (free-tier = 7 days, Pro = 30 days ?).

Another complication is that in more advanced examples, the optimization can occur asynchronously. As soon as one experimental “worker” is freed, a suggested experiment is generated and then assigned to it. My thought is to use a MongoDB Atlas function (Javascript) to trigger the REST API based on the guidelines in How to create a flow run from a deployment with REST API via http client to start a flow run. If this were only ever a sequential (rather than asynchronous) optimization setup, I could follow the loop of “record to database → trigger flow run → suggest new experiment → run experiment → record to database → …”. However, the AxClient object has a state that should be preserved. For example, it limits the total number of pending suggestions (i.e., suggested but not yet completed) via a max_parallelism argument. It also takes into account any pending trials when suggesting new ones to avoid suggesting the same points over and over.

Another option would be to have a normal Python script running continuously on EC2 that issues the start of flow runs. This ensures there is only one instance of the AxClient object; however, I lose Prefect’s observability of this script and the ability to have subflows and tasks attached to the parent flow.

I also thought about running the EC2 script as a Prefect deployment through the serve function (see Serving flows on long-lived infrastructure (Prefect docs)). If I trigger a flow run here (i.e., where the flow is the main Bayesian optimization loop), does this circumvent the 7-day retention period? (i.e., could I run the single flow indefinitely in this way? Anything in particular I would lose out on with this approach in terms of observability?)

Is this is a case where I should consider using agents for long-running tasks, even though they’re not the preferred method for most use-cases?

I would really appreciate any help or suggestions on how to approach this problem.

1 Like