Preparing datasets with benchopt prepare

Preparing datasets with `benchopt prepare`#

Benchopt separates data preparation (heavy one-time work: downloads, extraction, pre-processing) from data loading (fast, per-run work done by get_data()).

Preparation is cached by joblib so that invoking:

$ benchopt prepare path/to/benchmark

is a no-op when the result is already in the cache.

from benchopt.helpers.run_examples import ExampleBenchmark
from benchopt.helpers.run_examples import benchopt_cli

Simple case: `get_data()` as fallback#

When a dataset does not define a custom prepare() method, benchopt falls back to calling get_data() during preparation. This preserves backward compatibility with benchmarks written before the prepare step was introduced.

We start from the minimal_benchmark example. Its dataset only defines get_data(), so the prepare step will call that directly.

DATASET_SIMPLE = """
    from benchopt import BaseDataset
    import numpy as np

    class Dataset(BaseDataset):
        name = 'simulated'
        parameters = {'n_samples': [100, 1000]}

        def get_data(self):
            print(f"\\n\\tGetting data for n_samples={self.n_samples}")

            rng = np.random.default_rng(0)
            X = rng.standard_normal((self.n_samples, 10))
            return dict(X=X)
"""

benchmark = ExampleBenchmark(
    base="minimal_benchmark", name="prepare_example",
    ignore=["custom_plot.py", "example_config.yml"],
    datasets={"simulated.py": DATASET_SIMPLE},
)
benchmark

            

                
                    ⬇ Download
                    objective.pyfrom benchopt import BaseObjective
import numpy as np


class Objective(BaseObjective):
    # Name of the Objective function
    name = 'Quadratic'

    # The three methods below define the links between the Dataset,
    # the Objective and the Solver.
    def set_data(self, X):
        """Set the data from a Dataset to compute the objective.

        The argument are the keys of the dictionary returned by
        ``Dataset.get_data``.
        """
        self.X = X

    def get_objective(self):
        "Returns a dict passed to ``Solver.set_objective`` method."
        return dict(X=self.X)

    def evaluate_result(self, X_hat):
        """Compute the objective value(s) given the output of a solver.

        The arguments are the keys in the dictionary returned
        by ``Solver.get_result``.
        """
        return dict(value=np.linalg.norm(self.X - X_hat))

    def get_one_result(self):
        """Return one solution for which the objective can be evaluated.

        This function is mostly used for testing and debugging purposes.
        """
        return dict(X_hat=1)


datasets/simulated.pyfrom benchopt import BaseDataset
import numpy as np

class Dataset(BaseDataset):
    name = 'simulated'
    parameters = {'n_samples': [100, 1000]}

    def get_data(self):
        print(f"\n\tGetting data for n_samples={self.n_samples}")

        rng = np.random.default_rng(0)
        X = rng.standard_normal((self.n_samples, 10))
        return dict(X=X)


solvers/gd.pyfrom benchopt import BaseSolver
import numpy as np


class Solver(BaseSolver):
    # Name of the Solver, used to select it in the CLI
    name = 'gd'

    # By default, benchopt will evaluate the result of a method after various
    # number of iterations. Setting the sampling_strategy controls how this is
    # done. Here, we use a callback function that is called at each iteration.
    sampling_strategy = 'callback'

    # Parameters of the method, that will be tested by the benchmark.
    # Each parameter ``param_name`` will be accessible as ``self.param_name``.
    parameters = {'lr': [1e-3, 1e-2]}

    # The three methods below define the necessary methods for the Solver, to
    # get the info from the Objective, to run the method and to return a
    # result that can be evaluated by the Objective.
    def set_objective(self, X):
        """Set the info from a Objective, to run the method.

        This method is also typically used to adapt the solver's parameters to
        the data (e.g. scaling) or to initialize the algorithm.

        The kwargs are the keys of the dictionary returned by
        ``Objective.get_objective``.
        """
        self.X = X
        self.X_hat = np.zeros_like(X)

    def run(self, cb):
        """Run the actual method to benchmark.

        Here, as we use a "callback", we need to call it at each iteration to
        evaluate the result as the procedure progresses.

        The callback implements a stopping mechanism, based on the number of
        iterations, the time and the evoluation of the performances.
        """
        while cb():
            self.X_hat = self.X_hat - self.lr * (self.X_hat - self.X)

    def get_result(self):
        """Format the output of the method to be evaluated in the Objective.

        Returns a dict which is passed to ``Objective.evaluate_result`` method.
        """
        return {'X_hat': self.X_hat}


config.yml#loaded from minimal_benchmark/config.yml
plot_configs:
  Subopt. (log):
    plot_kind: objective_curve
    scale: loglog
  Runtimes:
    plot_kind: bar_chart


                
            "

Running benchopt prepare triggers the preparation step for every parameter combination. Because prepare() is not overridden here, get_data() is called as a fallback. The Getting data for n_samples=... print confirms it runs once for each combination.

benchopt_cli(f"prepare {benchmark.benchmark_dir}")

$ benchopt prepare temp_benchmark_f67u04wi/prepare_example

Preparing datasets for benchmark 'prepare_example'
Preparing simulated[n_samples=100] ...
        Getting data for n_samples=100
done
Preparing simulated[n_samples=1000] ...
        Getting data for n_samples=1000
done
Summary: 2/2 datasets ready.

A second call is a no-op: the cache recognises every combination and skips all preparation work. Each status line now reads Preparing ... done, with no call to the actual get_data.

benchopt_cli(f"prepare {benchmark.benchmark_dir}")

$ benchopt prepare temp_benchmark_f67u04wi/prepare_example

Preparing datasets for benchmark 'prepare_example'
Preparing simulated[n_samples=100] ... done
Preparing simulated[n_samples=1000] ... done
Summary: 2/2 datasets ready.

The --prepare flag of benchopt install runs the same preparation step right after installing the benchmark dependencies, so data is ready before the first run:

$ benchopt install path/to/benchmark --prepare

This is convenient in CI pipelines or when setting up a benchmark for the first time.

Custom `prepare()` method#

For datasets that require genuine heavy work (downloading an archive, extracting files, training a feature extractor, …), define a prepare() method. It is called at most once per unique parameter combination and its result is cached.

prepare() is meant to be idempotent: calling it multiple times is always safe. Think of it as a setup step that guarantees data is on disk and in the right form before get_data() ever runs.

DATASET_PREPARE = """
from benchopt import BaseDataset
import numpy as np

class Dataset(BaseDataset):
    name = 'simulated'
    parameters = {'n_samples': [100, 1000]}

    def prepare(self):
        # Heavy one-time work goes here: downloading archives, feature
        # extraction, pre-processing …
        # Here we just simulate it with a print statement.
        print(f"\\n    > Preparing n_samples={self.n_samples}")

    def get_data(self):
        rng = np.random.default_rng(0)
        X = rng.standard_normal((self.n_samples, 10))
        return dict(X=X)
"""

benchmark.update(datasets={"simulated.py": DATASET_PREPARE})

            
                We now update the following files:


                
                    ⬇ Download
                    datasets/simulated.pyfrom benchopt import BaseDataset
import numpy as np

class Dataset(BaseDataset):
    name = 'simulated'
    parameters = {'n_samples': [100, 1000]}

    def prepare(self):
        # Heavy one-time work goes here: downloading archives, feature
        # extraction, pre-processing …
        # Here we just simulate it with a print statement.
        print(f"\n    > Preparing n_samples={self.n_samples}")

    def get_data(self):
        rng = np.random.default_rng(0)
        X = rng.standard_normal((self.n_samples, 10))
        return dict(X=X)


                
            "

After modifying the dataset, the cache is invalidated. Calling benchopt prepare now run the new prepare() method for every parameter combination (here n_samples ∈ {100, 1000}). The > Preparing n_samples=... print confirms both combinations are executed.

benchopt_cli(f"prepare {benchmark.benchmark_dir}")

$ benchopt prepare temp_benchmark_f67u04wi/prepare_example

Preparing datasets for benchmark 'prepare_example'
Preparing simulated[n_samples=100] ...
    > Preparing n_samples=100
done
Preparing simulated[n_samples=1000] ...
    > Preparing n_samples=1000
done
Summary: 2/2 datasets ready.

`prepare_cache_ignore`: reducing redundant preparation work#

Some parameters influence the benchmark run but not the data that prepare() produces. A typical example is a random seed: the preparation step (e.g. downloading a fixed dataset) is identical across all seed values.

prepare_cache_ignore lists those parameters. Benchopt groups all parameter combinations that differ only in ignored dimensions and runs prepare() at most once per group:

prepare_cache_ignore = ('seed',) — ignore the seed parameter; preparation runs once per unique value of the remaining parameters.
prepare_cache_ignore = 'all' — ignore every parameter; prepare() runs at most once per dataset class regardless of parameterization.

DATASET_CACHE_IGNORE = """
from benchopt import BaseDataset
import numpy as np

class Dataset(BaseDataset):
    name = 'simulated'
    parameters = {'n_samples': [100, 1000], 'seed': [0, 1, 2]}

    # The preparation does not depend on the random seed, so we exclude
    # 'seed' from the cache key.  This reduces 6 prepare() calls to 2.
    prepare_cache_ignore = ('seed',)

    def prepare(self):
        print(f"\\n    > Preparing n_samples={self.n_samples}")

    def get_data(self):
        rng = np.random.default_rng(self.seed)
        X = rng.standard_normal((self.n_samples, 10))
        return dict(X=X)
"""

benchmark.update(datasets={"simulated.py": DATASET_CACHE_IGNORE})

            
                We now update the following files:


                
                    ⬇ Download
                    datasets/simulated.pyfrom benchopt import BaseDataset
import numpy as np

class Dataset(BaseDataset):
    name = 'simulated'
    parameters = {'n_samples': [100, 1000], 'seed': [0, 1, 2]}

    # The preparation does not depend on the random seed, so we exclude
    # 'seed' from the cache key.  This reduces 6 prepare() calls to 2.
    prepare_cache_ignore = ('seed',)

    def prepare(self):
        print(f"\n    > Preparing n_samples={self.n_samples}")

    def get_data(self):
        rng = np.random.default_rng(self.seed)
        X = rng.standard_normal((self.n_samples, 10))
        return dict(X=X)


                
            "

With 6 parameter combinations (2 × 3) but seed ignored, benchopt deduplicates to 2 effective preparation jobs. Only the seed=0 representative of each group appears in the output, and prepare() prints exactly twice.

benchopt_cli(f"prepare {benchmark.benchmark_dir}")

$ benchopt prepare temp_benchmark_f67u04wi/prepare_example

Preparing datasets for benchmark 'prepare_example'
Preparing simulated[n_samples=100,seed=0] ...
    > Preparing n_samples=100
done
Preparing simulated[n_samples=1000,seed=0] ...
    > Preparing n_samples=1000
done
Summary: 2/2 datasets ready.

Setting prepare_cache_ignore = 'all' is even more aggressive: it runs prepare() at most once per dataset class, regardless of any parameter values. Use this when the dataset is a fixed external file that requires no per-parameter processing at all.

Total running time of the script: (0 minutes 0.042 seconds)

Gallery generated by Sphinx-Gallery

Preparing datasets with benchopt prepare

Contents

Preparing datasets with benchopt prepare#

Simple case: get_data() as fallback#

Custom prepare() method#

prepare_cache_ignore: reducing redundant preparation work#

Preparing datasets with `benchopt prepare`#

Simple case: `get_data()` as fallback#

Custom `prepare()` method#

`prepare_cache_ignore`: reducing redundant preparation work#