benchopt.BaseDataset#

class benchopt.BaseDataset(**parameters)#

Base class to define a dataset in a benchmark.

Datasets that derive from this class should implement one method:

  • get_data(): retrieves/simulates the data contained in this data set and returns a dictionary containing the data. This dictionary is passed as arguments of the objective’s method set_data.

Optionally, datasets can implement:

  • prepare(): performs expensive one-time preparation (downloads, extraction, preprocessing) that is cached and separated from loading.

abstractmethod get_data()#

Return the data to feed to the objective .

Returns:
data: dict

Extra parameters of the objective. The objective will be instanciated by calling Objective.set_data(**data).

prepare()#

Prepare the dataset for use (optional).

Called before benchmark runs to perform expensive one-time operations such as downloading data, extracting archives, or pre-processing. Benchopt caches the result with joblib so that repeated calls with the same parameters are no-ops. Triggered via benchopt prepare.

Notes

  • Defaults to a no-op; datasets without a custom prepare() fall back to calling get_data() for backward compatibility.

  • Should be idempotent: calling it multiple times must be safe.

  • Parameters listed in prepare_cache_ignore are excluded from the cache key. Use prepare_cache_ignore = "all" to cache at most once per dataset class regardless of parameterization, or list specific names to ignore (e.g. a random seed):

    class Dataset(BaseDataset):
        parameters = {'n_samples': [100, 1000], 'seed': [0, 1, 2]}
        prepare_cache_ignore = ('seed',)  # 2 calls instead of 6