benchopt.BaseDataset#
- class benchopt.BaseDataset(**parameters)#
Base class to define a dataset in a benchmark.
Datasets that derive from this class should implement one method:
get_data(): retrieves/simulates the data contained in this data set and returns a dictionary containing the data. This dictionary is passed as arguments of the objective’s methodset_data.
Optionally, datasets can implement:
prepare(): performs expensive one-time preparation (downloads, extraction, preprocessing) that is cached and separated from loading.
- abstractmethod get_data()#
Return the data to feed to the objective .
- Returns:
- data: dict
Extra parameters of the objective. The objective will be instanciated by calling
Objective.set_data(**data).
- prepare()#
Prepare the dataset for use (optional).
Called before benchmark runs to perform expensive one-time operations such as downloading data, extracting archives, or pre-processing. Benchopt caches the result with joblib so that repeated calls with the same parameters are no-ops. Triggered via
benchopt prepare.Notes
Defaults to a no-op; datasets without a custom
prepare()fall back to callingget_data()for backward compatibility.Should be idempotent: calling it multiple times must be safe.
Parameters listed in
prepare_cache_ignoreare excluded from the cache key. Useprepare_cache_ignore = "all"to cache at most once per dataset class regardless of parameterization, or list specific names to ignore (e.g. a random seed):class Dataset(BaseDataset): parameters = {'n_samples': [100, 1000], 'seed': [0, 1, 2]} prepare_cache_ignore = ('seed',) # 2 calls instead of 6