torchdata.cachers¶

This module contains interface needed for cachers (used in cache method of td.Dataset ) .

To cache on disk all samples using Python’s pickle in folder cache (assuming you have already created td.Dataset instance named dataset):

import torchdata as td

...
dataset.cache(td.cachers.Pickle("./cache"))

Users are encouraged to write their custom cachers if the ones provided below are too slow or not good enough for their purposes (see Cacher abstract interface below).

class torchdata.cachers.Cacher[source]¶

Interface to fulfil to make object compatible with torchdata.Dataset.cache method.

If you want to implement your own caching functionality, inherit from this class and implement methods described below.

abstract __contains__(index: int) → bool[source]¶

Return true if sample under index is cached.

If False returned, cacher’s __setitem__ will be called, hence if you are not going to cache sample under this index, you should describe this operation at that method. This is simply a boolean indicator whether sample is cached.

If True cacher’s __getitem__ will be called and it’s users responsibility to return correct value in such case.

Parameters: index (int) – Index of sample

abstract __getitem__(index) → Any[source]¶

Retrieve sample from cache.

This function MUST return valid data sample and it’s users responsibility if custom cacher is implemented.

Return from this function datasample which lies under it’s respective index.

Parameters: index (int) – Index of sample

abstract __setitem__(index: int, data: Any) → None[source]¶

Saves sample under index in cache or do nothing.

This function should save sample under index to be later retrieved by __getitem__. If you don’t want to save specific index, you can implement this functionality in cacher or create separate modifier solely for this purpose (second approach is highly recommended).

Parameters

index (int) – Index of sample
data (Any) – Data generated by dataset.

class torchdata.cachers.Memory[source]¶

Save and load data in Python dictionary.

This cacher is used by default inside torchdata.Dataset.

__contains__(index: int) → bool[source]¶: True if index in dictionary.

__getitem__(index: int)[source]¶: Retrieve data from dictionary.

__setitem__(index: int, data: int)[source]¶: Adds data to dictionary.

class torchdata.cachers.Pickle(path: pathlib.Path, extension: str = '.pkl')[source]¶

Save and load data from disk using pickle module.

Data will be saved as pkl in specified path. If path does not exist, it will be created.

This object can be used as a context manager and it will delete path at the end of block:

with td.cachers.Pickle(pathlib.Path("./disk")) as pickler:
    dataset = dataset.map(lambda x: x+1).cache(pickler)
    ... # Do something with dataset
... # Folder removed

You can also issue clean() method manually for the same effect (though it’s discouraged as you might crash __setitem__ method).

Important:

This cacher can act between consecutive runs, just don’t use clean() method or don’t delete the folder manually. If so, please ensure correct sampling (same seed and sampling order) for reproducible behaviour between runs.

path¶

Path to the folder where samples will be saved and loaded from.

Type: pathlib.Path

extension¶

Extension to use for saved pickle files. Default: pkl

Type: str

__contains__(index: int) → bool[source]¶

Check whether file exists on disk.

If file is available it is considered cached, hence you can cache data between multiple runs (if you ensure repeatable sampling).

__getitem__(index: int)[source]¶

Retrieve data specified by index.

Name of the item will be equal to {self.path}/{index}{extension}.

__setitem__(index: int, data: int)[source]¶

Save data in specified folder.

Name of the item will be equal to {self.path}/{index}{extension}.

clean() → None[source]¶

Remove recursively folder self.path.

Behaves just like shutil.rmtree, but won’t act if directory does not exist.