torchdata¶

This module contains PyTorch compatible datasets with extended capabilities.

To quickly start with torchdata, just inherit from torchdata.Dataset and create your dataset as you normally would, for example:

import torchdata as td
from PIL import Image

# Image loading dataset (use td.datasets.Files for even less typing :D )
class Dataset(td.Dataset):
    def __init__(self, path: pathlib.Path):
        super().__init__() # This is necessary
        self.files = [file for file in path.glob("*")]

    def __getitem__(self, index):
        return Image.open(self.files[index])

    def __len__(self):
        return len(self.files)

Now you can use cache, map, apply, reduce just by issuing appropriate functions (standard torch.utils.data.Dataset can still be used):

import torchvision

# Map PIL to Tensor and cache dataset
dataset = Dataset("data").map(torchvision.transforms.ToTensor()).cache()
# You can create DataLoader as well
dataloader = torch.utils.data.DataLoader(dataset)

td.Iterable is an extension of torch.utils.data.IterableDataset, which allows the user to use map, apply and filter, for example:

# Based on original PyTorch example
class Dataset(td.Iterable):
    def __init__(self, start: int, end: int):
        super().__init__() # This is necessary
        self.start: int = start
        self.end: int = end

    def __iter__(self):
        return iter(range(self.start, self.end))

# Only elements divisible by 2
dataset = Dataset(100).filter(lambda value: value % 2 == 0)

Concrete implementations of datasets described above are located inside datasets module.

For custom caching routines and how to use them see cachers and their modifiers. To check available general map related functions see maps.

Custom sampling techniques useful with torch.utils.data.DataLoader are located inside samplers.

class torchdata.Dataset[source]¶

torch.utils.data.Dataset with extended capabilities.

This class inherits from torch.utils.data.Dataset, co can be used in the same manner after inheritance. It allows user to perform the following operations:

cache - cache all/part of data in memory or on disk
map - apply function to each element of dataset
apply - apply function to all elements of dataset
reduce - reduce dataset to single value with specified function

Important:

Last cache which is able to hold sample is used. Does not matter whether it’s in-memory or on-disk or user-specified.
Although multiple cache calls in different parts of map should work, users are encouraged to use it as rare as possible and possibly as late as possible for best performance.

Example:

import torchvision
from PIL import Image

# Image loading dataset (use Files for more serious business)
class Dataset(torchdata.Dataset):
    def __init__(self, path: pathlib.Path):
        super().__init__() # This is necessary
        self.files = [file for file in path.glob("*")]

    def __getitem__(self, index):
        return Image.open(self.files[index])

    def __len__(self, index):
        return len(self.files)

# Map PIL to Tensor and cache dataset
dataset = Dataset("data").map(torchvision.transforms.ToTensor()).cache()
# Create DataLoader as normally
dataloader = torch.utils.data.DataLoader(dataset)

apply(function)¶

Apply function to every element of the dataset.

Specified function has to take Python generator as first argument. This generator yields consecutive samples from the dataset and the function is free to do whatever it wants with them.

Other arguments will be forwarded to function.

WARNING:

This function returns anything that’s returned from function and it’s up to user to ensure correct pipeline functioning after using this transformation.

Example:

class Dataset(torchdata.Dataset):
    def __init__(self, max: int):
        super().__init__() # This is necessary
        self.range = list(range(max))

    def __getitem__(self, index):
        return self.range[index]

    def __len__(self):
        return len(self.range)

def summation(generator):
    return sum(value for value in generator)

summed_dataset = Dataset(101).apply(summation) # Returns 5050

Parameters: function (typing.Callable) – Function (or functional object) taking item generator as first object and variable list of other arguments (if necessary).
Returns: Value returned by function
Return type: typing.Any

cache(cacher: Callable = None)[source]¶

Cache data in memory, disk or specify custom caching.

By default all samples are cached in memory. To change this behaviour specify cacher argument. Some cacher implementations can be found in torchdata.cacher module or you can provide your own by inheriting from torchdata.cacher.Cacher and implementing appropriate methods.

Parameters: cacher (torchdata.cacher.Cacher, optional) – Instance of torchdata.cacher.Cacher (or any other object with compatible interface). Check cacher module documentation for more information. Default: torchdata.cacher.Memory which caches data in-memory
Returns: Returns self
Return type: Dataset

map(function: Callable)¶

Map function to each element of dataset.

Function has no specified signature; it is user’s responsibility to ensure it is taking correct arguments as returned from __getitem__ (in case of Dataset) or __iter__ (in case of Iterable).

Parameters: function (typing.Callable) – Function (or functor) taking arguments returned from __getitem__ and returning anything.
Returns
Return type: self

class torchdata.Iterable[source]¶

torch.utils.data.IterableDataset dataset with extended capabilities.

This class inherits from torch.utils.data.IterableDataset, co can be used in the same manner after inheritance.

It allows user to perform following operations:

map - apply function to each element of dataset
apply - apply function to all elements of dataset
filter - return elements for which predicate returns True

Example:

# Based on original PyTorch example
class Dataset(torchdata.Iterable):
    def __init__(self, start: int, end: int):
        super().__init__() # This is necessary
        self.start: int = start
        self.end: int = end

    def __iter__(self):
        return iter(range(self.start, self.end))

# range(1,25) originally, mapped to range(13, 37)
dataset = Dataset(1, 25).map(lambda value: value + 12)
# Sample-wise concatenation, yields range(13, 37) and range(1, 25)
for first, second in dataset | Dataset(1, 25):
    print(first, second) # 13 1 up to 37 25

apply(function)¶

Apply function to every element of the dataset.

Specified function has to take Python generator as first argument. This generator yields consecutive samples from the dataset and the function is free to do whatever it wants with them.

Other arguments will be forwarded to function.

WARNING:

This function returns anything that’s returned from function and it’s up to user to ensure correct pipeline functioning after using this transformation.

Example:

class Dataset(torchdata.Dataset):
    def __init__(self, max: int):
        super().__init__() # This is necessary
        self.range = list(range(max))

    def __getitem__(self, index):
        return self.range[index]

    def __len__(self):
        return len(self.range)

def summation(generator):
    return sum(value for value in generator)

summed_dataset = Dataset(101).apply(summation) # Returns 5050

Parameters: function (typing.Callable) – Function (or functional object) taking item generator as first object and variable list of other arguments (if necessary).
Returns: Value returned by function
Return type: typing.Any

map(function: Callable)¶

Map function to each element of dataset.

Function has no specified signature; it is user’s responsibility to ensure it is taking correct arguments as returned from __getitem__ (in case of Dataset) or __iter__ (in case of Iterable).

Parameters: function (typing.Callable) – Function (or functor) taking arguments returned from __getitem__ and returning anything.
Returns
Return type: self