torchdata¶
This module contains PyTorch compatible datasets with extended capabilities.
To quickly start with torchdata
, just inherit from torchdata.Dataset
and create
your dataset as you normally would, for example:
import torchdata as td
from PIL import Image
# Image loading dataset (use td.datasets.Files for even less typing :D )
class Dataset(td.Dataset):
def __init__(self, path: pathlib.Path):
super().__init__() # This is necessary
self.files = [file for file in path.glob("*")]
def __getitem__(self, index):
return Image.open(self.files[index])
def __len__(self):
return len(self.files)
Now you can use cache
, map
, apply
, reduce
just by issuing appropriate functions
(standard torch.utils.data.Dataset
can still be used):
import torchvision
# Map PIL to Tensor and cache dataset
dataset = Dataset("data").map(torchvision.transforms.ToTensor()).cache()
# You can create DataLoader as well
dataloader = torch.utils.data.DataLoader(dataset)
td.Iterable
is an extension of
torch.utils.data.IterableDataset,
which allows the user to use map
, apply
and filter
, for example:
# Based on original PyTorch example
class Dataset(td.Iterable):
def __init__(self, start: int, end: int):
super().__init__() # This is necessary
self.start: int = start
self.end: int = end
def __iter__(self):
return iter(range(self.start, self.end))
# Only elements divisible by 2
dataset = Dataset(100).filter(lambda value: value % 2 == 0)
Concrete implementations of datasets described above are located inside datasets
module.
For custom caching routines and how to use them see cachers
and their modifiers
.
To check available general map
related functions see maps
.
Custom sampling techniques useful with torch.utils.data.DataLoader
are located inside samplers
.
-
class
torchdata.
Dataset
[source]¶ torch.utils.data.Dataset
with extended capabilities.This class inherits from torch.utils.data.Dataset, co can be used in the same manner after inheritance. It allows user to perform the following operations:
cache
- cache all/part of data in memory or on diskmap
- apply function to each element of datasetapply
- apply function to all elements of datasetreduce
- reduce dataset to single value with specified function
Important:
Last cache which is able to hold sample is used. Does not matter whether it’s in-memory or on-disk or user-specified.
Although multiple cache calls in different parts of
map
should work, users are encouraged to use it as rare as possible and possibly as late as possible for best performance.
Example:
import torchvision from PIL import Image # Image loading dataset (use Files for more serious business) class Dataset(torchdata.Dataset): def __init__(self, path: pathlib.Path): super().__init__() # This is necessary self.files = [file for file in path.glob("*")] def __getitem__(self, index): return Image.open(self.files[index]) def __len__(self, index): return len(self.files) # Map PIL to Tensor and cache dataset dataset = Dataset("data").map(torchvision.transforms.ToTensor()).cache() # Create DataLoader as normally dataloader = torch.utils.data.DataLoader(dataset)
-
apply
(function)¶ Apply function to every element of the dataset.
Specified function has to take Python generator as first argument. This generator yields consecutive samples from the dataset and the function is free to do whatever it wants with them.
Other arguments will be forwarded to function.
WARNING:
This function returns anything that’s returned from function and it’s up to user to ensure correct pipeline functioning after using this transformation.
Example:
class Dataset(torchdata.Dataset): def __init__(self, max: int): super().__init__() # This is necessary self.range = list(range(max)) def __getitem__(self, index): return self.range[index] def __len__(self): return len(self.range) def summation(generator): return sum(value for value in generator) summed_dataset = Dataset(101).apply(summation) # Returns 5050
- Parameters
function (typing.Callable) – Function (or functional object) taking item generator as first object and variable list of other arguments (if necessary).
- Returns
Value returned by function
- Return type
typing.Any
-
cache
(cacher: Callable = None)[source]¶ Cache data in memory, disk or specify custom caching.
By default all samples are cached in memory. To change this behaviour specify
cacher
argument. Somecacher
implementations can be found intorchdata.cacher
module or you can provide your own by inheriting fromtorchdata.cacher.Cacher
and implementing appropriate methods.- Parameters
cacher (torchdata.cacher.Cacher, optional) – Instance of
torchdata.cacher.Cacher
(or any other object with compatible interface). Checkcacher
module documentation for more information. Default:torchdata.cacher.Memory
which caches data in-memory- Returns
Returns self
- Return type
-
map
(function: Callable)¶ Map function to each element of dataset.
Function has no specified signature; it is user’s responsibility to ensure it is taking correct arguments as returned from
__getitem__
(in case ofDataset
) or__iter__
(in case ofIterable
).- Parameters
function (typing.Callable) – Function (or functor) taking arguments returned from
__getitem__
and returning anything.- Returns
- Return type
self
-
class
torchdata.
Iterable
[source]¶ torch.utils.data.IterableDataset
dataset with extended capabilities.This class inherits from torch.utils.data.IterableDataset, co can be used in the same manner after inheritance.
It allows user to perform following operations:
map
- apply function to each element of datasetapply
- apply function to all elements of datasetfilter
- return elements for whichpredicate
returnsTrue
Example:
# Based on original PyTorch example class Dataset(torchdata.Iterable): def __init__(self, start: int, end: int): super().__init__() # This is necessary self.start: int = start self.end: int = end def __iter__(self): return iter(range(self.start, self.end)) # range(1,25) originally, mapped to range(13, 37) dataset = Dataset(1, 25).map(lambda value: value + 12) # Sample-wise concatenation, yields range(13, 37) and range(1, 25) for first, second in dataset | Dataset(1, 25): print(first, second) # 13 1 up to 37 25
-
apply
(function)¶ Apply function to every element of the dataset.
Specified function has to take Python generator as first argument. This generator yields consecutive samples from the dataset and the function is free to do whatever it wants with them.
Other arguments will be forwarded to function.
WARNING:
This function returns anything that’s returned from function and it’s up to user to ensure correct pipeline functioning after using this transformation.
Example:
class Dataset(torchdata.Dataset): def __init__(self, max: int): super().__init__() # This is necessary self.range = list(range(max)) def __getitem__(self, index): return self.range[index] def __len__(self): return len(self.range) def summation(generator): return sum(value for value in generator) summed_dataset = Dataset(101).apply(summation) # Returns 5050
- Parameters
function (typing.Callable) – Function (or functional object) taking item generator as first object and variable list of other arguments (if necessary).
- Returns
Value returned by function
- Return type
typing.Any
-
map
(function: Callable)¶ Map function to each element of dataset.
Function has no specified signature; it is user’s responsibility to ensure it is taking correct arguments as returned from
__getitem__
(in case ofDataset
) or__iter__
(in case ofIterable
).- Parameters
function (typing.Callable) – Function (or functor) taking arguments returned from
__getitem__
and returning anything.- Returns
- Return type
self