Helper Module for Deep Learning.
Module that provides core functions to load and split a dataset.
-
class
pynet.datasets.core.DataItem(inputs, outputs, labels)¶ -
property
inputs¶ Alias for field number 0
-
property
labels¶ Alias for field number 2
-
property
outputs¶ Alias for field number 1
-
property
-
class
pynet.datasets.core.DataManager(input_path, metadata_path, output_path=None, labels=None, stratify_label=None, custom_stratification=None, projection_labels=None, number_of_folds=10, batch_size=1, sampler='random', input_transforms=None, output_transforms=None, data_augmentation_transforms=None, add_input=False, test_size=0.1, label_mapping=None, patch_size=None, continuous_labels=False, sample_size=1, **dataloader_kwargs)[source]¶ Data manager used to split a dataset in train, test and validation pytorch datasets.
-
__init__(input_path, metadata_path, output_path=None, labels=None, stratify_label=None, custom_stratification=None, projection_labels=None, number_of_folds=10, batch_size=1, sampler='random', input_transforms=None, output_transforms=None, data_augmentation_transforms=None, add_input=False, test_size=0.1, label_mapping=None, patch_size=None, continuous_labels=False, sample_size=1, **dataloader_kwargs)[source]¶ Splits an input numpy array using memory-mapping into three sets: test, train and validation. This function can stratify the data.
The train/test indices are performed using a Stratified or not ShuffleSplit.
TODO: In the case of custom stratification, enable the weighted random sampler.
- Parameters
input_path: str
the path to the numpy array containing the input tensor data that will be splited/loaded or the dataset itself.
metadata_path: str
the path to the metadata table in tsv format.
output_path: str, default None
the path to the numpy array containing the output tensor data that will be splited/loaded.
labels: list of str, default None
in case of classification/regression, the name of the column(s) in the metadata table to be predicted.
projection_labels: dict, default None
selects only the data that match the conditions. Use this dictionary to filter the input data from the metadata table: {<column_name>: <value>}.
stratify_label: str, default None
the name of the column in the metadata table containing the label used during the stratification (mutuallty exclusive with ‘custom_stratification’).
custom_stratification: dict, default None
split the dataset into train/validation/test according to the defined stratification strategy. The filtering is performed as for the labels projection (mutuallty exclusive with ‘stratify_label’).
number_of_folds: int, default 10
the number of folds that will be used in the cross validation.
batch_size: int, default 1
the size of each mini-batch.
sampler: str or Sampler, default ‘random’
whether we use a sequential, random or weighted random sampler (to deal with imbalanced classes issue) during the generation of the mini-batches: None, ‘random’, ‘weighted_random’ or a custom Sampler class.
input_transforms, output_transforms: list of callable, default None
transforms a list of samples with pre-defined transformations.
data_augmentation_transforms: list of callable, default None
transforms the training dataset input with pre-defined transformations on the fly during the training.
add_input: bool, default False
if true concatenate the input tensor to the output tensor.
test_size: float, default 0.1
should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
label_mapping: dict, default None
a mapping that can be used to convert labels to be predicted (string to int conversion).
patch_size: tuple, default None
the size of the patches that will be extracted from the input/output images.
continuous_labels: bool, default False
if set consider labels as continuous values; ie. floats otherwise a discrete values, ie. integer.
sample_size: float, default 1
should be between 0.0 and 1.0 and represent the proportion of the dataset used by the manger (random selection that can be usefull during testing.
-
collate_fn(list_samples)[source]¶ After fetching a list of samples using the indices from sampler, the function passed as the collate_fn argument is used to collate lists of samples into batches.
A custom collate_fn is used here to apply the transformations.
See https://pytorch.org/docs/stable/data.html#dataloader-collate-fn.
-
classmethod
from_dataset(test_dataset=None, train_dataset=None, validation_dataset=None, batch_size=1, sampler='random', multi_bloc=False)[source]¶ Create a data manger from torch datasets.
- Parameters
*_dataset: Dataset
the train/validation/test datasets.
batch_size: int, default 1
the size of each mini-batch.
sampler: str or Sampler, default ‘random’
whether we use a sequential, random or weighted random sampler (to deal with imbalanced classes issue) during the generation of the mini-batches: None, ‘random’, ‘weighted_random’ or a custom Sampler class.
multi_bloc: bool, default False
if sett expect multi bloc datasets that returns a list with N bloc of data.
- Returns
ins: DataManager
a data manager.
-
classmethod
from_numpy(test_inputs=None, test_outputs=None, test_labels=None, train_inputs=None, train_outputs=None, train_labels=None, validation_inputs=None, validation_outputs=None, validation_labels=None, batch_size=1, sampler='random', input_transforms=None, output_transforms=None, data_augmentation_transforms=None, add_input=False, label_mapping=None, patch_size=None, continuous_labels=False)[source]¶ Create a data manger from numpy arrays.
- Parameters
*_inputs, *_outputs, *_labels: ndarrays
the training data.
batch_size: int, default 1
the size of each mini-batch.
sampler: str or Sampler, default ‘random’
whether we use a sequential, random or weighted random sampler (to deal with imbalanced classes issue) during the generation of the mini-batches: None, ‘random’, ‘weighted_random’ or a custom Sampler class.
input_transforms, output_transforms: list of callable, default None
transforms a list of samples with pre-defined transformations.
data_augmentation_transforms: list of callable, default None
transforms the training dataset input with pre-defined transformations on the fly during the training.
add_input: bool, default False
if true concatenate the input tensor to the output tensor.
label_mapping: dict, default None
a mapping that can be used to convert labels to be predicted (string to int conversion).
patch_size: tuple, default None
the size of the patches that will be extracted from the input/output images.
continuous_labels: bool, default False
if set consider labels as continuous values; ie. floats otherwise a discrete values, ie. integer.
- Returns
ins: DataManager
a data manager.
-
get_dataloader(train=False, validation=False, test=False, fold_index=0)[source]¶ Generate a pytorch DataLoader.
- Parameters
train: bool, default False
return the dataloader over the train set.
validation: bool, default False
return the dataloader over the validation set.
test: bool, default False
return the dataloader over the test set.
fold_index: int, default 0
the index of the fold to use for the training
- Returns
loaders: list of DataLoader
the requested data loaders.
-
static
get_mask(df, projection_labels=None, sample_size=1)[source]¶ Filter a table.
- Parameters
df: a pandas DataFrame
a table data.
projection_labels: dict, default None
selects only the data that match the conditions in the dict {<column_name>: <value>}.
sample_size: float, default 1
should be between 0.0 and 1.0 and represent the proportion of the dataset used by the manager (random selection that can be usefull during testing).
- Returns
mask: a list of boolean values.
-
Follow us
Inspired by AZMIND template.