Helper Module for Deep Learning.

Module that provides core functions to load and split a dataset.

class pynet.datasets.core.DataItem(inputs, outputs, labels)¶

property inputs¶: Alias for field number 0

property labels¶: Alias for field number 2

property outputs¶: Alias for field number 1

class pynet.datasets.core.DataManager(input_path, metadata_path, output_path=None, labels=None, stratify_label=None, custom_stratification=None, projection_labels=None, number_of_folds=10, batch_size=1, sampler='random', input_transforms=None, output_transforms=None, data_augmentation_transforms=None, add_input=False, test_size=0.1, label_mapping=None, patch_size=None, continuous_labels=False, sample_size=1, **dataloader_kwargs)[source]¶

Data manager used to split a dataset in train, test and validation pytorch datasets.

__init__(input_path, metadata_path, output_path=None, labels=None, stratify_label=None, custom_stratification=None, projection_labels=None, number_of_folds=10, batch_size=1, sampler='random', input_transforms=None, output_transforms=None, data_augmentation_transforms=None, add_input=False, test_size=0.1, label_mapping=None, patch_size=None, continuous_labels=False, sample_size=1, **dataloader_kwargs)[source]¶

Splits an input numpy array using memory-mapping into three sets: test, train and validation. This function can stratify the data.

The train/test indices are performed using a Stratified or not ShuffleSplit.

TODO: In the case of custom stratification, enable the weighted random sampler.

Parameters

input_path: str

the path to the numpy array containing the input tensor data that will be splited/loaded or the dataset itself.

metadata_path: str

the path to the metadata table in tsv format.

output_path: str, default None

the path to the numpy array containing the output tensor data that will be splited/loaded.

labels: list of str, default None

in case of classification/regression, the name of the column(s) in the metadata table to be predicted.

projection_labels: dict, default None

selects only the data that match the conditions. Use this dictionary to filter the input data from the metadata table: {<column_name>: <value>}.

stratify_label: str, default None

the name of the column in the metadata table containing the label used during the stratification (mutuallty exclusive with ‘custom_stratification’).

custom_stratification: dict, default None

split the dataset into train/validation/test according to the defined stratification strategy. The filtering is performed as for the labels projection (mutuallty exclusive with ‘stratify_label’).

number_of_folds: int, default 10

the number of folds that will be used in the cross validation.

batch_size: int, default 1

the size of each mini-batch.

sampler: str or Sampler, default ‘random’

whether we use a sequential, random or weighted random sampler (to deal with imbalanced classes issue) during the generation of the mini-batches: None, ‘random’, ‘weighted_random’ or a custom Sampler class.

input_transforms, output_transforms: list of callable, default None

transforms a list of samples with pre-defined transformations.

data_augmentation_transforms: list of callable, default None

transforms the training dataset input with pre-defined transformations on the fly during the training.

add_input: bool, default False

if true concatenate the input tensor to the output tensor.

test_size: float, default 0.1

should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.

label_mapping: dict, default None

a mapping that can be used to convert labels to be predicted (string to int conversion).

patch_size: tuple, default None

the size of the patches that will be extracted from the input/output images.

continuous_labels: bool, default False

if set consider labels as continuous values; ie. floats otherwise a discrete values, ie. integer.

sample_size: float, default 1

should be between 0.0 and 1.0 and represent the proportion of the dataset used by the manger (random selection that can be usefull during testing.

collate_fn(list_samples)[source]¶

After fetching a list of samples using the indices from sampler, the function passed as the collate_fn argument is used to collate lists of samples into batches.

A custom collate_fn is used here to apply the transformations.

See https://pytorch.org/docs/stable/data.html#dataloader-collate-fn.

classmethod from_dataset(test_dataset=None, train_dataset=None, validation_dataset=None, batch_size=1, sampler='random', multi_bloc=False)[source]¶

Create a data manger from torch datasets.

Parameters

*_dataset: Dataset

the train/validation/test datasets.

batch_size: int, default 1

the size of each mini-batch.

sampler: str or Sampler, default ‘random’

whether we use a sequential, random or weighted random sampler (to deal with imbalanced classes issue) during the generation of the mini-batches: None, ‘random’, ‘weighted_random’ or a custom Sampler class.

multi_bloc: bool, default False

if sett expect multi bloc datasets that returns a list with N bloc of data.

Returns

ins: DataManager

a data manager.

classmethod from_numpy(test_inputs=None, test_outputs=None, test_labels=None, train_inputs=None, train_outputs=None, train_labels=None, validation_inputs=None, validation_outputs=None, validation_labels=None, batch_size=1, sampler='random', input_transforms=None, output_transforms=None, data_augmentation_transforms=None, add_input=False, label_mapping=None, patch_size=None, continuous_labels=False)[source]¶

Create a data manger from numpy arrays.

Parameters

*_inputs, *_outputs, *_labels: ndarrays

the training data.

batch_size: int, default 1

the size of each mini-batch.

sampler: str or Sampler, default ‘random’

whether we use a sequential, random or weighted random sampler (to deal with imbalanced classes issue) during the generation of the mini-batches: None, ‘random’, ‘weighted_random’ or a custom Sampler class.

input_transforms, output_transforms: list of callable, default None

transforms a list of samples with pre-defined transformations.

data_augmentation_transforms: list of callable, default None

transforms the training dataset input with pre-defined transformations on the fly during the training.

add_input: bool, default False

if true concatenate the input tensor to the output tensor.

label_mapping: dict, default None

a mapping that can be used to convert labels to be predicted (string to int conversion).

patch_size: tuple, default None

the size of the patches that will be extracted from the input/output images.

continuous_labels: bool, default False

if set consider labels as continuous values; ie. floats otherwise a discrete values, ie. integer.

Returns

ins: DataManager

a data manager.

get_dataloader(train=False, validation=False, test=False, fold_index=0)[source]¶

Generate a pytorch DataLoader.

Parameters

train: bool, default False

return the dataloader over the train set.

validation: bool, default False

return the dataloader over the validation set.

test: bool, default False

return the dataloader over the test set.

fold_index: int, default 0

the index of the fold to use for the training

Returns

loaders: list of DataLoader

the requested data loaders.

static get_mask(df, projection_labels=None, sample_size=1)[source]¶

Filter a table.

Parameters

df: a pandas DataFrame

a table data.

projection_labels: dict, default None

selects only the data that match the conditions in the dict {<column_name>: <value>}.

sample_size: float, default 1

should be between 0.0 and 1.0 and represent the proportion of the dataset used by the manager (random selection that can be usefull during testing).

Returns

mask: a list of boolean values.

static get_mask_indices(mask)[source]¶: From an input mask vector, return the true indices.

class pynet.datasets.core.SetItem(test, train, validation)¶

property test¶: Alias for field number 0

property train¶: Alias for field number 1

property validation¶: Alias for field number 2

Helper Module for Deep Learning.

Follow us