deepnog.data package¶

deepnog.data.dataset module¶

Author: Lukas Gosch

Date: 2019-10-03

Description:

Dataset classes and helper functions for usage with deep network models written in PyTorch.

class deepnog.data.dataset.ProteinDataset(*args: Any, **kwargs: Any)[source]¶

Bases: torch.utils.data.Dataset

Protein dataset with sequences and labels for training.

If sequences and labels are provided as files rather than objects, loads and stores all proteins from input files during construction. While this comes at the price of some delay, it allows to truly shuffle the complete dataset during training.

Parameters

sequences (list, str, Path) – Protein sequences as list of Biopython Seq, or path to fasta file containing the sequences.
labels (DataFrame, str, Path, optional) – Protein orthologous group labels as DataFrame, or str to CSV file containing such a dataframe. This is required for training, and ignored during inference. Must be in CSV format with header line and index column, that is, compatible to be read by pandas.read_csv(…, index_col=0). The labels are expected in a column named “eggnog_id” or in the last column, and sequence IDs in a column “protein_id”.
f_format (str, optional) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.
label_encoder (LabelEncoder, optional) – The label encoder maps str class names to numerical labels. Provide a label encoder during validation.
verbose (int, optional) – Control verbosity of logging.

class deepnog.data.dataset.ProteinIterableDataset(*args: Any, **kwargs: Any)[source]¶

Bases: torch.utils.data.IterableDataset

Protein dataset holding the proteins to classify.

Does not load and store all proteins from a given sequence file but only holds an iterator to the next sequence to load.

Thread safe class allowing for multi-worker loading of sequences from a given datafile.

Parameters

file (str) – Path to file storing the protein sequences.
labels_file (str, optional) – Path to file storing labels associated to the sequences. This is required for training, and ignored during inference. Must be in CSV format with header line and index column, that is, compatible to be read by pandas.read_csv(…, index_col=0). The labels are expected in a column named “eggnog_id” or in the last column.
f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.
label_encoder (LabelEncoder, optional) – The label encoder maps str class names to numerical labels. Provide a label encoder during validation.

class deepnog.data.dataset.ProteinIterator(file_, labels: pandas.DataFrame, aa_vocab, f_format, n_skipped: Union[int, deepnog.utils.sync.SynchronizedCounter] = 0, num_workers=1, worker_id=0)[source]¶

Bases: object

Iterator allowing for multiprocess data loading of a sequence file.

ProteinIterator is a wrapper for the iterator returned by Biopython’s Bio.SeqIO class when parsing a sequence file. It specifies custom __next__() method to support single- and multi-process data loading.

In the single-process loading case, nothing special happens, the ProteinIterator sequentially iterates over the data file. In the end, it informs the main module about the number of skipped sequences (due to empty ids) through setting a global variable in the main module.

In the multi-process loading case, each ProteinIterator loads a sequence and then skips the next few sequences dedicated to the other workers. This works by each worker skipping num_worker - 1 data samples for each call to __next__(). Furthermore, each worker skips worker_id data samples in the initialization.

The ProteinIterator class also makes sure that a unique ID is set for each SeqRecord obtained from the data-iterator. This allows unambiguous handling of large protein datasets which may have duplicate IDs from merging multiple sources or may have no IDs at all. For easy and efficient sorting of batches of sequences as well as for direct access to the original IDs, the index is stored separately.

Parameters

file (str) – Path to sequence file, from which an iterator over the sequences will be created with Biopython’s Bio.SeqIO.parse() function.
labels (pd.DataFrame) – Dataframe storing labels associated to the sequences. This is required for training, and ignored during inference. Must contain ‘protein_id’ and ‘label_num’ columns providing identifiers and numerical labels.
aa_vocab (dict) – Amino-acid vocabulary mapping letters to integers
f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.
num_workers (int) – Number of workers set in DataLoader or one if no workers set. If bigger or equal to two, the multi-process loading case happens.
worker_id (int) – ID of worker this iterator belongs to

class deepnog.data.dataset.ShuffledProteinIterableDataset(*args: Any, **kwargs: Any)[source]¶

Bases: deepnog.data.dataset.ProteinIterableDataset

Shuffle an iterable ProteinDataset by introducing a shuffle buffer.

Parameters

file (str) – Path to file storing the protein sequences.
labels_file (str, optional) – Path to file storing labels associated to the sequences. This is required for training, and ignored during inference. Must be in CSV format with header line and index column, that is, compatible to be read by pandas.read_csv(…, index_col=0). The labels are expected in a column named “eggnog_id” or in the last column.
f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.
label_encoder (LabelEncoder, optional) – The label encoder maps str class names to numerical labels. Provide a label encoder during validation.
buffer_size (int) – How many objects will be buffered, i.e. are available to choose from.

References

Adapted from code by Sharvil Nanavati, see https://discuss.pytorch.org/t/how-to-shuffle-an-iterable-dataset/64130/5

deepnog.data.dataset.collate_sequences(batch: Union[List[deepnog.data.dataset.sequence], deepnog.data.dataset.sequence], zero_padding: bool = True, min_length: int = 36, random_padding: bool = False) → deepnog.data.dataset.collated_sequences[source]¶

Collate and zero-pad encoded sequence.

Parameters

batch (namedtuple, or list of namedtuples) – Batch of protein sequences to classify stored as a namedtuple sequence.
zero_padding (bool) – Zero-pad protein sequences, that is, append zeros until every sequence is as long as the longest sequences in batch. NOTE: currently unused. Zero-padding is always performed.
min_length (int, optional) – Zero-pad sequences to at least min_length. By default, this is set to 36, which is the largest kernel size in the default DeepNOG architecture.
random_padding (bool, optional) – Zero pad sequences by prepending and appending zeros. The fraction is determined randomly. This may counter detrimental effects, when short sequences would always have long zero-tails, otherwise.

Returns

batch – Input batch zero-padded and stored in namedtuple collated_sequences.

Return type

NamedTuple

deepnog.data.dataset.gen_amino_acid_vocab(alphabet=None)[source]¶

Create vocabulary for protein sequences.

A vocabulary is defined as a mapping from the amino-acid letters in the alphabet to numbers. As this mapping is aware of zero-padding, it maps the first letter in the alphabet to 1 instead of 0.

Parameters: alphabet (str) – Alphabet to use for vocabulary. If None, use ‘ACDEFGHIKLMNPQRSTVWYBXZJUO’ (equivalent to deprecated Biopython’s ExtendedIUPACProtein).
Returns: vocab – Mapping of amino acid characters to numbers.
Return type: dict

deepnog.data.split module¶

class deepnog.data.split.DataSplit(X_train: pandas.DataFrame, X_val: pandas.DataFrame, X_test: pandas.DataFrame, y_train: pandas.DataFrame, y_val: pandas.DataFrame, y_test: pandas.DataFrame, uniref_train: Optional[pandas.DataFrame], uniref_val: Optional[pandas.DataFrame], uniref_test: Optional[pandas.DataFrame])[source]¶

Bases: object

Class for returned data, labels, and groups after train/val/test split.

X_test: pandas.DataFrame¶

X_train: pandas.DataFrame¶

X_val: pandas.DataFrame¶

uniref_test: Optional[pandas.DataFrame]¶

uniref_train: Optional[pandas.DataFrame]¶

uniref_val: Optional[pandas.DataFrame]¶

y_test: pandas.DataFrame¶

y_train: pandas.DataFrame¶

y_val: pandas.DataFrame¶

deepnog.data.split.group_train_val_test_split(df: pandas.DataFrame, train_ratio: float = 0.96, validation_ratio: float = 0.02, test_ratio: float = 0.02, random_state: int = 123, with_replacement: bool = True, verbose: int = 0) → deepnog.data.split.DataSplit[source]¶

Create training/validation/test split for deepnog experiments.

Takes UniRef cluster IDs into account, that is, makes sure that sequences from the same cluster go into the same set. In other words, training, validation, and test sets are disjunct in terms of UniRef clusters.

Parameters

df (pandas DataFrame) – Must contain ‘string_id’, ‘eggnog_id’, ‘uniref_id’ columns
train_ratio (float) – Fraction of total sequences for training set
validation_ratio (float) – Fraction of total sequences for validation set
test_ratio (float) – Fraction of total sequences for test set
random_state (int) – Set random state for reproducible results
with_replacement (bool) – By default, scikit-learn GroupShuffleSplit samples objects with replacement. Disabling replacement removes
verbose (int) – Level of logging verbosity

Returns

data_split – Split X, y, groups

Return type

NamedTuple

deepnog.data.split.train_val_test_split(df: pandas.DataFrame, train_ratio: float = 0.96, validation_ratio: float = 0.02, test_ratio: float = 0.02, stratify: bool = True, shuffle: bool = True, random_state: int = 123, verbose: int = 0) → deepnog.data.split.DataSplit[source]¶

Create training/validation/test split for deepnog experiments.

Does not take UniRef clusters into account. Do not use for UniRef50/90 experiments.

Parameters

df (pandas DataFrame) – Must contain ‘string_id’, ‘eggnog_id’ columns
train_ratio (float) – Fraction of total sequences for training set
validation_ratio (float) – Fraction of total sequences for validation set
test_ratio (float) – Fractino of total sequences for test set
stratify (bool) – Stratify the splits according to the orthology labels
shuffle (bool) – Shuffle the sequences
random_state (int) – Set random state for reproducible results
verbose (int) – Level of logging verbosity

Returns

data_split – Split X, y, groups

Return type

DataSplit