deepnog.data package¶

deepnog.data.dataset module¶

Author: Lukas Gosch

Date: 2019-10-03

Description:

Dataset classes and helper functions for usage with deep network models written in PyTorch.

class deepnog.data.dataset.ProteinIterator(file_, labels: pandas.DataFrame, aa_vocab, f_format, n_skipped: Union[int, deepnog.utils.sync.SynchronizedCounter] = 0, num_workers=1, worker_id=0)[source]¶

Bases: object

Iterator allowing for multiprocess data loading of a sequence file.

ProteinIterator is a wrapper for the iterator returned by Biopython’s Bio.SeqIO class when parsing a sequence file. It specifies custom __next__() method to support single- and multi-process data loading.

In the single-process loading case, nothing special happens, the ProteinIterator sequentially iterates over the data file. In the end, it informs the main module about the number of skipped sequences (due to empty ids) through setting a global variable in the main module.

In the multi-process loading case, each ProteinIterator loads a sequence and then skips the next few sequences dedicated to the other workers. This works by each worker skipping num_worker - 1 data samples for each call to __next__(). Furthermore, each worker skips worker_id data samples in the initialization.

The ProteinIterator class also makes sure that a unique ID is set for each SeqRecord obtained from the data-iterator. This allows unambiguous handling of large protein datasets which may have duplicate IDs from merging multiple sources or may have no IDs at all. For easy and efficient sorting of batches of sequences as well as for direct access to the original IDs, the index is stored separately.

Parameters

file (str) – Path to sequence file, from which an iterator over the sequences will be created with Biopython’s Bio.SeqIO.parse() function.
labels (pd.DataFrame) – Dataframe storing labels associated to the sequences. This is required for training, and ignored during inference. Must contain ‘protein_id’ and ‘label_num’ columns providing identifiers and numerical labels.
aa_vocab (dict) – Amino-acid vocabulary mapping letters to integers
f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.
num_workers (int) – Number of workers set in DataLoader or one if no workers set. If bigger or equal to two, the multi-process loading case happens.
worker_id (int) – ID of worker this iterator belongs to

deepnog.data.dataset.collate_sequences(batch: Union[List[deepnog.data.dataset.sequence], deepnog.data.dataset.sequence], zero_padding: bool = True, min_length: int = 36, random_padding: bool = False) → deepnog.data.dataset.collated_sequences[source]¶

Collate and zero-pad encoded sequence.

Parameters

batch (namedtuple, or list of namedtuples) – Batch of protein sequences to classify stored as a namedtuple sequence.
zero_padding (bool) – Zero-pad protein sequences, that is, append zeros until every sequence is as long as the longest sequences in batch. NOTE: currently unused. Zero-padding is always performed.
min_length (int, optional) – Zero-pad sequences to at least min_length. By default, this is set to 36, which is the largest kernel size in the default DeepNOG/DeepEncoding architecture.
random_padding (bool, optional) – Zero pad sequences by prepending and appending zeros. The fraction is determined randomly. This may counter detrimental effects, when short sequences would always have long zero-tails, otherwise.

Returns

batch – Input batch zero-padded and stored in namedtuple collated_sequences.

Return type

NamedTuple

deepnog.data.dataset.gen_amino_acid_vocab(alphabet=None)[source]¶

Create vocabulary for protein sequences.

A vocabulary is defined as a mapping from the amino-acid letters in the alphabet to numbers. As this mapping is aware of zero-padding, it maps the first letter in the alphabet to 1 instead of 0.

Parameters: alphabet (str) – Alphabet to use for vocabulary. If None, use ‘ACDEFGHIKLMNPQRSTVWYBXZJUO’ (equivalent to deprecated Biopython’s ExtendedIUPACProtein).
Returns: vocab – Mapping of amino acid characters to numbers.
Return type: dict

deepnog.data.split module¶

class deepnog.data.split.DataSplit(X_train: pandas.DataFrame, X_val: pandas.DataFrame, X_test: pandas.DataFrame, y_train: pandas.DataFrame, y_val: pandas.DataFrame, y_test: pandas.DataFrame, uniref_train: Optional[pandas.DataFrame], uniref_val: Optional[pandas.DataFrame], uniref_test: Optional[pandas.DataFrame])[source]¶

Bases: object

Class for returned data, labels, and groups after train/val/test split.

X_test: pandas.DataFrame¶

X_train: pandas.DataFrame¶

X_val: pandas.DataFrame¶

uniref_test: Optional[pandas.DataFrame]¶

uniref_train: Optional[pandas.DataFrame]¶

uniref_val: Optional[pandas.DataFrame]¶

y_test: pandas.DataFrame¶

y_train: pandas.DataFrame¶

y_val: pandas.DataFrame¶

deepnog.data.split.group_train_val_test_split(df: pandas.DataFrame, train_ratio: float = 0.96, validation_ratio: float = 0.02, test_ratio: float = 0.02, random_state: int = 123, with_replacement: bool = True, verbose: int = 0) → deepnog.data.split.DataSplit [source]¶

Create training/validation/test split for deepnog experiments.

Takes UniRef cluster IDs into account, that is, makes sure that sequences from the same cluster go into the same set. In other words, training, validation, and test sets are disjunct in terms of UniRef clusters.

Parameters

df (pandas DataFrame) – Must contain ‘string_id’, ‘eggnog_id’, ‘uniref_id’ columns
train_ratio (float) – Fraction of total sequences for training set
validation_ratio (float) – Fraction of total sequences for validation set
test_ratio (float) – Fraction of total sequences for test set
random_state (int) – Set random state for reproducible results
with_replacement (bool) – By default, scikit-learn GroupShuffleSplit samples objects with replacement. Disabling replacement removes
verbose (int) – Level of logging verbosity

Returns

data_split – Split X, y, groups

Return type

NamedTuple

deepnog.data.split.train_val_test_split(df: pandas.DataFrame, train_ratio: float = 0.96, validation_ratio: float = 0.02, test_ratio: float = 0.02, stratify: bool = True, shuffle: bool = True, random_state: int = 123, verbose: int = 0) → deepnog.data.split.DataSplit [source]¶

Create training/validation/test split for deepnog experiments.

Does not take UniRef clusters into account. Do not use for UniRef50/90 experiments.

Parameters

df (pandas DataFrame) – Must contain ‘string_id’, ‘eggnog_id’ columns
train_ratio (float) – Fraction of total sequences for training set
validation_ratio (float) – Fraction of total sequences for validation set
test_ratio (float) – Fractino of total sequences for test set
stratify (bool) – Stratify the splits according to the orthology labels
shuffle (bool) – Shuffle the sequences
random_state (int) – Set random state for reproducible results
verbose (int) – Level of logging verbosity

Returns

data_split – Split X, y, groups

Return type

DataSplit