deepnog.data package

deepnog.data.dataset module

Author: Lukas Gosch

Date: 2019-10-03

Description:

Dataset classes and helper functions for usage with deep network models written in PyTorch.

class deepnog.data.dataset.ProteinIterator(file_, labels: pandas.DataFrame, aa_vocab, f_format, n_skipped: Union[int, deepnog.utils.sync.SynchronizedCounter] = 0, num_workers=1, worker_id=0)[source]

Bases: object

Iterator allowing for multiprocess data loading of a sequence file.

ProteinIterator is a wrapper for the iterator returned by Biopython’s Bio.SeqIO class when parsing a sequence file. It specifies custom __next__() method to support single- and multi-process data loading.

In the single-process loading case, nothing special happens, the ProteinIterator sequentially iterates over the data file. In the end, it informs the main module about the number of skipped sequences (due to empty ids) through setting a global variable in the main module.

In the multi-process loading case, each ProteinIterator loads a sequence and then skips the next few sequences dedicated to the other workers. This works by each worker skipping num_worker - 1 data samples for each call to __next__(). Furthermore, each worker skips worker_id data samples in the initialization.

The ProteinIterator class also makes sure that a unique ID is set for each SeqRecord obtained from the data-iterator. This allows unambiguous handling of large protein datasets which may have duplicate IDs from merging multiple sources or may have no IDs at all. For easy and efficient sorting of batches of sequences as well as for direct access to the original IDs, the index is stored separately.

Parameters
  • file (str) – Path to sequence file, from which an iterator over the sequences will be created with Biopython’s Bio.SeqIO.parse() function.

  • labels (pd.DataFrame) – Dataframe storing labels associated to the sequences. This is required for training, and ignored during inference. Must contain ‘protein_id’ and ‘label_num’ columns providing identifiers and numerical labels.

  • aa_vocab (dict) – Amino-acid vocabulary mapping letters to integers

  • f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.

  • num_workers (int) – Number of workers set in DataLoader or one if no workers set. If bigger or equal to two, the multi-process loading case happens.

  • worker_id (int) – ID of worker this iterator belongs to

deepnog.data.dataset.collate_sequences(batch: Union[List[deepnog.data.dataset.sequence], deepnog.data.dataset.sequence], zero_padding: bool = True, min_length: int = 36, random_padding: bool = False) → deepnog.data.dataset.collated_sequences[source]

Collate and zero-pad encoded sequence.

Parameters
  • batch (namedtuple, or list of namedtuples) – Batch of protein sequences to classify stored as a namedtuple sequence.

  • zero_padding (bool) – Zero-pad protein sequences, that is, append zeros until every sequence is as long as the longest sequences in batch. NOTE: currently unused. Zero-padding is always performed.

  • min_length (int, optional) – Zero-pad sequences to at least min_length. By default, this is set to 36, which is the largest kernel size in the default DeepNOG/DeepEncoding architecture.

  • random_padding (bool, optional) – Zero pad sequences by prepending and appending zeros. The fraction is determined randomly. This may counter detrimental effects, when short sequences would always have long zero-tails, otherwise.

Returns

batch – Input batch zero-padded and stored in namedtuple collated_sequences.

Return type

NamedTuple

deepnog.data.dataset.gen_amino_acid_vocab(alphabet=None)[source]

Create vocabulary for protein sequences.

A vocabulary is defined as a mapping from the amino-acid letters in the alphabet to numbers. As this mapping is aware of zero-padding, it maps the first letter in the alphabet to 1 instead of 0.

Parameters

alphabet (str) – Alphabet to use for vocabulary. If None, use ‘ACDEFGHIKLMNPQRSTVWYBXZJUO’ (equivalent to deprecated Biopython’s ExtendedIUPACProtein).

Returns

vocab – Mapping of amino acid characters to numbers.

Return type

dict

deepnog.data.split module

class deepnog.data.split.DataSplit(X_train: pandas.DataFrame, X_val: pandas.DataFrame, X_test: pandas.DataFrame, y_train: pandas.DataFrame, y_val: pandas.DataFrame, y_test: pandas.DataFrame, uniref_train: Optional[pandas.DataFrame], uniref_val: Optional[pandas.DataFrame], uniref_test: Optional[pandas.DataFrame])[source]

Bases: object

Class for returned data, labels, and groups after train/val/test split.

X_test: pandas.DataFrame
X_train: pandas.DataFrame
X_val: pandas.DataFrame
uniref_test: Optional[pandas.DataFrame]
uniref_train: Optional[pandas.DataFrame]
uniref_val: Optional[pandas.DataFrame]
y_test: pandas.DataFrame
y_train: pandas.DataFrame
y_val: pandas.DataFrame
deepnog.data.split.group_train_val_test_split(df: pandas.DataFrame, train_ratio: float = 0.96, validation_ratio: float = 0.02, test_ratio: float = 0.02, random_state: int = 123, with_replacement: bool = True, verbose: int = 0)deepnog.data.split.DataSplit[source]

Create training/validation/test split for deepnog experiments.

Takes UniRef cluster IDs into account, that is, makes sure that sequences from the same cluster go into the same set. In other words, training, validation, and test sets are disjunct in terms of UniRef clusters.

Parameters
  • df (pandas DataFrame) – Must contain ‘string_id’, ‘eggnog_id’, ‘uniref_id’ columns

  • train_ratio (float) – Fraction of total sequences for training set

  • validation_ratio (float) – Fraction of total sequences for validation set

  • test_ratio (float) – Fraction of total sequences for test set

  • random_state (int) – Set random state for reproducible results

  • with_replacement (bool) – By default, scikit-learn GroupShuffleSplit samples objects with replacement. Disabling replacement removes

  • verbose (int) – Level of logging verbosity

Returns

data_split – Split X, y, groups

Return type

NamedTuple

deepnog.data.split.train_val_test_split(df: pandas.DataFrame, train_ratio: float = 0.96, validation_ratio: float = 0.02, test_ratio: float = 0.02, stratify: bool = True, shuffle: bool = True, random_state: int = 123, verbose: int = 0)deepnog.data.split.DataSplit[source]

Create training/validation/test split for deepnog experiments.

Does not take UniRef clusters into account. Do not use for UniRef50/90 experiments.

Parameters
  • df (pandas DataFrame) – Must contain ‘string_id’, ‘eggnog_id’ columns

  • train_ratio (float) – Fraction of total sequences for training set

  • validation_ratio (float) – Fraction of total sequences for validation set

  • test_ratio (float) – Fractino of total sequences for test set

  • stratify (bool) – Stratify the splits according to the orthology labels

  • shuffle (bool) – Shuffle the sequences

  • random_state (int) – Set random state for reproducible results

  • verbose (int) – Level of logging verbosity

Returns

data_split – Split X, y, groups

Return type

DataSplit