deepnog.data package

deepnog.data.dataset module

Author: Lukas Gosch

Date: 2019-10-03

Description:

Dataset classes and helper functions for usage with deep network models written in PyTorch.

class deepnog.data.dataset.ProteinDataset(*args: Any, **kwargs: Any)[source]

Bases: torch.utils.data.Dataset

Protein dataset with sequences and labels for training.

If sequences and labels are provided as files rather than objects, loads and stores all proteins from input files during construction. While this comes at the price of some delay, it allows to truly shuffle the complete dataset during training.

Parameters
  • sequences (list, str, Path) – Protein sequences as list of Biopython Seq, or path to fasta file containing the sequences.

  • labels (DataFrame, str, Path, optional) – Protein orthologous group labels as DataFrame, or str to CSV file containing such a dataframe. This is required for training, and ignored during inference. Must be in CSV format with header line and index column, that is, compatible to be read by pandas.read_csv(…, index_col=0). The labels are expected in a column named “eggnog_id” or in the last column, and sequence IDs in a column “protein_id”.

  • f_format (str, optional) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.

  • label_encoder (LabelEncoder, optional) – The label encoder maps str class names to numerical labels. Provide a label encoder during validation.

  • verbose (int, optional) – Control verbosity of logging.

class deepnog.data.dataset.ProteinIterableDataset(*args: Any, **kwargs: Any)[source]

Bases: torch.utils.data.IterableDataset

Protein dataset holding the proteins to classify.

Does not load and store all proteins from a given sequence file but only holds an iterator to the next sequence to load.

Thread safe class allowing for multi-worker loading of sequences from a given datafile.

Parameters
  • file (str) – Path to file storing the protein sequences.

  • labels_file (str, optional) – Path to file storing labels associated to the sequences. This is required for training, and ignored during inference. Must be in CSV format with header line and index column, that is, compatible to be read by pandas.read_csv(…, index_col=0). The labels are expected in a column named “eggnog_id” or in the last column.

  • f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.

  • label_encoder (LabelEncoder, optional) – The label encoder maps str class names to numerical labels. Provide a label encoder during validation.

class deepnog.data.dataset.ProteinIterator(file_, labels: pandas.DataFrame, aa_vocab, f_format, n_skipped: Union[int, deepnog.utils.sync.SynchronizedCounter] = 0, num_workers=1, worker_id=0)[source]

Bases: object

Iterator allowing for multiprocess data loading of a sequence file.

ProteinIterator is a wrapper for the iterator returned by Biopython’s Bio.SeqIO class when parsing a sequence file. It specifies custom __next__() method to support single- and multi-process data loading.

In the single-process loading case, nothing special happens, the ProteinIterator sequentially iterates over the data file. In the end, it informs the main module about the number of skipped sequences (due to empty ids) through setting a global variable in the main module.

In the multi-process loading case, each ProteinIterator loads a sequence and then skips the next few sequences dedicated to the other workers. This works by each worker skipping num_worker - 1 data samples for each call to __next__(). Furthermore, each worker skips worker_id data samples in the initialization.

The ProteinIterator class also makes sure that a unique ID is set for each SeqRecord obtained from the data-iterator. This allows unambiguous handling of large protein datasets which may have duplicate IDs from merging multiple sources or may have no IDs at all. For easy and efficient sorting of batches of sequences as well as for direct access to the original IDs, the index is stored separately.

Parameters
  • file (str) – Path to sequence file, from which an iterator over the sequences will be created with Biopython’s Bio.SeqIO.parse() function.

  • labels (pd.DataFrame) – Dataframe storing labels associated to the sequences. This is required for training, and ignored during inference. Must contain ‘protein_id’ and ‘label_num’ columns providing identifiers and numerical labels.

  • aa_vocab (dict) – Amino-acid vocabulary mapping letters to integers

  • f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.

  • num_workers (int) – Number of workers set in DataLoader or one if no workers set. If bigger or equal to two, the multi-process loading case happens.

  • worker_id (int) – ID of worker this iterator belongs to

class deepnog.data.dataset.ShuffledProteinIterableDataset(*args: Any, **kwargs: Any)[source]

Bases: deepnog.data.dataset.ProteinIterableDataset

Shuffle an iterable ProteinDataset by introducing a shuffle buffer.

Parameters
  • file (str) – Path to file storing the protein sequences.

  • labels_file (str, optional) – Path to file storing labels associated to the sequences. This is required for training, and ignored during inference. Must be in CSV format with header line and index column, that is, compatible to be read by pandas.read_csv(…, index_col=0). The labels are expected in a column named “eggnog_id” or in the last column.

  • f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.

  • label_encoder (LabelEncoder, optional) – The label encoder maps str class names to numerical labels. Provide a label encoder during validation.

  • buffer_size (int) – How many objects will be buffered, i.e. are available to choose from.

References

Adapted from code by Sharvil Nanavati, see https://discuss.pytorch.org/t/how-to-shuffle-an-iterable-dataset/64130/5

deepnog.data.dataset.collate_sequences(batch: Union[List[deepnog.data.dataset.sequence], deepnog.data.dataset.sequence], zero_padding: bool = True, min_length: int = 36, random_padding: bool = False) deepnog.data.dataset.collated_sequences[source]

Collate and zero-pad encoded sequence.

Parameters
  • batch (namedtuple, or list of namedtuples) – Batch of protein sequences to classify stored as a namedtuple sequence.

  • zero_padding (bool) – Zero-pad protein sequences, that is, append zeros until every sequence is as long as the longest sequences in batch. NOTE: currently unused. Zero-padding is always performed.

  • min_length (int, optional) – Zero-pad sequences to at least min_length. By default, this is set to 36, which is the largest kernel size in the default DeepNOG architecture.

  • random_padding (bool, optional) – Zero pad sequences by prepending and appending zeros. The fraction is determined randomly. This may counter detrimental effects, when short sequences would always have long zero-tails, otherwise.

Returns

batch – Input batch zero-padded and stored in namedtuple collated_sequences.

Return type

NamedTuple

deepnog.data.dataset.gen_amino_acid_vocab(alphabet=None)[source]

Create vocabulary for protein sequences.

A vocabulary is defined as a mapping from the amino-acid letters in the alphabet to numbers. As this mapping is aware of zero-padding, it maps the first letter in the alphabet to 1 instead of 0.

Parameters

alphabet (str) – Alphabet to use for vocabulary. If None, use ‘ACDEFGHIKLMNPQRSTVWYBXZJUO’ (equivalent to deprecated Biopython’s ExtendedIUPACProtein).

Returns

vocab – Mapping of amino acid characters to numbers.

Return type

dict

deepnog.data.split module

class deepnog.data.split.DataSplit(X_train: pandas.DataFrame, X_val: pandas.DataFrame, X_test: pandas.DataFrame, y_train: pandas.DataFrame, y_val: pandas.DataFrame, y_test: pandas.DataFrame, uniref_train: Optional[pandas.DataFrame], uniref_val: Optional[pandas.DataFrame], uniref_test: Optional[pandas.DataFrame])[source]

Bases: object

Class for returned data, labels, and groups after train/val/test split.

X_test: pandas.DataFrame
X_train: pandas.DataFrame
X_val: pandas.DataFrame
uniref_test: Optional[pandas.DataFrame]
uniref_train: Optional[pandas.DataFrame]
uniref_val: Optional[pandas.DataFrame]
y_test: pandas.DataFrame
y_train: pandas.DataFrame
y_val: pandas.DataFrame
deepnog.data.split.group_train_val_test_split(df: pandas.DataFrame, train_ratio: float = 0.96, validation_ratio: float = 0.02, test_ratio: float = 0.02, random_state: int = 123, with_replacement: bool = True, verbose: int = 0) deepnog.data.split.DataSplit[source]

Create training/validation/test split for deepnog experiments.

Takes UniRef cluster IDs into account, that is, makes sure that sequences from the same cluster go into the same set. In other words, training, validation, and test sets are disjunct in terms of UniRef clusters.

Parameters
  • df (pandas DataFrame) – Must contain ‘string_id’, ‘eggnog_id’, ‘uniref_id’ columns

  • train_ratio (float) – Fraction of total sequences for training set

  • validation_ratio (float) – Fraction of total sequences for validation set

  • test_ratio (float) – Fraction of total sequences for test set

  • random_state (int) – Set random state for reproducible results

  • with_replacement (bool) – By default, scikit-learn GroupShuffleSplit samples objects with replacement. Disabling replacement removes

  • verbose (int) – Level of logging verbosity

Returns

data_split – Split X, y, groups

Return type

NamedTuple

deepnog.data.split.train_val_test_split(df: pandas.DataFrame, train_ratio: float = 0.96, validation_ratio: float = 0.02, test_ratio: float = 0.02, stratify: bool = True, shuffle: bool = True, random_state: int = 123, verbose: int = 0) deepnog.data.split.DataSplit[source]

Create training/validation/test split for deepnog experiments.

Does not take UniRef clusters into account. Do not use for UniRef50/90 experiments.

Parameters
  • df (pandas DataFrame) – Must contain ‘string_id’, ‘eggnog_id’ columns

  • train_ratio (float) – Fraction of total sequences for training set

  • validation_ratio (float) – Fraction of total sequences for validation set

  • test_ratio (float) – Fractino of total sequences for test set

  • stratify (bool) – Stratify the splits according to the orthology labels

  • shuffle (bool) – Shuffle the sequences

  • random_state (int) – Set random state for reproducible results

  • verbose (int) – Level of logging verbosity

Returns

data_split – Split X, y, groups

Return type

DataSplit