deepnog.dataset

Author: Lukas Gosch

Date: 2019-10-03

Description:

Dataset classes and helper functions for usage with deep network models written in PyTorch.

class deepnog.dataset.ProteinDataset(file, f_format='fasta')[source]

Bases: torch.utils.data.dataset.IterableDataset

Protein dataset holding the proteins to classify.

Does not load and store all proteins from a given sequence file but only holds an iterator to the next sequence to load.

Thread safe class allowing for multi-worker loading of sequences from a given datafile.

Parameters
  • file (str) – Path to file storing the protein sequences.

  • f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.

class deepnog.dataset.ProteinIterator(file_, aa_vocab, f_format, n_skipped: Union[int, deepnog.sync.SynchronizedCounter] = 0, num_workers=1, worker_id=0)[source]

Bases: object

Iterator allowing for multiprocess data loading of a sequence file.

ProteinIterator is a wrapper for the iterator returned by Biopython’s Bio.SeqIO class when parsing a sequence file. It specifies custom __next__() method to support single- and multi-process data loading.

In the single-process loading case, nothing special happens, the ProteinIterator sequentially iterates over the data file. In the end, it informs the main module about the number of skipped sequences (due to empty ids) through setting a global variable in the main module.

In the multi-process loading case, each ProteinIterator loads a sequence and then skips the next few sequences dedicated to the other workers. This works by each worker skipping num_worker - 1 data samples for each call to __next__(). Furthermore, each worker skips worker_id data samples in the initialization. At the end of the workers lifetime, it sends the number of skipped sequences back to the main process through a pipe the main process created.

The ProteinIterator class also makes sure that a unique ID is set for each SeqRecord obtained from the data-iterator. This allows unambiguous handling of large protein datasets which may have duplicate IDs from merging multiple sources or may have no IDs at all. For easy and efficient sorting of batches of sequences as well as for direct access to the original IDs, the index is stored separately.

Parameters
  • file (str) – Path to sequence file, from which an iterator over the sequences will be created with Biopython’s Bio.SeqIO.parse() function.

  • aa_vocab (dict) – Amino-acid vocabulary mapping letters to integers

  • f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.

  • num_workers (int) – Number of workers set in DataLoader or one if no workers set. If bigger or equal to two, the multi-process loading case happens.

  • worker_id (int) – ID of worker this iterator belongs to

deepnog.dataset.collate_sequences(batch, zero_padding=True)[source]

Collate and zero-pad encoded sequence.

Parameters
  • batch (list[namedtuple] or namedtuple) – Batch of protein sequences to classify stored as a namedtuple-class sequence (see ProteinDataset).

  • zero_padding (bool) – If True, zero-pads protein sequences through appending zeros until every sequence is as long as the longest sequences in batch. If False raise NotImplementedError.

Returns

batch – Input batch zero-padded and stored in namedtuple-class collated_sequences.

Return type

namedtuple

class deepnog.dataset.collated_sequences(indices, ids, sequences)[source]

Bases: tuple

count(value, /)

Return number of occurrences of value.

property ids

Alias for field number 1

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

property indices

Alias for field number 0

property sequences

Alias for field number 2

deepnog.dataset.consume(iterator, n=None)[source]

Advance the iterator n-steps ahead. If n is None, consume entirely.

Function from Itertools Recipes in official Python 3.7.4. docs.

deepnog.dataset.gen_amino_acid_vocab(alphabet=None)[source]

Create vocabulary for protein sequences.

A vocabulary is defined as a mapping from the amino-acid letters in the alphabet to numbers. As this mapping is aware of zero-padding, it maps the first letter in the alphabet to 1 instead of 0.

Parameters

alphabet (str) – Alphabet to use for vocabulary. If None, use ‘ACDEFGHIKLMNPQRSTVWYBXZJUO’ (equivalent to deprecated Biopython’s ExtendedIUPACProtein).

Returns

vocab – Mapping of amino acid characters to numbers.

Return type

dict