deepnog.learning package

deepnog.learning.inference module

Author: Roman Feldbauer

Date: 2020-02-19

Description:

Predict orthologous groups of protein sequences.

deepnog.learning.inference.predict(model, dataset, device='cpu', batch_size=16, num_workers=4, verbose=3)[source]

Use model to predict zero-indexed labels of dataset.

Also handles communication with ProteinIterators used to load data to log how many sequences have been skipped due to having empty sequence ids.

Parameters
  • model (nn.Module) – Trained neural network model.

  • dataset (ProteinIterableDataset) – Data to predict protein families for.

  • device ([str, torch.device]) – Device of model.

  • batch_size (int) – Forward batch_size proteins through neural network at once.

  • num_workers (int) – Number of workers for data loading.

  • verbose (int) – Define verbosity.

Returns

  • preds (torch.Tensor, shape (n_samples,)) – Stores the index of the output-node with the highest activation

  • confs (torch.Tensor, shape (n_samples,)) – Stores the confidence in the prediction

  • ids (list[str]) – Stores the (possible empty) protein labels extracted from data file.

  • indices (list[int]) – Stores the unique indices of sequences mapping to their position in the file

deepnog.learning.training module

Author: Roman Feldbauer

Date: 2020-06-03

Description:

Training deep networks for protein orthologous group prediction.

deepnog.learning.training.fit(architecture, module, cls, training_sequences, validation_sequences, training_labels, validation_labels, *, data_loader_params: Optional[dict] = None, iterable_dataset: bool = False, n_epochs: int = 15, shuffle: bool = False, learning_rate: float = 0.01, learning_rate_params: Optional[dict] = None, l2_coeff: Optional[float] = None, optimizer_cls=torch.optim.Adam, device: Union[str, torch.device] = 'auto', tensorboard_dir: Union[None, str] = 'auto', log_interval: int = 100, random_seed: Optional[int] = None, save_each_epoch: bool = True, out_dir: Optional[pathlib.Path] = None, experiment_name: Optional[str] = None, config_file: Optional[str] = None, verbose: int = 2) deepnog.learning.training.train_val_result[source]

Perform training and validation of a given model, data, and hyperparameters.

Parameters
  • architecture (str) – Network architecture, must be available in deepnog/models

  • module (str) – Python module containing the network definition (inside deepnog/models/).

  • cls (str) – Python class name of the network (inside deepnog/models/{module}.py).

  • training_sequences (str, Path) – File with training set sequences

  • validation_sequences (str, Path) – File with validation set sequences

  • training_labels (str, Path) – File with class labels (orthologous groups) of training sequences

  • validation_labels (str, Path) – File with class labels (orthologous groups) of validation sequences

  • data_loader_params (dict) – Parameters passed to PyTorch DataLoader construction

  • iterable_dataset (bool, default False) – Use an iterable dataset that does not load all sequences in advance. While this saves memory and does not involve the delay at start, random sampling is impaired, and requires a shuffle buffer.

  • n_epochs (int) – Number of training passes over the complete training set

  • shuffle (bool) – Shuffle the training data. This does NOT shuffle the complete data set, which requires having all sequences in memory, but uses a shuffle buffer (default size: 2**16), from which sequences are drawn.

  • learning_rate (float) – Learning rate, the central hyperparameter of deep network training. Too high values may lead to diverging solutions, while too low values result in slow learning.

  • learning_rate_params (dict) – Parameters passed to the learning rate Scheduler.

  • l2_coeff (float) – If not None, regularize training by L2 norm of network weights

  • optimizer_cls – Class of PyTorch optimizer

  • device (torch.device) – Use either ‘cpu’ or ‘cuda’ (GPU) for training/validation.

  • tensorboard_dir (str) – Save online learning statistics for tensorboard in this directory.

  • log_interval (int, optional) – Print intermediary results after log_interval minibatches

  • random_seed (int) – Set a random seed for numpy/pytorch for reproducible results.

  • save_each_epoch (bool) – Save the network after each training epoch

  • out_dir (Path) – Path to the output directory used to save models during training

  • experiment_name (str) – Prefix of model files saved during training

  • config_file (str) – Override path to config file, e.g. for custom models in unit tests

  • verbose (int) – Increasing levels of messages

Returns

results

A namedtuple containing:
  • the trained deep network model

  • training dataset

  • evaluation statistics

  • the ground truth labels (y_true)

  • the predicted labels (y_pred).

Return type

namedtuple