Concepts

DeepNOG is a command line tool written in Python 3. It uses trained neural networks for extremely fast protein homology predictions. In its current installation, it is based upon a neural network architecture called DeepEncoding trained on the root and bacterial level of the eggNOG 5.0 database (Huerta-Cepas et al. (2019)).

Input Data

As an input DeepNOG expects a protein sequence file which can also be provided gzipped. It is tested for the FASTA file format but in general should support all file formats supported by the Bio.SeqIO module of Biopython. Following the conventions in the bioinformatics field, protein sequences, given no IDs in the input data file, are skipped and not used for the following prediction phase. Furthermore, if two sequences in the input data file have the same associated ID, only the sequence encountered first in the input data file will be kept and all others discarded before the output file is created. The user will be notified if such cases are encountered.

Prediction Phase

In the prediction phase, DeepNOG loads a predefined neural network and the corresponding trained weights (defaults to DeepEncoding trained on eggNOG 5.0 (bacterial level)). Then it performs the prediction through forwarding the input sequences through the network performing the calculations either on a CPU or GPU. DeepNOG offers single-process data loading aimed for calculations on a single CPU core to produce as little overhead as possible. Additionally, it offers parallel multiprocess data loading aimed for very fast GPU calculations. This is, to provide the GPU with data following up the previous forward pass fast enough such that the GPU does not experience idling. In its default parametrization, DeepNOG is optimized for single core CPU calculations, for details on how to best exploit GPUs for orthologous group predictions using DeepNOG, the reader is referred to the advanced Section 3.3 in this user’s guide.

Output Data

As an output DeepNOG generates a CSV file which consists of three columns. First, the unique names or IDs of the proteins extracted from the sequence file, the second column corresponds to the OG-predictions and in the third column the confidence of the neural network in the prediction is stored. Each neural network model has the possibility to define a prediction confidence threshold below which, the neural network’s output layer is treated as having predicted that the input protein sequence is not associated to any orthologous group in the model. Therefore, if the highest prediction confidence for any OG for a given input protein sequence is below this threshold, the prediction is left empty. Per default, using DeepEncoding on eggNOG 5.0, the prediction confidence threshold is set to a strict 99%. This threshold can be adjusted by the user.