DeepNOG: fast and accurate protein orthologous group assignment¶
deepnog is a Python package for assigning proteins to orthologous groups
(eggNOG 5) with deep networks.
Installation¶
Installation from PyPI¶
The current release of deepnog can be installed from PyPI:
pip install deepnog
For typical use cases, and quick start, this is sufficient. Note that this guide assumes Linux, and may work under macOS. We currently don’t provide detailed instructions for Windows.
Dependencies and model files¶
All package dependencies of deepnog are automatically installed
by pip. We also require model files (= networks parameters/weights),
which are too large for GitHub/PyPI. These are hosted on separate servers,
and downloaded automatically by deepnog, when required. By default,
models are cached in $HOME/deepnog_data/.
You can change this path by setting the DEEPNOG_DATA environment variable. Choose among the following options to do so:
# Set data path temporarily
DEEPNOG_DATA="/custom/path/models" deepnog infer sequences.fa
# Set data path for the current shell
export DEEPNOG_DATA="/custom/path/models"
# Set data path permanently
printf "\n# Set path to DeepNOG models\nexport DEEPNOG_DATA=\"/custom/path/models\"\n" >> ~/.bashrc
Installation from source¶
You can always grab the latest version of deepnog directly from GitHub:
cd install_dir
git clone git@github.com:univieCUBE/deepnog.git
cd deepnog
pip install -e .
This is the recommended approach, if you want to contribute
to the development of deepnog.
Quick Start Example¶
The following example shows all these steps for predicting protein orthologous groups
with the command line interface of deepnog as well as using the Python API.
Please make sure you have installed deepnog (installation instructions).
CLI Usage Example¶
Using deepnog from the command line is the simple, and preferred way of interacting with the
deepnog package.
Here, we assign orthologous groups (OGs) of proteins using a model trained on the eggNOG 5.0 database and using only bacterial OGs (default settings), and redirect the output from stdout to a file:
deepnog infer input.fa > assignments.csv
Alternatively, the output file and other settings can be specified explicitly like so:
deepnog infer input.fa --out prediction.csv -db eggNOG5 --tax 2
For a detailed explanation of flags and further settings, please consult the User Guide.
Note that deepnog masks predictions below a certain confidence threshold.
The default confidence threshold baked into the model at 0.99
can be overridden from the command line interface:
deepnog infer input.fa --confidence-threshold 0.8 > assignments.csv
The output comma-separated values (CSV) file assignments.csv then looks something like:
sequence_id,prediction,confidence
WP_004995615.1,COG5449,0.99999964
WP_004995619.1,COG0340,1.0
WP_004995637.1,COG4285,1.0
WP_004995655.1,COG4118,1.0
WP_004995678.1,COG0184,1.0
WP_004995684.1,COG1137,1.0
WP_004995690.1,COG0208,1.0
WP_004995697.1,,
WP_004995703.1,COG0190,1.0
The file contains a single line for each protein in the input sequence file, and the following fields:
sequence_id, the name of the input protein sequence.prediction, the name of the predicted protein OG. Empty if masked by confidence threshold.confidence, the confidence value (0-1 inclusive) thatdeepnogascribes to this assignment. Empty if masked by confidence threshold.
API Example Usage¶
import torch
from deepnog.data import ProteinIterableDataset
from deepnog.inference import predict
from deepnog.utils import create_df, get_config, get_weights_path, load_nn, set_device
PROTEIN_FILE = '/path/to/file.faa'
DATABASE = 'eggNOG5'
TAX = 2
ARCH = 'deepnog'
CONF_THRESH = 0.99
# load protein sequence file into a ProteinIterableDataset
dataset = ProteinIterableDataset(PROTEIN_FILE, f_format='fasta')
# Construct path to saved parameters deepnog model.
weights_path = get_weights_path(
database=DATABASE,
level=str(TAX),
architecture=ARCH,
)
# Set up device for prediction
device = set_device('auto')
torch.set_num_threads(1)
# Load neural network parameters
model_dict = torch.load(weights_path, map_location=device)
# Lookup where to find the chosen network
config = get_config()
module = config['architecture'][ARCH]['module']
cls = config['architecture'][ARCH]['class']
# Load neural network model and class names
model = load_nn((module, cls), model_dict, device)
class_labels = model_dict['classes']
# perform prediction
preds, confs, ids, indices = predict(
model=model,
dataset=dataset,
device=device,
batch_size=1,
num_workers=1,
verbose=3
)
# Construct results (a pandas DataFrame)
df = create_df(
class_labels=class_labels,
preds=preds,
confs=confs,
ids=ids,
indices=indices,
threshold=threshold,
verbose=3
)
User Guide¶
Concepts¶
DeepNOG is a command line tool written in Python 3. It uses trained neural networks for extremely fast protein homology predictions. In its current installation, it is based upon a neural network architecture called DeepEncoding trained on the root and bacterial level of the eggNOG 5.0 database (Huerta-Cepas et al. (2019)).
Input Data¶
As an input DeepNOG expects a protein sequence file which can also be provided gzipped. It is tested for the FASTA file format but in general should support all file formats supported by the Bio.SeqIO module of Biopython. Following the conventions in the bioinformatics field, protein sequences, given no IDs in the input data file, are skipped and not used for the following prediction phase. Furthermore, if two sequences in the input data file have the same associated ID, only the sequence encountered first in the input data file will be kept and all others discarded before the output file is created. The user will be notified if such cases are encountered.
Prediction Phase¶
In the prediction phase, DeepNOG loads a predefined neural network and the corresponding trained weights (defaults to DeepEncoding trained on eggNOG 5.0 (bacterial level)). Then it performs the prediction through forwarding the input sequences through the network performing the calculations either on a CPU or GPU. DeepNOG offers single-process data loading aimed for calculations on a single CPU core to produce as little overhead as possible. Additionally, it offers parallel multiprocess data loading aimed for very fast GPU calculations. This is, to provide the GPU with data following up the previous forward pass fast enough such that the GPU does not experience idling. In its default parametrization, DeepNOG is optimized for single core CPU calculations, for details on how to best exploit GPUs for orthologous group predictions using DeepNOG, the reader is referred to the advanced Section 3.3 in this user’s guide.
Output Data¶
As an output DeepNOG generates a CSV file which consists of three columns. First, the unique names or IDs of the proteins extracted from the sequence file, the second column corresponds to the OG-predictions and in the third column the confidence of the neural network in the prediction is stored. Each neural network model has the possibility to define a prediction confidence threshold below which, the neural network’s output layer is treated as having predicted that the input protein sequence is not associated to any orthologous group in the model. Therefore, if the highest prediction confidence for any OG for a given input protein sequence is below this threshold, the prediction is left empty. Per default, using DeepEncoding on eggNOG 5.0, the prediction confidence threshold is set to a strict 99%. This threshold can be adjusted by the user.
Deepnog CLI Documentation¶
Invocation:
deepnog SEQUENCE_FILE [options] > predictions.csv
Basic Commands¶
These options may be commonly tuned for a basic invocation for OG prediction.
positional arguments:
SEQUENCE_FILE File containing protein sequences for classification.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-o FILE, --out FILE Store orthologous group predictions to outputfile. Per
default, write predictions to stdout. (default: None)
-c FLOAT, --confidence-threshold FLOAT
The confidence value below which predictions are
masked by deepnog. By default, apply the confidence
threshold saved in the model if one exists, and else
do not apply a confidence threshold. (default: None)
Advanced Commands¶
These options are unlikely to require manual tuning for the average user.
--verbose INT Define verbosity of DeepNOGs output written to stdout
or stderr. 0 only writes errors to stderr which cause
DeepNOG to abort and exit. 1 also writes warnings to
stderr if e.g. a protein without an ID was found and
skipped. 2 additionally writes general progress
messages to stdout.3 includes a dynamic progress bar
of the prediction stage using tqdm. (default: 3)
-ff STR, --fformat STR
File format of protein sequences. Must be supported by
Biopythons Bio.SeqIO class. (default: fasta)
-of {csv,tsv,legacy} --outformat {csv,tsv,legacy}
The file format of the output file produced by
deepnog. (default: csv)
-d {auto,cpu,gpu}, --device {auto,cpu,gpu}
Define device for calculating protein sequence
classification. Auto chooses GPU if available,
otherwise CPU. (default: auto)
-db {eggNOG5}, --database {eggNOG5}
Orthologous group/family database to use. (default:
eggNOG5)
-t {1,2}, --tax {1,2}
Taxonomic level to use in specified database
(1 = root, 2 = bacteria) (default: 2)
-nw INT, --num-workers INT
Number of subprocesses (workers) to use for data
loading. Set to a value <= 0 to use single-process
data loading. Note: Only use multi-process data
loading if you are calculating on a gpu (otherwise
inefficient)! (default: 0)
-a {deepencoding}, --architecture {deepencoding}
Network architecture to use for classification.
(default: deepencoding)
-w FILE, --weights FILE
Custom weights file path (optional) (default: None)
-bs INT, --batch-size INT
Batch size used for prediction.Defines how many
sequences should be forwarded in the network at once.
With a batch size of one, the protein sequences are
sequentially classified by the network without
leveraging parallelism. Higher batch-sizes than the
default can speed up the prediction significantly if
on a gpu. On a cpu, however, they can be slower than
smaller ones due to the increased average sequence
length in the convolution step due to zero-padding
every sequence in each batch. (default: 1)
API Documentation¶
This is the API documentation for deepnog.
DeepNOG¶
DeepNOG is a deep learning based command line tool to infer orthologous groups of given protein sequences. It provides a number of models for eggNOG orthologous groups, and allows to train additional models for eggNOG or other databases.
deepnog.dataset¶
deepnog.inference¶
deepnog.io¶
deepnog.sync¶
deepnog.tests¶
Description:
Helpers for deepnog tests.
- Including:
test data
test network weights (parameters)
some helper functions
Individual tests are located within the respective deepnog subpackages.
deepnog.utils¶
-
class
deepnog.utils.SynchronizedCounter(init: int = 0)[source]¶ Bases:
objectA multiprocessing-safe counter.
- Parameters
init (int, optional) – Counter starts at init (default: 0)
-
increment(n=1)[source]¶ Obtain a lock before incrementing, since += isn’t atomic.
- Parameters
n (int, optional) – Increment counter by n (default: 1)
-
increment_and_get_value(n=1) → int[source]¶ Obtain a lock before incrementing, since += isn’t atomic.
- Parameters
n (int, optional) – Increment counter by n (default: 1)
-
property
value¶
-
deepnog.utils.count_parameters(model, tunable_only: bool = True) → int[source]¶ Count the number of parameters in the given model.
- Parameters
model (torch.nn.Module) – PyTorch model (deep network)
tunable_only (bool, optional) – Count only tunable network parameters
References
https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
-
deepnog.utils.create_df(class_labels: list, preds: torch.Tensor, confs: torch.Tensor, ids: List[str], indices: List[int], threshold: float = None)[source]¶ Creates one dataframe storing all relevant prediction information.
The rows in the returned dataframe have the same order as the original sequences in the data file. First column of the dataframe represents the position of the sequence in the datafile.
- Parameters
class_labels (list) – Store class name corresponding to an output node of the network.
preds (torch.Tensor, shape (n_samples,)) – Stores the index of the output-node with the highest activation
confs (torch.Tensor, shape (n_samples,)) – Stores the confidence in the prediction
ids (list[str]) – Stores the (possible empty) protein labels extracted from data file.
indices (list[int]) – Stores the unique indices of sequences mapping to their position in the file
threshold (float) – If given, prediction labels and confidences are set to ‘’ if confidence in prediction is not at least threshold.
- Returns
df – Stores prediction information about the input protein sequences. Duplicates (defined by their sequence_id) have been removed from df.
- Return type
pandas.DataFrame
-
deepnog.utils.get_config(config_file: Optional[Union[pathlib.Path, str]] = None) → Dict[source]¶ Get a config dictionary
If no file is provided, look in the DEEPNOG_CONFIG env variable for the path. If this fails, load a default config file (lacking any user customization).
This contains the available models (databases, levels). Additional config may be added in future releases.
-
deepnog.utils.get_data_home(data_home: str = None, verbose: int = 0) → pathlib.Path[source]¶ Return the path of the deepnog data dir.
This folder is used for large files that cannot go into the Python package on PyPI etc. For example, the network parameters (weights) files may be larger than 100MiB. By default the data dir is set to a folder named ‘deepnog_data’ in the user home folder. Alternatively, it can be set by the ‘DEEPNOG_DATA’ environment variable or programmatically by giving an explicit folder path. If the folder does not already exist, it is automatically created.
- Parameters
data_home (str | None) – The path to deepnog data dir.
verbose (int) – Log or not.
Notes
Adapted from SKLEARN_DATAHOME.
-
deepnog.utils.get_logger(initname: str = 'deepnog', verbose: int = 0) → logging.Logger[source]¶ This function provides a nicely formatted logger.
- Parameters
initname (str) – The name of the logger to show up in log.
verbose (int) – Increasing levels of verbosity
References
Shamelessly stolen from phenotrex
-
deepnog.utils.get_weights_path(database: str, level: str, architecture: str, data_home: str = None, download_if_missing: bool = True, verbose: int = 0) → pathlib.Path[source]¶ Get path to neural network weights.
This is a path on local storage. If the corresponding files are not present, download from remote storage. The default remote URL can be overridden by setting the environment variable DEEPNOG_REMOTE.
- Parameters
database (str) – The orthologous groups database. Example: eggNOG5
level (str) – The taxonomic level within the database. Example: 2 (for bacteria)
architecture (str) – Network architecture. Example: deepencoding
data_home (str, optional) – Specify another download and cache folder for the weights. By default all deepnog data is stored in ‘$HOME/deepnog_data’ subfolders.
download_if_missing (boolean, default=True) – If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.
verbose (int) – Log or not
- Returns
weights_path – Path to file of network weights
- Return type
Path
-
deepnog.utils.load_nn(architecture: Union[str, Sequence[str]], model_dict: dict = None, phase: str = 'eval', device: Union[torch.device, str] = 'cpu', verbose: int = 0)[source]¶ Import NN architecture and set loaded parameters.
- Parameters
architecture (str or list-like of two str) – If single string: name of neural network module and class to import. E.g. ‘deepencoding’ will load deepnog.models.deepencoding.deepencoding. Otherwise, separate module and class name of deep network to import. E.g. (‘deepthought’, ‘DeepNettigkeit’) will load deepnog.models.deepthought.DeepNettigkeit.
model_dict (dict, optional) – Dictionary holding all parameters and hyper-parameters of the model. Required during inference, optional for training.
phase (['train', 'infer', 'eval']) – Set network in training or inference=evaluation mode with effects on storing gradients, dropout, etc.
device ([str, torch.device]) – Device to load the model into.
verbose (int) – Increasingly verbose logging
- Returns
model – Neural network object of type architecture with parameters loaded from model_dict and moved to device.
- Return type
torch.nn.Module
-
deepnog.utils.parse(p: pathlib.Path, fformat: str = 'fasta', alphabet=None) → Iterator[source]¶ Parse a possibly compressed sequence file.
- Parameters
p (Path or str) – Path to sequence file
fformat (str) – File format supported by Biopython.SeqIO.parse, e.g “fasta”
alphabet (any) – Pass alphabet to SeqIO.parse
- Returns
it – The SeqIO.parse iterator yielding SeqRecords
- Return type
Iterator
-
deepnog.utils.set_device(device: Union[str, torch.device]) → torch.device[source]¶ Set device (CPU/GPU) depending on user choice and availability.
- Parameters
device ([str, torch.device]) – Device set by user as an argument to DeepNOG call.
- Returns
device – Object containing the device type to be used for prediction calculations.
- Return type
torch.device
Deepnog New Models and Architectures¶
deepnog is developed with extensibility in mind,
and allows to plug in additional models (for different taxonomic levels,
or different orthology databases).
It also supports addition of new network architectures.
In order to register a new network architecture, we recommend an editable installation with pip, as described in Installation from Source.
Register new network architectures¶
Create a Python module under deepnog/models/<my_network.py>.
You can use deepencoding.py as a template.
When the new module is in place, also edit deepnog/client.py
to expose the new network to the user:
parser.add_argument("-a", "--architecture",
default='deepencoding',
choices=['deepencoding',
'my_network',
],
help="Network architecture to use for classification.")
Register new models¶
New models for additional taxnomic levels or even different orthology databases
using existing network architectures must be placed in the deepnog data directory
as specified by the DEEPNOG_DATA environment variable (default: $HOME/deepnog_data).
The directory looks like this:
| deepnog_data
| ├── eggNOG5
| │ ├── 1
| │ | └── deepencoding.pth
| │ └── 2
| │ └── deepencoding.pth
| ├── ...
|
|
In order to add a root level model for “MyOrthologyDB”, we place the serialized PyTorch parameters like this:
| deepnog_data
| ├── eggNOG5
| │ ├── 1
| │ | └── deepencoding.pth
| │ └── 2
| │ └── deepencoding.pth
| ├── MyOrthologyDB
| | └── 1
| | └── deepencoding.pth
| ├── ...
|
Assuming we want to compare deepencoding to my_network,
we add the trained network parameters like this:
| deepnog_data
| ├── eggNOG5
| │ ├── 1
| │ | ├── deepencoding.pth
| │ | └── my_network.pth
| │ └── 2
| │ ├── deepencoding.pth
| │ └── my_network.pth
| ├── MyOrthologyDB
| | └── 1
| │ ├── deepencoding.pth
| │ └── my_network.pth
| ├── ...
|
Finally, expose the new models to the user by modifying deepnog/client.py
again. The relevant section is argument parsing for --database,
and --tax, if new taxonomic levels are introduced as well.
parser.add_argument("-db", "--database",
type=str,
choices=['eggNOG5',
'MyOrthologyDB',
],
default='eggNOG5',
help="Orthologous group/family database to use.")
parser.add_argument("-t", "--tax",
type=int,
choices=[1, 2, ],
default=2,
help="Taxonomic level to use in specified database "
"(1 = root, 2 = bacteria)")
Training scripts¶
Please note, that no training scripts are currently shipped with
deepnog, as scripts used for the available models rely on in-house
software libraries and databases, such as SIMAP2.
We are currently working on standalone training scripts,
that will be made public asap.
Contributing¶
deepnog is free open source software. Contributions from the community are highly appreciated. Even small contributions improve the software’s quality.
Even if you are not familiar with programming languages and tools, you may contribute by filing bugs or any problems as a GitHub issue.
Git and branching model¶
We use git for version control (CVS), as do most projects nowadays. If you are not familiar with git, there are lots of tutorials on GitHub Guide. All the important basics are covered in the GitHub Git handbook.
Development of deepnog (mostly) follows this git branching model. We currently use one main branch: master. For any changes, a new branch should be created. This includes new feature, noncritical or critical bug fixes, etc.
Workflow¶
In case of large changes to the software, please first get in contact with the authors for coordination, for example by filing an issue. If you want to fix small issues (typos in the docs, obvious errors, etc.) you can - of course - directly submit a pull request (PR).
- Create a fork of deepnog in your GitHub account.
Simply click “Fork” button on https://github.com/univieCUBE/deepnog.
- Clone your fork on your computer.
$
git clone git@github.com:YOUR-ACCOUNT-GOES-HERE/deepnog.git && cd deepnog
- Add remote upstream.
$
git remote add upstream git@github.com:univieCUBE/deepnog.git
- Create feature/bugfix branch.
$
git checkout -b bugfix123 master
- Implement feature/fix bug/fix typo/…
Happy coding!
- Create a commit with meaningful message
If you only modified existing files, simply
$ git commit -am "descriptive message what this commit does (in present tense) here"
- Push to GitHub
e.g. $
git push origin featureXYZ
- Create pull request (PR)
Git will likely provide a link to directly create the PR. If not, click “New pull request” on your fork on GitHub.
- Wait…
Several devops checks will be performed automatically (e.g. continuous integration (CI) with Travis, AppVeyor).
The authors will get in contact with you, and may ask for changes.
- Respond to code review.
If there were issues with continuous integration, or the authors asked for changes, please create a new commit locally, and simply push again to GitHub as you did before. The PR will be updated automatically.
- Maintainers merge PR, when all issues are resolved.
Thanks a lot for your contribution!
Code style and further guidelines¶
Please make sure all code complies with PEP 8
All code should be documented sufficiently (functions, classes, etc. must have docstrings with general description, parameters, ideally return values, raised exceptions, notes, etc.)
Documentation style is NumPy format.
New code must be covered by unit tests using pytest.
If you fix a bug, please provide regression tests (fail on old code, succeed on new code).
It may be helpful to install deepnog in editable mode for development. When you have already cloned the package, switch into the corresponding directory, and
pip install -e .
(don’t omit the trailing period). This way, any changes to the code are reflected immediately. That is, you don’t need to install the package each and every time, when you make changes while developing code.
Testing¶
In deepnog, we aim for high code coverage. As of Feb 2020, more than 95% of all code lines are visited at least once when running the complete test suite. This is primarily to ensure:
correctness of the code (to some extent) and
maintainability (new changes don’t break old code).
Creating a new PR, ideally all code would be covered by tests. Sometimes, this is not feasible or only with large effort. Pull requests will likely be accepted, if the overall code coverage at least does not decrease.
Unit tests are automatically performed for each PR using CI tools online. This may take some time, however. To run the tests locally, you need pytest installed. From the deepnog directory, call
pytest deepnog/
to run all the tests. You can also restrict the tests to the subpackage you are working on, down to single tests. For example
pytest deepnog/tests/test_dataset.py --showlocals -v
only runs tests about datasets.
In order to check code coverage locally, you need the pytest-cov plugin.
pytest deepnog --cov=deepnog
Changelog¶
1.1.0 - 2020-02-28¶
Added¶
EggNOG5 root (tax 1) prediction
Changed¶
Package structure changed for higher modularity. This will require changes in downstream usages.
Remove network weights from the repository, because files are too large for github and/or PyPI.
deepnogautomatically downloads these from CUBE servers, and caches them locally.More robust inter-process communication in data loading
Fixes¶
Fix error on very short amino acid sequences
Fix error on unrecognized symbols in sequences (stop codons etc.)
Fix multiprocess data loading from gzipped files
Fix type mismatch in deepencoding embedding layer (Windows only)
Getting started¶
Get started with deepnog in a breeze.
Find how to install the package and
see all core functionality applied in a single
quick start example.
User Guide¶
The User Guide introduces the main concepts
of deepnog.
It also contains complete CLI
and API documentations of the package.
Development¶
There are several possibilities to contribute to this free open source software. We highly appreciate all input from the community, be it bug reports or code contributions.
Source code, issue tracking, discussion, and continuous integration appear on our GitHub page.
What’s new¶
To see what’s new in the latest version of deepnog,
have a look at the changelog.