DeepNOG: fast and accurate protein orthologous group prediction¶
deepnog
is a Python package for predicting protein orthologous groups
with deep networks.
Installation¶
Installation from PyPI¶
The current release of deepnog
can be installed from PyPI:
pip install deepnog
For typical use cases, and quick start, this is sufficient.
Dependencies and model files¶
All package dependencies of deepnog
are automatically installed
by pip
. We also require model files (networks parameters/weights),
which are too large for GitHub/PyPI. These are hosted on separate servers,
and downloaded automatically by deepnog
, when required. By default,
models are cached in $HOME/deepnog_data/.
You can change this path by setting the DEEPNOG_DATA environment variable.
DEEPNOG_DATA="/custom/path/models" deepnog sequences.fa
Installation from source¶
You can always grab the latest version of deepnog
directly from GitHub:
cd install_dir
git clone git@github.com:VarIr/deepnog.git
cd deepnog
pip install -e .
This is the recommended approach, if you want to contribute
to the development of deepnog
.
Quick start example¶
Users of deepnog
typically want to
…
…
…
The following example shows all these steps for an example dataset.
Please make sure you have installed deepnog
(installation instructions).
First, we load the dataset and inspect its size.
from deepnog import predict
...
deepnog input.fa --out prediction.csv -db eggNOG5 --tax 2
User guide¶
Welcome to deepnog
!
Here we describe the core functionality of the package
(…),
and provide several usage examples.
API Documentation¶
This is the API documentation for deepnog
.
DeepNOG¶
DeepNOG is a deep learning based command line tool which predicts the protein families of given protein sequences based on pretrained neural networks.
The main module of this tool is defined in deepnog.py. For details about the usage of the tool, the reader is referred to the documentation as well as deepnog.py.
deepnog.client
¶
Author: Lukas Gosch
Date: 2019-10-18
Usage: python client.py –help
Description:
Provides the
deepnog
command line client and entry point for users.DeepNOG predicts protein families/orthologous groups of given protein sequences with deep learning.
File formats supported: Preferred: FASTA DeepNOG supports protein sequences stored in all file formats listed in https://biopython.org/wiki/SeqIO but is tested for the FASTA-file format only.
Architectures supported:
- Databases supported:
eggNOG 5.0, taxonomic level 1 (root)
eggNOG 5.0, taxonomic level 2 (bacteria)
deepnog.dataset
¶
Author: Lukas Gosch
Date: 2019-10-03
Description:
Dataset classes and helper functions for usage with deep network models written in PyTorch.
-
class
deepnog.dataset.
ProteinDataset
(file, f_format='fasta')[source]¶ Bases:
torch.utils.data.dataset.IterableDataset
Protein dataset holding the proteins to classify.
Does not load and store all proteins from a given sequence file but only holds an iterator to the next sequence to load.
Thread safe class allowing for multi-worker loading of sequences from a given datafile.
- Parameters
file (str) – Path to file storing the protein sequences.
f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.
-
class
deepnog.dataset.
ProteinIterator
(file_, aa_vocab, f_format, n_skipped: Union[int, deepnog.sync.SynchronizedCounter] = 0, num_workers=1, worker_id=0)[source]¶ Bases:
object
Iterator allowing for multiprocess data loading of a sequence file.
ProteinIterator is a wrapper for the iterator returned by Biopython’s Bio.SeqIO class when parsing a sequence file. It specifies custom __next__() method to support single- and multi-process data loading.
In the single-process loading case, nothing special happens, the ProteinIterator sequentially iterates over the data file. In the end, it informs the main module about the number of skipped sequences (due to empty ids) through setting a global variable in the main module.
In the multi-process loading case, each ProteinIterator loads a sequence and then skips the next few sequences dedicated to the other workers. This works by each worker skipping num_worker - 1 data samples for each call to __next__(). Furthermore, each worker skips worker_id data samples in the initialization. At the end of the workers lifetime, it sends the number of skipped sequences back to the main process through a pipe the main process created.
The ProteinIterator class also makes sure that a unique ID is set for each SeqRecord obtained from the data-iterator. This allows unambiguous handling of large protein datasets which may have duplicate IDs from merging multiple sources or may have no IDs at all. For easy and efficient sorting of batches of sequences as well as for direct access to the original IDs, the index is stored separately.
- Parameters
file (str) – Path to sequence file, from which an iterator over the sequences will be created with Biopython’s Bio.SeqIO.parse() function.
aa_vocab (dict) – Amino-acid vocabulary mapping letters to integers
f_format (str) – File format in which to expect the protein sequences. Must be supported by Biopython’s Bio.SeqIO class.
num_workers (int) – Number of workers set in DataLoader or one if no workers set. If bigger or equal to two, the multi-process loading case happens.
worker_id (int) – ID of worker this iterator belongs to
-
deepnog.dataset.
collate_sequences
(batch, zero_padding=True)[source]¶ Collate and zero-pad encoded sequence.
- Parameters
batch (list[namedtuple] or namedtuple) – Batch of protein sequences to classify stored as a namedtuple-class sequence (see ProteinDataset).
zero_padding (bool) – If True, zero-pads protein sequences through appending zeros until every sequence is as long as the longest sequences in batch. If False raise NotImplementedError.
- Returns
batch – Input batch zero-padded and stored in namedtuple-class collated_sequences.
- Return type
namedtuple
-
class
deepnog.dataset.
collated_sequences
(indices, ids, sequences)[source]¶ Bases:
tuple
-
count
(value, /)¶ Return number of occurrences of value.
-
property
ids
¶ Alias for field number 1
-
index
(value, start=0, stop=9223372036854775807, /)¶ Return first index of value.
Raises ValueError if the value is not present.
-
property
indices
¶ Alias for field number 0
-
property
sequences
¶ Alias for field number 2
-
-
deepnog.dataset.
consume
(iterator, n=None)[source]¶ Advance the iterator n-steps ahead. If n is None, consume entirely.
Function from Itertools Recipes in official Python 3.7.4. docs.
-
deepnog.dataset.
gen_amino_acid_vocab
(alphabet=None)[source]¶ Create vocabulary for protein sequences.
A vocabulary is defined as a mapping from the amino-acid letters in the alphabet to numbers. As this mapping is aware of zero-padding, it maps the first letter in the alphabet to 1 instead of 0.
- Parameters
alphabet (str) – Alphabet to use for vocabulary. If None, use ‘ACDEFGHIKLMNPQRSTVWYBXZJUO’ (equivalent to deprecated Biopython’s ExtendedIUPACProtein).
- Returns
vocab – Mapping of amino acid characters to numbers.
- Return type
dict
deepnog.inference
¶
Author: Roman Feldbauer
Date: 2020-02-19
Description:
Predict orthologous groups of protein sequences.
-
deepnog.inference.
load_nn
(architecture, model_dict, device='cpu')[source]¶ Import NN architecture and set loaded parameters.
- Parameters
architecture (str) – Name of neural network module and class to import.
model_dict (dict) – Dictionary holding all parameters and hyper-parameters of the model.
device ([str, torch.device]) – Device to load the model into.
- Returns
model – Neural network object of type architecture with parameters loaded from model_dict and moved to device.
- Return type
torch.nn.Module
-
deepnog.inference.
predict
(model, dataset, device='cpu', batch_size=16, num_workers=4, verbose=3)[source]¶ Use model to predict zero-indexed labels of dataset.
Also handles communication with ProteinIterators used to load data to log how many sequences have been skipped due to having empty sequence ids.
- Parameters
model (nn.Module) – Trained neural network model.
dataset (ProteinDataset) – Data to predict protein families for.
device ([str, torch.device]) – Device of model.
batch_size (int) – Forward batch_size proteins through neural network at once.
num_workers (int) – Number of workers for data loading.
verbose (int) – Define verbosity.
- Returns
preds (torch.Tensor, shape (n_samples,)) – Stores the index of the output-node with the highest activation
confs (torch.Tensor, shape (n_samples,)) – Stores the confidence in the prediction
ids (list[str]) – Stores the (possible empty) protein labels extracted from data file.
indices (list[int]) – Stores the unique indices of sequences mapping to their position in the file
deepnog.io
¶
Author: Roman Feldbauer
Date: 2020-02-19
Description:
Input/output helper functions
-
deepnog.io.
create_df
(class_labels, preds, confs, ids, indices, threshold=None, verbose=3)[source]¶ Creates one dataframe storing all relevant prediction information.
The rows in the returned dataframe have the same order as the original sequences in the data file. First column of the dataframe represents the position of the sequence in the datafile.
- Parameters
class_labels (list) – Store class name corresponding to an output node of the network.
preds (torch.Tensor, shape (n_samples,)) – Stores the index of the output-node with the highest activation
confs (torch.Tensor, shape (n_samples,)) – Stores the confidence in the prediction
ids (list[str]) – Stores the (possible empty) protein labels extracted from data file.
indices (list[int]) – Stores the unique indices of sequences mapping to their position in the file
threshold (int) – If given, prediction labels and confidences are set to ‘’ if confidence in prediction is not at least threshold.
verbose (int) – If bigger 0, outputs warning if duplicates detected.
- Returns
df – Stores prediction information about the input protein sequences. Duplicates (defined by their sequence_id) have been removed from df.
- Return type
pandas.DataFrame
-
deepnog.io.
get_data_home
(data_home: str = None) → pathlib.Path[source]¶ Return the path of the deepnog data dir.
This folder is used for large files that cannot go into the Python package on PyPI etc. For example, the network parameters (weights) files may be larger than 100MiB. By default the data dir is set to a folder named ‘deepnog_data’ in the user home folder. Alternatively, it can be set by the ‘DEEPNOG_DATA’ environment variable or programmatically by giving an explicit folder path. If the folder does not already exist, it is automatically created.
- Parameters
data_home (str | None) – The path to deepnog data dir.
Notes
Adapted from SKLEARN_DATAHOME.
-
deepnog.io.
get_weights_path
(database: str, level: str, architecture: str, data_home=None, download_if_missing=True) → pathlib.Path[source]¶ Get path to neural network weights.
This is a path on local storage. If the corresponding files are not present, download from remote storage. The default remote URL can be overridden by setting the environment variable DEEPNOG_REMOTE.
- Parameters
database (str) – The orthologous groups database. Example: eggNOG5
level (str) – The taxonomic level within the database. Example: 2 (for bacteria)
architecture (str) – Network architecture. Example: deepencoding
data_home (string, optional) – Specify another download and cache folder for the weights. By default all deepnog data is stored in ‘~/deepnog_data’ subfolders.
download_if_missing (boolean, default=True) – If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.
- Returns
weights_path – Path to file of network weights
- Return type
Path
deepnog.sync
¶
Author: Roman Feldbauer
Date: 2020-02-19
Description:
Parallel processing helpers
-
class
deepnog.sync.
SynchronizedCounter
(init: int = 0)[source]¶ Bases:
object
A multiprocessing-safe counter.
- Parameters
init (int, optional) – Counter starts at init (default: 0)
-
increment
(n=1)[source]¶ Obtain a lock before incrementing, since += isn’t atomic.
- Parameters
n (int, optional) – Increment counter by n (default: 1)
-
increment_and_get_value
(n=1) → int[source]¶ Obtain a lock before incrementing, since += isn’t atomic.
- Parameters
n (int, optional) – Increment counter by n (default: 1)
-
property
value
¶
Contributing¶
deepnog is free open source software. Contributions from the community are highly appreciated. Even small contributions improve the software’s quality.
Even if you are not familiar with programming languages and tools, you may contribute by filing bugs or any problems as a GitHub issue.
Git and branching model¶
We use git for version control (CVS), as do most projects nowadays. If you are not familiar with git, there are lots of tutorials on GitHub Guide. All the important basics are covered in the GitHub Git handbook.
Development of deepnog (mostly) follows this git branching model. We currently use one main branch: master. For any changes, a new branch should be created. This includes new feature, noncritical or critical bug fixes, etc.
Workflow¶
In case of large changes to the software, please first get in contact with the authors for coordination, for example by filing an issue. If you want to fix small issues (typos in the docs, obvious errors, etc.) you can - of course - directly submit a pull request (PR).
- Create a fork of deepnog in your GitHub account.
Simply click “Fork” button on https://github.com/VarIr/deepnog.
- Clone your fork on your computer.
$
git clone git@github.com:YOUR-ACCOUNT-GOES-HERE/deepnog.git && cd deepnog
- Add remote upstream.
$
git remote add upstream git@github.com:VarIr/deepnog.git
- Create feature/bugfix branch.
$
git checkout -b bugfix123 master
- Implement feature/fix bug/fix typo/…
Happy coding!
- Create a commit with meaningful message
If you only modified existing files, simply
$ git commit -am "descriptive message what this commit does (in present tense) here"
- Push to GitHub
e.g. $
git push origin featureXYZ
- Create pull request (PR)
Git will likely provide a link to directly create the PR. If not, click “New pull request” on your fork on GitHub.
- Wait…
Several devops checks will be performed automatically (e.g. continuous integration (CI) with Travis, AppVeyor).
The authors will get in contact with you, and may ask for changes.
- Respond to code review.
If there were issues with continuous integration, or the authors asked for changes, please create a new commit locally, and simply push again to GitHub as you did before. The PR will be updated automatically.
- Maintainers merge PR, when all issues are resolved.
Thanks a lot for your contribution!
Code style and further guidelines¶
Please make sure all code complies with PEP 8
All code should be documented sufficiently (functions, classes, etc. must have docstrings with general description, parameters, ideally return values, raised exceptions, notes, etc.)
Documentation style is NumPy format.
New code must be covered by unit tests using pytest.
If you fix a bug, please provide regression tests (fail on old code, succeed on new code).
It may be helpful to install deepnog in editable mode for development. When you have already cloned the package, switch into the corresponding directory, and
pip install -e .
(don’t omit the trailing period). This way, any changes to the code are reflected immediately. That is, you don’t need to install the package each and every time, when you make changes while developing code.
Testing¶
In deepnog, we aim for high code coverage. As of Feb 2020, more than 95% of all code lines are visited at least once when running the complete test suite. This is primarily to ensure:
correctness of the code (to some extent) and
maintainability (new changes don’t break old code).
Creating a new PR, ideally all code would be covered by tests. Sometimes, this is not feasible or only with large effort. Pull requests will likely be accepted, if the overall code coverage at least does not decrease.
Unit tests are automatically performed for each PR using CI tools online. This may take some time, however. To run the tests locally, you need pytest installed. From the deepnog directory, call
pytest deepnog/
to run all the tests. You can also restrict the tests to the subpackage you are working on, down to single tests. For example
pytest deepnog/tests/test_dataset.py --showlocals -v
only runs tests about datasets.
In order to check code coverage locally, you need the pytest-cov plugin.
pytest deepnog --cov=deepnog
Changelog¶
1.1.0 - 2020-02-28¶
Added¶
EggNOG5 root (tax 1) prediction
Changed¶
Package structure changed for higher modularity. This will require changes in downstream usages.
Remove network weights from the repository, because files are too large for github and/or PyPI.
deepnog
automatically downloads these from CUBE servers, and caches them locally.More robust inter-process communication in data loading
Fixes¶
Fix error on very short amino acid sequences
Fix error on unrecognized symbols in sequences (stop codons etc.)
Fix multiprocess data loading from gzipped files
Fix type mismatch in deepencoding embedding layer (Windows only)
Getting started¶
Get started with deepnog
in a breeze.
Find how to install the package and
see all core functionality applied in a single
quick start example.
User Guide¶
The User Guide introduces the main concepts
of deepnog
.
API Documentation¶
The API Documentation provides detailed
information of the implemented methods.
This information includes method descriptions, parameters, references,
examples, etc. Find all the information about specific modules and functions
of deepnog
in this section.
Development¶
There are several possibilities to contribute to this free open source software. We highly appreciate all input from the community, be it bug reports or code contributions.
Source code, issue tracking, discussion, and continuous integration appear on our GitHub page.
What’s new¶
To see what’s new in the latest version of deepnog
,
have a look at the changelog.