doubletrouble

R packages
bioconductor
evolutionary genomics
comparative genomics
genome duplication

An R/Bioconductor package to identify and classify duplicated genes from whole-genome protein sequence data

Author

Fabrício Almeida-Silva

Published

December 27, 2022

Summary

The major goal of doubletrouble is to identify duplicated genes from whole-genome protein sequences and classify them based on their modes of duplication. The simplest classification scheme has two duplication modes:

  1. Whole-genome duplication (WGD);
  2. Small-scale duplication (SSD)

For a more detailed view of the duplication modes, users can also choose to split SSD into subcategories, so the available duplication modes will be:

  1. Whole-genome duplication (WGD);
  2. Tandem duplication (TD);
  3. Proximal duplication (PD);
  4. Transposed duplication (TRD);
  5. Dispersed duplication (DD).

Besides classifying gene pairs, users can also classify genes, so that each gene is assigned a unique mode of duplication.

Users can also calculate substitution rates per substitution site (i.e., Ka and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and classify gene pairs into age groups based on Ks peaks.