R packages
evolutionary genomics
comparative genomics
genome duplication

An R/Bioconductor package to identify and classify duplicated genes from whole-genome protein sequence data


Fabrício Almeida-Silva


December 27, 2022


The major goal of doubletrouble is to identify duplicated genes from whole-genome protein sequences and classify them based on their modes of duplication. The simplest classification scheme has two duplication modes:

  1. Whole-genome duplication (WGD);
  2. Small-scale duplication (SSD)

For a more detailed view of the duplication modes, users can also choose to split SSD into subcategories, so the available duplication modes will be:

  1. Whole-genome duplication (WGD);
  2. Tandem duplication (TD);
  3. Proximal duplication (PD);
  4. Transposed duplication (TRD);
  5. Dispersed duplication (DD).

Besides classifying gene pairs, users can also classify genes, so that each gene is assigned a unique mode of duplication.

Users can also calculate substitution rates per substitution site (i.e., Ka and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and classify gene pairs into age groups based on Ks peaks.