Classify duplicate gene pairs based on their modes of duplication


  annotation = NULL,
  blast_list = NULL,
  scheme = "standard",
  blast_inter = NULL,
  evalue = 1e-10,
  anchors = 5,
  max_gaps = 25,
  proximal_max = 10,
  collinearity_dir = NULL



A processed GRangesList or CompressedGRangesList object as returned by syntenet::process_input().


A list of data frames containing BLAST tabular output for intraspecies comparisons. Each list element corresponds to the BLAST output for a given species, and names of list elements must match the names of list elements in annotation. BLASTp, DIAMOND or simular programs must be run on processed sequence data as returned by process_input().


Character indicating which classification scheme to use. One of "binary", "standard", "extended", or "full". See details below for information on what each scheme means. Default: "standard".


(Only valid if scheme == "extended" or "full"). A list of data frames containing BLAST tabular output for the comparison between target species and outgroups. Names of list elements must match the names of list elements in annotation. BLASTp, DIAMOND or simular programs must be run on processed sequence data as returned by process_input().


(Only valid if scheme == "full"). A list of 2-column data frames with the number of introns per gene as returned by get_intron_counts(). Names of list elements must match names of annotation.


Numeric scalar indicating the E-value threshold. Default: 1e-10.


Numeric indicating the minimum required number of genes to call a syntenic block, as in syntenet::infer_syntenet. Default: 5.


Numeric indicating the number of upstream and downstream genes to search for anchors, as in syntenet::infer_syntenet. Default: 25.


Numeric scalar with the maximum distance (in number of genes) between two genes to consider them as proximal duplicates. Default: 10.


Character indicating the path to the directory where .collinearity files will be stored. If NULL, files will be stored in a subdirectory of tempdir(). Default: NULL.


A list of 3-column data frames of duplicated gene pairs (columns 1 and 2), and their modes of duplication (column 3).


The classification schemes increase in complexity (number of classes) in the order 'binary', 'standard', 'extended', and 'full'.

For classification scheme "binary", duplicates are classified into one of 'SD' (segmental duplications) or 'SSD' (small-scale duplications).

For classification scheme "standard" (default), duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), and 'DD' (dispersed duplication).

For classification scheme "extended", duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), 'TRD' (transposon-derived duplication), and 'DD' (dispersed duplication).

Finally, for classification scheme "full", duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), 'rTRD' (retrotransposon-derived duplication), 'dTRD' (DNA transposon-derived duplication), and 'DD' (dispersed duplication).


# Load example data

# Get processed annotation data
annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation
intron_counts <- lapply(txdb_list, get_intron_counts)

# Classify duplicates - full scheme
dup_class <- classify_gene_pairs(
    annotation = annotation, 
    blast_list = diamond_intra, 
    scheme = "full",
    blast_inter = diamond_inter, 
    intron_counts = intron_counts

# Check number of gene pairs per class
#>   SD   TD   PD rTRD dTRD   DD 
#>  342   42   80   52  963 2109