Classify duplicate gene pairs based on their modes of duplication

Usage

classify_gene_pairs(
  annotation = NULL,
  blast_list = NULL,
  scheme = "standard",
  blast_inter = NULL,
  intron_counts,
  evalue = 1e-10,
  anchors = 5,
  max_gaps = 25,
  proximal_max = 10,
  collinearity_dir = NULL,
  outgroup_coverage = 70
)

Arguments

annotation: A processed GRangesList or CompressedGRangesList object as returned by syntenet::process_input().
blast_list: A list of data frames containing BLAST tabular output for intraspecies comparisons. Each list element corresponds to the BLAST output for a given species, and names of list elements must match the names of list elements in annotation. BLASTp, DIAMOND or simular programs must be run on processed sequence data as returned by process_input().
scheme: Character indicating which classification scheme to use. One of "binary", "standard", "extended", or "full". See details below for information on what each scheme means. Default: "standard".
blast_inter: (Only valid if scheme == "extended" or "full"). A list of data frames containing BLAST tabular output for the comparison between target species and outgroups. Names of list elements must match the names of list elements in annotation. BLASTp, DIAMOND or simular programs must be run on processed sequence data as returned by process_input().
intron_counts: (Only valid if scheme == "full"). A list of 2-column data frames with the number of introns per gene as returned by get_intron_counts(). Names of list elements must match names of annotation.
evalue: Numeric scalar indicating the E-value threshold. Default: 1e-10.
anchors: Numeric indicating the minimum required number of genes to call a syntenic block, as in syntenet::infer_syntenet. Default: 5.
max_gaps: Numeric indicating the number of upstream and downstream genes to search for anchors, as in syntenet::infer_syntenet. Default: 25.
proximal_max: Numeric scalar with the maximum distance (in number of genes) between two genes to consider them as proximal duplicates. Default: 10.
collinearity_dir: Character indicating the path to the directory where .collinearity files will be stored. If NULL, files will be stored in a subdirectory of tempdir(). Default: NULL.
outgroup_coverage: Numeric indicating the minimum percentage of outgroup species to use to consider genes as transposed duplicates. Only valid if multiple outgroup species are present (see details below). Values should range from 0 to 100. Default: 70.

Value

A list of 3-column data frames of duplicated gene pairs (columns 1 and 2), and their modes of duplication (column 3).

Details

The classification schemes increase in complexity (number of classes) in the order 'binary', 'standard', 'extended', and 'full'.

For classification scheme "binary", duplicates are classified into one of 'SD' (segmental duplications) or 'SSD' (small-scale duplications).

For classification scheme "standard" (default), duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), and 'DD' (dispersed duplication).

For classification scheme "extended", duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), 'TRD' (transposon-derived duplication), and 'DD' (dispersed duplication).

Finally, for classification scheme "full", duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), 'rTRD' (retrotransposon-derived duplication), 'dTRD' (DNA transposon-derived duplication), and 'DD' (dispersed duplication).

Examples

# Load example data
data(diamond_intra)
data(diamond_inter)
data(yeast_annot)
data(yeast_seq)

# Get processed annotation data
annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation

# Get collapsed DIAMOND inter
blast_inter <- syntenet::collapse_bidirectional_hits(
    diamond_inter,
    data.frame("Scerevisiae", "Cglabrata")
)

# Get list of intron counts
library(txdbmaker)
#> Loading required package: BiocGenerics
#> Loading required package: generics
#> 
#> Attaching package: ‘generics’
#> The following objects are masked from ‘package:base’:
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#>     unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:utils’:
#> 
#>     findMatches
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: Seqinfo
#> Loading required package: GenomicRanges
#> Loading required package: IRanges
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: ‘txdbmaker’
#> The following objects are masked from ‘package:GenomicFeatures’:
#> 
#>     UCSCFeatureDbTableSchema, browseUCSCtrack, getChromInfoFromBiomart,
#>     makeFDbPackageFromUCSC, makeFeatureDbFromUCSC, makePackageName,
#>     makeTxDb, makeTxDbFromBiomart, makeTxDbFromEnsembl,
#>     makeTxDbFromGFF, makeTxDbFromGRanges, makeTxDbFromUCSC,
#>     makeTxDbPackage, makeTxDbPackageFromBiomart,
#>     makeTxDbPackageFromUCSC, supportedMiRBaseBuildValues,
#>     supportedUCSCFeatureDbTables, supportedUCSCFeatureDbTracks,
#>     supportedUCSCtables
txdb_list <- lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges)
#> Warning: genome version information is not available for this TxDb object
#> Warning: genome version information is not available for this TxDb object
intron_counts <- lapply(txdb_list, get_intron_counts)

# Classify duplicates - full scheme
dup_class <- classify_gene_pairs(
    annotation = annotation, 
    blast_list = diamond_intra, 
    scheme = "full",
    blast_inter = blast_inter, 
    intron_counts = intron_counts
)

# Check number of gene pairs per class
table(dup_class$Scerevisiae$type)
#> 
#>   SD   TD   PD rTRD dTRD   DD 
#>  342   42   80   50  963 2111