Classify duplicate gene pairs based on their modes of duplication
Source:R/duplicate_classification.R
classify_gene_pairs.Rd
Classify duplicate gene pairs based on their modes of duplication
Usage
classify_gene_pairs(
annotation = NULL,
blast_list = NULL,
scheme = "standard",
blast_inter = NULL,
intron_counts,
evalue = 1e-10,
anchors = 5,
max_gaps = 25,
proximal_max = 10,
collinearity_dir = NULL,
outgroup_coverage = 70
)
Arguments
- annotation
A processed GRangesList or CompressedGRangesList object as returned by
syntenet::process_input()
.- blast_list
A list of data frames containing BLAST tabular output for intraspecies comparisons. Each list element corresponds to the BLAST output for a given species, and names of list elements must match the names of list elements in annotation. BLASTp, DIAMOND or simular programs must be run on processed sequence data as returned by
process_input()
.- scheme
Character indicating which classification scheme to use. One of "binary", "standard", "extended", or "full". See details below for information on what each scheme means. Default: "standard".
- blast_inter
(Only valid if
scheme == "extended" or "full"
). A list of data frames containing BLAST tabular output for the comparison between target species and outgroups. Names of list elements must match the names of list elements inannotation
. BLASTp, DIAMOND or simular programs must be run on processed sequence data as returned byprocess_input()
.- intron_counts
(Only valid if
scheme == "full"
). A list of 2-column data frames with the number of introns per gene as returned byget_intron_counts()
. Names of list elements must match names of annotation.- evalue
Numeric scalar indicating the E-value threshold. Default: 1e-10.
- anchors
Numeric indicating the minimum required number of genes to call a syntenic block, as in
syntenet::infer_syntenet
. Default: 5.- max_gaps
Numeric indicating the number of upstream and downstream genes to search for anchors, as in
syntenet::infer_syntenet
. Default: 25.- proximal_max
Numeric scalar with the maximum distance (in number of genes) between two genes to consider them as proximal duplicates. Default: 10.
- collinearity_dir
Character indicating the path to the directory where .collinearity files will be stored. If NULL, files will be stored in a subdirectory of
tempdir()
. Default: NULL.- outgroup_coverage
Numeric indicating the minimum percentage of outgroup species to use to consider genes as transposed duplicates. Only valid if multiple outgroup species are present (see details below). Values should range from 0 to 100. Default: 70.
Value
A list of 3-column data frames of duplicated gene pairs (columns 1 and 2), and their modes of duplication (column 3).
Details
The classification schemes increase in complexity (number of classes) in the order 'binary', 'standard', 'extended', and 'full'.
For classification scheme "binary", duplicates are classified into one of 'SD' (segmental duplications) or 'SSD' (small-scale duplications).
For classification scheme "standard" (default), duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), and 'DD' (dispersed duplication).
For classification scheme "extended", duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), 'TRD' (transposon-derived duplication), and 'DD' (dispersed duplication).
Finally, for classification scheme "full", duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), 'rTRD' (retrotransposon-derived duplication), 'dTRD' (DNA transposon-derived duplication), and 'DD' (dispersed duplication).
Examples
# Load example data
data(diamond_intra)
data(diamond_inter)
data(yeast_annot)
data(yeast_seq)
# Get processed annotation data
annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation
# Get list of intron counts
library(txdbmaker)
#> Loading required package: BiocGenerics
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:utils’:
#>
#> findMatches
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> Loading required package: GenomeInfoDb
#> Loading required package: IRanges
#> Loading required package: GenomicRanges
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: ‘txdbmaker’
#> The following objects are masked from ‘package:GenomicFeatures’:
#>
#> UCSCFeatureDbTableSchema, browseUCSCtrack, getChromInfoFromBiomart,
#> makeFDbPackageFromUCSC, makeFeatureDbFromUCSC, makePackageName,
#> makeTxDb, makeTxDbFromBiomart, makeTxDbFromEnsembl,
#> makeTxDbFromGFF, makeTxDbFromGRanges, makeTxDbFromUCSC,
#> makeTxDbPackage, makeTxDbPackageFromBiomart,
#> makeTxDbPackageFromUCSC, supportedMiRBaseBuildValues,
#> supportedUCSCFeatureDbTables, supportedUCSCFeatureDbTracks,
#> supportedUCSCtables
txdb_list <- lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges)
intron_counts <- lapply(txdb_list, get_intron_counts)
# Classify duplicates - full scheme
dup_class <- classify_gene_pairs(
annotation = annotation,
blast_list = diamond_intra,
scheme = "full",
blast_inter = diamond_inter,
intron_counts = intron_counts
)
# Check number of gene pairs per class
table(dup_class$Scerevisiae$type)
#>
#> SD TD PD rTRD dTRD DD
#> 342 42 80 52 963 2109