
syntenet as a synteny detection tool
Fabricio Almeida-Silva
VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumDepartment of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, BelgiumTao Zhao
State Key Laboratory of Crop Stress Biology for Arid Areas/Shaanxi Key Laboratory of Apple, College of Horticulture, Northwest A&F University, Yangling, ChinaKristian K Ullrich
Department of Evolutionary Biology, Max Planck Institute For Evolutionary Biology, Ploen, GermanyYves Van de Peer
VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumDepartment of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, BelgiumCollege of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, ChinaCenter for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South AfricaSource:
vignettes/vignette_02_synteny_detection_with_syntenet.Rmd
vignette_02_synteny_detection_with_syntenet.Rmd
Introduction
Although syntenet was designed for large-scale synteny analyses (i.e., with tens of genomes), users might also be interested in using syntenet as a simple synteny detection tool. For example, one can use syntenet to identify syntenic regions in a genome (intraspecies synteny), or to identify syntenic regions between the genomes of two species (interspecies synteny). To detect synteny, syntenet uses its own implementation of the MCScanX algorithm (Wang et al. 2012), ported to R thanks to the Rcpp package for R and C++ integration. In this vignette, we will guide you on how to perform intra and interspecies synteny with syntenet.
Detecting intragenome synteny
To detect intragenome (or intraspecies) synteny, you will use the
function intraspecies_synteny
. As input, you will need:
A
GRangesList
object containing the processed annotation for your species of interest, as returned byprocess_input()
.A list of data frames with the output of similarity search programs (e.g., DIAMOND, BLAST, etc). Only intragenome comparisons must be included. This can be obtained with
run_diamond(seq, compare = "intraspecies")
.
To demonstrate the usage of intraspecies_synteny()
,
let’s identify syntenic regions in the genome of Saccharomyces
cerevisiae. The processed annotation and DIAMOND output are stored
in the example data sets scerevisiae_annot
and
scerevisiae_diamond
.
# Load example data sets
data(scerevisiae_annot)
head(scerevisiae_annot)
#> $Scerevisiae
#> GRanges object with 6600 ranges and 1 metadata column:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <character>
#> [1] Sce_I 335-649 * | Sce_YAL069W
#> [2] Sce_I 538-792 * | Sce_YAL068W-A
#> [3] Sce_I 1807-2169 * | Sce_YAL068C
#> [4] Sce_I 2480-2707 * | Sce_YAL067W-A
#> [5] Sce_I 7235-9016 * | Sce_YAL067C
#> ... ... ... ... . ...
#> [6596] Sce_XVI 939922-941136 * | Sce_YPR201W
#> [6597] Sce_XVI 943032-943896 * | Sce_YPR202W
#> [6598] Sce_XVI 943880-944188 * | Sce_YPR203W
#> [6599] Sce_XVI 944603-947701 * | Sce_YPR204W
#> [6600] Sce_XVI 946856-947338 * | Sce_YPR204C-A
#> -------
#> seqinfo: 17 sequences from an unspecified genome; no seqlengths
data(scerevisiae_diamond)
names(scerevisiae_diamond)
#> [1] "Scerevisiae_Scerevisiae"
head(scerevisiae_diamond$Scerevisiae_Scerevisiae)
#> query db perc_identity length mismatches gap_open qstart qend
#> 1 Sce_YLR106C Sce_YLR106C 100.0 4910 0 0 1 4910
#> 2 Sce_YKR054C Sce_YKR054C 100.0 4092 0 0 1 4092
#> 3 Sce_YHR099W Sce_YHR099W 100.0 3744 0 0 1 3744
#> 4 Sce_YDR457W Sce_YDR457W 100.0 3268 0 0 1 3268
#> 5 Sce_YDR457W Sce_YER125W 44.1 354 195 3 2913 3266
#> 6 Sce_YDR457W Sce_YJR036C 30.7 378 228 12 2913 3266
#> tstart tend evalue bitscore
#> 1 1 4910 0.00e+00 9095
#> 2 1 4092 0.00e+00 7940
#> 3 1 3744 0.00e+00 7334
#> 4 1 3268 0.00e+00 6170
#> 5 457 807 2.18e-91 315
#> 6 523 890 7.99e-44 172
Once we have the data, we can detect synteny with
intraspecies_synteny()
. This function returns the path to
the .collinearity files generated by MCScanX (Wang et al. 2012), which can be read and parsed
with the parse_collinearity()
function.
# Detect intragenome synteny
intra_syn <- intraspecies_synteny(
scerevisiae_diamond, scerevisiae_annot
)
intra_syn # see where the .collinearity file is
#> [1] "/tmp/Rtmp4D0cbF/intra/Scerevisiae.collinearity"
# Get anchor pairs from .collinearity file
anchors <- parse_collinearity(intra_syn)
head(anchors)
#> Anchor1 Anchor2
#> 1 Sce_YAR050W Sce_YHR211W
#> 2 Sce_YAR060C Sce_YHR212C
#> 3 Sce_YAR064W Sce_YHR213W-B
#> 4 Sce_YAR066W Sce_YHR214W
#> 5 Sce_YAR068W Sce_YHR214W-A
#> 6 Sce_YAR069C Sce_YHR214C-D
By default, the output of parse_collinearity()
is a
2-column data frame with anchor pairs in syntenic regions. You can also
extract data on each synteny block (or synteny region) by specifying
as = 'blocks'
.
# Get synteny block information with `parse_collinearity()`
blocks <- parse_collinearity(intra_syn, as = "blocks")
head(blocks)
#> Block Block_score Chr Orientation
#> 1 0 446 Sce_I&Sce_VIII plus
#> 2 1 528 Sce_I&Sce_XV minus
#> 3 2 572 Sce_II&Sce_IV plus
#> 4 3 643 Sce_II&Sce_V minus
#> 5 4 446 Sce_II&Sce_XVI plus
#> 6 5 422 Sce_III&Sce_IV minus
You can even extract both anchor pairs and synteny block data at once in a single data frame.
# Get anchors and block data with `parse_collinearity()`
intrasyn_all <- parse_collinearity(intra_syn, as = "all")
head(intrasyn_all)
#> Block Block_score Chr Orientation Anchor1 Anchor2
#> 1 0 446 Sce_I&Sce_VIII plus Sce_YAR050W Sce_YHR211W
#> 2 0 446 Sce_I&Sce_VIII plus Sce_YAR060C Sce_YHR212C
#> 3 0 446 Sce_I&Sce_VIII plus Sce_YAR064W Sce_YHR213W-B
#> 4 0 446 Sce_I&Sce_VIII plus Sce_YAR066W Sce_YHR214W
#> 5 0 446 Sce_I&Sce_VIII plus Sce_YAR068W Sce_YHR214W-A
#> 6 0 446 Sce_I&Sce_VIII plus Sce_YAR069C Sce_YHR214C-D
Detecting intergenome synteny
To detect intergenome (or interspecies) synteny, you will use the
function interspecies_synteny
, which works very similarly
to intraspecies_synteny()
. As input, you will need:
A
GRangesList
object containing the processed annotation for your species of interest (2+ species), as returned byprocess_input()
.A list of data frames with the output of similarity search programs (e.g., DIAMOND, BLAST, etc). Only intergenome comparisons must be included. This can be obtained with
run_diamond(seq, compare = compare_df)
.
Importantly, to detect intergenome synteny, the MCScanX algorithm
requires bidirectional BLAST hits. For instance, if you’re trying to
detect synteny between spA
and spB
, you need
to perform DIAMOND/BLAST searches in both directions:
spA_spB
and spB_spA
. You can create a data
frame specifying these comparisons as follows:
# Create a data frame with comparisons to be made
comp <- data.frame(
query = "spA",
db = "spB"
)
comp
#> query db
#> 1 spA spB
# Make comparisons bidirectional
comp_bi <- make_bidirectional(comp)
comp_bi
#> query db
#> 1 spA spB
#> 2 spB spA
Then, you can peform similarity searches for these comparisons with:
# NOTE: Not executed because object `seq` doesn't exist; for demo only
dmd_inter <- run_diamond(seq, compare = comp_bi)
The code above would generate a list of two data frames with DIAMOND
tables: one named spA_spB
, and another one named
spB_spA
.
To demonstrate what this looks like, we will load the example data
set we will use to detect intergenome synteny. Here, we will detect
syntenic regions between the genomes of the algae Ostreococcus
lucimarinus and Ostreococcus sp RCC809. The list of
DIAMOND tables is stored in object blast_list
, and we will
create the processed annotation from objects proteomes
and
annotation
.
# Load list of DIAMOND tables
data(blast_list)
names(blast_list)
#> [1] "Olucimarinus_Olucimarinus" "Olucimarinus_OspRCC809"
#> [3] "OspRCC809_Olucimarinus" "OspRCC809_OspRCC809"
algae_inter <- blast_list[c(2,3)] # keep only intergenome comparisons
names(algae_inter)
#> [1] "Olucimarinus_OspRCC809" "OspRCC809_Olucimarinus"
# Get processed annotation
data(proteomes) # A list of `AAStringSet` objects
data(annotation) # A `GRangesList` object
pdata <- process_input(proteomes, annotation)
names(pdata$annotation)
#> [1] "Olucimarinus" "OspRCC809"
Now that we have a list of DIAMOND tables with bidirectional
comparisons between Olucimarinus and OspRCC809, as
well as annotation for these two genomes, we can detect synteny with
interspecies_synteny()
. Like
intraspecies_synteny()
, this function will return the path
to a .collinearity file generated by MCScanX, and we can parse
this file with parse_collinearity()
.
# Detect interspecies synteny
intersyn <- interspecies_synteny(algae_inter, pdata$annotation)
intersyn # see where the .collinearity file is
#> [1] "/tmp/Rtmp4D0cbF/inter/Olucimarinus_OspRCC809.collinearity"
# Parse collinearity file
## 1) Get anchor pairs
algae_anchors <- parse_collinearity(intersyn)
head(algae_anchors)
#> Anchor1 Anchor2
#> 1 Olu_OL01G00100 Osp_ORCC809_01G06480
#> 2 Olu_OL01G00130 Osp_ORCC809_01G06440
#> 3 Olu_OL01G00150 Osp_ORCC809_01G06420
#> 4 Olu_OL01G00160 Osp_ORCC809_01G06410
#> 5 Olu_OL01G00170 Osp_ORCC809_01G06400
#> 6 Olu_OL01G00180 Osp_ORCC809_01G06390
## 2) Get synteny blocks
algae_blocks <- parse_collinearity(intersyn, as = "blocks")
head(algae_blocks)
#> Block Block_score Chr Orientation
#> 1 0 27234 Olu_Chr_1&Osp_chr_1 minus
#> 2 1 299 Olu_Chr_2&Osp_chr_2 plus
#> 3 2 296 Olu_Chr_2&Osp_chr_2 plus
#> 4 3 287 Olu_Chr_2&Osp_chr_2 plus
#> 5 4 7548 Olu_Chr_2&Osp_chr_2 minus
#> 6 5 4578 Olu_Chr_2&Osp_chr_2 minus
## 3) Get all data combined (blocks + anchor pairs)
algae_all <- parse_collinearity(intersyn, as = "all")
head(algae_all)
#> Block Block_score Chr Orientation Anchor1
#> 1 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00100
#> 2 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00130
#> 3 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00150
#> 4 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00160
#> 5 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00170
#> 6 0 27234 Olu_Chr_1&Osp_chr_1 minus Olu_OL01G00180
#> Anchor2
#> 1 Osp_ORCC809_01G06480
#> 2 Osp_ORCC809_01G06440
#> 3 Osp_ORCC809_01G06420
#> 4 Osp_ORCC809_01G06410
#> 5 Osp_ORCC809_01G06400
#> 6 Osp_ORCC809_01G06390
Session information
This document was created under the following conditions:
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] syntenet_1.11.2 BiocStyle_2.37.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 generics_0.1.4 lattice_0.22-7
#> [4] digest_0.6.37 magrittr_2.0.3 statnet.common_4.12.0
#> [7] intergraph_2.0-4 evaluate_1.0.4 grid_4.5.1
#> [10] RColorBrewer_1.1-3 bookdown_0.43 fastmap_1.2.0
#> [13] jsonlite_2.0.0 ggnetwork_0.5.13 network_1.19.0
#> [16] BiocManager_1.30.26 scales_1.4.0 Biostrings_2.77.2
#> [19] codetools_0.2-20 textshaping_1.0.1 jquerylib_0.1.4
#> [22] cli_3.6.5 rlang_1.1.6 crayon_1.5.3
#> [25] XVector_0.49.0 cachem_1.1.0 yaml_2.3.10
#> [28] tools_4.5.1 parallel_4.5.1 BiocParallel_1.43.4
#> [31] coda_0.19-4.1 dplyr_1.1.4 ggplot2_3.5.2
#> [34] BiocGenerics_0.55.0 vctrs_0.6.5 R6_2.6.1
#> [37] stats4_4.5.1 lifecycle_1.0.4 Seqinfo_0.99.1
#> [40] S4Vectors_0.47.0 fs_1.6.6 htmlwidgets_1.6.4
#> [43] IRanges_2.43.0 ragg_1.4.0 pkgconfig_2.0.3
#> [46] desc_1.4.3 gtable_0.3.6 pkgdown_2.1.3
#> [49] pillar_1.11.0 bslib_0.9.0 glue_1.8.0
#> [52] Rcpp_1.1.0 systemfonts_1.2.3 tidyselect_1.2.1
#> [55] xfun_0.52 tibble_3.3.0 GenomicRanges_1.61.1
#> [58] knitr_1.50 farver_2.1.2 igraph_2.1.4
#> [61] htmltools_0.5.8.1 rmarkdown_2.29 pheatmap_1.0.13
#> [64] compiler_4.5.1