syntenet as a synteny detection tool

Fabricio Almeida-Silva

VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumDepartment of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium

Tao Zhao

State Key Laboratory of Crop Stress Biology for Arid Areas/Shaanxi Key Laboratory of Apple, College of Horticulture, Northwest A&F University, Yangling, China

Kristian K Ullrich

Department of Evolutionary Biology, Max Planck Institute For Evolutionary Biology, Ploen, Germany

Yves Van de Peer

VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumDepartment of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, BelgiumCollege of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, ChinaCenter for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa
Source: vignettes/vignette_02_synteny_detection_with_syntenet.Rmd

vignette_02_synteny_detection_with_syntenet.Rmd

Introduction

Although syntenet was designed for large-scale synteny analyses (i.e., with tens of genomes), users might also be interested in using syntenet as a simple synteny detection tool. For example, one can use syntenet to identify syntenic regions in a genome (intraspecies synteny), or to identify syntenic regions between the genomes of two species (interspecies synteny). To detect synteny, syntenet uses its own implementation of the MCScanX algorithm (Wang et al. 2012), ported to R thanks to the Rcpp package for R and C++ integration. In this vignette, we will guide you on how to perform intra and interspecies synteny with syntenet.

library(syntenet)

Detecting intragenome synteny

To detect intragenome (or intraspecies) synteny, you will use the function intraspecies_synteny. As input, you will need:

A GRangesList object containing the processed annotation for your species of interest, as returned by process_input().
A list of data frames with the output of similarity search programs (e.g., DIAMOND, BLAST, etc). Only intragenome comparisons must be included. This can be obtained with run_diamond(seq, compare = "intraspecies").

To demonstrate the usage of intraspecies_synteny(), let’s identify syntenic regions in the genome of Saccharomyces cerevisiae. The processed annotation and DIAMOND output are stored in the example data sets scerevisiae_annot and scerevisiae_diamond.

# Load example data sets
data(scerevisiae_annot)
head(scerevisiae_annot)
#> $Scerevisiae
#> GRanges object with 6600 ranges and 1 metadata column:
#>          seqnames        ranges strand |          gene
#>             <Rle>     <IRanges>  <Rle> |   <character>
#>      [1]    Sce_I       335-649      * |   Sce_YAL069W
#>      [2]    Sce_I       538-792      * | Sce_YAL068W-A
#>      [3]    Sce_I     1807-2169      * |   Sce_YAL068C
#>      [4]    Sce_I     2480-2707      * | Sce_YAL067W-A
#>      [5]    Sce_I     7235-9016      * |   Sce_YAL067C
#>      ...      ...           ...    ... .           ...
#>   [6596]  Sce_XVI 939922-941136      * |   Sce_YPR201W
#>   [6597]  Sce_XVI 943032-943896      * |   Sce_YPR202W
#>   [6598]  Sce_XVI 943880-944188      * |   Sce_YPR203W
#>   [6599]  Sce_XVI 944603-947701      * |   Sce_YPR204W
#>   [6600]  Sce_XVI 946856-947338      * | Sce_YPR204C-A
#>   -------
#>   seqinfo: 17 sequences from an unspecified genome; no seqlengths

data(scerevisiae_diamond)
names(scerevisiae_diamond)
#> [1] "Scerevisiae_Scerevisiae"
head(scerevisiae_diamond$Scerevisiae_Scerevisiae)
#>         query          db perc_identity length mismatches gap_open qstart qend
#> 1 Sce_YLR106C Sce_YLR106C         100.0   4910          0        0      1 4910
#> 2 Sce_YKR054C Sce_YKR054C         100.0   4092          0        0      1 4092
#> 3 Sce_YHR099W Sce_YHR099W         100.0   3744          0        0      1 3744
#> 4 Sce_YDR457W Sce_YDR457W         100.0   3268          0        0      1 3268
#> 5 Sce_YDR457W Sce_YER125W          44.1    354        195        3   2913 3266
#> 6 Sce_YDR457W Sce_YJR036C          30.7    378        228       12   2913 3266
#>   tstart tend   evalue bitscore
#> 1      1 4910 0.00e+00     9095
#> 2      1 4092 0.00e+00     7940
#> 3      1 3744 0.00e+00     7334
#> 4      1 3268 0.00e+00     6170
#> 5    457  807 2.18e-91      315
#> 6    523  890 7.99e-44      172

Once we have the data, we can detect synteny with intraspecies_synteny(). This function returns the path to the .collinearity files generated by MCScanX (Wang et al. 2012), which can be read and parsed with the parse_collinearity() function.

# Detect intragenome synteny
intra_syn <- intraspecies_synteny(
    scerevisiae_diamond, scerevisiae_annot
)

intra_syn # see where the .collinearity file is
#> [1] "/tmp/Rtmp4D0cbF/intra/Scerevisiae.collinearity"

# Get anchor pairs from .collinearity file
anchors <- parse_collinearity(intra_syn)
head(anchors)
#>       Anchor1       Anchor2
#> 1 Sce_YAR050W   Sce_YHR211W
#> 2 Sce_YAR060C   Sce_YHR212C
#> 3 Sce_YAR064W Sce_YHR213W-B
#> 4 Sce_YAR066W   Sce_YHR214W
#> 5 Sce_YAR068W Sce_YHR214W-A
#> 6 Sce_YAR069C Sce_YHR214C-D

By default, the output of parse_collinearity() is a 2-column data frame with anchor pairs in syntenic regions. You can also extract data on each synteny block (or synteny region) by specifying as = 'blocks'.

# Get synteny block information with `parse_collinearity()`
blocks <- parse_collinearity(intra_syn, as = "blocks")
head(blocks)
#>   Block Block_score            Chr Orientation
#> 1     0         446 Sce_I&Sce_VIII        plus
#> 2     1         528   Sce_I&Sce_XV       minus
#> 3     2         572  Sce_II&Sce_IV        plus
#> 4     3         643   Sce_II&Sce_V       minus
#> 5     4         446 Sce_II&Sce_XVI        plus
#> 6     5         422 Sce_III&Sce_IV       minus

You can even extract both anchor pairs and synteny block data at once in a single data frame.

# Get anchors and block data with `parse_collinearity()`
intrasyn_all <- parse_collinearity(intra_syn, as = "all")
head(intrasyn_all)
#>   Block Block_score            Chr Orientation     Anchor1       Anchor2
#> 1     0         446 Sce_I&Sce_VIII        plus Sce_YAR050W   Sce_YHR211W
#> 2     0         446 Sce_I&Sce_VIII        plus Sce_YAR060C   Sce_YHR212C
#> 3     0         446 Sce_I&Sce_VIII        plus Sce_YAR064W Sce_YHR213W-B
#> 4     0         446 Sce_I&Sce_VIII        plus Sce_YAR066W   Sce_YHR214W
#> 5     0         446 Sce_I&Sce_VIII        plus Sce_YAR068W Sce_YHR214W-A
#> 6     0         446 Sce_I&Sce_VIII        plus Sce_YAR069C Sce_YHR214C-D

Detecting intergenome synteny

To detect intergenome (or interspecies) synteny, you will use the function interspecies_synteny, which works very similarly to intraspecies_synteny(). As input, you will need:

A GRangesList object containing the processed annotation for your species of interest (2+ species), as returned by process_input().
A list of data frames with the output of similarity search programs (e.g., DIAMOND, BLAST, etc). Only intergenome comparisons must be included. This can be obtained with run_diamond(seq, compare = compare_df).

Importantly, to detect intergenome synteny, the MCScanX algorithm requires bidirectional BLAST hits. For instance, if you’re trying to detect synteny between spA and spB, you need to perform DIAMOND/BLAST searches in both directions: spA_spB and spB_spA. You can create a data frame specifying these comparisons as follows:

# Create a data frame with comparisons to be made
comp <- data.frame(
    query = "spA",
    db = "spB"
)
comp
#>   query  db
#> 1   spA spB

# Make comparisons bidirectional
comp_bi <- make_bidirectional(comp)
comp_bi
#>   query  db
#> 1   spA spB
#> 2   spB spA

Then, you can peform similarity searches for these comparisons with:

# NOTE: Not executed because object `seq` doesn't exist; for demo only
dmd_inter <- run_diamond(seq, compare = comp_bi)

The code above would generate a list of two data frames with DIAMOND tables: one named spA_spB, and another one named spB_spA.

To demonstrate what this looks like, we will load the example data set we will use to detect intergenome synteny. Here, we will detect syntenic regions between the genomes of the algae Ostreococcus lucimarinus and Ostreococcus sp RCC809. The list of DIAMOND tables is stored in object blast_list, and we will create the processed annotation from objects proteomes and annotation.

# Load list of DIAMOND tables
data(blast_list)
names(blast_list)
#> [1] "Olucimarinus_Olucimarinus" "Olucimarinus_OspRCC809"   
#> [3] "OspRCC809_Olucimarinus"    "OspRCC809_OspRCC809"

algae_inter <- blast_list[c(2,3)] # keep only intergenome comparisons
names(algae_inter)
#> [1] "Olucimarinus_OspRCC809" "OspRCC809_Olucimarinus"


# Get processed annotation
data(proteomes)     # A list of `AAStringSet` objects
data(annotation)    # A `GRangesList` object

pdata <- process_input(proteomes, annotation)
names(pdata$annotation)
#> [1] "Olucimarinus" "OspRCC809"

Now that we have a list of DIAMOND tables with bidirectional comparisons between Olucimarinus and OspRCC809, as well as annotation for these two genomes, we can detect synteny with interspecies_synteny(). Like intraspecies_synteny(), this function will return the path to a .collinearity file generated by MCScanX, and we can parse this file with parse_collinearity().

# Detect interspecies synteny
intersyn <- interspecies_synteny(algae_inter, pdata$annotation)

intersyn # see where the .collinearity file is
#> [1] "/tmp/Rtmp4D0cbF/inter/Olucimarinus_OspRCC809.collinearity"

# Parse collinearity file
## 1) Get anchor pairs
algae_anchors <- parse_collinearity(intersyn)
head(algae_anchors)
#>          Anchor1              Anchor2
#> 1 Olu_OL01G00100 Osp_ORCC809_01G06480
#> 2 Olu_OL01G00130 Osp_ORCC809_01G06440
#> 3 Olu_OL01G00150 Osp_ORCC809_01G06420
#> 4 Olu_OL01G00160 Osp_ORCC809_01G06410
#> 5 Olu_OL01G00170 Osp_ORCC809_01G06400
#> 6 Olu_OL01G00180 Osp_ORCC809_01G06390

## 2) Get synteny blocks
algae_blocks <- parse_collinearity(intersyn, as = "blocks")
head(algae_blocks)
#>   Block Block_score                 Chr Orientation
#> 1     0       27234 Olu_Chr_1&Osp_chr_1       minus
#> 2     1         299 Olu_Chr_2&Osp_chr_2        plus
#> 3     2         296 Olu_Chr_2&Osp_chr_2        plus
#> 4     3         287 Olu_Chr_2&Osp_chr_2        plus
#> 5     4        7548 Olu_Chr_2&Osp_chr_2       minus
#> 6     5        4578 Olu_Chr_2&Osp_chr_2       minus

## 3) Get all data combined (blocks + anchor pairs)
algae_all <- parse_collinearity(intersyn, as = "all")
head(algae_all)
#>   Block Block_score                 Chr Orientation        Anchor1
#> 1     0       27234 Olu_Chr_1&Osp_chr_1       minus Olu_OL01G00100
#> 2     0       27234 Olu_Chr_1&Osp_chr_1       minus Olu_OL01G00130
#> 3     0       27234 Olu_Chr_1&Osp_chr_1       minus Olu_OL01G00150
#> 4     0       27234 Olu_Chr_1&Osp_chr_1       minus Olu_OL01G00160
#> 5     0       27234 Olu_Chr_1&Osp_chr_1       minus Olu_OL01G00170
#> 6     0       27234 Olu_Chr_1&Osp_chr_1       minus Olu_OL01G00180
#>                Anchor2
#> 1 Osp_ORCC809_01G06480
#> 2 Osp_ORCC809_01G06440
#> 3 Osp_ORCC809_01G06420
#> 4 Osp_ORCC809_01G06410
#> 5 Osp_ORCC809_01G06400
#> 6 Osp_ORCC809_01G06390

Session information

This document was created under the following conditions:

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] syntenet_1.11.2  BiocStyle_2.37.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10           generics_0.1.4        lattice_0.22-7       
#>  [4] digest_0.6.37         magrittr_2.0.3        statnet.common_4.12.0
#>  [7] intergraph_2.0-4      evaluate_1.0.4        grid_4.5.1           
#> [10] RColorBrewer_1.1-3    bookdown_0.43         fastmap_1.2.0        
#> [13] jsonlite_2.0.0        ggnetwork_0.5.13      network_1.19.0       
#> [16] BiocManager_1.30.26   scales_1.4.0          Biostrings_2.77.2    
#> [19] codetools_0.2-20      textshaping_1.0.1     jquerylib_0.1.4      
#> [22] cli_3.6.5             rlang_1.1.6           crayon_1.5.3         
#> [25] XVector_0.49.0        cachem_1.1.0          yaml_2.3.10          
#> [28] tools_4.5.1           parallel_4.5.1        BiocParallel_1.43.4  
#> [31] coda_0.19-4.1         dplyr_1.1.4           ggplot2_3.5.2        
#> [34] BiocGenerics_0.55.0   vctrs_0.6.5           R6_2.6.1             
#> [37] stats4_4.5.1          lifecycle_1.0.4       Seqinfo_0.99.1       
#> [40] S4Vectors_0.47.0      fs_1.6.6              htmlwidgets_1.6.4    
#> [43] IRanges_2.43.0        ragg_1.4.0            pkgconfig_2.0.3      
#> [46] desc_1.4.3            gtable_0.3.6          pkgdown_2.1.3        
#> [49] pillar_1.11.0         bslib_0.9.0           glue_1.8.0           
#> [52] Rcpp_1.1.0            systemfonts_1.2.3     tidyselect_1.2.1     
#> [55] xfun_0.52             tibble_3.3.0          GenomicRanges_1.61.1 
#> [58] knitr_1.50            farver_2.1.2          igraph_2.1.4         
#> [61] htmltools_0.5.8.1     rmarkdown_2.29        pheatmap_1.0.13      
#> [64] compiler_4.5.1

References

Wang, Yupeng, Haibao Tang, Jeremy D DeBarry, Xu Tan, Jingping Li, Xiyin Wang, Tae-ho Lee, et al. 2012. “MCScanX: A Toolkit for Detection and Evolutionary Analysis of Gene Synteny and Collinearity.” Nucleic Acids Research 40 (7): e49–49.