Accessing data from the PLAZA database

Introduction

PLAZA (Van Bel et al. 2022) is a database for plant comparative and functional genomics, centralizing data produced by different genome sequencing initiatives. Like Ensembl (Dyer et al. 2025), PLAZA organizes data in instances based on clades of the (green) tree of life (PLAZA Dicots, PLAZA Monocots, PLAZA Diatoms, and pico-PLAZA). Besides its rich web server where users can interactively search and explore plant genomic data, PLAZA offers bulk download of data, including whole-genome and locus sequences, gene annotations, functional annotations, homology, etc. rplaza provides an R interface to the PLAZA database, allowing the direct import of PLAZA data to the R session using standard R/Bioconductor classes, allowing seamless interoperability and integration with Bioconductor packages.

Data types provided by PLAZA and how they are stored for interoperability with R/Bioconductor packages are summarized below:

Data	File format	R class
Sequences	FASTA	AAStringSet/DNAStringSet
Ranges	GFF3	GRanges
Functions and homology	TSV	data.frame

Installation

rplaza can be installed from Bioconductor with the following code:

if(!requireNamespace('BiocManager', quietly = TRUE)) {
    install.packages('BiocManager')
}

BiocManager::install("rplaza")

# Load package after installation
library(rplaza)

set.seed(123) # for reproducibility

PLAZA instances and species (or fantastic plants and where to find them)

The PLAZA database is divided in four instances that are specialized in different clades:

PLAZA Dicots: focus on eudicot plants (N = 98 species);
PLAZA Monocots: focus on monocot plants (N = 52 species);
pico-PLAZA: focus on eukaryotic micro-organisms (N = 39 species);
PLAZA Diatoms: focus on diatoms (N = 26 species).

To access data on what species can be found in which instance, as well as associated metadata (taxonomic information, data sources, etc), you can inspect the object plaza_metadata, which comes with rplaza. This object is a list of data frames, one for each PLAZA instance.

# Load PLAZA metadata
data("plaza_metadata")

names(plaza_metadata)
#> [1] "Dicots"   "Monocots" "Diatoms"  "Pico"

As you can see, the list has four data frames. If you want to have a look at the species in PLAZA Dicots, for instance, you need to extract the corresponding data frame from the list.

# Look at metadata for 
head(plaza_metadata$Dicots)
#>   PLAZA_ID         common_name           NCBI_name NCBI_taxid          family
#> 1      aag Anthoceros agrestis Anthoceros agrestis      41834 Anthocerotaceae
#> 2      aar Aethionema arabicum Aethionema arabicum     228871    Brassicaceae
#> 3   acertr      Acer truncatum      Acer truncatum      47965     Sapindaceae
#> 4      ach Actinidia chinensis Actinidia chinensis       3625   Actinidiaceae
#> 5      aly  Arabidopsis lyrata  Arabidopsis lyrata      59689    Brassicaceae
#> 6      ama    Avicennia marina    Avicennia marina      82927     Acanthaceae
#>            order            class        source
#> 1 Anthocerotales Anthocerotopsida          v1.1
#> 2    Brassicales    Magnoliopsida          v3.1
#> 3     Sapindales    Magnoliopsida          v2.0
#> 4       Ericales    Magnoliopsida          V3.0
#> 5    Brassicales    Magnoliopsida      JGI v2.1
#> 6       Lamiales    Magnoliopsida v1.1 20200527
#>                                                                    data_provider
#> 1                                           https://www.hornworts.uzh.ch/en.html
#> 2                        https://genomevolution.org/coge/GenomeInfo.pl?gid=37218
#> 3                                https://doi.org/10.6084/m9.figshare.12986237.v2
#> 4 https://figshare.com/articles/dataset/Actinidia_chinensis_genome_data/10046558
#> 5           https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Alyrata
#> 6                https://datadryad.org/stash/dataset/doi:10.5061/dryad.3j9kd51f5
#>   pubmed_id
#> 1  32170292
#> 2  31554715
#> 3  32772482
#> 4  31645971
#> 5  26382944
#> 6  33561229

Note that PLAZA gives unique IDs to each species, and these are available in column PLAZA_ID of each data frame. These PLAZA species IDs are essential to work with rplaza, as most functions in this package require specifying species using PLAZA IDs.

It is also important to keep in mind that some species are present in multiple PLAZA instances (e.g., Arabidopsis thaliana), so if you’re combining these data frames in a single data frame, you need to remove duplicate rows. For example:

# Combine all data frames together and remove redundancy
plaza_all <- dplyr::bind_rows(plaza_metadata, .id = "instance") |>
    dplyr::distinct(PLAZA_ID, .keep_all = TRUE)

head(plaza_all)
#>   instance PLAZA_ID         common_name           NCBI_name NCBI_taxid
#> 1   Dicots      aag Anthoceros agrestis Anthoceros agrestis      41834
#> 2   Dicots      aar Aethionema arabicum Aethionema arabicum     228871
#> 3   Dicots   acertr      Acer truncatum      Acer truncatum      47965
#> 4   Dicots      ach Actinidia chinensis Actinidia chinensis       3625
#> 5   Dicots      aly  Arabidopsis lyrata  Arabidopsis lyrata      59689
#> 6   Dicots      ama    Avicennia marina    Avicennia marina      82927
#>            family          order            class        source
#> 1 Anthocerotaceae Anthocerotales Anthocerotopsida          v1.1
#> 2    Brassicaceae    Brassicales    Magnoliopsida          v3.1
#> 3     Sapindaceae     Sapindales    Magnoliopsida          v2.0
#> 4   Actinidiaceae       Ericales    Magnoliopsida          V3.0
#> 5    Brassicaceae    Brassicales    Magnoliopsida      JGI v2.1
#> 6     Acanthaceae       Lamiales    Magnoliopsida v1.1 20200527
#>                                                                    data_provider
#> 1                                           https://www.hornworts.uzh.ch/en.html
#> 2                        https://genomevolution.org/coge/GenomeInfo.pl?gid=37218
#> 3                                https://doi.org/10.6084/m9.figshare.12986237.v2
#> 4 https://figshare.com/articles/dataset/Actinidia_chinensis_genome_data/10046558
#> 5           https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Alyrata
#> 6                https://datadryad.org/stash/dataset/doi:10.5061/dryad.3j9kd51f5
#>   pubmed_id
#> 1  32170292
#> 2  31554715
#> 3  32772482
#> 4  31645971
#> 5  26382944
#> 6  33561229
nrow(plaza_all)
#> [1] 171

After combining all data frames together and removing duplicates, we can see that there are 171 unique species across all PLAZA instances.

A note on software design

rplaza uses BiocFileCache to create a temporary local cache of remote resources per R session. This way, if you try to retrieve the same data multiple times in a single R session, you will only download data once. Caching is especially important in two common scenarios:

You downloaded data for species A and B, but later decide to include species C and D as well. If you re-run the code to obtain data for species A, B, C, and D, you will only download data for species C and D, as data for species A and B are already stored in a local cache.
You tried to download data for several species, but after some time your download failed for some reason (internet issues, timeout, etc). If you re-run the code to download data, your download will start from where it failed, so it will finish much faster.

Obtaining sequence data

PLAZA provides sequences for whole genomes and whole-genome loci (all proteins, all transcripts, and all CDS in a genome). These are available as FASTA files, which rplaza reads as DNAStringSet objects (or AAStringSet objects for proteins).

Genomes

To extract whole-genome sequences for a particular species, you will use the function get_genome(). For a single species, get_genome() takes a species ID and returns a DNAStringSet object. To demonstrate it, let’s obtain the whole genome of Arabidopsis thaliana.

# Retrieve genome of Arabidopsis thaliana
ath_genome <- get_genome("ath")

ath_genome
#> DNAStringSet object of length 7:
#>        width seq                                            names               
#> [1] 30427671 CCCTAAACCCTAAACCCTAAAC...TTTAGGGTTTAGGGTTTAGGG Chr1
#> [2] 19698289 NNNNNNNNNNNNNNNNNNNNNN...TTTAGGGTTTAGGGTTTAGGG Chr2
#> [3] 23459830 NNNNNNNNNNNNNNNNNNNNNN...TAAACCCTAAACCCTAAACCC Chr3
#> [4] 18585056 NNNNNNNNNNNNNNNNNNNNNN...GTTTAGGGTTTAGGGTTTAGG Chr4
#> [5] 26975502 TATACCATGTACCCTCAACCTT...AGGATTTAGGGTTTTTAGATC Chr5
#> [6]   154478 ATGGGCGAACGACGGGAATTGA...AATAACTTGGTCCCGGGCATC ChrC
#> [7]   366924 GGATCCGTTCGAAACAGGTTAG...AGAATGGAAACAAACCGGATT ChrM

For multiple species, get_genome() takes a vector of species IDs and returns a list of DNAStringSet objects.

# Retrieve genome of A. thaliana and A. lyrata
arabidopsis_genomes <- get_genome(c("ath", "aly"))

arabidopsis_genomes
#> $ath
#> DNAStringSet object of length 7:
#>        width seq                                            names               
#> [1] 30427671 CCCTAAACCCTAAACCCTAAAC...TTTAGGGTTTAGGGTTTAGGG Chr1
#> [2] 19698289 NNNNNNNNNNNNNNNNNNNNNN...TTTAGGGTTTAGGGTTTAGGG Chr2
#> [3] 23459830 NNNNNNNNNNNNNNNNNNNNNN...TAAACCCTAAACCCTAAACCC Chr3
#> [4] 18585056 NNNNNNNNNNNNNNNNNNNNNN...GTTTAGGGTTTAGGGTTTAGG Chr4
#> [5] 26975502 TATACCATGTACCCTCAACCTT...AGGATTTAGGGTTTTTAGATC Chr5
#> [6]   154478 ATGGGCGAACGACGGGAATTGA...AATAACTTGGTCCCGGGCATC ChrC
#> [7]   366924 GGATCCGTTCGAAACAGGTTAG...AGAATGGAAACAAACCGGATT ChrM
#> 
#> $aly
#> DNAStringSet object of length 695:
#>          width seq                                          names               
#>   [1] 33132539 TCACTTTCACTTCTGGATGCT...GTCTAAGCAACTTTCCATAA scaffold_1
#>   [2] 19320864 CTAAACCCTAAACCCTAAACC...TGACAAGTGTAATATCTTGA scaffold_2
#>   [3] 24464547 AAAAAAGGGGGTTCGGATAGA...TATTGGATGAGTTTCAGTAC scaffold_3
#>   [4] 23328337 CGATGGAGTTTTTCAGCAGTT...TTTAGGGTTTAGGGTTTAGG scaffold_4
#>   [5] 21221946 CCCTGAAGCCTGAACCCTGAA...TTAAGGGTTAGGGGTTAGGG scaffold_5
#>   ...      ... ...
#> [691]     1047 GTGGTGGGCGAACAAAACAAG...TGTGCCAAAATAAATATACT scaffold_1112
#> [692]     1037 TATAAGAGCTTTTTGGAGCTT...TTGTAAGGGTTCCCAAAGCC scaffold_1113
#> [693]     1032 CAAATCATCTGCATGTGGCTT...TCTATAAATAGTTTATATAT scaffold_1116
#> [694]     1017 CAATGAACACATAAAATGCAA...AGACAGCACCCCTCCGTGGG scaffold_1117
#> [695]     1005 CTTTGTTCTTTTTATTTTCAA...TCCCAACTATCCGAATTGAG scaffold_1118

To learn more about how to work with DNAStringSet objects, read the vignette of the Biostrings package.

Loci

To retrieve sequences for all proteins, transcripts, or CDS in the genome of a particular species, you will use the function get_sequences(). Besides specifying one or multiple species IDs, get_sequences() also requires the locus type (one of ‘CDS’, ‘transcript’, or ‘protein’), and what transcripts to extract (one of ‘all’ or ‘longest’).

For example, researchers in evolutionary genomics often want to obtain protein sequences of the longest transcripts of each gene, so that they can compare genomes. These are usually called ‘proteomes’ by software tools such as OrthoFinder (Emms and Kelly 2019) and syntenet (Almeida-Silva et al. 2023). To do that, you’d run:

# Retrieve protein sequences of the longest transcripts (one per gene)
proteome <- get_sequences("ath", type = "protein", transcripts = "longest")

proteome
#> AAStringSet object of length 27615:
#>         width seq                                           names               
#>     [1]   430 MEDQVGFGFRPNDEELVGHYL...NMIIGVLLFISVISWIILVG* AT1G01010.1 | AT1...
#>     [2]   199 MIIFIDLILHRPKVYRHVLYN...IFLIQIGSLLQYMSYFFRIV* AT1G01020.5 | AT1...
#>     [3]   359 MDLSLAPTTTTSSDQEQDRDQ...GDDDQFAKKGKSSLSLNFNP* AT1G01030.1 | AT1...
#>     [4]  1910 MVMEDEPREATIKPSYWLDAC...KKAKDSAAVLLLELLNKTFS* AT1G01040.1 | AT1...
#>     [5]   213 MSEETKDNQRLQRPAPRLNER...EAIQYSMDLYAEYILHTLRR* AT1G01050.1 | AT1...
#>     ...   ... ...
#> [27611]   146 MTKREYNSQPEMKEEVLAYLL...CFYPIVQTFLFLMGGGGGPA* ATMG01350.1 | ATM...
#> [27612]   528 MKNLVRWLFSTNHKDIGTLYF...PAFHTFGELPAIKETKSYVK* ATMG01360.1 | ATM...
#> [27613]   112 MKISYFIRRGKKTSRRSHFIK...HPSIVHFDCLLFFLDTTPCL* ATMG01370.1 | ATM...
#> [27614]   106 MKYHFSSMEPWWKREFSFCIP...VERSRSNSLNSGPNPLENAT* ATMG01400.1 | ATM...
#> [27615]   205 MFGGRRRRLPQDGTFNQTQPF...REIFPRESGVLSLMSVPSLA* ATMG01410.1 | ATM...

Alternatively, one might be interested in all transcripts produced by a particular genome. In transcriptomics, this is often called a ‘reference transcriptome’, and it is required by software tools that quantify transcript abundances, such as salmon (Patro et al. 2017). To do that, you would run:

# Obtain all transcripts produced by a genome (aka 'reference transcriptome')
ref_tx <- get_sequences("ath", type = "transcript", transcripts = "all")

ref_tx
#> DNAStringSet object of length 91393:
#>         width seq                                           names               
#>     [1]  1688 AAATTATTAGATATACCAAAC...TGATATATAGTCTTGTATAAG AT1G01010.1 | AT1...
#>     [2]  1306 AATTTATTTTCTTATAAACTT...AATTGAATTAGACCTTACTAA AT1G01020.5 | AT1...
#>     [3]  1905 ATATCATTCATGGGCCCCTAC...TCTTTTCAATTTTGTATATAA AT1G01030.1 | AT1...
#>     [4]  6276 GTGGAAAACAGACCAGAAGAG...AAACCAAAAATTTCAATGCCA AT1G01040.1 | AT1...
#>     [5]   207 TACTTTTCTAATATCACGAGG...CTTCTATAATGTTATTTATTA AT1G01046.1 | AT1...
#>     ...   ... ...
#> [91389]   936 CAAGCCCAAAATAGACAACAT...CTGCATCTGTTTTCAACTTAA AT5G67520.3 | AT5...
#> [91390]  1945 CCGAGAGAAAGCAGAAAAAAA...TATACAAATGCTTTTGTGACT AT5G67540.1 | AT5...
#> [91391]  2024 AGAAAAAAAAAAGAAAAAAAA...AGTAAATGGCGGCTGCCTTAA AT5G67540.3 | AT5...
#> [91392]  1370 CTTAGCTTGAATACAAAAGAA...TGTGGATGATCTGTTTAGTAA AT5G67580.2 | AT5...
#> [91393]  1814 CTGAAACCCACAAGGAAGAAA...TCTTTTAGTATGGATATTAAT AT5G67610.2 | AT5...

Note that the number of sequences in ref_tx is much larger than the number of sequences in proteome, since the former contains sequences for all transcripts (including multiple isoforms per gene).

To learn more about how to work with AAStringSet and DNAStringSet objects, read the vignette of the Biostrings package.

Obtaining gene annotations (a.k.a. ‘ranges’)

The obtain the genomic locations (ranges, with start and end coordinates) of all genes in the genome of a particular species, you will use the function get_annotation(). This function returns a GRanges object (or GRangesList for multiple species) with genomic coordinates and associated metadata. As input, you need to specify the PLAZA species ID of your species of interest (one or multiple), what transcripts you want to extract (‘all’ or ‘longest’, exactly as in get_sequences()), and what features to extract per transcript (‘all’ - including exons, CDS, UTRs, etc. - or ‘exons’).

For example, if you want to extract ranges for the complete genome annotation (i.e., all transcripts, all features), you’d run:

# Extract ranges for all features (exons, CDS, UTRs, etc) of all transcripts
annotation_all <- get_annotation("ath", transcripts = "all", features = "all")

annotation_all
#> GRanges object with 835565 ranges and 14 metadata columns:
#>            seqnames        ranges strand |    source           type     score
#>               <Rle>     <IRanges>  <Rle> |  <factor>       <factor> <numeric>
#>        [1]     Chr1     3631-5899      + | Araport11 gene                  NA
#>        [2]     Chr1     3631-5899      + | Araport11 mRNA                  NA
#>        [3]     Chr1     3631-3913      + | Araport11 exon                  NA
#>        [4]     Chr1     3631-3759      + | Araport11 five_prime_UTR        NA
#>        [5]     Chr1     3760-3913      + | Araport11 CDS                   NA
#>        ...      ...           ...    ... .       ...            ...       ...
#>   [835561]     ChrM 363725-364042      + | Araport11           CDS         NA
#>   [835562]     ChrM 366086-366700      - | Araport11           gene        NA
#>   [835563]     ChrM 366086-366700      - | Araport11           mRNA        NA
#>   [835564]     ChrM 366086-366700      - | Araport11           exon        NA
#>   [835565]     ChrM 366086-366700      - | Araport11           CDS         NA
#>                phase                     ID     uniprot      symbol         tid
#>            <integer>            <character> <character> <character> <character>
#>        [1]      <NA>              AT1G01010      Q0WV96      NAC001 AT1G01010.1
#>        [2]      <NA>            AT1G01010.1        <NA>        <NA>        <NA>
#>        [3]      <NA>     AT1G01010.1:exon:1        <NA>        <NA>        <NA>
#>        [4]      <NA> AT1G01010.1:five_pri..        <NA>        <NA>        <NA>
#>        [5]         0      AT1G01010.1:CDS:1        <NA>        <NA>        <NA>
#>        ...       ...                    ...         ...         ...         ...
#>   [835561]         0      ATMG01400.1:CDS:1        <NA>        <NA>        <NA>
#>   [835562]      <NA>              ATMG01410      P92567      ORF204 ATMG01410.1
#>   [835563]      <NA>            ATMG01410.1        <NA>        <NA>        <NA>
#>   [835564]      <NA>     ATMG01410.1:exon:1        <NA>        <NA>        <NA>
#>   [835565]         0      ATMG01410.1:CDS:1        <NA>        <NA>        <NA>
#>                                     Alias        Name     gene_id
#>                           <CharacterList> <character> <character>
#>        [1] ANAC001,NAC domain containin..   AT1G01010   AT1G01010
#>        [2]                                  AT1G01010   AT1G01010
#>        [3]                                  AT1G01010   AT1G01010
#>        [4]                                  AT1G01010   AT1G01010
#>        [5]                                  AT1G01010   AT1G01010
#>        ...                            ...         ...         ...
#>   [835561]                                  ATMG01400   ATMG01400
#>   [835562]                                  ATMG01410   ATMG01410
#>   [835563]                                  ATMG01410   ATMG01410
#>   [835564]                                  ATMG01410   ATMG01410
#>   [835565]                                  ATMG01410   ATMG01410
#>                     Parent   full_name            name
#>            <CharacterList> <character> <CharacterList>
#>        [1]                        <NA>                
#>        [2]       AT1G01010        <NA>                
#>        [3]     AT1G01010.1        <NA>                
#>        [4]     AT1G01010.1        <NA>                
#>        [5]     AT1G01010.1        <NA>                
#>        ...             ...         ...             ...
#>   [835561]     ATMG01400.1        <NA>                
#>   [835562]                        <NA>                
#>   [835563]       ATMG01410        <NA>                
#>   [835564]     ATMG01410.1        <NA>                
#>   [835565]     ATMG01410.1        <NA>                
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

To learn more about how to work with GRanges objects, read the vignette of the GenomicRanges package.

Obtaining functional annotations

PLAZA offers functional annotation for all genes in all species using GO terms, MapMan pathways, and InterPro domains. To retrieve functional annotation for all genes of a particular species, you will use the function get_functional_annotation(). As input, this function takes one or multiple PLAZA species IDS and the ontology to use (one of ‘GO’, ‘MapMan’, or ‘InterPro’), and it returns a data frame (or list of data frames, in case of multiple species) with terms associated with each gene.

For example, to obtain InterPro domains annotated to each gene of Arabidopsis thaliana, you’d use:

# Get InterPro domains
ath_interpro <- get_functional_annotation("ath", ontology = "InterPro")

head(ath_interpro)
#>     gene_id species  motif_id            description start stop       score
#> 1 AT1G01010     ath IPR003441             NAC domain     5  137 4.60000e-26
#> 2 AT1G01010     ath IPR036093 NAC domain superfamily     5  156 4.84000e-37
#> 3 AT1G01020     ath IPR007290           Arv1 protein     1  146 1.50000e-21
#> 4 AT1G01020     ath IPR007290           Arv1 protein     1  176 9.20000e-68
#> 5 AT1G01030     ath IPR003340  B3 DNA binding domain    56  162 6.30000e-22
#> 6 AT1G01030     ath IPR003340  B3 DNA binding domain    54  150 2.13236e-27
#>                                 comment
#> 1          source=Pfam,domainId=PF02365
#> 2 source=SUPERFAMILY,domainId=SSF101941
#> 3          source=Pfam,domainId=PF04161
#> 4     source=PANTHER,domainId=PTHR14467
#> 5         source=SMART,domainId=SM01019
#> 6           source=CDD,domainId=cd10017

As another example, to obtain GO annotations for the Chlorophyta algae Chlamydomonas reinhardtii and Micromonas commoda, you’d use:

# Obtain functional annotation for Chlorophyta algae
algae_go <- get_functional_annotation(c("cre", "mco"), ontology = "GO")

names(algae_go)
#> [1] "cre" "mco"
head(algae_go$cre)
#>         gene_id species         go evidence go_source     provider
#> 1 Cre01.g000050     cre GO:0003674      IEA   primary interproscan
#> 2 Cre01.g000050     cre GO:0140110      IEA   primary interproscan
#> 3 Cre01.g000050     cre GO:0006355      IEA   primary          goa
#> 4 Cre01.g000050     cre GO:0006139      IEA   primary          goa
#> 5 Cre01.g000050     cre GO:0006351      IEA   primary          goa
#> 6 Cre01.g000050     cre GO:0006725      IEA   primary          goa
#>                                       comment
#> 1                                            
#> 2                                            
#> 3 assigned_by=GOC;db_reference=GO_REF:0000108
#> 4 assigned_by=GOC;db_reference=GO_REF:0000108
#> 5 assigned_by=GOC;db_reference=GO_REF:0000108
#> 6 assigned_by=GOC;db_reference=GO_REF:0000108
#>                                        description propagated_from_child
#> 1                               molecular_function                     1
#> 2                 transcription regulator activity                     1
#> 3       regulation of transcription, DNA-templated                     0
#> 4 nucleobase-containing compound metabolic process                     1
#> 5                     transcription, DNA-templated                     1
#> 6     cellular aromatic compound metabolic process                     1

Lastly, to obtain MapMan annotations for all A. thaliana genes, you’d use:

# Retrieve MapMan annotations for all *A. thaliana* genes
ath_mapman <- get_functional_annotation("ath", ontology = "MapMan")

head(ath_mapman)
#>        id species   gene_id  mapman_id
#> 1 3508586     ath AT1G01020       35.1
#> 2 3499151     ath AT1G01030   15.5.5.3
#> 3 3500934     ath AT1G01040 16.9.2.1.1
#> 4 3508292     ath AT1G01050     27.6.1
#> 5 3508030     ath AT1G01060     27.1.1
#> 6 3506530     ath AT1G01070   24.2.1.5
#>                                                                                                   mapman_description
#> 1                                                                                             not assigned.annotated
#> 2 RNA biosynthesis.transcriptional regulation.B3  transcription factor superfamily.transcription factor (RAV/NGATHA)
#> 3     RNA processing.mRNA silencing.miRNA pathway.DCL1-HYL1 miRNA biogenesis complex.endoribonuclease component DCL1
#> 4                                       Multi-process regulation.pyrophosphate homeostasis.cytosolic pyrophosphatase
#> 5                                 Multi-process regulation.circadian clock system.core oscillator protein (LHY/CCA1)
#> 6                            Solute transport.carrier-mediated transport.DMT superfamily.solute transporter (UmamiT)

Obtaining gene IDs and descriptions

Researchers often have transcript IDs for a particular genome and want to obtain their corresponding gene IDs. PLAZA provides correspondence tables of transcript-to-gene ID mappings for all species, which can be very helpful in this context. These can be retrieved with get_tx2gene, which takes as input PLAZA species IDs and what transcripts to retrieve (one of ‘all’ or ‘longest’).

For example, to obtain transcript-to-gene ID mappings for Arabidopsis thaliana, you’d run:

# Get tx2gene mappings for *A. thaliana*
ath_tx2gene <- get_tx2gene("ath", transcripts = "all")

head(ath_tx2gene)
#>   transcript_id   gene_id
#> 1   AT1G01020.5 AT1G01020
#> 2   AT1G01030.1 AT1G01030
#> 3   AT1G01040.1 AT1G01040
#> 4   AT1G01046.1 AT1G01046
#> 5   AT1G01050.1 AT1G01050
#> 6   AT1G01060.6 AT1G01060

A similar problem occurs when researchers have gene symbols (e.g., FLC and MAF2 in A. thaliana) and want to extract their gene IDs. To help in this case, PLAZA offers ID conversion tables, which can be accessed with the function get_id_conversions(), as demonstrated below:

# Get ID conversion table for *A. thaliana*
ath_ids <- get_id_conversions("ath")

head(ath_ids)
#>   original_id alt_id_type      alt_id
#> 1   AT1G01010      symbol      NAC001
#> 2   AT1G01010         tid AT1G01010.1
#> 3   AT1G01010     uniprot      Q0WV96
#> 4   AT1G01020      symbol        ARV1
#> 5   AT1G01020         tid AT1G01020.5
#> 6   AT1G01020     uniprot      Q5MK24

Lastly, researchers are often interested in short (usually one sentence) descriptions for genes to help them interpret their results. These can be obtained from PLAZA with the function get_descriptions(), which takes one or multiple PLAZA species IDs as input.

For example, to obtain gene descriptions for all A. thaliana genes, you’d run:

# Get gene descriptions for *A. thaliana*
ath_descriptions <- get_descriptions("ath")

head(ath_descriptions)
#>        gene                                       description
#> 1 AT1G01020                               ARV1 family protein
#> 2 AT1G01030 AP2/B3-like transcriptional factor family protein
#> 3 AT1G01040                                      dicer-like 1
#> 4 AT1G01046                     microRNA ath-MIR838 precursor
#> 5 AT1G01050                               pyrophosphorylase 1
#> 6 AT1G01060              Homeodomain-like superfamily protein

Obtaining homology data

For each PLAZA instance, all (coding) genes in all species are grouped in homologous and orthologous gene families. These gene families, sometimes referred to as ‘orthogroups’ or ‘hierarchical orthologous groups (HOGs)’, are the basis for comparative analyses in evolutionary genomics. To obtain gene family assignments for all genes in a PLAZA instance, you will use the function get_family_assignments(). As input, you need to pass the PLAZA instance (one of ‘Dicots’, ‘Monocots’, ‘Diatoms’, or ‘Pico’), and the gene family type (one of ‘homo’ for homologous, or ‘ortho’ for orthologous).

For example, to retrieve all genes and their homologous family assignments in PLAZA Dicots, you’d run:

# Get family assignments from PLAZA Dicots
dicots_fams <- get_family_assignments(instance = "Dicots", type = "homo")

head(dicots_fams)
#>      family_id species                                gene_id
#> 1 HOM05D000001     aag       AagrBONN_evm.TU.Sc2ySwM_344.5580
#> 2 HOM05D000001     aag       AagrBONN_evm.TU.Sc2ySwM_344.5581
#> 3 HOM05D000001     aag AagrBONN_evm.TU.Sc2ySwM_344.5582__5583
#> 4 HOM05D000001     aag       AagrBONN_evm.TU.Sc2ySwM_344.5586
#> 5 HOM05D000001     aag       AagrBONN_evm.TU.Sc2ySwM_344.5587
#> 6 HOM05D000001     aag       AagrBONN_evm.TU.Sc2ySwM_344.5588

Once you have some homologous or orthologous gene families of interest (e.g., families that expanded in a particular clade), you would probably want to know the functions of the genes in these families. Luckily for you, PLAZA stores data on overrepresented functions in its gene families. These can be accessed with the function get_family_functions(), but they are only available for PLAZA Dicots and PLAZA Monocots.

For example, to retrieve overrepresented functions in all homologous gene families in PLAZA Dicots, you’d run:

# Get overrepresented functions in PLAZA Dicots families
dicots_fams_func <- get_family_functions(instance = "Dicots", type = "homo")

head(dicots_fams_func)
#>      family_id    term_id                                    term_description
#> 1 HOM05D000001 GO:0005488                                             binding
#> 2 HOM05D000001  IPR011990   Tetratricopeptide-like helical domain superfamily
#> 3 HOM05D000001  IPR002885                            Pentatricopeptide repeat
#> 4 HOM05D000002  IPR013103 Reverse transcriptase, RNA-dependent DNA polymerase
#> 5 HOM05D000002 GO:0003676                                nucleic acid binding
#> 6 HOM05D000002 GO:0097159                     organic cyclic compound binding
#>   ontology      score
#> 1       go   28.33542
#> 2 interpro  420.04453
#> 3 interpro  594.17669
#> 4 interpro 1000.07958
#> 5       go   33.32392
#> 6       go   16.16927

Lastly, PLAZA also stores data on block duplicates (anchor pairs in syntenic regions, derived from segmental or whole-genome duplications) for all its genomes. These data can be retrieved with the function get_block_duplicates(), which takes one or multiple PLAZA species IDs as input.

For example, to retrieve all A. thaliana block duplicates, you’d run:

# Get block duplicates for *A. thaliana*
ath_blocks <- get_block_duplicates("ath")

head(ath_blocks)
#>     anchor1   anchor2 multiplicon_id multiplicon_level Ks
#> 1 AT2G41290 AT3G57010           4569                 2 NA
#> 2 AT2G41310 AT3G57040           4569                 2 NA
#> 3 AT2G41451 AT3G57200           4569                 2 NA
#> 4 AT2G41490 AT3G57220           4569                 2 NA
#> 5 AT2G41560 AT3G57330           4569                 2 NA
#> 6 AT2G41620 AT3G57350           4569                 2 NA

Note that the Ks column is empty, meaning that synonymous substitution rates ( $K_s$ ) were not calculated for this genome.

Session information

This document was created under the following conditions:

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.1 (2025-06-13)
#>  os       Ubuntu 24.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       UTC
#>  date     2025-07-28
#>  pandoc   3.7.0.2 @ /usr/bin/ (via rmarkdown)
#>  quarto   1.6.42 @ /usr/local/bin/quarto
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package              * version   date (UTC) lib source
#>  abind                  1.4-8     2024-09-12 [1] RSPM (R 4.5.0)
#>  Biobase                2.69.0    2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#>  BiocFileCache          2.99.5    2025-05-18 [1] Bioconductor 3.22 (R 4.5.0)
#>  BiocGenerics           0.55.0    2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#>  BiocIO                 1.19.0    2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#>  BiocManager            1.30.26   2025-06-05 [2] CRAN (R 4.5.1)
#>  BiocParallel           1.43.4    2025-06-16 [1] Bioconductor 3.22 (R 4.5.1)
#>  BiocStyle            * 2.37.0    2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#>  Biostrings             2.77.2    2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#>  bit                    4.6.0     2025-03-06 [1] RSPM (R 4.5.0)
#>  bit64                  4.6.0-1   2025-01-16 [1] RSPM (R 4.5.0)
#>  bitops                 1.0-9     2024-10-03 [1] RSPM (R 4.5.0)
#>  blob                   1.2.4     2023-03-17 [1] RSPM (R 4.5.0)
#>  bookdown               0.43      2025-04-15 [1] RSPM (R 4.5.0)
#>  bslib                  0.9.0     2025-01-30 [2] RSPM (R 4.5.0)
#>  cachem                 1.1.0     2024-05-16 [2] RSPM (R 4.5.0)
#>  cli                    3.6.5     2025-04-23 [2] RSPM (R 4.5.0)
#>  codetools              0.2-20    2024-03-31 [3] CRAN (R 4.5.1)
#>  crayon                 1.5.3     2024-06-20 [2] RSPM (R 4.5.0)
#>  curl                   6.4.0     2025-06-22 [2] RSPM (R 4.5.0)
#>  DBI                    1.2.3     2024-06-02 [1] RSPM (R 4.5.0)
#>  dbplyr                 2.5.0     2024-03-19 [1] RSPM (R 4.5.0)
#>  DelayedArray           0.35.2    2025-06-18 [1] Bioconductor 3.22 (R 4.5.1)
#>  desc                   1.4.3     2023-12-10 [2] RSPM (R 4.5.0)
#>  digest                 0.6.37    2024-08-19 [2] RSPM (R 4.5.0)
#>  dplyr                  1.1.4     2023-11-17 [1] RSPM (R 4.5.0)
#>  evaluate               1.0.4     2025-06-18 [2] RSPM (R 4.5.0)
#>  fastmap                1.2.0     2024-05-15 [2] RSPM (R 4.5.0)
#>  filelock               1.0.3     2023-12-11 [1] RSPM (R 4.5.0)
#>  fs                     1.6.6     2025-04-12 [2] RSPM (R 4.5.0)
#>  generics               0.1.4     2025-05-09 [1] RSPM (R 4.5.0)
#>  GenomicAlignments      1.45.1    2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#>  GenomicRanges          1.61.1    2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#>  glue                   1.8.0     2024-09-30 [2] RSPM (R 4.5.0)
#>  htmltools              0.5.8.1   2024-04-04 [2] RSPM (R 4.5.0)
#>  htmlwidgets            1.6.4     2023-12-06 [2] RSPM (R 4.5.0)
#>  httr                   1.4.7     2023-08-15 [1] RSPM (R 4.5.0)
#>  httr2                  1.2.1     2025-07-22 [2] RSPM (R 4.5.0)
#>  IRanges                2.43.0    2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#>  jquerylib              0.1.4     2021-04-26 [2] RSPM (R 4.5.0)
#>  jsonlite               2.0.0     2025-03-27 [2] RSPM (R 4.5.0)
#>  knitr                  1.50      2025-03-16 [2] RSPM (R 4.5.0)
#>  lattice                0.22-7    2025-04-02 [3] CRAN (R 4.5.1)
#>  lifecycle              1.0.4     2023-11-07 [2] RSPM (R 4.5.0)
#>  magrittr               2.0.3     2022-03-30 [2] RSPM (R 4.5.0)
#>  Matrix                 1.7-3     2025-03-11 [3] CRAN (R 4.5.1)
#>  MatrixGenerics         1.21.0    2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#>  matrixStats            1.5.0     2025-01-07 [1] RSPM (R 4.5.0)
#>  memoise                2.0.1     2021-11-26 [2] RSPM (R 4.5.0)
#>  pillar                 1.11.0    2025-07-04 [2] RSPM (R 4.5.0)
#>  pkgconfig              2.0.3     2019-09-22 [2] RSPM (R 4.5.0)
#>  pkgdown                2.1.3     2025-05-25 [2] RSPM (R 4.5.0)
#>  purrr                  1.1.0     2025-07-10 [2] RSPM (R 4.5.0)
#>  R6                     2.6.1     2025-02-15 [2] RSPM (R 4.5.0)
#>  ragg                   1.4.0     2025-04-10 [2] RSPM (R 4.5.0)
#>  rappdirs               0.3.3     2021-01-31 [2] RSPM (R 4.5.0)
#>  RCurl                  1.98-1.17 2025-03-22 [1] RSPM (R 4.5.0)
#>  restfulr               0.0.16    2025-06-27 [1] RSPM (R 4.5.1)
#>  rjson                  0.2.23    2024-09-16 [1] RSPM (R 4.5.0)
#>  rlang                  1.1.6     2025-04-11 [2] RSPM (R 4.5.0)
#>  rmarkdown              2.29      2024-11-04 [2] RSPM (R 4.5.0)
#>  rplaza               * 0.99.0    2025-07-28 [1] Bioconductor
#>  Rsamtools              2.25.1    2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#>  RSQLite                2.4.2     2025-07-18 [1] RSPM (R 4.5.0)
#>  rtracklayer            1.69.1    2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#>  S4Arrays               1.9.1     2025-05-29 [1] Bioconductor 3.22 (R 4.5.1)
#>  S4Vectors              0.47.0    2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#>  sass                   0.4.10    2025-04-11 [2] RSPM (R 4.5.0)
#>  Seqinfo                0.99.2    2025-07-22 [1] Bioconductor 3.22 (R 4.5.1)
#>  sessioninfo            1.2.3     2025-02-05 [2] RSPM (R 4.5.0)
#>  SparseArray            1.9.1     2025-07-18 [1] Bioconductor 3.22 (R 4.5.1)
#>  SummarizedExperiment   1.39.1    2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#>  systemfonts            1.2.3     2025-04-30 [2] RSPM (R 4.5.0)
#>  textshaping            1.0.1     2025-05-01 [2] RSPM (R 4.5.0)
#>  tibble                 3.3.0     2025-06-08 [2] RSPM (R 4.5.0)
#>  tidyselect             1.2.1     2024-03-11 [1] RSPM (R 4.5.0)
#>  vctrs                  0.6.5     2023-12-01 [2] RSPM (R 4.5.0)
#>  withr                  3.0.2     2024-10-28 [2] RSPM (R 4.5.0)
#>  xfun                   0.52      2025-04-02 [2] RSPM (R 4.5.0)
#>  XML                    3.99-0.18 2025-01-01 [1] RSPM (R 4.5.0)
#>  XVector                0.49.0    2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#>  yaml                   2.3.10    2024-07-26 [2] RSPM (R 4.5.0)
#> 
#>  [1] /__w/_temp/Library
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/local/lib/R/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

References

Almeida-Silva, Fabricio, Tao Zhao, Kristian K Ullrich, M Eric Schranz, and Yves Van de Peer. 2023. “Syntenet: An r/Bioconductor Package for the Inference and Analysis of Synteny Networks.” Bioinformatics 39 (1): btac806.

Dyer, Sarah C, Olanrewaju Austine-Orimoloye, Andrey G Azov, Matthieu Barba, If Barnes, Vianey Paola Barrera-Enriquez, Arne Becker, et al. 2025. “Ensembl 2025.” Nucleic Acids Research 53 (D1): D948–57.

Emms, David M, and Steven Kelly. 2019. “OrthoFinder: Phylogenetic Orthology Inference for Comparative Genomics.” Genome Biology 20: 1–14.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14 (4): 417–19.

Van Bel, Michiel, Francesca Silvestri, Eric M Weitz, Lukasz Kreft, Alexander Botzki, Frederik Coppens, and Klaas Vandepoele. 2022. “PLAZA 5.0: Extending the Scope and Power of Comparative and Functional Genomics in Plants.” Nucleic Acids Research 50 (D1): D1468–74.

Fabricio Almeida-Silva

Yves Van de Peer