
Accessing data from the PLAZA database
Fabricio Almeida-Silva
VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumYves Van de Peer
VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumSource:
vignettes/rplaza.Rmd
rplaza.Rmd
Introduction
PLAZA (Van Bel et al. 2022) is a database for plant comparative and functional genomics, centralizing data produced by different genome sequencing initiatives. Like Ensembl (Dyer et al. 2025), PLAZA organizes data in instances based on clades of the (green) tree of life (PLAZA Dicots, PLAZA Monocots, PLAZA Diatoms, and pico-PLAZA). Besides its rich web server where users can interactively search and explore plant genomic data, PLAZA offers bulk download of data, including whole-genome and locus sequences, gene annotations, functional annotations, homology, etc. rplaza provides an R interface to the PLAZA database, allowing the direct import of PLAZA data to the R session using standard R/Bioconductor classes, allowing seamless interoperability and integration with Bioconductor packages.
Data types provided by PLAZA and how they are stored for interoperability with R/Bioconductor packages are summarized below:
Data | File format | R class |
---|---|---|
Sequences | FASTA | AAStringSet/DNAStringSet |
Ranges | GFF3 | GRanges |
Functions and homology | TSV | data.frame |
Installation
rplaza can be installed from Bioconductor with the following code:
if(!requireNamespace('BiocManager', quietly = TRUE)) {
install.packages('BiocManager')
}
BiocManager::install("rplaza")
PLAZA instances and species (or fantastic plants and where to find them)
The PLAZA database is divided in four instances that are specialized in different clades:
- PLAZA Dicots: focus on eudicot plants (N = 98 species);
- PLAZA Monocots: focus on monocot plants (N = 52 species);
- pico-PLAZA: focus on eukaryotic micro-organisms (N = 39 species);
- PLAZA Diatoms: focus on diatoms (N = 26 species).
To access data on what species can be found in which instance, as
well as associated metadata (taxonomic information, data sources, etc),
you can inspect the object plaza_metadata
, which comes with
rplaza. This object is
a list of data frames, one for each PLAZA instance.
# Load PLAZA metadata
data("plaza_metadata")
names(plaza_metadata)
#> [1] "Dicots" "Monocots" "Diatoms" "Pico"
As you can see, the list has four data frames. If you want to have a look at the species in PLAZA Dicots, for instance, you need to extract the corresponding data frame from the list.
# Look at metadata for
head(plaza_metadata$Dicots)
#> PLAZA_ID common_name NCBI_name NCBI_taxid family
#> 1 aag Anthoceros agrestis Anthoceros agrestis 41834 Anthocerotaceae
#> 2 aar Aethionema arabicum Aethionema arabicum 228871 Brassicaceae
#> 3 acertr Acer truncatum Acer truncatum 47965 Sapindaceae
#> 4 ach Actinidia chinensis Actinidia chinensis 3625 Actinidiaceae
#> 5 aly Arabidopsis lyrata Arabidopsis lyrata 59689 Brassicaceae
#> 6 ama Avicennia marina Avicennia marina 82927 Acanthaceae
#> order class source
#> 1 Anthocerotales Anthocerotopsida v1.1
#> 2 Brassicales Magnoliopsida v3.1
#> 3 Sapindales Magnoliopsida v2.0
#> 4 Ericales Magnoliopsida V3.0
#> 5 Brassicales Magnoliopsida JGI v2.1
#> 6 Lamiales Magnoliopsida v1.1 20200527
#> data_provider
#> 1 https://www.hornworts.uzh.ch/en.html
#> 2 https://genomevolution.org/coge/GenomeInfo.pl?gid=37218
#> 3 https://doi.org/10.6084/m9.figshare.12986237.v2
#> 4 https://figshare.com/articles/dataset/Actinidia_chinensis_genome_data/10046558
#> 5 https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Alyrata
#> 6 https://datadryad.org/stash/dataset/doi:10.5061/dryad.3j9kd51f5
#> pubmed_id
#> 1 32170292
#> 2 31554715
#> 3 32772482
#> 4 31645971
#> 5 26382944
#> 6 33561229
Note that PLAZA gives unique IDs to each species, and these are
available in column PLAZA_ID
of each data frame. These
PLAZA species IDs are essential to work with rplaza, as most functions in
this package require specifying species using PLAZA IDs.
It is also important to keep in mind that some species are present in multiple PLAZA instances (e.g., Arabidopsis thaliana), so if you’re combining these data frames in a single data frame, you need to remove duplicate rows. For example:
# Combine all data frames together and remove redundancy
plaza_all <- dplyr::bind_rows(plaza_metadata, .id = "instance") |>
dplyr::distinct(PLAZA_ID, .keep_all = TRUE)
head(plaza_all)
#> instance PLAZA_ID common_name NCBI_name NCBI_taxid
#> 1 Dicots aag Anthoceros agrestis Anthoceros agrestis 41834
#> 2 Dicots aar Aethionema arabicum Aethionema arabicum 228871
#> 3 Dicots acertr Acer truncatum Acer truncatum 47965
#> 4 Dicots ach Actinidia chinensis Actinidia chinensis 3625
#> 5 Dicots aly Arabidopsis lyrata Arabidopsis lyrata 59689
#> 6 Dicots ama Avicennia marina Avicennia marina 82927
#> family order class source
#> 1 Anthocerotaceae Anthocerotales Anthocerotopsida v1.1
#> 2 Brassicaceae Brassicales Magnoliopsida v3.1
#> 3 Sapindaceae Sapindales Magnoliopsida v2.0
#> 4 Actinidiaceae Ericales Magnoliopsida V3.0
#> 5 Brassicaceae Brassicales Magnoliopsida JGI v2.1
#> 6 Acanthaceae Lamiales Magnoliopsida v1.1 20200527
#> data_provider
#> 1 https://www.hornworts.uzh.ch/en.html
#> 2 https://genomevolution.org/coge/GenomeInfo.pl?gid=37218
#> 3 https://doi.org/10.6084/m9.figshare.12986237.v2
#> 4 https://figshare.com/articles/dataset/Actinidia_chinensis_genome_data/10046558
#> 5 https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Alyrata
#> 6 https://datadryad.org/stash/dataset/doi:10.5061/dryad.3j9kd51f5
#> pubmed_id
#> 1 32170292
#> 2 31554715
#> 3 32772482
#> 4 31645971
#> 5 26382944
#> 6 33561229
nrow(plaza_all)
#> [1] 171
After combining all data frames together and removing duplicates, we can see that there are 171 unique species across all PLAZA instances.
A note on software design
rplaza uses BiocFileCache to create a temporary local cache of remote resources per R session. This way, if you try to retrieve the same data multiple times in a single R session, you will only download data once. Caching is especially important in two common scenarios:
You downloaded data for species A and B, but later decide to include species C and D as well. If you re-run the code to obtain data for species A, B, C, and D, you will only download data for species C and D, as data for species A and B are already stored in a local cache.
You tried to download data for several species, but after some time your download failed for some reason (internet issues, timeout, etc). If you re-run the code to download data, your download will start from where it failed, so it will finish much faster.
Obtaining sequence data
PLAZA provides sequences for whole genomes and whole-genome loci (all
proteins, all transcripts, and all CDS in a genome). These are available
as FASTA files, which rplaza reads as
DNAStringSet
objects (or AAStringSet
objects
for proteins).
Genomes
To extract whole-genome sequences for a particular species, you will
use the function get_genome()
. For a single species,
get_genome()
takes a species ID and returns a
DNAStringSet
object. To demonstrate it, let’s obtain the
whole genome of Arabidopsis thaliana.
# Retrieve genome of Arabidopsis thaliana
ath_genome <- get_genome("ath")
ath_genome
#> DNAStringSet object of length 7:
#> width seq names
#> [1] 30427671 CCCTAAACCCTAAACCCTAAAC...TTTAGGGTTTAGGGTTTAGGG Chr1
#> [2] 19698289 NNNNNNNNNNNNNNNNNNNNNN...TTTAGGGTTTAGGGTTTAGGG Chr2
#> [3] 23459830 NNNNNNNNNNNNNNNNNNNNNN...TAAACCCTAAACCCTAAACCC Chr3
#> [4] 18585056 NNNNNNNNNNNNNNNNNNNNNN...GTTTAGGGTTTAGGGTTTAGG Chr4
#> [5] 26975502 TATACCATGTACCCTCAACCTT...AGGATTTAGGGTTTTTAGATC Chr5
#> [6] 154478 ATGGGCGAACGACGGGAATTGA...AATAACTTGGTCCCGGGCATC ChrC
#> [7] 366924 GGATCCGTTCGAAACAGGTTAG...AGAATGGAAACAAACCGGATT ChrM
For multiple species, get_genome()
takes a vector of
species IDs and returns a list of DNAStringSet
objects.
# Retrieve genome of A. thaliana and A. lyrata
arabidopsis_genomes <- get_genome(c("ath", "aly"))
arabidopsis_genomes
#> $ath
#> DNAStringSet object of length 7:
#> width seq names
#> [1] 30427671 CCCTAAACCCTAAACCCTAAAC...TTTAGGGTTTAGGGTTTAGGG Chr1
#> [2] 19698289 NNNNNNNNNNNNNNNNNNNNNN...TTTAGGGTTTAGGGTTTAGGG Chr2
#> [3] 23459830 NNNNNNNNNNNNNNNNNNNNNN...TAAACCCTAAACCCTAAACCC Chr3
#> [4] 18585056 NNNNNNNNNNNNNNNNNNNNNN...GTTTAGGGTTTAGGGTTTAGG Chr4
#> [5] 26975502 TATACCATGTACCCTCAACCTT...AGGATTTAGGGTTTTTAGATC Chr5
#> [6] 154478 ATGGGCGAACGACGGGAATTGA...AATAACTTGGTCCCGGGCATC ChrC
#> [7] 366924 GGATCCGTTCGAAACAGGTTAG...AGAATGGAAACAAACCGGATT ChrM
#>
#> $aly
#> DNAStringSet object of length 695:
#> width seq names
#> [1] 33132539 TCACTTTCACTTCTGGATGCT...GTCTAAGCAACTTTCCATAA scaffold_1
#> [2] 19320864 CTAAACCCTAAACCCTAAACC...TGACAAGTGTAATATCTTGA scaffold_2
#> [3] 24464547 AAAAAAGGGGGTTCGGATAGA...TATTGGATGAGTTTCAGTAC scaffold_3
#> [4] 23328337 CGATGGAGTTTTTCAGCAGTT...TTTAGGGTTTAGGGTTTAGG scaffold_4
#> [5] 21221946 CCCTGAAGCCTGAACCCTGAA...TTAAGGGTTAGGGGTTAGGG scaffold_5
#> ... ... ...
#> [691] 1047 GTGGTGGGCGAACAAAACAAG...TGTGCCAAAATAAATATACT scaffold_1112
#> [692] 1037 TATAAGAGCTTTTTGGAGCTT...TTGTAAGGGTTCCCAAAGCC scaffold_1113
#> [693] 1032 CAAATCATCTGCATGTGGCTT...TCTATAAATAGTTTATATAT scaffold_1116
#> [694] 1017 CAATGAACACATAAAATGCAA...AGACAGCACCCCTCCGTGGG scaffold_1117
#> [695] 1005 CTTTGTTCTTTTTATTTTCAA...TCCCAACTATCCGAATTGAG scaffold_1118
To learn more about how to work with DNAStringSet
objects, read the vignette of the Biostrings
package.
Loci
To retrieve sequences for all proteins, transcripts, or CDS in the
genome of a particular species, you will use the function
get_sequences()
. Besides specifying one or multiple species
IDs, get_sequences()
also requires the locus
type
(one of ‘CDS’, ‘transcript’, or ‘protein’), and what
transcripts
to extract (one of ‘all’ or ‘longest’).
For example, researchers in evolutionary genomics often want to obtain protein sequences of the longest transcripts of each gene, so that they can compare genomes. These are usually called ‘proteomes’ by software tools such as OrthoFinder (Emms and Kelly 2019) and syntenet (Almeida-Silva et al. 2023). To do that, you’d run:
# Retrieve protein sequences of the longest transcripts (one per gene)
proteome <- get_sequences("ath", type = "protein", transcripts = "longest")
proteome
#> AAStringSet object of length 27615:
#> width seq names
#> [1] 430 MEDQVGFGFRPNDEELVGHYL...NMIIGVLLFISVISWIILVG* AT1G01010.1 | AT1...
#> [2] 199 MIIFIDLILHRPKVYRHVLYN...IFLIQIGSLLQYMSYFFRIV* AT1G01020.5 | AT1...
#> [3] 359 MDLSLAPTTTTSSDQEQDRDQ...GDDDQFAKKGKSSLSLNFNP* AT1G01030.1 | AT1...
#> [4] 1910 MVMEDEPREATIKPSYWLDAC...KKAKDSAAVLLLELLNKTFS* AT1G01040.1 | AT1...
#> [5] 213 MSEETKDNQRLQRPAPRLNER...EAIQYSMDLYAEYILHTLRR* AT1G01050.1 | AT1...
#> ... ... ...
#> [27611] 146 MTKREYNSQPEMKEEVLAYLL...CFYPIVQTFLFLMGGGGGPA* ATMG01350.1 | ATM...
#> [27612] 528 MKNLVRWLFSTNHKDIGTLYF...PAFHTFGELPAIKETKSYVK* ATMG01360.1 | ATM...
#> [27613] 112 MKISYFIRRGKKTSRRSHFIK...HPSIVHFDCLLFFLDTTPCL* ATMG01370.1 | ATM...
#> [27614] 106 MKYHFSSMEPWWKREFSFCIP...VERSRSNSLNSGPNPLENAT* ATMG01400.1 | ATM...
#> [27615] 205 MFGGRRRRLPQDGTFNQTQPF...REIFPRESGVLSLMSVPSLA* ATMG01410.1 | ATM...
Alternatively, one might be interested in all transcripts produced by a particular genome. In transcriptomics, this is often called a ‘reference transcriptome’, and it is required by software tools that quantify transcript abundances, such as salmon (Patro et al. 2017). To do that, you would run:
# Obtain all transcripts produced by a genome (aka 'reference transcriptome')
ref_tx <- get_sequences("ath", type = "transcript", transcripts = "all")
ref_tx
#> DNAStringSet object of length 91393:
#> width seq names
#> [1] 1688 AAATTATTAGATATACCAAAC...TGATATATAGTCTTGTATAAG AT1G01010.1 | AT1...
#> [2] 1306 AATTTATTTTCTTATAAACTT...AATTGAATTAGACCTTACTAA AT1G01020.5 | AT1...
#> [3] 1905 ATATCATTCATGGGCCCCTAC...TCTTTTCAATTTTGTATATAA AT1G01030.1 | AT1...
#> [4] 6276 GTGGAAAACAGACCAGAAGAG...AAACCAAAAATTTCAATGCCA AT1G01040.1 | AT1...
#> [5] 207 TACTTTTCTAATATCACGAGG...CTTCTATAATGTTATTTATTA AT1G01046.1 | AT1...
#> ... ... ...
#> [91389] 936 CAAGCCCAAAATAGACAACAT...CTGCATCTGTTTTCAACTTAA AT5G67520.3 | AT5...
#> [91390] 1945 CCGAGAGAAAGCAGAAAAAAA...TATACAAATGCTTTTGTGACT AT5G67540.1 | AT5...
#> [91391] 2024 AGAAAAAAAAAAGAAAAAAAA...AGTAAATGGCGGCTGCCTTAA AT5G67540.3 | AT5...
#> [91392] 1370 CTTAGCTTGAATACAAAAGAA...TGTGGATGATCTGTTTAGTAA AT5G67580.2 | AT5...
#> [91393] 1814 CTGAAACCCACAAGGAAGAAA...TCTTTTAGTATGGATATTAAT AT5G67610.2 | AT5...
Note that the number of sequences in ref_tx
is much
larger than the number of sequences in proteome
, since the
former contains sequences for all transcripts (including multiple
isoforms per gene).
To learn more about how to work with AAStringSet
and
DNAStringSet
objects, read the vignette of the Biostrings
package.
Obtaining gene annotations (a.k.a. ‘ranges’)
The obtain the genomic locations (ranges, with start and end
coordinates) of all genes in the genome of a particular species, you
will use the function get_annotation()
. This function
returns a GRanges
object (or GRangesList
for
multiple species) with genomic coordinates and associated metadata. As
input, you need to specify the PLAZA species ID of your species of
interest (one or multiple), what transcripts
you want to
extract (‘all’ or ‘longest’, exactly as in
get_sequences()
), and what features
to extract
per transcript (‘all’ - including exons, CDS, UTRs, etc. - or
‘exons’).
For example, if you want to extract ranges for the complete genome annotation (i.e., all transcripts, all features), you’d run:
# Extract ranges for all features (exons, CDS, UTRs, etc) of all transcripts
annotation_all <- get_annotation("ath", transcripts = "all", features = "all")
annotation_all
#> GRanges object with 835565 ranges and 14 metadata columns:
#> seqnames ranges strand | source type score
#> <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
#> [1] Chr1 3631-5899 + | Araport11 gene NA
#> [2] Chr1 3631-5899 + | Araport11 mRNA NA
#> [3] Chr1 3631-3913 + | Araport11 exon NA
#> [4] Chr1 3631-3759 + | Araport11 five_prime_UTR NA
#> [5] Chr1 3760-3913 + | Araport11 CDS NA
#> ... ... ... ... . ... ... ...
#> [835561] ChrM 363725-364042 + | Araport11 CDS NA
#> [835562] ChrM 366086-366700 - | Araport11 gene NA
#> [835563] ChrM 366086-366700 - | Araport11 mRNA NA
#> [835564] ChrM 366086-366700 - | Araport11 exon NA
#> [835565] ChrM 366086-366700 - | Araport11 CDS NA
#> phase ID uniprot symbol tid
#> <integer> <character> <character> <character> <character>
#> [1] <NA> AT1G01010 Q0WV96 NAC001 AT1G01010.1
#> [2] <NA> AT1G01010.1 <NA> <NA> <NA>
#> [3] <NA> AT1G01010.1:exon:1 <NA> <NA> <NA>
#> [4] <NA> AT1G01010.1:five_pri.. <NA> <NA> <NA>
#> [5] 0 AT1G01010.1:CDS:1 <NA> <NA> <NA>
#> ... ... ... ... ... ...
#> [835561] 0 ATMG01400.1:CDS:1 <NA> <NA> <NA>
#> [835562] <NA> ATMG01410 P92567 ORF204 ATMG01410.1
#> [835563] <NA> ATMG01410.1 <NA> <NA> <NA>
#> [835564] <NA> ATMG01410.1:exon:1 <NA> <NA> <NA>
#> [835565] 0 ATMG01410.1:CDS:1 <NA> <NA> <NA>
#> Alias Name gene_id
#> <CharacterList> <character> <character>
#> [1] ANAC001,NAC domain containin.. AT1G01010 AT1G01010
#> [2] AT1G01010 AT1G01010
#> [3] AT1G01010 AT1G01010
#> [4] AT1G01010 AT1G01010
#> [5] AT1G01010 AT1G01010
#> ... ... ... ...
#> [835561] ATMG01400 ATMG01400
#> [835562] ATMG01410 ATMG01410
#> [835563] ATMG01410 ATMG01410
#> [835564] ATMG01410 ATMG01410
#> [835565] ATMG01410 ATMG01410
#> Parent full_name name
#> <CharacterList> <character> <CharacterList>
#> [1] <NA>
#> [2] AT1G01010 <NA>
#> [3] AT1G01010.1 <NA>
#> [4] AT1G01010.1 <NA>
#> [5] AT1G01010.1 <NA>
#> ... ... ... ...
#> [835561] ATMG01400.1 <NA>
#> [835562] <NA>
#> [835563] ATMG01410 <NA>
#> [835564] ATMG01410.1 <NA>
#> [835565] ATMG01410.1 <NA>
#> -------
#> seqinfo: 7 sequences from an unspecified genome; no seqlengths
To learn more about how to work with GRanges
objects,
read the vignette of the GenomicRanges
package.
Obtaining functional annotations
PLAZA offers functional annotation for all genes in all species using
GO terms, MapMan pathways, and InterPro domains. To retrieve
functional annotation for all genes of a particular species, you will
use the function get_functional_annotation()
. As input,
this function takes one or multiple PLAZA species IDS and the
ontology
to use (one of ‘GO’, ‘MapMan’, or ‘InterPro’), and
it returns a data frame (or list of data frames, in case of multiple
species) with terms associated with each gene.
For example, to obtain InterPro domains annotated to each gene of Arabidopsis thaliana, you’d use:
# Get InterPro domains
ath_interpro <- get_functional_annotation("ath", ontology = "InterPro")
head(ath_interpro)
#> gene_id species motif_id description start stop score
#> 1 AT1G01010 ath IPR003441 NAC domain 5 137 4.60000e-26
#> 2 AT1G01010 ath IPR036093 NAC domain superfamily 5 156 4.84000e-37
#> 3 AT1G01020 ath IPR007290 Arv1 protein 1 146 1.50000e-21
#> 4 AT1G01020 ath IPR007290 Arv1 protein 1 176 9.20000e-68
#> 5 AT1G01030 ath IPR003340 B3 DNA binding domain 56 162 6.30000e-22
#> 6 AT1G01030 ath IPR003340 B3 DNA binding domain 54 150 2.13236e-27
#> comment
#> 1 source=Pfam,domainId=PF02365
#> 2 source=SUPERFAMILY,domainId=SSF101941
#> 3 source=Pfam,domainId=PF04161
#> 4 source=PANTHER,domainId=PTHR14467
#> 5 source=SMART,domainId=SM01019
#> 6 source=CDD,domainId=cd10017
As another example, to obtain GO annotations for the Chlorophyta algae Chlamydomonas reinhardtii and Micromonas commoda, you’d use:
# Obtain functional annotation for Chlorophyta algae
algae_go <- get_functional_annotation(c("cre", "mco"), ontology = "GO")
names(algae_go)
#> [1] "cre" "mco"
head(algae_go$cre)
#> gene_id species go evidence go_source provider
#> 1 Cre01.g000050 cre GO:0003674 IEA primary interproscan
#> 2 Cre01.g000050 cre GO:0140110 IEA primary interproscan
#> 3 Cre01.g000050 cre GO:0006355 IEA primary goa
#> 4 Cre01.g000050 cre GO:0006139 IEA primary goa
#> 5 Cre01.g000050 cre GO:0006351 IEA primary goa
#> 6 Cre01.g000050 cre GO:0006725 IEA primary goa
#> comment
#> 1
#> 2
#> 3 assigned_by=GOC;db_reference=GO_REF:0000108
#> 4 assigned_by=GOC;db_reference=GO_REF:0000108
#> 5 assigned_by=GOC;db_reference=GO_REF:0000108
#> 6 assigned_by=GOC;db_reference=GO_REF:0000108
#> description propagated_from_child
#> 1 molecular_function 1
#> 2 transcription regulator activity 1
#> 3 regulation of transcription, DNA-templated 0
#> 4 nucleobase-containing compound metabolic process 1
#> 5 transcription, DNA-templated 1
#> 6 cellular aromatic compound metabolic process 1
Lastly, to obtain MapMan annotations for all A. thaliana genes, you’d use:
# Retrieve MapMan annotations for all *A. thaliana* genes
ath_mapman <- get_functional_annotation("ath", ontology = "MapMan")
head(ath_mapman)
#> id species gene_id mapman_id
#> 1 3508586 ath AT1G01020 35.1
#> 2 3499151 ath AT1G01030 15.5.5.3
#> 3 3500934 ath AT1G01040 16.9.2.1.1
#> 4 3508292 ath AT1G01050 27.6.1
#> 5 3508030 ath AT1G01060 27.1.1
#> 6 3506530 ath AT1G01070 24.2.1.5
#> mapman_description
#> 1 not assigned.annotated
#> 2 RNA biosynthesis.transcriptional regulation.B3 transcription factor superfamily.transcription factor (RAV/NGATHA)
#> 3 RNA processing.mRNA silencing.miRNA pathway.DCL1-HYL1 miRNA biogenesis complex.endoribonuclease component DCL1
#> 4 Multi-process regulation.pyrophosphate homeostasis.cytosolic pyrophosphatase
#> 5 Multi-process regulation.circadian clock system.core oscillator protein (LHY/CCA1)
#> 6 Solute transport.carrier-mediated transport.DMT superfamily.solute transporter (UmamiT)
Obtaining gene IDs and descriptions
Researchers often have transcript IDs for a particular genome and
want to obtain their corresponding gene IDs. PLAZA provides
correspondence tables of transcript-to-gene ID mappings for all species,
which can be very helpful in this context. These can be retrieved with
get_tx2gene
, which takes as input PLAZA species IDs and
what transcripts
to retrieve (one of ‘all’ or
‘longest’).
For example, to obtain transcript-to-gene ID mappings for Arabidopsis thaliana, you’d run:
# Get tx2gene mappings for *A. thaliana*
ath_tx2gene <- get_tx2gene("ath", transcripts = "all")
head(ath_tx2gene)
#> transcript_id gene_id
#> 1 AT1G01020.5 AT1G01020
#> 2 AT1G01030.1 AT1G01030
#> 3 AT1G01040.1 AT1G01040
#> 4 AT1G01046.1 AT1G01046
#> 5 AT1G01050.1 AT1G01050
#> 6 AT1G01060.6 AT1G01060
A similar problem occurs when researchers have gene symbols (e.g.,
FLC and MAF2 in A. thaliana) and want to
extract their gene IDs. To help in this case, PLAZA offers ID conversion
tables, which can be accessed with the function
get_id_conversions()
, as demonstrated below:
# Get ID conversion table for *A. thaliana*
ath_ids <- get_id_conversions("ath")
head(ath_ids)
#> original_id alt_id_type alt_id
#> 1 AT1G01010 symbol NAC001
#> 2 AT1G01010 tid AT1G01010.1
#> 3 AT1G01010 uniprot Q0WV96
#> 4 AT1G01020 symbol ARV1
#> 5 AT1G01020 tid AT1G01020.5
#> 6 AT1G01020 uniprot Q5MK24
Lastly, researchers are often interested in short (usually one
sentence) descriptions for genes to help them interpret their results.
These can be obtained from PLAZA with the function
get_descriptions()
, which takes one or multiple PLAZA
species IDs as input.
For example, to obtain gene descriptions for all A. thaliana genes, you’d run:
# Get gene descriptions for *A. thaliana*
ath_descriptions <- get_descriptions("ath")
head(ath_descriptions)
#> gene description
#> 1 AT1G01020 ARV1 family protein
#> 2 AT1G01030 AP2/B3-like transcriptional factor family protein
#> 3 AT1G01040 dicer-like 1
#> 4 AT1G01046 microRNA ath-MIR838 precursor
#> 5 AT1G01050 pyrophosphorylase 1
#> 6 AT1G01060 Homeodomain-like superfamily protein
Obtaining homology data
For each PLAZA instance, all (coding) genes in all species are
grouped in homologous and orthologous gene families. These gene
families, sometimes referred to as ‘orthogroups’ or ‘hierarchical
orthologous groups (HOGs)’, are the basis for comparative analyses in
evolutionary genomics. To obtain gene family assignments for all genes
in a PLAZA instance, you will use the function
get_family_assignments()
. As input, you need to pass the
PLAZA instance
(one of ‘Dicots’, ‘Monocots’, ‘Diatoms’, or
‘Pico’), and the gene family type
(one of ‘homo’ for
homologous, or ‘ortho’ for orthologous).
For example, to retrieve all genes and their homologous family assignments in PLAZA Dicots, you’d run:
# Get family assignments from PLAZA Dicots
dicots_fams <- get_family_assignments(instance = "Dicots", type = "homo")
head(dicots_fams)
#> family_id species gene_id
#> 1 HOM05D000001 aag AagrBONN_evm.TU.Sc2ySwM_344.5580
#> 2 HOM05D000001 aag AagrBONN_evm.TU.Sc2ySwM_344.5581
#> 3 HOM05D000001 aag AagrBONN_evm.TU.Sc2ySwM_344.5582__5583
#> 4 HOM05D000001 aag AagrBONN_evm.TU.Sc2ySwM_344.5586
#> 5 HOM05D000001 aag AagrBONN_evm.TU.Sc2ySwM_344.5587
#> 6 HOM05D000001 aag AagrBONN_evm.TU.Sc2ySwM_344.5588
Once you have some homologous or orthologous gene families of
interest (e.g., families that expanded in a particular clade), you would
probably want to know the functions of the genes in these families.
Luckily for you, PLAZA stores data on overrepresented functions in its
gene families. These can be accessed with the function
get_family_functions()
, but they are only available for
PLAZA Dicots and PLAZA Monocots.
For example, to retrieve overrepresented functions in all homologous gene families in PLAZA Dicots, you’d run:
# Get overrepresented functions in PLAZA Dicots families
dicots_fams_func <- get_family_functions(instance = "Dicots", type = "homo")
head(dicots_fams_func)
#> family_id term_id term_description
#> 1 HOM05D000001 GO:0005488 binding
#> 2 HOM05D000001 IPR011990 Tetratricopeptide-like helical domain superfamily
#> 3 HOM05D000001 IPR002885 Pentatricopeptide repeat
#> 4 HOM05D000002 IPR013103 Reverse transcriptase, RNA-dependent DNA polymerase
#> 5 HOM05D000002 GO:0003676 nucleic acid binding
#> 6 HOM05D000002 GO:0097159 organic cyclic compound binding
#> ontology score
#> 1 go 28.33542
#> 2 interpro 420.04453
#> 3 interpro 594.17669
#> 4 interpro 1000.07958
#> 5 go 33.32392
#> 6 go 16.16927
Lastly, PLAZA also stores data on block duplicates (anchor pairs in
syntenic regions, derived from segmental or whole-genome duplications)
for all its genomes. These data can be retrieved with the function
get_block_duplicates()
, which takes one or multiple PLAZA
species IDs as input.
For example, to retrieve all A. thaliana block duplicates, you’d run:
# Get block duplicates for *A. thaliana*
ath_blocks <- get_block_duplicates("ath")
head(ath_blocks)
#> anchor1 anchor2 multiplicon_id multiplicon_level Ks
#> 1 AT2G41290 AT3G57010 4569 2 NA
#> 2 AT2G41310 AT3G57040 4569 2 NA
#> 3 AT2G41451 AT3G57200 4569 2 NA
#> 4 AT2G41490 AT3G57220 4569 2 NA
#> 5 AT2G41560 AT3G57330 4569 2 NA
#> 6 AT2G41620 AT3G57350 4569 2 NA
Note that the Ks
column is empty, meaning that
synonymous substitution rates
()
were not calculated for this genome.
Session information
This document was created under the following conditions:
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.5.1 (2025-06-13)
#> os Ubuntu 24.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz UTC
#> date 2025-06-24
#> pandoc 3.7.0.2 @ /usr/bin/ (via rmarkdown)
#> quarto 1.6.42 @ /usr/local/bin/quarto
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-8 2024-09-12 [1] RSPM (R 4.5.0)
#> Biobase 2.69.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> BiocFileCache 2.99.5 2025-05-18 [1] Bioconductor 3.22 (R 4.5.0)
#> BiocGenerics 0.55.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> BiocIO 1.19.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> BiocManager 1.30.26 2025-06-05 [2] CRAN (R 4.5.1)
#> BiocParallel 1.43.4 2025-06-16 [1] Bioconductor 3.22 (R 4.5.1)
#> BiocStyle * 2.37.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> Biostrings 2.77.2 2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#> bit 4.6.0 2025-03-06 [1] RSPM (R 4.5.0)
#> bit64 4.6.0-1 2025-01-16 [1] RSPM (R 4.5.0)
#> bitops 1.0-9 2024-10-03 [1] RSPM (R 4.5.0)
#> blob 1.2.4 2023-03-17 [1] RSPM (R 4.5.0)
#> bookdown 0.43 2025-04-15 [1] RSPM (R 4.5.0)
#> bslib 0.9.0 2025-01-30 [2] RSPM (R 4.5.0)
#> cachem 1.1.0 2024-05-16 [2] RSPM (R 4.5.0)
#> cli 3.6.5 2025-04-23 [2] RSPM (R 4.5.0)
#> codetools 0.2-20 2024-03-31 [3] CRAN (R 4.5.1)
#> crayon 1.5.3 2024-06-20 [2] RSPM (R 4.5.0)
#> curl 6.4.0 2025-06-22 [2] RSPM (R 4.5.0)
#> DBI 1.2.3 2024-06-02 [1] RSPM (R 4.5.0)
#> dbplyr 2.5.0 2024-03-19 [1] RSPM (R 4.5.0)
#> DelayedArray 0.35.2 2025-06-18 [1] Bioconductor 3.22 (R 4.5.1)
#> desc 1.4.3 2023-12-10 [2] RSPM (R 4.5.0)
#> digest 0.6.37 2024-08-19 [2] RSPM (R 4.5.0)
#> dplyr 1.1.4 2023-11-17 [1] RSPM (R 4.5.0)
#> evaluate 1.0.4 2025-06-18 [2] RSPM (R 4.5.0)
#> fastmap 1.2.0 2024-05-15 [2] RSPM (R 4.5.0)
#> filelock 1.0.3 2023-12-11 [1] RSPM (R 4.5.0)
#> fs 1.6.6 2025-04-12 [2] RSPM (R 4.5.0)
#> generics 0.1.4 2025-05-09 [1] RSPM (R 4.5.0)
#> GenomicAlignments 1.45.1 2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#> GenomicRanges 1.61.1 2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#> glue 1.8.0 2024-09-30 [2] RSPM (R 4.5.0)
#> htmltools 0.5.8.1 2024-04-04 [2] RSPM (R 4.5.0)
#> htmlwidgets 1.6.4 2023-12-06 [2] RSPM (R 4.5.0)
#> httr 1.4.7 2023-08-15 [1] RSPM (R 4.5.0)
#> httr2 1.1.2 2025-03-26 [2] RSPM (R 4.5.0)
#> IRanges 2.43.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> jquerylib 0.1.4 2021-04-26 [2] RSPM (R 4.5.0)
#> jsonlite 2.0.0 2025-03-27 [2] RSPM (R 4.5.0)
#> knitr 1.50 2025-03-16 [2] RSPM (R 4.5.0)
#> lattice 0.22-7 2025-04-02 [3] CRAN (R 4.5.1)
#> lifecycle 1.0.4 2023-11-07 [2] RSPM (R 4.5.0)
#> magrittr 2.0.3 2022-03-30 [2] RSPM (R 4.5.0)
#> Matrix 1.7-3 2025-03-11 [3] CRAN (R 4.5.1)
#> MatrixGenerics 1.21.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> matrixStats 1.5.0 2025-01-07 [1] RSPM (R 4.5.0)
#> memoise 2.0.1 2021-11-26 [2] RSPM (R 4.5.0)
#> pillar 1.10.2 2025-04-05 [2] RSPM (R 4.5.0)
#> pkgconfig 2.0.3 2019-09-22 [2] RSPM (R 4.5.0)
#> pkgdown 2.1.3 2025-05-25 [2] RSPM (R 4.5.0)
#> purrr 1.0.4 2025-02-05 [2] RSPM (R 4.5.0)
#> R6 2.6.1 2025-02-15 [2] RSPM (R 4.5.0)
#> ragg 1.4.0 2025-04-10 [2] RSPM (R 4.5.0)
#> rappdirs 0.3.3 2021-01-31 [2] RSPM (R 4.5.0)
#> RCurl 1.98-1.17 2025-03-22 [1] RSPM (R 4.5.0)
#> restfulr 0.0.15 2022-06-16 [1] RSPM (R 4.5.0)
#> rjson 0.2.23 2024-09-16 [1] RSPM (R 4.5.0)
#> rlang 1.1.6 2025-04-11 [2] RSPM (R 4.5.0)
#> rmarkdown 2.29 2024-11-04 [2] RSPM (R 4.5.0)
#> rplaza * 0.99.0 2025-06-24 [1] Bioconductor
#> Rsamtools 2.25.1 2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#> RSQLite 2.4.1 2025-06-08 [1] RSPM (R 4.5.0)
#> rtracklayer 1.69.1 2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#> S4Arrays 1.9.1 2025-05-29 [1] Bioconductor 3.22 (R 4.5.1)
#> S4Vectors 0.47.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> sass 0.4.10 2025-04-11 [2] RSPM (R 4.5.0)
#> Seqinfo 0.99.1 2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#> sessioninfo 1.2.3 2025-02-05 [2] RSPM (R 4.5.0)
#> SparseArray 1.9.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> SummarizedExperiment 1.39.1 2025-06-22 [1] Bioconductor 3.22 (R 4.5.1)
#> systemfonts 1.2.3 2025-04-30 [2] RSPM (R 4.5.0)
#> textshaping 1.0.1 2025-05-01 [2] RSPM (R 4.5.0)
#> tibble 3.3.0 2025-06-08 [2] RSPM (R 4.5.0)
#> tidyselect 1.2.1 2024-03-11 [1] RSPM (R 4.5.0)
#> vctrs 0.6.5 2023-12-01 [2] RSPM (R 4.5.0)
#> withr 3.0.2 2024-10-28 [2] RSPM (R 4.5.0)
#> xfun 0.52 2025-04-02 [2] RSPM (R 4.5.0)
#> XML 3.99-0.18 2025-01-01 [1] RSPM (R 4.5.0)
#> XVector 0.49.0 2025-04-15 [1] Bioconductor 3.22 (R 4.5.0)
#> yaml 2.3.10 2024-07-26 [2] RSPM (R 4.5.0)
#>
#> [1] /__w/_temp/Library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/local/lib/R/library
#> * ── Packages attached to the search path.
#>
#> ──────────────────────────────────────────────────────────────────────────────