Comparative transcriptomic analysis of hybrids and their progenitors
Fabricio Almeida-Silva
VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumLucas Prost-Boxoen
VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumYves Van de Peer
VIB-UGent Center for Plant Systems Biology, Ghent, BelgiumSource:
vignettes/HybridExpress.Rmd
HybridExpress.Rmd
Introduction
The formation of hybrids through the fusion of distinct genomes and subsequent genome duplication (in cases of allopolyploidy) represents a significant evolutionary event with complex effects on cellular biology, particularly gene expression. The impact of such genome mergings and duplications on transcription remain incompletely understood. To bridge this gap, we introduce HybridExpress, a comprehensive package designed to facilitate the comparative transcriptomic analysis of hybrids and their progenitor species. HybridExpress is tailored for RNA-Seq data derived from a ‘hybrid triplet’: the hybrid organism and its two parental species. This package offers a suite of intuitive functions enabling researchers to perform differential expression analysis with ease, generate principal component analysis (PCA) plots to visualize sample grouping, categorize genes into 12 distinct expression pattern groups (as in Rapp, Udall, and Wendel (2009)), and conduct functional analyses. Acknowledging the potential variability in cell and transcriptome size across species and ploidy levels, HybridExpress incorporates features for rigorous normalization of count data. Specifically, it allows for the integration of spike-in controls directly into the normalization process, ensuring accurate transcriptome size adjustments when these standards are present in the RNA-Seq count data (see full methodology in Coate (2023)). By offering these capabilities, HybridExpress provides a robust tool set for unraveling the intricate effects of genome doubling and merging on gene expression, paving the way for novel insights into the cellular biology of hybrid organisms.
Installation
HybridExpress can be installed from Bioconductor with the following code:
if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install("HybridExpress")
# Load package after installation
library(HybridExpress)
set.seed(123) # for reproducibility
Data description
For this vignette, we will use an example data set that comprises
(unpublished) gene expression data (in counts) for Chlamydomonas
reinhardtii. In our lab, we crossed a diploid line of C.
reinhardtii (hereafter “P1”) with a haploid line (hereafter “P2”),
thus generating a triploid line through the merging of the two parental
genomes. The count matrix and sample metadata are stored in
SummarizedExperiment
objects, which is the standard
Bioconductor data structure required by HybridExpress. For
instructions on how to create a SummarizedExperiment
object, check the FAQ section of this vignette.
Let’s load the example data and take a quick look at it:
library(SummarizedExperiment)
# Load data
data(se_chlamy)
# Inspect the `SummarizedExperiment` object
se_chlamy
#> class: SummarizedExperiment
#> dim: 13058 18
#> metadata(0):
#> assays(1): counts
#> rownames(13058): Cre01.g000050 Cre01.g000150 ... ERCC-00170 ERCC-00171
#> rowData names(0):
#> colnames(18): S1 S2 ... S17 S18
#> colData names(2): Ploidy Generation
## Take a look at the colData and count matrix
colData(se_chlamy)
#> DataFrame with 18 rows and 2 columns
#> Ploidy Generation
#> <character> <factor>
#> S1 diploid P1
#> S2 diploid P1
#> S3 diploid P1
#> S4 diploid P1
#> S5 diploid P1
#> ... ... ...
#> S14 triploid F1
#> S15 triploid F1
#> S16 triploid F1
#> S17 triploid F1
#> S18 triploid F1
assay(se_chlamy) |> head()
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18
#> Cre01.g000050 29 22 21 16
#> Cre01.g000150 37 17 24 21
#> Cre01.g000200 24 26 18 21
#> Cre01.g000250 358 339 340 332
#> Cre01.g000300 1524 1720 1517 1243
#> Cre01.g000350 276 324 246 275
table(se_chlamy$Ploidy, se_chlamy$Generation)
#>
#> F1 P1 P2
#> diploid 0 6 0
#> haploid 0 0 6
#> triploid 6 0 0
As you can see, the count matrix contains 13058 genes and 18 samples, with 6 replicates for parent 1 (P1, diploid), 6 replicates for parent 2 (P2, haploid), and 6 replicates for the progeny (F1, triploid).
Adding midparent expression values
First of all, you’d want to add in your count matrix in
silico samples that contain the expression values of the midparent.
This can be done with the function
add_midparent_expression()
, which takes a random sample
pair (one sample from each parent, sampling without replacement) and
calculates the midparent expression value in one of three ways:
- Mean (default): get the mean expression of the two samples.
- Sum: get the sum of the two samples.
- Weighted mean: get the weighted mean of the two samples by multiplying the expression value of each parent by a weight. Typically, this can be used if the two parents have different ploidy levels, and the weights would correspond to the ploidy level of each parent.
For this function, besides specifying the method to obtain the
midparent expression values (i.e., “mean”, “sum”, or “weightedmean”),
users must also specify the name of the column in colData
that contains information about the generations (default:
“Generation”), as well as the levels corresponding to each
parent (default: “P1” and “P2” for parents 1 and 2,
respectively).
# Add midparent expression using the mean of the counts
se <- add_midparent_expression(se_chlamy)
head(assay(se))
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050 29 22 21 16 18 27 14 20
#> Cre01.g000150 37 17 24 21 32 46 38 56
#> Cre01.g000200 24 26 18 21 19 22 20 18
#> Cre01.g000250 358 339 340 332 310 372 273 284
#> Cre01.g000300 1524 1720 1517 1243 1030 1331 1072 1166
#> Cre01.g000350 276 324 246 275 242 316 242 268
#> midparent5 midparent6
#> Cre01.g000050 18 22
#> Cre01.g000150 31 46
#> Cre01.g000200 16 24
#> Cre01.g000250 278 348
#> Cre01.g000300 1076 1182
#> Cre01.g000350 242 304
# Alternative 1: using the sum of the counts
add_midparent_expression(se_chlamy, method = "sum") |>
assay() |>
head()
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050 29 22 21 16 43 26 58 46
#> Cre01.g000150 37 17 24 21 86 25 86 140
#> Cre01.g000200 24 26 18 21 38 24 50 41
#> Cre01.g000250 358 339 340 332 666 518 810 587
#> Cre01.g000300 1524 1720 1517 1243 2557 1893 2629 2373
#> Cre01.g000350 276 324 246 275 527 453 668 531
#> midparent5 midparent6
#> Cre01.g000050 37 27
#> Cre01.g000150 96 62
#> Cre01.g000200 51 34
#> Cre01.g000250 714 432
#> Cre01.g000300 2548 1716
#> Cre01.g000350 633 419
# Alternative 2: using the weighted mean of the counts (weights = ploidy)
w <- c(2, 1) # P1 = diploid; P2 = haploid
add_midparent_expression(se_chlamy, method = "weightedmean", weights = w) |>
assay() |>
head()
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18 midparent1 midparent2 midparent3 midparent4
#> Cre01.g000050 29 22 21 16 25 29 39 22
#> Cre01.g000150 37 17 24 21 27 30 29 16
#> Cre01.g000200 24 26 18 21 27 32 33 21
#> Cre01.g000250 358 339 340 332 181 232 270 210
#> Cre01.g000300 1524 1720 1517 1243 1453 1576 1753 1532
#> Cre01.g000350 276 324 246 275 166 202 223 165
#> midparent5 midparent6
#> Cre01.g000050 26 17
#> Cre01.g000150 48 15
#> Cre01.g000200 29 17
#> Cre01.g000250 202 148
#> Cre01.g000300 1705 1125
#> Cre01.g000350 186 136
We will proceed our analyses with the midparent expression values
obtained from the mean of the counts, stored in the se
object.
Normalizing count data
To normalize count data, the function add_size_factors()
calculates size factors (used by DESeq2 for
normalization) and adds them as an extra column in the colData slot of
your SummarizedExperiment
object. Such size factors are
calculated using one of two methods:
By library size (default) using the ‘median of ratios’ method implemented in DESeq2.
By cell size/biomass using spike-in controls (if available). If spike-in controls are present in the count matrix, you can use them for normalization by setting
spikein = TRUE
and specifying the pattern used to indicate rows that contain spike-ins (usually they start with ERCC)1. Normalization with spike-in controls is particularly useful if the amount of mRNA per cell is not equal between generations (due to, for instance, different ploidy levels, which in turn can lead to different cell sizes and/or biomass).
In our example data set, spike-in controls are available in the last rows of the count matrix. Let’s take a look at them.
# Show last rows of the count matrix
assay(se) |>
tail()
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13
#> ERCC-00163 74 75 55 77 51 84 132 127 93 108 79 102 66
#> ERCC-00164 0 4 0 1 4 1 6 4 1 4 3 2 5
#> ERCC-00165 147 139 87 165 118 179 236 246 139 218 176 145 118
#> ERCC-00168 2 5 2 2 2 6 5 1 1 5 0 3 4
#> ERCC-00170 97 95 73 101 70 118 186 148 110 167 110 103 72
#> ERCC-00171 4644 4959 3554 5357 4170 5946 8207 7915 4992 7843 6455 5884 4537
#> S14 S15 S16 S17 S18 midparent1 midparent2 midparent3 midparent4
#> ERCC-00163 67 62 80 88 47 67 96 101 90
#> ERCC-00164 4 3 5 4 2 2 2 4 2
#> ERCC-00165 143 115 185 192 136 132 198 192 155
#> ERCC-00168 6 2 11 5 2 1 6 3 2
#> ERCC-00170 99 64 121 103 84 92 142 122 102
#> ERCC-00171 5665 4598 6168 5771 4747 5004 6894 6437 5620
#> midparent5 midparent6
#> ERCC-00163 92 84
#> ERCC-00164 5 0
#> ERCC-00165 177 143
#> ERCC-00168 4 2
#> ERCC-00170 128 104
#> ERCC-00171 6188 4818
As you can see, rows with spike-in counts start with “ERCC”. Let’s
add size factors to our SummarizedExperiment
object using
spike-in controls.
# Take a look at the original colData
colData(se)
#> DataFrame with 24 rows and 2 columns
#> Ploidy Generation
#> <character> <factor>
#> S1 diploid P1
#> S2 diploid P1
#> S3 diploid P1
#> S4 diploid P1
#> S5 diploid P1
#> ... ... ...
#> midparent2 NA midparent
#> midparent3 NA midparent
#> midparent4 NA midparent
#> midparent5 NA midparent
#> midparent6 NA midparent
# Add size factors estimated from spike-in controls
se <- add_size_factors(se, spikein = TRUE, spikein_pattern = "ERCC")
#> converting counts to integer mode
# Take a look at the new colData
colData(se)
#> DataFrame with 24 rows and 3 columns
#> Ploidy Generation sizeFactor
#> <character> <factor> <numeric>
#> S1 diploid P1 0.856262
#> S2 diploid P1 0.938417
#> S3 diploid P1 0.648653
#> S4 diploid P1 0.969369
#> S5 diploid P1 0.718761
#> ... ... ... ...
#> midparent2 NA midparent 1.278078
#> midparent3 NA midparent 1.201072
#> midparent4 NA midparent 1.000892
#> midparent5 NA midparent 1.186773
#> midparent6 NA midparent 0.871889
Exploratory data analyses
Next, you can perform exploratory analyses of sample clustering to verify if samples group as expected. With HybridExpress, this can be performed using two functions:
-
pca_plot()
: creates principal component analysis (PCA) plots, with colors and shapes (optional) mapped to levels ofcolData
variables; -
plot_samplecor()
: plots a heatmap of hierarchically clustered pairwise sample correlations.
Let’s start with the PCA plot:
# For colData rows with missing values (midparent samples), add "midparent"
se$Ploidy[is.na(se$Ploidy)] <- "midparent"
se$Generation[is.na(se$Generation)] <- "midparent"
# Create PCA plot
pca_plot(se, color_by = "Generation", shape_by = "Ploidy", add_mean = TRUE)
In the plot above, each data point corresponds to a sample, and
colors and shapes are mapped to levels of the variables specified in the
arguments color_by
and shape_by
, respectively.
Besides, by specifying add_mean = TRUE
, we added a diamond
shape indicating the mean PC coordinates based on the variable in
color_by
(here, “Generation”).
Now, let’s plot the heatmap of sample correlations:
# Plot a heatmap of sample correlations
plot_samplecor(se, coldata_cols = c("Generation", "Ploidy"))
We can see that samples group well together both in the PCA plot and in the correlation heatmap.
Of note, both pca_plot()
and
plot_samplecor()
use only the top 500 genes with the
highest variances to create the plot. This is because genes with low
variances (i.e., genes that do not vary much across samples) are
uninformative and typically only add noise. You can change this number
(to use more or less genes) in the ntop
argument of both
functions.
Identifying differentially expressed genes between hybrids and their parents
To compare gene expression levels of hybrids to their progenitor
species, you can use the function get_deg_list()
. This
function performs differential expression analyses using DESeq2 and
returns a list of data frames with gene-wise test statistics for the
following contrasts:
-
P2_vs_P1
: parent 2 (numerator) versus parent 1 (denominator). -
F1_vs_P1
: hybrid (numerator) versus parent 1 (denominator). -
F1_vs_P2
: hybrid (numerator) versus parent 2 (denominator). -
F1_vs_midparent
: hybrid (numerator) vs midparent (denominator).
The size factors estimated with add_size_factors()
are
used for normalization before differential expression testing. Let’s use
get_deg_list()
to get differentially expressed genes (DEGs)
for each contrast.
# Get a list of differentially expressed genes for each contrast
deg_list <- get_deg_list(se)
#> using pre-existing size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
# Inspecting the output
## Getting contrast names
names(deg_list)
#> [1] "P2_vs_P1" "F1_vs_P1" "F1_vs_P2" "F1_vs_midparent"
## Accessing gene-wise test statistics for contrast `F1_vs_P1`
head(deg_list$F1_vs_P1)
#> baseMean log2FoldChange lfcSE stat pvalue
#> Cre01.g000450 39.59273 -1.1632346 0.2165407 -5.371898 7.791211e-08
#> Cre01.g000750 103.10296 0.8807307 0.1842856 4.779162 1.760270e-06
#> Cre01.g000900 471.44084 0.6807925 0.1980730 3.437079 5.880239e-04
#> Cre01.g001000 10.34574 -6.6304915 0.9631384 -6.884256 5.809030e-12
#> Cre01.g001100 453.87151 -0.6457115 0.1506903 -4.285025 1.827187e-05
#> Cre01.g001200 159.03643 0.7231060 0.1946090 3.715685 2.026536e-04
#> padj
#> Cre01.g000450 4.653531e-07
#> Cre01.g000750 8.329159e-06
#> Cre01.g000900 1.707442e-03
#> Cre01.g001000 6.659069e-11
#> Cre01.g001100 7.200448e-05
#> Cre01.g001200 6.453951e-04
## Counting the number of DEGs per contrast
sapply(deg_list, nrow)
#> P2_vs_P1 F1_vs_P1 F1_vs_P2 F1_vs_midparent
#> 8698 5476 7350 4348
To summarize the frequencies of up- and down-regulated genes per
contrast in a single data frame, use the function
get_deg_counts()
.
# Get a data frame with DEG frequencies for each contrast
deg_counts <- get_deg_counts(deg_list)
deg_counts
#> contrast up down total perc_up perc_down perc_total
#> 1 P2_vs_P1 828 7870 8698 6.4 60.7 67.1
#> 2 F1_vs_P1 1471 4005 5476 11.3 30.9 42.2
#> 3 F1_vs_P2 6487 863 7350 50.0 6.7 56.7
#> 4 F1_vs_midparent 2620 1728 4348 20.2 13.3 33.5
It is important to note that the columns perc_up
,
perc_down
, and perc_total
show the percentages
of up-regulated, down-regulated, and all differentially expressed genes
relative to the total number of genes in the count matrix. The total
number of genes in the count matrix is stored in the ngenes
attribute of the list returned by get_deg_list()
:
attributes(deg_list)$ngenes
#> [1] 12966
However, since the count matrix usually does not include all genes in
the genome (e.g., lowly expressed genes and genes with low variance are
usually filtered out), the percentages in perc_up
,
perc_down
, and perc_total
are not relative to
the total number of genes in the genome. To use the total number of
genes in the genome as the reference, we need to update the
ngenes
attribute of the DEG list with the appropriate
number as follows:
# Total number of genes in the C. reinhardtii genome (v6.1): 16883
attributes(deg_list)$ngenes <- 16883
Then, we can run get_deg_counts()
again to get the
percentages relative to the total number of genes in the genome.
deg_counts <- get_deg_counts(deg_list)
deg_counts
#> contrast up down total perc_up perc_down perc_total
#> 1 P2_vs_P1 828 7870 8698 4.9 46.6 51.5
#> 2 F1_vs_P1 1471 4005 5476 8.7 23.7 32.4
#> 3 F1_vs_P2 6487 863 7350 38.4 5.1 43.5
#> 4 F1_vs_midparent 2620 1728 4348 15.5 10.2 25.8
Finally, we can summarize everything in a single publication-ready
figure using the plot plot_expression_triangle()
, which
shows the ‘experimental trio’ (i.e., hybrid and its progenitors) as a
triangle, with the frequencies of DEGs indicated.
# Plot expression triangle
plot_expression_triangle(deg_counts)
This figure is commonly used in publications, and it was inspired by Rapp, Udall, and Wendel (2009). For each edge (line), numbers in the middle (in bold) indicate the frequency of all DEGs, and numbers at the ends (close to boxes) indicate the frequency of up-regulated genes in each generation. For instance, the figure above shows that, for the contrast between F1 and P1, there are 5476 DEGs (32.4% of the genome), of which 1471 are up-regulated in F1, and 4005 are up-regulated in P1.
For a custom figure, you can also specify your own color palette and labels for the boxes. For example:
# Create vectors (length 4) of colors and box labels
pal <- c("springgreen4", "darkorange3", "mediumpurple4", "mediumpurple3")
labels <- c("Parent 1\n(2n)", "Parent 2\n(n)", "Progeny\n(3n)", "Midparent")
plot_expression_triangle(deg_counts, palette = pal, box_labels = labels)
Expression-based gene classification
After identifying DEGs for different contrasts, you’d typically want
to classify your genes into expression partitions based on their
expression patterns. This can be performed with the function
expression_partitioning()
, which classifies genes into one
of the 12 categories as in Rapp,
Udall, and Wendel (2009), and into 5 major
classes that summarize the 12 categories. The five
classes are:
- Transgressive up-regulation (UP): gene is up-regulated in the hybrid as compared to both parents.
- Transgressive down-regulation (DOWN): gene is down-regulated in the hybrid as compared to both parents.
- Additivity (ADD): gene expression in the hybrid is the mean of both parents (additive effect).
- Expression-level dominance toward parent 1 (ELD_P1): gene expression in the hybrid is the same as in parent 1, but different from parent 2.
- Expression-level dominance toward parent 2 (ELD_P2): gene expression in the hybrid is the same as in parent 2, but different from parent 1.
# Classify genes in expression partitions
exp_partitions <- expression_partitioning(deg_list)
# Inspect the output
head(exp_partitions)
#> Gene Category Class lFC_F1_vs_P1 lFC_F1_vs_P2
#> 1 Cre01.g003650 1 ADD 0.7838125 -1.0484464
#> 2 Cre01.g005150 1 ADD 1.0473362 -0.5041601
#> 3 Cre01.g008600 1 ADD 5.0518384 -1.4840829
#> 4 Cre01.g013500 1 ADD 2.1099265 -1.5329846
#> 5 Cre01.g034850 1 ADD 1.5838851 -0.7611868
#> 6 Cre01.g800005 1 ADD 1.4928449 -0.9315119
# Count number of genes per category
table(exp_partitions$Category)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12
#> 70 262 66 3283 77 287 214 651 280 147 1760 1666
# Count number of genes per class
table(exp_partitions$Class)
#>
#> UP DOWN ADD ELD_P1 ELD_P2
#> 1015 427 1736 3563 2022
To visualize the expression partitions as a scatter plot of
expression divergences, you can use the function
plot_expression_partitions()
.
# Plot partitions as a scatter plot of divergences
plot_expression_partitions(exp_partitions, group_by = "Category")
By default, genes are grouped by Category
. However, you
can also group genes by Class
as follows:
# Group by `Class`
plot_expression_partitions(exp_partitions, group_by = "Class")
You can also visualize the frequencies of genes in each partition
with the function plot_partition_frequencies()
.
# Visualize frequency of genes in each partition
## By `Category` (default)
plot_partition_frequencies(exp_partitions)
## By `Class`
plot_partition_frequencies(exp_partitions, group_by = "Class")
Overrepresentation analysis of functional terms
Lastly, you’d want to explore whether gene sets of interest (e.g.,
up-regulated genes in each contrast) are enriched in any particular GO
term, pathway, protein domain, etc. For that, you will use the function
ora()
, which performs overrepresentation analysis for a
gene set given a data frame of functional annotation for each gene.
Here, we will use an example data set with GO annotation for C. reinhardtii genes. This data set illustrates how the annotation data frame must be shaped: gene ID in the first column, and functional annotations in other columns.
# Load example functional annotation (GO terms)
data(go_chlamy)
head(go_chlamy)
#> gene GO
#> 1 Cre01.g000050 <NA>
#> 2 Cre01.g000100 <NA>
#> 3 Cre01.g000150 membrane
#> 4 Cre01.g000150 metal ion transport
#> 5 Cre01.g000150 metal ion transmembrane transporter activity
#> 6 Cre01.g000150 transmembrane transport
To demonstrate the usage of ora()
, let’s check if we can
find enrichment for any GO term among genes assigned to class “ADD”.
# Get a vector of genes in class "ADD"
genes_add <- exp_partitions$Gene[exp_partitions$Class == "ADD"]
head(genes_add)
#> [1] "Cre01.g003650" "Cre01.g005150" "Cre01.g008600" "Cre01.g013500"
#> [5] "Cre01.g034850" "Cre01.g800005"
# Get background genes - genes in the count matrix
bg <- rownames(se)
# Perform overrepresentation analysis
ora_add <- ora(genes_add, go_chlamy, background = bg)
# Inspect results
head(ora_add)
#> term genes all pval
#> 20 aminoacyl-tRNA ligase activity 14 33 3.811521e-05
#> 84 cytoplasm 43 151 7.379762e-07
#> 97 DNA replication 17 37 1.471956e-06
#> 114 endopeptidase activity 6 10 7.400151e-04
#> 115 endoplasmic reticulum 9 17 1.193854e-04
#> 118 eukaryotic translation initiation factor 3 complex 7 9 2.144087e-05
#> padj category
#> 20 1.766005e-03 GO
#> 84 7.693402e-05 GO
#> 97 1.227611e-04 GO
#> 114 2.204188e-02 GO
#> 115 4.978369e-03 GO
#> 118 1.277263e-03 GO
Example 1: overrepresentation analyses for all expression-based classes
In the example above, we performed an overrepresentation analysis in genes associated with class “ADD”. What if we wanted to do the same for all classes?
In that case, you could run ora()
multiple times by
looping over each class. In details, you would do the following:
- For each class, create a vector of genes associated with it;
- Run
ora()
to get a data frame with ORA results.
Below you can find an example on how to do it using
lapply()
. Results for each class will be stored in elements
of a list object.
# Create a list of genes in each class
genes_by_class <- split(exp_partitions$Gene, exp_partitions$Class)
names(genes_by_class)
#> [1] "UP" "DOWN" "ADD" "ELD_P1" "ELD_P2"
head(genes_by_class$UP)
#> [1] "Cre01.g029250" "Cre01.g034380" "Cre01.g036700" "Cre01.g049400"
#> [5] "Cre02.g085500" "Cre02.g087150"
# Iterate through each class and perform ORA
ora_classes <- lapply(
genes_by_class, # list through which we will iterate
ora, # function we will apply to each element of the list
annotation = go_chlamy, background = bg # additional arguments to function
)
# Inspect output
ora_classes
#> $UP
#> term genes all pval padj
#> 29 ATP hydrolysis activity 21 119 3.380614e-04 0.0469905310
#> 109 dynein complex 8 13 1.237752e-06 0.0005161427
#> 127 flavin adenine dinucleotide binding 11 42 2.857201e-04 0.0469905310
#> category
#> 29 GO
#> 109 GO
#> 127 GO
#>
#> $DOWN
#> term genes all pval padj
#> 242 oxidoreductase activity 21 274 2.931239e-04 2.444654e-02
#> 271 photosynthesis 13 58 3.656759e-08 7.624343e-06
#> 272 photosynthesis, light harvesting 18 24 1.638332e-22 6.831844e-20
#> 276 photosystem I 5 15 8.644389e-05 9.011776e-03
#> 376 tetrapyrrole biosynthetic process 5 8 1.952599e-06 2.714112e-04
#> category
#> 242 GO
#> 271 GO
#> 272 GO
#> 276 GO
#> 376 GO
#>
#> $ADD
#> term
#> 20 aminoacyl-tRNA ligase activity
#> 84 cytoplasm
#> 97 DNA replication
#> 114 endopeptidase activity
#> 115 endoplasmic reticulum
#> 118 eukaryotic translation initiation factor 3 complex
#> 165 intracellular protein transport
#> 227 nuclear pore
#> 250 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor
#> 284 prefoldin complex
#> 287 proteasome core complex
#> 293 protein folding
#> 294 protein glycosylation
#> 304 protein refolding
#> 390 translation initiation factor activity
#> 410 unfolded protein binding
#> 414 vesicle-mediated transport
#> genes all pval padj category
#> 20 14 33 3.811521e-05 1.766005e-03 GO
#> 84 43 151 7.379762e-07 7.693402e-05 GO
#> 97 17 37 1.471956e-06 1.227611e-04 GO
#> 114 6 10 7.400151e-04 2.204188e-02 GO
#> 115 9 17 1.193854e-04 4.978369e-03 GO
#> 118 7 9 2.144087e-05 1.277263e-03 GO
#> 165 25 73 4.442015e-06 3.087201e-04 GO
#> 227 6 9 3.339022e-04 1.160310e-02 GO
#> 250 10 27 1.739601e-03 4.533836e-02 GO
#> 284 4 4 3.203862e-04 1.160310e-02 GO
#> 287 9 15 3.165895e-05 1.650223e-03 GO
#> 293 29 81 2.663812e-07 3.702698e-05 GO
#> 294 6 10 7.400151e-04 2.204188e-02 GO
#> 304 10 10 1.809949e-09 7.547489e-07 GO
#> 390 9 21 8.722652e-04 2.424897e-02 GO
#> 410 19 38 6.478683e-08 1.350805e-05 GO
#> 414 13 41 1.955735e-03 4.797302e-02 GO
#>
#> $ELD_P1
#> term genes all pval padj category
#> 38 binding 88 208 2.494608e-06 0.001040251 GO
#> 239 nucleus 77 193 1.141172e-04 0.023793431 GO
#> 345 rRNA processing 15 24 3.464517e-04 0.048156786 GO
#>
#> $ELD_P2
#> term genes all
#> 28 aspartic-type endopeptidase activity 6 8
#> 170 iron ion binding 30 93
#> 242 oxidoreductase activity 71 274
#> 271 photosynthesis 26 58
#> 278 photosystem II 17 32
#> 320 proton-transporting two-sector ATPase complex, catalytic domain 6 8
#> pval padj category
#> 28 3.007465e-04 2.090188e-02 GO
#> 170 4.577693e-05 4.772245e-03 GO
#> 242 6.069186e-06 8.436168e-04 GO
#> 271 1.191954e-07 4.970448e-05 GO
#> 278 9.602908e-07 2.002206e-04 GO
#> 320 3.007465e-04 2.090188e-02 GO
To do the same for each expression-based category (not class), you’d
need to replace Class
with Category
in the
split()
function (see example above).
Example 2: overrepresentation analyses for differentially expressed genes
The same you can use lapply()
to loop through each
expression class, you can also loop through each contrast in the list
returned by get_deg_list()
, and perform ORA for up- and
down-regulated genes. Below you can find an example:
# Get up-regulated genes for each contrast
up_genes <- lapply(deg_list, function(x) rownames(x[x$log2FoldChange > 0, ]))
names(up_genes)
#> [1] "P2_vs_P1" "F1_vs_P1" "F1_vs_P2" "F1_vs_midparent"
head(up_genes$F1_vs_P1)
#> [1] "Cre01.g000750" "Cre01.g000900" "Cre01.g001200" "Cre01.g002750"
#> [5] "Cre01.g003524" "Cre01.g003650"
# Perform ORA
ora_up <- lapply(
up_genes,
ora,
annotation = go_chlamy, background = bg
)
ora_up
#> $P2_vs_P1
#> [1] term genes all pval padj category
#> <0 rows> (or 0-length row.names)
#>
#> $F1_vs_P1
#> term genes all pval padj category
#> 109 dynein complex 8 13 2.049704e-05 0.008547264 GO
#>
#> $F1_vs_P2
#> term genes all pval
#> 38 binding 138 208 1.214715e-06
#> 96 DNA repair 56 83 9.520576e-04
#> 144 GTP binding 72 111 1.095054e-03
#> 169 ion transport 33 45 1.225270e-03
#> 209 microtubule-based process 10 10 9.792020e-04
#> 239 nucleus 125 193 2.252487e-05
#> 303 protein polymerization 10 10 9.792020e-04
#> 304 protein refolding 10 10 9.792020e-04
#> 307 protein serine/threonine phosphatase activity 20 24 7.724253e-04
#> 338 RNA binding 101 157 1.948736e-04
#> 384 transcription, DNA-templated 28 36 5.951883e-04
#> padj category
#> 38 0.000506536 GO
#> 96 0.045369693 GO
#> 144 0.045663753 GO
#> 169 0.046448881 GO
#> 209 0.045369693 GO
#> 239 0.004696435 GO
#> 303 0.045369693 GO
#> 304 0.045369693 GO
#> 307 0.045369693 GO
#> 338 0.027087430 GO
#> 384 0.045369693 GO
#>
#> $F1_vs_midparent
#> term genes all pval
#> 80 cyclic nucleotide biosynthetic process 38 108 1.983329e-04
#> 109 dynein complex 9 13 1.789139e-04
#> 166 intracellular signal transduction 38 110 3.036659e-04
#> 168 ion channel activity 20 48 5.611617e-04
#> 169 ion transport 21 45 5.765412e-05
#> 290 protein dephosphorylation 21 44 3.792684e-05
#> 307 protein serine/threonine phosphatase activity 15 24 7.437911e-06
#> padj category
#> 80 0.016540962 GO
#> 109 0.016540962 GO
#> 166 0.021104781 GO
#> 168 0.033429202 GO
#> 169 0.008013923 GO
#> 290 0.007907747 GO
#> 307 0.003101609 GO
Likewise, for down-regulated genes, you need to replace the
>
symbol with a <
symbol in the
anonymous function to subset rows.
FAQ
- How do I create a
SummarizedExperiment
object?
A SummarizedExperiment
is a data structure (an S4 class)
that can be used to store, in a single object, the following
elements:
- assay: A quantitative matrix with features in rows and samples in columns. In the context of HybridExpress, this would be a gene expression matrix (in counts) with genes in rows and samples in columns.
- colData: A data frame of sample metadata with samples in rows and variables that describe samples (e.g., tissue, treatment, and other covariates) in columns.
- rowData: A data frame of gene metadata with genes in rows and variables that describe genes (e.g., chromosome name, alternative IDs, functional information, etc) in columns.
For this package, you must have an assay containing the count matrix and a colData slot with sample metadata. rowData can be present, but it is not required.
To demonstrate how to create a SummarizedExperiment
object, we will extract the assay and colData from the
example object se_chlamy
that comes with this package.
# Get count matrix
count_matrix <- assay(se_chlamy)
head(count_matrix)
#> S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14
#> Cre01.g000050 31 21 26 33 19 48 17 6 13 6 10 7 31 22
#> Cre01.g000150 50 29 35 99 11 58 51 46 41 33 28 14 53 25
#> Cre01.g000200 36 26 24 29 17 36 14 15 12 8 14 7 25 24
#> Cre01.g000250 440 272 394 332 283 585 272 274 255 160 225 235 405 391
#> Cre01.g000300 1242 839 1216 1251 811 1785 1341 1306 1122 877 844 1082 1704 1739
#> Cre01.g000350 412 264 294 336 252 478 233 221 195 155 190 201 299 272
#> S15 S16 S17 S18
#> Cre01.g000050 29 22 21 16
#> Cre01.g000150 37 17 24 21
#> Cre01.g000200 24 26 18 21
#> Cre01.g000250 358 339 340 332
#> Cre01.g000300 1524 1720 1517 1243
#> Cre01.g000350 276 324 246 275
# Get colData (data frame of sample metadata)
coldata <- colData(se_chlamy)
head(coldata)
#> DataFrame with 6 rows and 2 columns
#> Ploidy Generation
#> <character> <factor>
#> S1 diploid P1
#> S2 diploid P1
#> S3 diploid P1
#> S4 diploid P1
#> S5 diploid P1
#> S6 diploid P1
Once you have these two objects, you can create a
SummarizedExperiment
object with:
# Create a SummarizedExperiment object
new_se <- SummarizedExperiment(
assays = list(counts = count_matrix),
colData = coldata
)
new_se
#> class: SummarizedExperiment
#> dim: 13058 18
#> metadata(0):
#> assays(1): counts
#> rownames(13058): Cre01.g000050 Cre01.g000150 ... ERCC-00170 ERCC-00171
#> rowData names(0):
#> colnames(18): S1 S2 ... S17 S18
#> colData names(2): Ploidy Generation
For more details on this data structure, read the vignette of the SummarizedExperiment package.
Session information
This document was created under the following conditions:
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.1 (2024-06-14)
#> os Ubuntu 22.04.4 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz UTC
#> date 2024-09-05
#> pandoc 3.3 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-5 2016-07-21 [1] RSPM (R 4.4.0)
#> Biobase * 2.65.1 2024-08-28 [1] Bioconductor 3.20 (R 4.4.1)
#> BiocGenerics * 0.51.1 2024-09-03 [1] Bioconductor 3.20 (R 4.4.1)
#> BiocManager 1.30.25 2024-08-28 [1] RSPM (R 4.4.0)
#> BiocParallel 1.39.0 2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#> BiocStyle * 2.33.1 2024-06-12 [1] Bioconductor 3.20 (R 4.4.0)
#> bookdown 0.40 2024-07-02 [1] RSPM (R 4.4.0)
#> bslib 0.8.0 2024-07-29 [2] RSPM (R 4.4.0)
#> cachem 1.1.0 2024-05-16 [2] RSPM (R 4.4.0)
#> circlize 0.4.16 2024-02-20 [1] RSPM (R 4.4.0)
#> cli 3.6.3 2024-06-21 [2] RSPM (R 4.4.0)
#> clue 0.3-65 2023-09-23 [1] RSPM (R 4.4.0)
#> cluster 2.1.6 2023-12-01 [3] CRAN (R 4.4.1)
#> codetools 0.2-20 2024-03-31 [3] CRAN (R 4.4.1)
#> colorspace 2.1-1 2024-07-26 [1] RSPM (R 4.4.0)
#> ComplexHeatmap 2.21.0 2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#> crayon 1.5.3 2024-06-20 [2] RSPM (R 4.4.0)
#> DelayedArray 0.31.11 2024-08-04 [1] Bioconductor 3.20 (R 4.4.1)
#> desc 1.4.3 2023-12-10 [2] RSPM (R 4.4.0)
#> DESeq2 1.45.3 2024-07-24 [1] Bioconductor 3.20 (R 4.4.1)
#> digest 0.6.37 2024-08-19 [2] RSPM (R 4.4.0)
#> doParallel 1.0.17 2022-02-07 [1] RSPM (R 4.4.0)
#> dplyr 1.1.4 2023-11-17 [1] RSPM (R 4.4.0)
#> evaluate 0.24.0 2024-06-10 [2] RSPM (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [2] RSPM (R 4.4.0)
#> farver 2.1.2 2024-05-13 [1] RSPM (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [2] RSPM (R 4.4.0)
#> foreach 1.5.2 2022-02-02 [1] RSPM (R 4.4.0)
#> fs 1.6.4 2024-04-25 [2] RSPM (R 4.4.0)
#> generics 0.1.3 2022-07-05 [1] RSPM (R 4.4.0)
#> GenomeInfoDb * 1.41.1 2024-05-24 [1] Bioconductor 3.20 (R 4.4.0)
#> GenomeInfoDbData 1.2.12 2024-06-24 [1] Bioconductor
#> GenomicRanges * 1.57.1 2024-06-12 [1] Bioconductor 3.20 (R 4.4.0)
#> GetoptLong 1.0.5 2020-12-15 [1] RSPM (R 4.4.0)
#> ggplot2 3.5.1 2024-04-23 [1] RSPM (R 4.4.0)
#> GlobalOptions 0.1.2 2020-06-10 [1] RSPM (R 4.4.0)
#> glue 1.7.0 2024-01-09 [2] RSPM (R 4.4.0)
#> gtable 0.3.5 2024-04-22 [1] RSPM (R 4.4.0)
#> highr 0.11 2024-05-26 [2] RSPM (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [2] RSPM (R 4.4.0)
#> htmlwidgets 1.6.4 2023-12-06 [2] RSPM (R 4.4.0)
#> httr 1.4.7 2023-08-15 [1] RSPM (R 4.4.0)
#> HybridExpress * 1.1.1 2024-09-05 [1] Bioconductor
#> IRanges * 2.39.2 2024-07-17 [1] Bioconductor 3.20 (R 4.4.1)
#> iterators 1.0.14 2022-02-05 [1] RSPM (R 4.4.0)
#> jquerylib 0.1.4 2021-04-26 [2] RSPM (R 4.4.0)
#> jsonlite 1.8.8 2023-12-04 [2] RSPM (R 4.4.0)
#> knitr 1.48 2024-07-07 [2] RSPM (R 4.4.0)
#> labeling 0.4.3 2023-08-29 [1] RSPM (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [3] CRAN (R 4.4.1)
#> lifecycle 1.0.4 2023-11-07 [2] RSPM (R 4.4.0)
#> locfit 1.5-9.10 2024-06-24 [1] RSPM (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [2] RSPM (R 4.4.0)
#> Matrix 1.7-0 2024-04-26 [3] CRAN (R 4.4.1)
#> MatrixGenerics * 1.17.0 2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#> matrixStats * 1.3.0 2024-04-11 [1] RSPM (R 4.4.0)
#> munsell 0.5.1 2024-04-01 [1] RSPM (R 4.4.0)
#> patchwork 1.2.0 2024-01-08 [1] RSPM (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [2] RSPM (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [2] RSPM (R 4.4.0)
#> pkgdown 2.1.0 2024-07-06 [1] RSPM (R 4.4.0)
#> png 0.1-8 2022-11-29 [1] RSPM (R 4.4.0)
#> R6 2.5.1 2021-08-19 [2] RSPM (R 4.4.0)
#> ragg 1.3.2 2024-05-15 [2] RSPM (R 4.4.0)
#> RColorBrewer 1.1-3 2022-04-03 [1] RSPM (R 4.4.0)
#> Rcpp 1.0.13 2024-07-17 [2] RSPM (R 4.4.0)
#> rjson 0.2.22 2024-08-20 [1] RSPM (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [2] RSPM (R 4.4.0)
#> rmarkdown 2.28 2024-08-17 [1] RSPM (R 4.4.0)
#> S4Arrays 1.5.7 2024-08-06 [1] Bioconductor 3.20 (R 4.4.1)
#> S4Vectors * 0.43.2 2024-07-17 [1] Bioconductor 3.20 (R 4.4.1)
#> sass 0.4.9 2024-03-15 [2] RSPM (R 4.4.0)
#> scales 1.3.0 2023-11-28 [1] RSPM (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [2] RSPM (R 4.4.0)
#> shape 1.4.6.1 2024-02-23 [1] RSPM (R 4.4.0)
#> SparseArray 1.5.31 2024-08-04 [1] Bioconductor 3.20 (R 4.4.1)
#> SummarizedExperiment * 1.35.1 2024-06-28 [1] Bioconductor 3.20 (R 4.4.1)
#> systemfonts 1.1.0 2024-05-15 [2] RSPM (R 4.4.0)
#> textshaping 0.4.0 2024-05-24 [2] RSPM (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [2] RSPM (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [1] RSPM (R 4.4.0)
#> UCSC.utils 1.1.0 2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#> utf8 1.2.4 2023-10-22 [2] RSPM (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [2] RSPM (R 4.4.0)
#> withr 3.0.1 2024-07-31 [2] RSPM (R 4.4.0)
#> xfun 0.47 2024-08-17 [2] RSPM (R 4.4.0)
#> XVector 0.45.0 2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#> yaml 2.3.10 2024-07-26 [2] RSPM (R 4.4.0)
#> zlibbioc 1.51.1 2024-06-05 [1] Bioconductor 3.20 (R 4.4.0)
#>
#> [1] /__w/_temp/Library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/local/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────