# Load packages
library(Biostrings)
library(GenomicRanges)
library(SummarizedExperiment)
library(tidyverse)
library(here)
4 Analyzing biological data with the Bioconductor ecosystem
The Bioconductor project was created to develop, support, and disseminate free open source software to analyze biological data. It works as a repository of R packages for biological data analysis, with both “core packages” (developed and maintained by the Bioconductor Core Team), and community-contributed packages. We usually refer to Bioconductor as a “package ecosystem” because its packages are tightly integrated and designed for easy interoperation, such that different packages can be used together with minimal effort. An important facilitator of such interoperability is the existence of standardized data structures, such as GRanges
objects for genomic coordinates, DNA/RNA/AAStringSet
objects for sequences, and SummarizedExperiment
objects for quantitative data.
Let’s start by loading the required packages and data:
4.1 Goals of this lesson
At the end of this lesson, you will be able to:
- read and process sequence data with
Biostrings
- read and process genomic coordinates with
GenomicRanges
- read and process quantitative data (e.g., gene expression) with
SummarizedExperiment
4.2 Working with sequences: the Biostrings
package
The Biostrings
package allows efficient manipulation of biological sequences, and its data classes are standard for all Bioconductor packages that involve sequence analyses. The data classes in Biostrings to represent sets of biological sequences are:
DNAStringSet
: for sets of DNA strings.RNAStringSet
: for sets of RNA strings.AAStringSet
: for sets of amino acid strings.
For a single sequence, the each of the above has a correspoding XString
class (i.e., DNAString
, RNAString
, AAString
). For example, let’s create some XString
objects manually:
# Create `DNAString` object
<- DNAString("ATGGCCGACTCA")
dna_seq dna_seq
12-letter DNAString object
seq: ATGGCCGACTCA
# Convert `DNAString` to `RNAString`
<- RNAString(dna_seq)
rna_seq rna_seq
12-letter RNAString object
seq: AUGGCCGACUCA
# Translate `RNAString` to create `AAString`
<- translate(rna_seq)
aa_seq aa_seq
4-letter AAString object
seq: MADS
In real-world data analyses, we would work with multiple sequences (e.g., a whole genome or a whole proteome). Multiple sequences are stored in XStringSet
objects, which can be read from FASTA files with readXStringSet()
functions, namely:
readDNAStringSet()
: read FASTA file containing multiple DNA sequences.readRNAStringSet()
: read FASTA file containing multiple RNA sequences.readAAStringSet()
: read FASTA file containing multiple AA sequences.
For example, let’s read an example FASTA file that come with the Biostrings package.
# Path to FASTA file containing an ORF
<- system.file("extdata", "someORF.fa", package = "Biostrings")
fasta_file fasta_file
[1] "/home/faalm/R/x86_64-pc-linux-gnu-library/4.3/Biostrings/extdata/someORF.fa"
# Read FASTA file as a DNAStringSet object
<- readDNAStringSet(fasta_file)
dna_seqs dna_seqs
DNAStringSet object of length 7:
width seq names
[1] 5573 ACTTGTAAATATATCTTTTATTT...CTTATCGACCTTATTGTTGATAT YAL001C TFC3 SGDI...
[2] 5825 TTCCAAGGCCGATGAATTCGACT...AGTAAATTTTTTTCTATTCTCTT YAL002W VPS8 SGDI...
[3] 2987 CTTCATGTCAGCCTGCACTTCTG...TGGTACTCATGTAGCTGCCTCAT YAL003W EFB1 SGDI...
[4] 3929 CACTCATATCGGGGGTCTTACTT...TGTCCCGAAACACGAAAAAGTAC YAL005C SSA1 SGDI...
[5] 2648 AGAGAAAGAGTTTCACTTCTTGA...ATATAATTTATGTGTGAACATAG YAL007C ERP2 SGDI...
[6] 2597 GTGTCCGGGCCTCGCAGGCGTTC...AAGTTTTGGCAGAATGTACTTTT YAL008W FUN14 SGD...
[7] 2780 CAAGATAATGTCAAAGTTAGTGG...GCTAAGGAAGAAAAAAAAATCAC YAL009W SPO7 SGDI...
Some examples of what you can do to explore XStringSet
objects include:
# width(): get sequence lengths
width(dna_seqs)
[1] 5573 5825 2987 3929 2648 2597 2780
# names(): get sequence names
names(dna_seqs)
[1] "YAL001C TFC3 SGDID:S0000001, Chr I from 152168-146596, reverse complement, Verified ORF"
[2] "YAL002W VPS8 SGDID:S0000002, Chr I from 142709-148533, Verified ORF"
[3] "YAL003W EFB1 SGDID:S0000003, Chr I from 141176-144162, Verified ORF"
[4] "YAL005C SSA1 SGDID:S0000004, Chr I from 142433-138505, reverse complement, Verified ORF"
[5] "YAL007C ERP2 SGDID:S0000005, Chr I from 139347-136700, reverse complement, Verified ORF"
[6] "YAL008W FUN14 SGDID:S0000006, Chr I from 135916-138512, Verified ORF"
[7] "YAL009W SPO7 SGDID:S0000007, Chr I from 134856-137635, Verified ORF"
# subset DNAStringSet and create a DNAStringSet of length 1
1] dna_seqs[
DNAStringSet object of length 1:
width seq names
[1] 5573 ACTTGTAAATATATCTTTTATTT...CTTATCGACCTTATTGTTGATAT YAL001C TFC3 SGDI...
# subset DNAStringSet and create a DNAString object
1]] dna_seqs[[
5573-letter DNAString object
seq: ACTTGTAAATATATCTTTTATTTTCCGAGAGGAAAA...AATTTCTTAAACGCTTATCGACCTTATTGTTGATAT
All functions above would also work for AAStringSet
objects and RNAStringSet
objects.
Use the dna_seqs
object created above to answer the following questions:
- What is the length of the 3rd sequence?
- What is the name of the 5th sequence?
- How many times does the TTCC sequence string occur in each sequence? Hint: use the function
vcountPattern()
. - What are the first 3 nucleotides of each sequence? Hint: use the
subseq()
function.
4.3 Working with genomic ranges: the GenomicRanges
package
The GenomicRanges
package implements S4 classes to represent genomic ranges as S4 objects.
Specifically, the GRanges
class is designed to store a set of intervals including the name of the sequence where features are located as well as the range of integer coordinates spanned by the feature in that sequence.
More generally, the IRanges
class is designed to store a set of intervals over a range of integer coordinates, without the notion of sequence names. As such, a GRanges
object is merely the combination of an IRanges
object and a vector of sequence names.
Those S4 classes provide automatic validity-checking functionality, and a range of methods implementing common operations on integer intervals and genomic ranges, from the calculation of distance between pairs of intervals to the identification of overlapping genomic ranges.
4.3.1 The IRanges
class
While the genomic space of many organisms is subdivided into multiple sequences (e.g., chromosomes), many operations on genomic ranges take place within individual sequences, where only integer positions matter. The IRanges
class provides a container for such “simple” ranges that are defined by two out of three pieces of information:
- the start position of the range
- the width of the range
- the end position of the range
The IRanges()
constructor function accepts those three pieces of information in the arguments start=
, width=
, and end=
. For instance, we create two integer ranges from their start position and width:
- one range starts at position 10 and has width 10
- one range starts at position 15 and has width 5
<- IRanges(start = c(10, 15), width = c(10, 5))
demo_iranges demo_iranges
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 10 19 10
[2] 15 19 5
We note how the object displays not only the start and width information that we requested for each range, but also the end position that is naturally computed from the other two pieces of information.
Create the same two ranges as above, using the arguments start=
and end=
of the IRanges()
constructor function.
The start and end positions as well as the width of every interval can be extracted as numeric vector using the functions start()
, end()
and width()
, respectively.
start(demo_iranges)
[1] 10 15
end(demo_iranges)
[1] 19 19
width(demo_iranges)
[1] 10 5
Objects of the IRanges
family extend the Vector
class, and are handled as unidimensional vectors in terms of indexing. As such, individual ranges can be extracted by integer index like any regular vector.
1] demo_iranges[
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 10 19 10
4.3.2 Metadata on IRanges
The IRanges
class can accommodate metadata information on each range, including names - passed to the names=
argument - and miscellaneous metadata passed as named vectors.
For instance, we create two ranges named “A” and “B”. Furthermore, we define metadata fields to store an example of character values and numeric values, respectively. Both the names and the values of the metadata fields are completely arbitrary in this example.
<- IRanges(
demo_with_metadata start = c(10, 15),
end = c(19, 19),
names = c("A", "B"),
character_metadata = c("control", "target"),
numeric_metadata = c(100, 200)
)
demo_with_metadata
IRanges object with 2 ranges and 2 metadata columns:
start end width | character_metadata numeric_metadata
<integer> <integer> <integer> | <character> <numeric>
A 10 19 10 | control 100
B 15 19 5 | target 200
The metadata columns can be extracted as a DataFrame
using the function mcols()
(short for “metadata columns”).
mcols(demo_with_metadata)
DataFrame with 2 rows and 2 columns
character_metadata numeric_metadata
<character> <numeric>
A control 100
B target 200
The character vector of names can be extracted using the function names()
.
names(demo_with_metadata)
[1] "A" "B"
Similarly to named vector of base data types, individual ranges can be extracted by name.
"A"] demo_with_metadata[
IRanges object with 1 range and 2 metadata columns:
start end width | character_metadata numeric_metadata
<integer> <integer> <integer> | <character> <numeric>
A 10 19 10 | control 100
4.3.3 The GRanges
class
Having defined integer ranges, the only additional information necessary to define genomic ranges is the name of the genomic sequence on which each range is located.
For instance, we define two genomic ranges, as follows:
- one genomic range on chromosome 1 (abbreviated “chr1”), from position 10 to 25
- one genomic range on chromosome 2 (abbreviated “chr2”), from position 20 to 35
To do so, we use the GRanges()
constructor function. We provide the sequence names as a character vector to the argument seqnames=
, and we provide both the start and end position to the argument ranges=
as an IRanges
object.
<- GRanges(
demo_granges seqnames = c("chr1", "chr2"),
ranges = IRanges(
start = c(10, 20),
end = c(25, 35))
)
demo_granges
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 10-25 *
[2] chr2 20-35 *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
In the console, the object displays the sequence names in the seqnames
component, and the ranges in the form start-end
in the ranges
component. Furthermore, the example above also demonstrate that GRanges
objects possess a component called strand
; the symbol *
indicates unstranded genomic ranges, as we have not provided that information.
The strand information can be supplied to the strand=
argument, for instance:
<- GRanges(
demo_granges2 seqnames = c("chr1", "chr2"),
ranges = IRanges(
start = c(10, 20),
end = c(25, 35)),
strand = c("+", "-")
)
demo_granges2
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 10-25 +
[2] chr2 20-35 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Finally, the examples above also demonstrate that GRanges
objects include a component called seqinfo
, which is occasionally used to store information about each sequence that may be represented in the seqnames
component. In the latest example above, we have not provide any information about any sequence. As such, the seqinfo
component was automatically populated with the names of the sequences that we used to create the object, while the remaining pieces of information were left unspecified, as NA
.
seqinfo(demo_granges2)
Seqinfo object with 2 sequences from an unspecified genome; no seqlengths:
seqnames seqlengths isCircular genome
chr1 NA NA <NA>
chr2 NA NA <NA>
The example above reveals that information about sequences include not only their respective name and length, but also whether they represent a circular polymer (e.g., plasmid), and the name of the genome that they are part of.
4.3.4 Metadata on GRanges
Similarly to IRanges
, metadata can be passed directly to the GRanges
constructor function. For instance:
<- GRanges(
demo_granges3 seqnames = c("chr1", "chr2"),
ranges = IRanges(
start = c(10, 20),
end = c(25, 35)),
metadata1 = c("control", "target"),
metadata2 = c(1, 2)
)
demo_granges3
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | metadata1 metadata2
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr1 10-25 * | control 1
[2] chr2 20-35 * | target 2
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
4.3.5 Importing genomic ranges from files
Frequently, large collections of genomic ranges are imported from files rather than described in manually written code. In particular, genome-wide annotations of known gene features are distributed as files on websites such as Ensembl.
Various file formats are commonly used to store genomic ranges in bioinformatics workflows. For instance, the BED (Browser Extensible Data) format is commonly found in Chromatin Immunoprecipitation Sequencing (ChIP-Seq), while GTF (Gene Transfer Format, GTF2.2) is the de facto standard file format to describe genomic features such as exons, transcripts, and genes.
In the following example, we import the gene model for Actin Beta (ACTB) from a small GTF file as a set of genomic ranges. The example file represents a subset of a GTF file for the Homo sapiens species, downloaded from Ensembl. The original file contains more than 3 millions lines and 22 metadata fields, from which a subset was extracted into a smaller file for this lesson.
In particular, we use the import()
generic defined in the BiocIO package - with methods implemented in the rtracklayer package - as a versatile function that is capable of recognising common file extensions and associating them with the appropriate method for parsing each particular file format.
library(rtracklayer)
<- rtracklayer::import(here("data", "actb.gtf"))
actb_gtf_data actb_gtf_data
GRanges object with 267 ranges and 7 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 7 5526409-5563902 - | rtracklayer gene NA
[2] 7 5526409-5530601 - | rtracklayer transcript NA
[3] 7 5530542-5530601 - | rtracklayer exon NA
[4] 7 5529535-5529684 - | rtracklayer exon NA
[5] 7 5529535-5529657 - | rtracklayer CDS NA
... ... ... ... . ... ... ...
[263] 7 5540676-5540771 - | rtracklayer five_prime_utr NA
[264] 7 5529658-5529663 - | rtracklayer five_prime_utr NA
[265] 7 5561852-5562716 - | rtracklayer transcript NA
[266] 7 5562390-5562716 - | rtracklayer exon NA
[267] 7 5561852-5561949 - | rtracklayer exon NA
phase gene_id gene_name transcript_id
<integer> <character> <character> <character>
[1] <NA> ENSG00000075624 ACTB <NA>
[2] <NA> ENSG00000075624 ACTB ENST00000674681
[3] <NA> ENSG00000075624 ACTB ENST00000674681
[4] <NA> ENSG00000075624 ACTB ENST00000674681
[5] <NA> ENSG00000075624 ACTB ENST00000674681
... ... ... ... ...
[263] <NA> ENSG00000075624 ACTB ENST00000414620
[264] <NA> ENSG00000075624 ACTB ENST00000414620
[265] <NA> ENSG00000075624 ACTB ENST00000646584
[266] <NA> ENSG00000075624 ACTB ENST00000646584
[267] <NA> ENSG00000075624 ACTB ENST00000646584
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
In the example above, the contents of the GTF file were imported into a GRanges
object. For each entry in the file, the sequence name, start and end position, and strand information were used to populate the dedicated components of the object, while all other pieces of information are stored as separate columns of metadata.
From here on, this GRanges
object can be manipulated just like any of the other GRanges
objects that we have created earlier in this episode.
4.3.6 Operations on GRanges
and the GRangesList
class
As we have demonstrated so far, GRanges
objects can be manually defined or imported from files. Those often represent genomic regions of interest, and databases of known genomic features, respectively. Either way, a number of operations are commonly applied to GRanges
objects throughout bioinformatics workflows.
4.3.6.1 Subset
For instance, the subset()
method is extremely convenient to extract a set of genomic ranges matching a condition on any component, including sequence name, start and end position, strand, or any metadata field. In the example below, we extract all the records of type transcript
that start at position 5527147
.
subset(actb_gtf_data, type == "transcript" & start == 5527147)
GRanges object with 5 ranges and 7 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 7 5527147-5529949 - | rtracklayer transcript NA
[2] 7 5527147-5530581 - | rtracklayer transcript NA
[3] 7 5527147-5530604 - | rtracklayer transcript NA
[4] 7 5527147-5530604 - | rtracklayer transcript NA
[5] 7 5527147-5530604 - | rtracklayer transcript NA
phase gene_id gene_name transcript_id
<integer> <character> <character> <character>
[1] <NA> ENSG00000075624 ACTB ENST00000642480
[2] <NA> ENSG00000075624 ACTB ENST00000676397
[3] <NA> ENSG00000075624 ACTB ENST00000676319
[4] <NA> ENSG00000075624 ACTB ENST00000676189
[5] <NA> ENSG00000075624 ACTB ENST00000473257
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
4.3.6.2 Split
Separately, the split()
method is useful to divide a set of genomic ranges initially stored in a single GRanges
object into groups that are stored in a named list of GRanges
objects. Conveniently, the GRangesList
class provides a container for efficiently displaying and processing lists of GRanges
objects.
In the example below, we first extract the subset of entries that represent exons, before separating those exons by transcript identifier, yielding the result as a GRangesList
object.
<- subset(actb_gtf_data, type == "exon")
actb_exons <- split(actb_exons, actb_exons$transcript_id)
actb_exons_by_transcript actb_exons_by_transcript
GRangesList object of length 23:
$ENST00000414620
GRanges object with 4 ranges and 7 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 7 5562574-5562790 - | rtracklayer exon NA
[2] 7 5540676-5540771 - | rtracklayer exon NA
[3] 7 5529535-5529663 - | rtracklayer exon NA
[4] 7 5529282-5529400 - | rtracklayer exon NA
phase gene_id gene_name transcript_id
<integer> <character> <character> <character>
[1] <NA> ENSG00000075624 ACTB ENST00000414620
[2] <NA> ENSG00000075624 ACTB ENST00000414620
[3] <NA> ENSG00000075624 ACTB ENST00000414620
[4] <NA> ENSG00000075624 ACTB ENST00000414620
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
$ENST00000417101
GRanges object with 3 ranges and 7 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 7 5529806-5529982 - | rtracklayer exon NA
[2] 7 5529535-5529663 - | rtracklayer exon NA
[3] 7 5529235-5529400 - | rtracklayer exon NA
phase gene_id gene_name transcript_id
<integer> <character> <character> <character>
[1] <NA> ENSG00000075624 ACTB ENST00000417101
[2] <NA> ENSG00000075624 ACTB ENST00000417101
[3] <NA> ENSG00000075624 ACTB ENST00000417101
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
$ENST00000425660
GRanges object with 7 ranges and 7 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 7 5530524-5530601 - | rtracklayer exon NA
[2] 7 5529535-5529663 - | rtracklayer exon NA
[3] 7 5529161-5529400 - | rtracklayer exon NA
[4] 7 5529019-5529059 - | rtracklayer exon NA
[5] 7 5528281-5528719 - | rtracklayer exon NA
[6] 7 5528004-5528185 - | rtracklayer exon NA
[7] 7 5527156-5527891 - | rtracklayer exon NA
phase gene_id gene_name transcript_id
<integer> <character> <character> <character>
[1] <NA> ENSG00000075624 ACTB ENST00000425660
[2] <NA> ENSG00000075624 ACTB ENST00000425660
[3] <NA> ENSG00000075624 ACTB ENST00000425660
[4] <NA> ENSG00000075624 ACTB ENST00000425660
[5] <NA> ENSG00000075624 ACTB ENST00000425660
[6] <NA> ENSG00000075624 ACTB ENST00000425660
[7] <NA> ENSG00000075624 ACTB ENST00000425660
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
...
<20 more elements>
When printing the object above in the console, the first line confirms the class of the object as GRrangesList
, while each named GRanges
in that list is introduced by the dollar sign and the name of that item, just like regular named lists in base R.
4.3.6.3 Length
By nature, many of the methods applicable to list
objects can be directly applied to GRangesList
objects. For instance, the lengths()
function can be used on GRangesList
to display the length of each GRanges
object in the list as an integer vector.
In the latest example above, we can compute the number of exons in each transcript as the length of each GRanges
object within the GRangesList
:
lengths(actb_exons_by_transcript)
ENST00000414620 ENST00000417101 ENST00000425660 ENST00000432588 ENST00000443528
4 3 7 5 3
ENST00000462494 ENST00000464611 ENST00000473257 ENST00000477812 ENST00000480301
5 3 5 5 2
ENST00000484841 ENST00000493945 ENST00000642480 ENST00000645025 ENST00000645576
5 6 5 4 5
ENST00000646584 ENST00000646664 ENST00000647275 ENST00000674681 ENST00000675515
2 6 3 6 6
ENST00000676189 ENST00000676319 ENST00000676397
6 3 6
Importantly, the function lengths()
(with a final s
) demonstrated above is different from the function length()
(without s
). The former is meant to be used on list objects, while the latter is meant to be used on vectors.
What does length(actb_exons_by_transcript)
return, and why?
4.3.7 Subset by overlap
Possibly one of the most common operations when working with genomic ranges is to subset arbitrarily large collections of genomic ranges to those located in a specific region of the genome; for instance, when visualising information as tracks in a genome browser.
To demonstrate, we manually define a new GRanges
representing a region of interest that we will use to extract all of the genomic ranges imported earlier from the GTF file which overlap that region of interest.
<- GRanges(
region_of_interest seqnames = "7",
ranges = IRanges(start = 5525830, end = 5531239)
)
<- subsetByOverlaps(x = actb_gtf_data, ranges = region_of_interest)
actb_in_region actb_in_region
GRanges object with 256 ranges and 7 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 7 5526409-5563902 - | rtracklayer gene NA
[2] 7 5526409-5530601 - | rtracklayer transcript NA
[3] 7 5530542-5530601 - | rtracklayer exon NA
[4] 7 5529535-5529684 - | rtracklayer exon NA
[5] 7 5529535-5529657 - | rtracklayer CDS NA
... ... ... ... . ... ... ...
[252] 7 5529535-5529657 - | rtracklayer CDS NA
[253] 7 5529655-5529657 - | rtracklayer start_codon NA
[254] 7 5529282-5529400 - | rtracklayer exon NA
[255] 7 5529282-5529400 - | rtracklayer CDS NA
[256] 7 5529658-5529663 - | rtracklayer five_prime_utr NA
phase gene_id gene_name transcript_id
<integer> <character> <character> <character>
[1] <NA> ENSG00000075624 ACTB <NA>
[2] <NA> ENSG00000075624 ACTB ENST00000674681
[3] <NA> ENSG00000075624 ACTB ENST00000674681
[4] <NA> ENSG00000075624 ACTB ENST00000674681
[5] <NA> ENSG00000075624 ACTB ENST00000674681
... ... ... ... ...
[252] <NA> ENSG00000075624 ACTB ENST00000414620
[253] <NA> ENSG00000075624 ACTB ENST00000414620
[254] <NA> ENSG00000075624 ACTB ENST00000414620
[255] <NA> ENSG00000075624 ACTB ENST00000414620
[256] <NA> ENSG00000075624 ACTB ENST00000414620
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Like the subset()
method, the subsetByOverlaps()
method returns a new GRanges
object. We can visually compare the information printed in the object (256 ranges in the new subsetted object, relative to 267 ranges in the original object), or we can programmatically compare the length of the two objects to check whether the new GRanges
object is any smaller than the original GRanges
object:
length(actb_in_region) - length(actb_gtf_data)
[1] -11
In the example above, we learn that the new GRanges
object has 11 records less than the original GRanges
object.
4.4 Working with quantitative data: the SummarizedExperiment
package
The figure below represents the anatomy of the SummarizedExperiment class.
Objects of the class SummarizedExperiment
contain :
One (or more) assay(s) containing the quantitative omics data (expression data), stored as a matrix-like object. Features (genes, transcripts, proteins, …) are defined along the rows, and samples along the columns.
A sample metadata slot containing sample co-variates, stored as a data frame. Rows from this table represent samples (rows match exactly the columns of the expression data).
A feature metadata slot containing feature co-variates, stored as a data frame. The rows of this data frame match exactly the rows of the expression data.
The coordinated nature of the SummarizedExperiment
guarantees that during data manipulation, the dimensions of the different slots will always match (i.e the columns in the expression data and then rows in the sample metadata, as well as the rows in the expression data and feature metadata) during data manipulation. For example, if we had to exclude one sample from the assay, it would be automatically removed from the sample metadata in the same operation.
The metadata slots can grow additional co-variates (columns) without affecting the other structures.
4.4.1 Creating a SummarizedExperiment
object
In order to create a SummarizedExperiment
, we will create the individual components, i.e the count matrix, the sample and gene metadata from csv files. These are typically how RNA-Seq data are provided (after raw data have been processed).
- An expression matrix: we load the count matrix, specifying that the first columns contains row/gene names, and convert the
data.frame
to amatrix
.
# Read count matrix
<- read.csv(here("data", "count_matrix.csv"), row.names = 1) |>
count_matrix as.matrix()
# Inspect data
1:5, ] count_matrix[
GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Asl 1170 361 400 586 626 988
Apod 36194 10347 9173 10620 13021 29594
Cyp2d22 4060 1616 1603 1901 2171 3349
Klk6 287 629 641 578 448 195
Fcrls 85 233 244 237 180 38
GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Asl 836 535 586 597 938 1035
Apod 24959 13668 13230 15868 27769 34301
Cyp2d22 3122 2008 2254 2277 2985 3452
Klk6 186 1101 537 567 327 233
Fcrls 68 375 199 177 89 67
GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Asl 494 481 666 937 803 541
Apod 11258 11812 15816 29242 20415 13682
Cyp2d22 1883 2014 2417 3678 2920 2216
Klk6 742 881 828 250 798 710
Fcrls 300 233 231 81 303 285
GSM2545354 GSM2545362 GSM2545363 GSM2545380
Asl 473 748 576 1192
Apod 11088 15916 11166 38148
Cyp2d22 1821 2842 2011 4019
Klk6 894 501 598 259
Fcrls 248 179 184 68
dim(count_matrix)
[1] 1474 22
- A table describing the samples, often referred to as sample metadata.
# Read sample metadata
<- read.csv(here("data", "sample_metadata.csv"))
sample_metadata
# Inspect data
sample_metadata
sample organism age sex infection strain time tissue mouse
1 GSM2545336 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
2 GSM2545337 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebellum 9
3 GSM2545338 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebellum 10
4 GSM2545339 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebellum 15
5 GSM2545340 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebellum 18
6 GSM2545341 Mus musculus 8 Male InfluenzaA C57BL/6 8 Cerebellum 6
7 GSM2545342 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 5
8 GSM2545343 Mus musculus 8 Male NonInfected C57BL/6 0 Cerebellum 11
9 GSM2545344 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebellum 22
10 GSM2545345 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebellum 13
11 GSM2545346 Mus musculus 8 Male InfluenzaA C57BL/6 8 Cerebellum 23
12 GSM2545347 Mus musculus 8 Male InfluenzaA C57BL/6 8 Cerebellum 24
13 GSM2545348 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebellum 8
14 GSM2545349 Mus musculus 8 Male NonInfected C57BL/6 0 Cerebellum 7
15 GSM2545350 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebellum 1
16 GSM2545351 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 16
17 GSM2545352 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebellum 21
18 GSM2545353 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebellum 4
19 GSM2545354 Mus musculus 8 Male NonInfected C57BL/6 0 Cerebellum 2
20 GSM2545362 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebellum 20
21 GSM2545363 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebellum 12
22 GSM2545380 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 19
dim(sample_metadata)
[1] 22 9
- A table describing the genes, often referred to as gene metadata.
# Read gene metadata
<- read.csv(here("data", "gene_metadata.csv"))
gene_metadata
# Inspect data
1:10, 1:4] gene_metadata[
gene ENTREZID
1 Asl 109900
2 Apod 11815
3 Cyp2d22 56448
4 Klk6 19144
5 Fcrls 80891
6 Slc2a4 20528
7 Exd2 97827
8 Gjc2 118454
9 Plp1 18823
10 Gnb4 14696
product
1 argininosuccinate lyase, transcript variant X1
2 apolipoprotein D, transcript variant 3
3 cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2
4 kallikrein related-peptidase 6, transcript variant 2
5 Fc receptor-like S, scavenger receptor, transcript variant X1
6 solute carrier family 2 (facilitated glucose transporter), member 4
7 exonuclease 3'-5' domain containing 2
8 gap junction protein, gamma 2, transcript variant 1
9 proteolipid protein (myelin) 1, transcript variant 1
10 guanine nucleotide binding protein (G protein), beta 4, transcript variant X2
ensembl_gene_id
1 ENSMUSG00000025533
2 ENSMUSG00000022548
3 ENSMUSG00000061740
4 ENSMUSG00000050063
5 ENSMUSG00000015852
6 ENSMUSG00000018566
7 ENSMUSG00000032705
8 ENSMUSG00000043448
9 ENSMUSG00000031425
10 ENSMUSG00000027669
dim(gene_metadata)
[1] 1474 9
We will create a SummarizedExperiment
from these tables:
The count matrix that will be used as the assay
The table describing the samples will be used as the colData slot
The table describing the genes will be used as the rowData slot
To do this we can put the different parts together using the SummarizedExperiment
constructor:
# Create SummarizedExperiment object
<- SummarizedExperiment(
se assays = list(counts = count_matrix),
colData = sample_metadata,
rowData = gene_metadata
)
se
class: SummarizedExperiment
dim: 1474 22
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(9): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(9): sample organism ... tissue mouse
Using this data structure, we can access the expression matrix with the assay()
function:
head(assay(se))
GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Asl 1170 361 400 586 626 988
Apod 36194 10347 9173 10620 13021 29594
Cyp2d22 4060 1616 1603 1901 2171 3349
Klk6 287 629 641 578 448 195
Fcrls 85 233 244 237 180 38
Slc2a4 782 231 248 265 313 786
GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Asl 836 535 586 597 938 1035
Apod 24959 13668 13230 15868 27769 34301
Cyp2d22 3122 2008 2254 2277 2985 3452
Klk6 186 1101 537 567 327 233
Fcrls 68 375 199 177 89 67
Slc2a4 528 249 266 357 654 693
GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Asl 494 481 666 937 803 541
Apod 11258 11812 15816 29242 20415 13682
Cyp2d22 1883 2014 2417 3678 2920 2216
Klk6 742 881 828 250 798 710
Fcrls 300 233 231 81 303 285
Slc2a4 271 304 349 715 513 320
GSM2545354 GSM2545362 GSM2545363 GSM2545380
Asl 473 748 576 1192
Apod 11088 15916 11166 38148
Cyp2d22 1821 2842 2011 4019
Klk6 894 501 598 259
Fcrls 248 179 184 68
Slc2a4 248 350 317 796
dim(assay(se))
[1] 1474 22
We can access the sample metadata using the colData()
function:
colData(se)
DataFrame with 22 rows and 9 columns
sample organism age sex infection
<character> <character> <integer> <character> <character>
GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA
GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
GSM2545339 GSM2545339 Mus musculus 8 Female InfluenzaA
GSM2545340 GSM2545340 Mus musculus 8 Male InfluenzaA
... ... ... ... ... ...
GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected
GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected
GSM2545362 GSM2545362 Mus musculus 8 Female InfluenzaA
GSM2545363 GSM2545363 Mus musculus 8 Male InfluenzaA
GSM2545380 GSM2545380 Mus musculus 8 Female InfluenzaA
strain time tissue mouse
<character> <integer> <character> <integer>
GSM2545336 C57BL/6 8 Cerebellum 14
GSM2545337 C57BL/6 0 Cerebellum 9
GSM2545338 C57BL/6 0 Cerebellum 10
GSM2545339 C57BL/6 4 Cerebellum 15
GSM2545340 C57BL/6 4 Cerebellum 18
... ... ... ... ...
GSM2545353 C57BL/6 0 Cerebellum 4
GSM2545354 C57BL/6 0 Cerebellum 2
GSM2545362 C57BL/6 4 Cerebellum 20
GSM2545363 C57BL/6 4 Cerebellum 12
GSM2545380 C57BL/6 8 Cerebellum 19
dim(colData(se))
[1] 22 9
We can also access the feature metadata using the rowData()
function:
head(rowData(se))
DataFrame with 6 rows and 9 columns
gene ENTREZID product ensembl_gene_id
<character> <integer> <character> <character>
Asl Asl 109900 argininosuccinate ly.. ENSMUSG00000025533
Apod Apod 11815 apolipoprotein D, tr.. ENSMUSG00000022548
Cyp2d22 Cyp2d22 56448 cytochrome P450, fam.. ENSMUSG00000061740
Klk6 Klk6 19144 kallikrein related-p.. ENSMUSG00000050063
Fcrls Fcrls 80891 Fc receptor-like S, .. ENSMUSG00000015852
Slc2a4 Slc2a4 20528 solute carrier famil.. ENSMUSG00000018566
external_synonym chromosome_name gene_biotype phenotype_description
<character> <character> <character> <character>
Asl 2510006M18Rik 5 protein_coding abnormal circulating..
Apod NA 16 protein_coding abnormal lipid homeo..
Cyp2d22 2D22 15 protein_coding abnormal skin morpho..
Klk6 Bssp 7 protein_coding abnormal cytokine le..
Fcrls 2810439C17Rik 3 protein_coding decreased CD8-positi..
Slc2a4 Glut-4 11 protein_coding abnormal circulating..
hsapiens_homolog_associated_gene_name
<character>
Asl ASL
Apod APOD
Cyp2d22 CYP2D6
Klk6 KLK6
Fcrls FCRL2
Slc2a4 SLC2A4
dim(rowData(se))
[1] 1474 9
4.4.2 Subsetting a SummarizedExperiment
SummarizedExperiment
objects can be subset just like with data frames, with numerics or with characters of logicals.
Below, we create a new instance of class SummarizedExperiment that contains only the 5 first features for the 3 first samples.
<- se[1:5, 1:3]
se1 se1
class: SummarizedExperiment
dim: 5 3
metadata(0):
assays(1): counts
rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
rowData names(9): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(3): GSM2545336 GSM2545337 GSM2545338
colData names(9): sample organism ... tissue mouse
colData(se1)
DataFrame with 3 rows and 9 columns
sample organism age sex infection
<character> <character> <integer> <character> <character>
GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA
GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
strain time tissue mouse
<character> <integer> <character> <integer>
GSM2545336 C57BL/6 8 Cerebellum 14
GSM2545337 C57BL/6 0 Cerebellum 9
GSM2545338 C57BL/6 0 Cerebellum 10
rowData(se1)
DataFrame with 5 rows and 9 columns
gene ENTREZID product ensembl_gene_id
<character> <integer> <character> <character>
Asl Asl 109900 argininosuccinate ly.. ENSMUSG00000025533
Apod Apod 11815 apolipoprotein D, tr.. ENSMUSG00000022548
Cyp2d22 Cyp2d22 56448 cytochrome P450, fam.. ENSMUSG00000061740
Klk6 Klk6 19144 kallikrein related-p.. ENSMUSG00000050063
Fcrls Fcrls 80891 Fc receptor-like S, .. ENSMUSG00000015852
external_synonym chromosome_name gene_biotype phenotype_description
<character> <character> <character> <character>
Asl 2510006M18Rik 5 protein_coding abnormal circulating..
Apod NA 16 protein_coding abnormal lipid homeo..
Cyp2d22 2D22 15 protein_coding abnormal skin morpho..
Klk6 Bssp 7 protein_coding abnormal cytokine le..
Fcrls 2810439C17Rik 3 protein_coding decreased CD8-positi..
hsapiens_homolog_associated_gene_name
<character>
Asl ASL
Apod APOD
Cyp2d22 CYP2D6
Klk6 KLK6
Fcrls FCRL2
We can also use the colData()
function to subset on something from the sample metadata or the rowData()
to subset on something from the feature metadata. For example, here we keep only miRNAs and the non infected samples:
# Subset object
<- se[rowData(se)$gene_biotype == "miRNA",
se1 colData(se)$infection == "NonInfected"]
se1
class: SummarizedExperiment
dim: 7 7
metadata(0):
assays(1): counts
rownames(7): Mir1901 Mir378a ... Mir128-1 Mir7682
rowData names(9): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(7): GSM2545337 GSM2545338 ... GSM2545353 GSM2545354
colData names(9): sample organism ... tissue mouse
assay(se1)
GSM2545337 GSM2545338 GSM2545343 GSM2545348 GSM2545349 GSM2545353
Mir1901 45 44 74 55 68 33
Mir378a 11 7 9 4 12 4
Mir133b 4 6 5 4 6 7
Mir30c-2 10 6 16 12 8 17
Mir149 1 2 0 0 0 0
Mir128-1 4 1 2 2 1 2
Mir7682 2 0 4 1 3 5
GSM2545354
Mir1901 60
Mir378a 8
Mir133b 3
Mir30c-2 15
Mir149 2
Mir128-1 1
Mir7682 5
colData(se1)
DataFrame with 7 rows and 9 columns
sample organism age sex infection
<character> <character> <integer> <character> <character>
GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
GSM2545343 GSM2545343 Mus musculus 8 Male NonInfected
GSM2545348 GSM2545348 Mus musculus 8 Female NonInfected
GSM2545349 GSM2545349 Mus musculus 8 Male NonInfected
GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected
GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected
strain time tissue mouse
<character> <integer> <character> <integer>
GSM2545337 C57BL/6 0 Cerebellum 9
GSM2545338 C57BL/6 0 Cerebellum 10
GSM2545343 C57BL/6 0 Cerebellum 11
GSM2545348 C57BL/6 0 Cerebellum 8
GSM2545349 C57BL/6 0 Cerebellum 7
GSM2545353 C57BL/6 0 Cerebellum 4
GSM2545354 C57BL/6 0 Cerebellum 2
rowData(se1)
DataFrame with 7 rows and 9 columns
gene ENTREZID product ensembl_gene_id
<character> <integer> <character> <character>
Mir1901 Mir1901 100316686 microRNA 1901 ENSMUSG00000084565
Mir378a Mir378a 723889 microRNA 378a ENSMUSG00000105200
Mir133b Mir133b 723817 microRNA 133b ENSMUSG00000065480
Mir30c-2 Mir30c-2 723964 microRNA 30c-2 ENSMUSG00000065567
Mir149 Mir149 387167 microRNA 149 ENSMUSG00000065470
Mir128-1 Mir128-1 387147 microRNA 128-1 ENSMUSG00000065520
Mir7682 Mir7682 102466847 microRNA 7682 ENSMUSG00000106406
external_synonym chromosome_name gene_biotype phenotype_description
<character> <character> <character> <character>
Mir1901 Mirn1901 18 miRNA NA
Mir378a Mirn378 18 miRNA abnormal mitochondri..
Mir133b mir 133b 1 miRNA no abnormal phenotyp..
Mir30c-2 mir 30c-2 1 miRNA NA
Mir149 Mirn149 1 miRNA increased circulatin..
Mir128-1 Mirn128 1 miRNA no abnormal phenotyp..
Mir7682 mmu-mir-7682 1 miRNA NA
hsapiens_homolog_associated_gene_name
<character>
Mir1901 NA
Mir378a MIR378A
Mir133b MIR133B
Mir30c-2 MIR30C2
Mir149 NA
Mir128-1 MIR128-1
Mir7682 NA
- Extract the gene expression levels of the 3 first genes in samples at time 0 and at time 8.
4.4.3 Saving data
Exporting data to a spreadsheet, as we did in a previous lesson, has several limitations, such as those described in the first chapter (possible inconsistencies with ,
and .
for decimal separators and lack of variable type definitions). Furthermore, exporting data to a spreadsheet is only relevant for rectangular data such as dataframes and matrices.
A more general way to save data, that is specific to R and is guaranteed to work on any operating system, is to use the saveRDS()
function. Saving objects like this will generate a binary representation on disk (using the rds
file extension here), which can be loaded back into R using the readRDS
function.
# Save R object as an RDS file
saveRDS(se, file = here("output", "se.rds"))
rm(se)
<- readRDS(here("output", "se.rds"))
se head(se)
class: SummarizedExperiment
dim: 6 22
metadata(0):
assays(1): counts
rownames(6): Asl Apod ... Fcrls Slc2a4
rowData names(9): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(9): sample organism ... tissue mouse
To conclude, when it comes to saving data from R that will be loaded again in R, saving and loading with saveRDS
and readRDS
is the preferred approach. If tabular data need to be shared with somebody that is not using R, then exporting to a text-based spreadsheet is a good alternative.