4  Analyzing biological data with the Bioconductor ecosystem

The Bioconductor project was created to develop, support, and disseminate free open source software to analyze biological data. It works as a repository of R packages for biological data analysis, with both “core packages” (developed and maintained by the Bioconductor Core Team), and community-contributed packages. We usually refer to Bioconductor as a “package ecosystem” because its packages are tightly integrated and designed for easy interoperation, such that different packages can be used together with minimal effort. An important facilitator of such interoperability is the existence of standardized data structures, such as GRanges objects for genomic coordinates, DNA/RNA/AAStringSet objects for sequences, and SummarizedExperiment objects for quantitative data.

Let’s start by loading the required packages and data:

# Load packages
library(Biostrings)
library(GenomicRanges)
library(SummarizedExperiment)
library(tidyverse)
library(here)

4.1 Goals of this lesson

At the end of this lesson, you will be able to:

  • read and process sequence data with Biostrings
  • read and process genomic coordinates with GenomicRanges
  • read and process quantitative data (e.g., gene expression) with SummarizedExperiment

4.2 Working with sequences: the Biostrings package

The Biostrings package allows efficient manipulation of biological sequences, and its data classes are standard for all Bioconductor packages that involve sequence analyses. The data classes in Biostrings to represent sets of biological sequences are:

  1. DNAStringSet: for sets of DNA strings.
  2. RNAStringSet: for sets of RNA strings.
  3. AAStringSet: for sets of amino acid strings.

For a single sequence, the each of the above has a correspoding XString class (i.e., DNAString, RNAString, AAString). For example, let’s create some XString objects manually:

# Create `DNAString` object
dna_seq <- DNAString("ATGGCCGACTCA")
dna_seq
12-letter DNAString object
seq: ATGGCCGACTCA
# Convert `DNAString` to `RNAString`
rna_seq <- RNAString(dna_seq)
rna_seq
12-letter RNAString object
seq: AUGGCCGACUCA
# Translate `RNAString` to create `AAString`
aa_seq <- translate(rna_seq)
aa_seq
4-letter AAString object
seq: MADS

In real-world data analyses, we would work with multiple sequences (e.g., a whole genome or a whole proteome). Multiple sequences are stored in XStringSet objects, which can be read from FASTA files with readXStringSet() functions, namely:

  1. readDNAStringSet(): read FASTA file containing multiple DNA sequences.
  2. readRNAStringSet(): read FASTA file containing multiple RNA sequences.
  3. readAAStringSet(): read FASTA file containing multiple AA sequences.

For example, let’s read an example FASTA file that come with the Biostrings package.

# Path to FASTA file containing an ORF
fasta_file <- system.file("extdata", "someORF.fa", package = "Biostrings")
fasta_file
[1] "/home/faalm/R/x86_64-pc-linux-gnu-library/4.3/Biostrings/extdata/someORF.fa"
# Read FASTA file as a DNAStringSet object
dna_seqs <- readDNAStringSet(fasta_file)
dna_seqs
DNAStringSet object of length 7:
    width seq                                               names               
[1]  5573 ACTTGTAAATATATCTTTTATTT...CTTATCGACCTTATTGTTGATAT YAL001C TFC3 SGDI...
[2]  5825 TTCCAAGGCCGATGAATTCGACT...AGTAAATTTTTTTCTATTCTCTT YAL002W VPS8 SGDI...
[3]  2987 CTTCATGTCAGCCTGCACTTCTG...TGGTACTCATGTAGCTGCCTCAT YAL003W EFB1 SGDI...
[4]  3929 CACTCATATCGGGGGTCTTACTT...TGTCCCGAAACACGAAAAAGTAC YAL005C SSA1 SGDI...
[5]  2648 AGAGAAAGAGTTTCACTTCTTGA...ATATAATTTATGTGTGAACATAG YAL007C ERP2 SGDI...
[6]  2597 GTGTCCGGGCCTCGCAGGCGTTC...AAGTTTTGGCAGAATGTACTTTT YAL008W FUN14 SGD...
[7]  2780 CAAGATAATGTCAAAGTTAGTGG...GCTAAGGAAGAAAAAAAAATCAC YAL009W SPO7 SGDI...

Some examples of what you can do to explore XStringSet objects include:

# width(): get sequence lengths
width(dna_seqs)
[1] 5573 5825 2987 3929 2648 2597 2780
# names(): get sequence names
names(dna_seqs)
[1] "YAL001C TFC3 SGDID:S0000001, Chr I from 152168-146596, reverse complement, Verified ORF"
[2] "YAL002W VPS8 SGDID:S0000002, Chr I from 142709-148533, Verified ORF"                    
[3] "YAL003W EFB1 SGDID:S0000003, Chr I from 141176-144162, Verified ORF"                    
[4] "YAL005C SSA1 SGDID:S0000004, Chr I from 142433-138505, reverse complement, Verified ORF"
[5] "YAL007C ERP2 SGDID:S0000005, Chr I from 139347-136700, reverse complement, Verified ORF"
[6] "YAL008W FUN14 SGDID:S0000006, Chr I from 135916-138512, Verified ORF"                   
[7] "YAL009W SPO7 SGDID:S0000007, Chr I from 134856-137635, Verified ORF"                    
# subset DNAStringSet and create a DNAStringSet of length 1
dna_seqs[1]
DNAStringSet object of length 1:
    width seq                                               names               
[1]  5573 ACTTGTAAATATATCTTTTATTT...CTTATCGACCTTATTGTTGATAT YAL001C TFC3 SGDI...
# subset DNAStringSet and create a DNAString object
dna_seqs[[1]]
5573-letter DNAString object
seq: ACTTGTAAATATATCTTTTATTTTCCGAGAGGAAAA...AATTTCTTAAACGCTTATCGACCTTATTGTTGATAT

All functions above would also work for AAStringSet objects and RNAStringSet objects.

Practice

Use the dna_seqs object created above to answer the following questions:

  1. What is the length of the 3rd sequence?
  2. What is the name of the 5th sequence?
  3. How many times does the TTCC sequence string occur in each sequence? Hint: use the function vcountPattern().
  4. What are the first 3 nucleotides of each sequence? Hint: use the subseq() function.
# Q1: Length of the 3rd sequence
width(dna_seqs)[3]
[1] 2987
# Q2: Name of the 5th sequence
names(dna_seqs)[5]
[1] "YAL007C ERP2 SGDID:S0000005, Chr I from 139347-136700, reverse complement, Verified ORF"
# Q3: Minimum and maximum temperature
vcountPattern("TTCC", dna_seqs)
[1] 21 28 16 23 15  8  5
# Q4: First 3 nucleotides of each sequence
subseq(dna_seqs, start = 1, end = 3)
DNAStringSet object of length 7:
    width seq                                               names               
[1]     3 ACT                                               YAL001C TFC3 SGDI...
[2]     3 TTC                                               YAL002W VPS8 SGDI...
[3]     3 CTT                                               YAL003W EFB1 SGDI...
[4]     3 CAC                                               YAL005C SSA1 SGDI...
[5]     3 AGA                                               YAL007C ERP2 SGDI...
[6]     3 GTG                                               YAL008W FUN14 SGD...
[7]     3 CAA                                               YAL009W SPO7 SGDI...

4.3 Working with genomic ranges: the GenomicRanges package

The GenomicRanges package implements S4 classes to represent genomic ranges as S4 objects.

Specifically, the GRanges class is designed to store a set of intervals including the name of the sequence where features are located as well as the range of integer coordinates spanned by the feature in that sequence.

More generally, the IRanges class is designed to store a set of intervals over a range of integer coordinates, without the notion of sequence names. As such, a GRanges object is merely the combination of an IRanges object and a vector of sequence names.

Those S4 classes provide automatic validity-checking functionality, and a range of methods implementing common operations on integer intervals and genomic ranges, from the calculation of distance between pairs of intervals to the identification of overlapping genomic ranges.

4.3.1 The IRanges class

While the genomic space of many organisms is subdivided into multiple sequences (e.g., chromosomes), many operations on genomic ranges take place within individual sequences, where only integer positions matter. The IRanges class provides a container for such “simple” ranges that are defined by two out of three pieces of information:

  • the start position of the range
  • the width of the range
  • the end position of the range

The IRanges() constructor function accepts those three pieces of information in the arguments start=, width=, and end=. For instance, we create two integer ranges from their start position and width:

  • one range starts at position 10 and has width 10
  • one range starts at position 15 and has width 5
demo_iranges <- IRanges(start = c(10, 15), width = c(10, 5))
demo_iranges
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        19        10
  [2]        15        19         5

We note how the object displays not only the start and width information that we requested for each range, but also the end position that is naturally computed from the other two pieces of information.

Practice

Create the same two ranges as above, using the arguments start= and end= of the IRanges() constructor function.

IRanges(start = c(10, 15), end = c(19, 19))
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        19        10
  [2]        15        19         5

The start and end positions as well as the width of every interval can be extracted as numeric vector using the functions start(), end() and width(), respectively.

start(demo_iranges)
[1] 10 15
end(demo_iranges)
[1] 19 19
width(demo_iranges)
[1] 10  5

Objects of the IRanges family extend the Vector class, and are handled as unidimensional vectors in terms of indexing. As such, individual ranges can be extracted by integer index like any regular vector.

demo_iranges[1]
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        19        10

4.3.2 Metadata on IRanges

The IRanges class can accommodate metadata information on each range, including names - passed to the names= argument - and miscellaneous metadata passed as named vectors.

For instance, we create two ranges named “A” and “B”. Furthermore, we define metadata fields to store an example of character values and numeric values, respectively. Both the names and the values of the metadata fields are completely arbitrary in this example.

demo_with_metadata <- IRanges(
    start = c(10,  15),
    end   = c(19,  19),
    names = c("A", "B"),
    character_metadata = c("control", "target"),
    numeric_metadata = c(100, 200)
)

demo_with_metadata
IRanges object with 2 ranges and 2 metadata columns:
        start       end     width | character_metadata numeric_metadata
    <integer> <integer> <integer> |        <character>        <numeric>
  A        10        19        10 |            control              100
  B        15        19         5 |             target              200

The metadata columns can be extracted as a DataFrame using the function mcols() (short for “metadata columns”).

mcols(demo_with_metadata)
DataFrame with 2 rows and 2 columns
  character_metadata numeric_metadata
         <character>        <numeric>
A            control              100
B             target              200

The character vector of names can be extracted using the function names().

names(demo_with_metadata)
[1] "A" "B"

Similarly to named vector of base data types, individual ranges can be extracted by name.

demo_with_metadata["A"]
IRanges object with 1 range and 2 metadata columns:
        start       end     width | character_metadata numeric_metadata
    <integer> <integer> <integer> |        <character>        <numeric>
  A        10        19        10 |            control              100

4.3.3 The GRanges class

Having defined integer ranges, the only additional information necessary to define genomic ranges is the name of the genomic sequence on which each range is located.

For instance, we define two genomic ranges, as follows:

  • one genomic range on chromosome 1 (abbreviated “chr1”), from position 10 to 25
  • one genomic range on chromosome 2 (abbreviated “chr2”), from position 20 to 35

To do so, we use the GRanges() constructor function. We provide the sequence names as a character vector to the argument seqnames=, and we provide both the start and end position to the argument ranges= as an IRanges object.

demo_granges <- GRanges(
    seqnames = c("chr1", "chr2"),
    ranges = IRanges(
        start = c(10, 20),
        end   = c(25, 35))
)

demo_granges
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     10-25      *
  [2]     chr2     20-35      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

In the console, the object displays the sequence names in the seqnames component, and the ranges in the form start-end in the ranges component. Furthermore, the example above also demonstrate that GRanges objects possess a component called strand; the symbol * indicates unstranded genomic ranges, as we have not provided that information.

The strand information can be supplied to the strand= argument, for instance:

demo_granges2 <- GRanges(
    seqnames = c("chr1", "chr2"),
    ranges = IRanges(
        start = c(10, 20),
        end   = c(25, 35)),
    strand  = c("+", "-")
)

demo_granges2
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     10-25      +
  [2]     chr2     20-35      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Finally, the examples above also demonstrate that GRanges objects include a component called seqinfo, which is occasionally used to store information about each sequence that may be represented in the seqnames component. In the latest example above, we have not provide any information about any sequence. As such, the seqinfo component was automatically populated with the names of the sequences that we used to create the object, while the remaining pieces of information were left unspecified, as NA.

seqinfo(demo_granges2)
Seqinfo object with 2 sequences from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chr1             NA         NA   <NA>
  chr2             NA         NA   <NA>

The example above reveals that information about sequences include not only their respective name and length, but also whether they represent a circular polymer (e.g., plasmid), and the name of the genome that they are part of.

4.3.4 Metadata on GRanges

Similarly to IRanges, metadata can be passed directly to the GRanges constructor function. For instance:

demo_granges3 <- GRanges(
    seqnames = c("chr1", "chr2"),
    ranges = IRanges(
        start = c(10, 20),
        end = c(25, 35)),
    metadata1 = c("control", "target"),
    metadata2 = c(1, 2)
)

demo_granges3
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |   metadata1 metadata2
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]     chr1     10-25      * |     control         1
  [2]     chr2     20-35      * |      target         2
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

4.3.5 Importing genomic ranges from files

Frequently, large collections of genomic ranges are imported from files rather than described in manually written code. In particular, genome-wide annotations of known gene features are distributed as files on websites such as Ensembl.

Various file formats are commonly used to store genomic ranges in bioinformatics workflows. For instance, the BED (Browser Extensible Data) format is commonly found in Chromatin Immunoprecipitation Sequencing (ChIP-Seq), while GTF (Gene Transfer Format, GTF2.2) is the de facto standard file format to describe genomic features such as exons, transcripts, and genes.

In the following example, we import the gene model for Actin Beta (ACTB) from a small GTF file as a set of genomic ranges. The example file represents a subset of a GTF file for the Homo sapiens species, downloaded from Ensembl. The original file contains more than 3 millions lines and 22 metadata fields, from which a subset was extracted into a smaller file for this lesson.

In particular, we use the import() generic defined in the BiocIO package - with methods implemented in the rtracklayer package - as a versatile function that is capable of recognising common file extensions and associating them with the appropriate method for parsing each particular file format.

library(rtracklayer)

actb_gtf_data <- rtracklayer::import(here("data", "actb.gtf"))
actb_gtf_data
GRanges object with 267 ranges and 7 metadata columns:
        seqnames          ranges strand |      source           type     score
           <Rle>       <IRanges>  <Rle> |    <factor>       <factor> <numeric>
    [1]        7 5526409-5563902      - | rtracklayer     gene              NA
    [2]        7 5526409-5530601      - | rtracklayer     transcript        NA
    [3]        7 5530542-5530601      - | rtracklayer     exon              NA
    [4]        7 5529535-5529684      - | rtracklayer     exon              NA
    [5]        7 5529535-5529657      - | rtracklayer     CDS               NA
    ...      ...             ...    ... .         ...            ...       ...
  [263]        7 5540676-5540771      - | rtracklayer five_prime_utr        NA
  [264]        7 5529658-5529663      - | rtracklayer five_prime_utr        NA
  [265]        7 5561852-5562716      - | rtracklayer transcript            NA
  [266]        7 5562390-5562716      - | rtracklayer exon                  NA
  [267]        7 5561852-5561949      - | rtracklayer exon                  NA
            phase         gene_id   gene_name   transcript_id
        <integer>     <character> <character>     <character>
    [1]      <NA> ENSG00000075624        ACTB            <NA>
    [2]      <NA> ENSG00000075624        ACTB ENST00000674681
    [3]      <NA> ENSG00000075624        ACTB ENST00000674681
    [4]      <NA> ENSG00000075624        ACTB ENST00000674681
    [5]      <NA> ENSG00000075624        ACTB ENST00000674681
    ...       ...             ...         ...             ...
  [263]      <NA> ENSG00000075624        ACTB ENST00000414620
  [264]      <NA> ENSG00000075624        ACTB ENST00000414620
  [265]      <NA> ENSG00000075624        ACTB ENST00000646584
  [266]      <NA> ENSG00000075624        ACTB ENST00000646584
  [267]      <NA> ENSG00000075624        ACTB ENST00000646584
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

In the example above, the contents of the GTF file were imported into a GRanges object. For each entry in the file, the sequence name, start and end position, and strand information were used to populate the dedicated components of the object, while all other pieces of information are stored as separate columns of metadata.

From here on, this GRanges object can be manipulated just like any of the other GRanges objects that we have created earlier in this episode.

4.3.6 Operations on GRanges and the GRangesList class

As we have demonstrated so far, GRanges objects can be manually defined or imported from files. Those often represent genomic regions of interest, and databases of known genomic features, respectively. Either way, a number of operations are commonly applied to GRanges objects throughout bioinformatics workflows.

4.3.6.1 Subset

For instance, the subset() method is extremely convenient to extract a set of genomic ranges matching a condition on any component, including sequence name, start and end position, strand, or any metadata field. In the example below, we extract all the records of type transcript that start at position 5527147.

subset(actb_gtf_data, type == "transcript" & start == 5527147)
GRanges object with 5 ranges and 7 metadata columns:
      seqnames          ranges strand |      source       type     score
         <Rle>       <IRanges>  <Rle> |    <factor>   <factor> <numeric>
  [1]        7 5527147-5529949      - | rtracklayer transcript        NA
  [2]        7 5527147-5530581      - | rtracklayer transcript        NA
  [3]        7 5527147-5530604      - | rtracklayer transcript        NA
  [4]        7 5527147-5530604      - | rtracklayer transcript        NA
  [5]        7 5527147-5530604      - | rtracklayer transcript        NA
          phase         gene_id   gene_name   transcript_id
      <integer>     <character> <character>     <character>
  [1]      <NA> ENSG00000075624        ACTB ENST00000642480
  [2]      <NA> ENSG00000075624        ACTB ENST00000676397
  [3]      <NA> ENSG00000075624        ACTB ENST00000676319
  [4]      <NA> ENSG00000075624        ACTB ENST00000676189
  [5]      <NA> ENSG00000075624        ACTB ENST00000473257
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

4.3.6.2 Split

Separately, the split() method is useful to divide a set of genomic ranges initially stored in a single GRanges object into groups that are stored in a named list of GRanges objects. Conveniently, the GRangesList class provides a container for efficiently displaying and processing lists of GRanges objects.

In the example below, we first extract the subset of entries that represent exons, before separating those exons by transcript identifier, yielding the result as a GRangesList object.

actb_exons <- subset(actb_gtf_data, type == "exon")
actb_exons_by_transcript <- split(actb_exons, actb_exons$transcript_id)
actb_exons_by_transcript
GRangesList object of length 23:
$ENST00000414620
GRanges object with 4 ranges and 7 metadata columns:
      seqnames          ranges strand |      source     type     score
         <Rle>       <IRanges>  <Rle> |    <factor> <factor> <numeric>
  [1]        7 5562574-5562790      - | rtracklayer     exon        NA
  [2]        7 5540676-5540771      - | rtracklayer     exon        NA
  [3]        7 5529535-5529663      - | rtracklayer     exon        NA
  [4]        7 5529282-5529400      - | rtracklayer     exon        NA
          phase         gene_id   gene_name   transcript_id
      <integer>     <character> <character>     <character>
  [1]      <NA> ENSG00000075624        ACTB ENST00000414620
  [2]      <NA> ENSG00000075624        ACTB ENST00000414620
  [3]      <NA> ENSG00000075624        ACTB ENST00000414620
  [4]      <NA> ENSG00000075624        ACTB ENST00000414620
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$ENST00000417101
GRanges object with 3 ranges and 7 metadata columns:
      seqnames          ranges strand |      source     type     score
         <Rle>       <IRanges>  <Rle> |    <factor> <factor> <numeric>
  [1]        7 5529806-5529982      - | rtracklayer     exon        NA
  [2]        7 5529535-5529663      - | rtracklayer     exon        NA
  [3]        7 5529235-5529400      - | rtracklayer     exon        NA
          phase         gene_id   gene_name   transcript_id
      <integer>     <character> <character>     <character>
  [1]      <NA> ENSG00000075624        ACTB ENST00000417101
  [2]      <NA> ENSG00000075624        ACTB ENST00000417101
  [3]      <NA> ENSG00000075624        ACTB ENST00000417101
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$ENST00000425660
GRanges object with 7 ranges and 7 metadata columns:
      seqnames          ranges strand |      source     type     score
         <Rle>       <IRanges>  <Rle> |    <factor> <factor> <numeric>
  [1]        7 5530524-5530601      - | rtracklayer     exon        NA
  [2]        7 5529535-5529663      - | rtracklayer     exon        NA
  [3]        7 5529161-5529400      - | rtracklayer     exon        NA
  [4]        7 5529019-5529059      - | rtracklayer     exon        NA
  [5]        7 5528281-5528719      - | rtracklayer     exon        NA
  [6]        7 5528004-5528185      - | rtracklayer     exon        NA
  [7]        7 5527156-5527891      - | rtracklayer     exon        NA
          phase         gene_id   gene_name   transcript_id
      <integer>     <character> <character>     <character>
  [1]      <NA> ENSG00000075624        ACTB ENST00000425660
  [2]      <NA> ENSG00000075624        ACTB ENST00000425660
  [3]      <NA> ENSG00000075624        ACTB ENST00000425660
  [4]      <NA> ENSG00000075624        ACTB ENST00000425660
  [5]      <NA> ENSG00000075624        ACTB ENST00000425660
  [6]      <NA> ENSG00000075624        ACTB ENST00000425660
  [7]      <NA> ENSG00000075624        ACTB ENST00000425660
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

...
<20 more elements>

When printing the object above in the console, the first line confirms the class of the object as GRrangesList, while each named GRanges in that list is introduced by the dollar sign and the name of that item, just like regular named lists in base R.

4.3.6.3 Length

By nature, many of the methods applicable to list objects can be directly applied to GRangesList objects. For instance, the lengths() function can be used on GRangesList to display the length of each GRanges object in the list as an integer vector.

In the latest example above, we can compute the number of exons in each transcript as the length of each GRanges object within the GRangesList:

lengths(actb_exons_by_transcript)
ENST00000414620 ENST00000417101 ENST00000425660 ENST00000432588 ENST00000443528 
              4               3               7               5               3 
ENST00000462494 ENST00000464611 ENST00000473257 ENST00000477812 ENST00000480301 
              5               3               5               5               2 
ENST00000484841 ENST00000493945 ENST00000642480 ENST00000645025 ENST00000645576 
              5               6               5               4               5 
ENST00000646584 ENST00000646664 ENST00000647275 ENST00000674681 ENST00000675515 
              2               6               3               6               6 
ENST00000676189 ENST00000676319 ENST00000676397 
              6               3               6 
Practice

Importantly, the function lengths() (with a final s) demonstrated above is different from the function length() (without s). The former is meant to be used on list objects, while the latter is meant to be used on vectors.

What does length(actb_exons_by_transcript) return, and why?

length(actb_exons_by_transcript)
[1] 23

4.3.7 Subset by overlap

Possibly one of the most common operations when working with genomic ranges is to subset arbitrarily large collections of genomic ranges to those located in a specific region of the genome; for instance, when visualising information as tracks in a genome browser.

To demonstrate, we manually define a new GRanges representing a region of interest that we will use to extract all of the genomic ranges imported earlier from the GTF file which overlap that region of interest.

region_of_interest <- GRanges(
    seqnames = "7",
    ranges = IRanges(start = 5525830, end = 5531239)
)

actb_in_region <- subsetByOverlaps(x = actb_gtf_data, ranges = region_of_interest)
actb_in_region
GRanges object with 256 ranges and 7 metadata columns:
        seqnames          ranges strand |      source           type     score
           <Rle>       <IRanges>  <Rle> |    <factor>       <factor> <numeric>
    [1]        7 5526409-5563902      - | rtracklayer     gene              NA
    [2]        7 5526409-5530601      - | rtracklayer     transcript        NA
    [3]        7 5530542-5530601      - | rtracklayer     exon              NA
    [4]        7 5529535-5529684      - | rtracklayer     exon              NA
    [5]        7 5529535-5529657      - | rtracklayer     CDS               NA
    ...      ...             ...    ... .         ...            ...       ...
  [252]        7 5529535-5529657      - | rtracklayer CDS                   NA
  [253]        7 5529655-5529657      - | rtracklayer start_codon           NA
  [254]        7 5529282-5529400      - | rtracklayer exon                  NA
  [255]        7 5529282-5529400      - | rtracklayer CDS                   NA
  [256]        7 5529658-5529663      - | rtracklayer five_prime_utr        NA
            phase         gene_id   gene_name   transcript_id
        <integer>     <character> <character>     <character>
    [1]      <NA> ENSG00000075624        ACTB            <NA>
    [2]      <NA> ENSG00000075624        ACTB ENST00000674681
    [3]      <NA> ENSG00000075624        ACTB ENST00000674681
    [4]      <NA> ENSG00000075624        ACTB ENST00000674681
    [5]      <NA> ENSG00000075624        ACTB ENST00000674681
    ...       ...             ...         ...             ...
  [252]      <NA> ENSG00000075624        ACTB ENST00000414620
  [253]      <NA> ENSG00000075624        ACTB ENST00000414620
  [254]      <NA> ENSG00000075624        ACTB ENST00000414620
  [255]      <NA> ENSG00000075624        ACTB ENST00000414620
  [256]      <NA> ENSG00000075624        ACTB ENST00000414620
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Like the subset() method, the subsetByOverlaps() method returns a new GRanges object. We can visually compare the information printed in the object (256 ranges in the new subsetted object, relative to 267 ranges in the original object), or we can programmatically compare the length of the two objects to check whether the new GRanges object is any smaller than the original GRanges object:

length(actb_in_region) - length(actb_gtf_data)
[1] -11

In the example above, we learn that the new GRanges object has 11 records less than the original GRanges object.

4.4 Working with quantitative data: the SummarizedExperiment package

The figure below represents the anatomy of the SummarizedExperiment class.

Objects of the class SummarizedExperiment contain :

  • One (or more) assay(s) containing the quantitative omics data (expression data), stored as a matrix-like object. Features (genes, transcripts, proteins, …) are defined along the rows, and samples along the columns.

  • A sample metadata slot containing sample co-variates, stored as a data frame. Rows from this table represent samples (rows match exactly the columns of the expression data).

  • A feature metadata slot containing feature co-variates, stored as a data frame. The rows of this data frame match exactly the rows of the expression data.

The coordinated nature of the SummarizedExperiment guarantees that during data manipulation, the dimensions of the different slots will always match (i.e the columns in the expression data and then rows in the sample metadata, as well as the rows in the expression data and feature metadata) during data manipulation. For example, if we had to exclude one sample from the assay, it would be automatically removed from the sample metadata in the same operation.

The metadata slots can grow additional co-variates (columns) without affecting the other structures.

4.4.1 Creating a SummarizedExperiment object

In order to create a SummarizedExperiment, we will create the individual components, i.e the count matrix, the sample and gene metadata from csv files. These are typically how RNA-Seq data are provided (after raw data have been processed).

  • An expression matrix: we load the count matrix, specifying that the first columns contains row/gene names, and convert the data.frame to a matrix.
# Read count matrix
count_matrix <- read.csv(here("data", "count_matrix.csv"), row.names = 1) |>
    as.matrix()

# Inspect data
count_matrix[1:5, ]
        GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Asl           1170        361        400        586        626        988
Apod         36194      10347       9173      10620      13021      29594
Cyp2d22       4060       1616       1603       1901       2171       3349
Klk6           287        629        641        578        448        195
Fcrls           85        233        244        237        180         38
        GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Asl            836        535        586        597        938       1035
Apod         24959      13668      13230      15868      27769      34301
Cyp2d22       3122       2008       2254       2277       2985       3452
Klk6           186       1101        537        567        327        233
Fcrls           68        375        199        177         89         67
        GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Asl            494        481        666        937        803        541
Apod         11258      11812      15816      29242      20415      13682
Cyp2d22       1883       2014       2417       3678       2920       2216
Klk6           742        881        828        250        798        710
Fcrls          300        233        231         81        303        285
        GSM2545354 GSM2545362 GSM2545363 GSM2545380
Asl            473        748        576       1192
Apod         11088      15916      11166      38148
Cyp2d22       1821       2842       2011       4019
Klk6           894        501        598        259
Fcrls          248        179        184         68
dim(count_matrix)
[1] 1474   22
  • A table describing the samples, often referred to as sample metadata.
# Read sample metadata
sample_metadata <- read.csv(here("data", "sample_metadata.csv"))

# Inspect data
sample_metadata
       sample     organism age    sex   infection  strain time     tissue mouse
1  GSM2545336 Mus musculus   8 Female  InfluenzaA C57BL/6    8 Cerebellum    14
2  GSM2545337 Mus musculus   8 Female NonInfected C57BL/6    0 Cerebellum     9
3  GSM2545338 Mus musculus   8 Female NonInfected C57BL/6    0 Cerebellum    10
4  GSM2545339 Mus musculus   8 Female  InfluenzaA C57BL/6    4 Cerebellum    15
5  GSM2545340 Mus musculus   8   Male  InfluenzaA C57BL/6    4 Cerebellum    18
6  GSM2545341 Mus musculus   8   Male  InfluenzaA C57BL/6    8 Cerebellum     6
7  GSM2545342 Mus musculus   8 Female  InfluenzaA C57BL/6    8 Cerebellum     5
8  GSM2545343 Mus musculus   8   Male NonInfected C57BL/6    0 Cerebellum    11
9  GSM2545344 Mus musculus   8 Female  InfluenzaA C57BL/6    4 Cerebellum    22
10 GSM2545345 Mus musculus   8   Male  InfluenzaA C57BL/6    4 Cerebellum    13
11 GSM2545346 Mus musculus   8   Male  InfluenzaA C57BL/6    8 Cerebellum    23
12 GSM2545347 Mus musculus   8   Male  InfluenzaA C57BL/6    8 Cerebellum    24
13 GSM2545348 Mus musculus   8 Female NonInfected C57BL/6    0 Cerebellum     8
14 GSM2545349 Mus musculus   8   Male NonInfected C57BL/6    0 Cerebellum     7
15 GSM2545350 Mus musculus   8   Male  InfluenzaA C57BL/6    4 Cerebellum     1
16 GSM2545351 Mus musculus   8 Female  InfluenzaA C57BL/6    8 Cerebellum    16
17 GSM2545352 Mus musculus   8 Female  InfluenzaA C57BL/6    4 Cerebellum    21
18 GSM2545353 Mus musculus   8 Female NonInfected C57BL/6    0 Cerebellum     4
19 GSM2545354 Mus musculus   8   Male NonInfected C57BL/6    0 Cerebellum     2
20 GSM2545362 Mus musculus   8 Female  InfluenzaA C57BL/6    4 Cerebellum    20
21 GSM2545363 Mus musculus   8   Male  InfluenzaA C57BL/6    4 Cerebellum    12
22 GSM2545380 Mus musculus   8 Female  InfluenzaA C57BL/6    8 Cerebellum    19
dim(sample_metadata)
[1] 22  9
  • A table describing the genes, often referred to as gene metadata.
# Read gene metadata
gene_metadata <- read.csv(here("data", "gene_metadata.csv"))

# Inspect data
gene_metadata[1:10, 1:4]
      gene ENTREZID
1      Asl   109900
2     Apod    11815
3  Cyp2d22    56448
4     Klk6    19144
5    Fcrls    80891
6   Slc2a4    20528
7     Exd2    97827
8     Gjc2   118454
9     Plp1    18823
10    Gnb4    14696
                                                                         product
1                                 argininosuccinate lyase, transcript variant X1
2                                         apolipoprotein D, transcript variant 3
3   cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2
4                           kallikrein related-peptidase 6, transcript variant 2
5                  Fc receptor-like S, scavenger receptor, transcript variant X1
6            solute carrier family 2 (facilitated glucose transporter), member 4
7                                          exonuclease 3'-5' domain containing 2
8                            gap junction protein, gamma 2, transcript variant 1
9                           proteolipid protein (myelin) 1, transcript variant 1
10 guanine nucleotide binding protein (G protein), beta 4, transcript variant X2
      ensembl_gene_id
1  ENSMUSG00000025533
2  ENSMUSG00000022548
3  ENSMUSG00000061740
4  ENSMUSG00000050063
5  ENSMUSG00000015852
6  ENSMUSG00000018566
7  ENSMUSG00000032705
8  ENSMUSG00000043448
9  ENSMUSG00000031425
10 ENSMUSG00000027669
dim(gene_metadata)
[1] 1474    9

We will create a SummarizedExperiment from these tables:

  • The count matrix that will be used as the assay

  • The table describing the samples will be used as the colData slot

  • The table describing the genes will be used as the rowData slot

To do this we can put the different parts together using the SummarizedExperiment constructor:

# Create SummarizedExperiment object
se <- SummarizedExperiment(
    assays = list(counts = count_matrix),
    colData = sample_metadata,
    rowData = gene_metadata
)

se
class: SummarizedExperiment 
dim: 1474 22 
metadata(0):
assays(1): counts
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(9): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(9): sample organism ... tissue mouse

Using this data structure, we can access the expression matrix with the assay() function:

head(assay(se))
        GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Asl           1170        361        400        586        626        988
Apod         36194      10347       9173      10620      13021      29594
Cyp2d22       4060       1616       1603       1901       2171       3349
Klk6           287        629        641        578        448        195
Fcrls           85        233        244        237        180         38
Slc2a4         782        231        248        265        313        786
        GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Asl            836        535        586        597        938       1035
Apod         24959      13668      13230      15868      27769      34301
Cyp2d22       3122       2008       2254       2277       2985       3452
Klk6           186       1101        537        567        327        233
Fcrls           68        375        199        177         89         67
Slc2a4         528        249        266        357        654        693
        GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Asl            494        481        666        937        803        541
Apod         11258      11812      15816      29242      20415      13682
Cyp2d22       1883       2014       2417       3678       2920       2216
Klk6           742        881        828        250        798        710
Fcrls          300        233        231         81        303        285
Slc2a4         271        304        349        715        513        320
        GSM2545354 GSM2545362 GSM2545363 GSM2545380
Asl            473        748        576       1192
Apod         11088      15916      11166      38148
Cyp2d22       1821       2842       2011       4019
Klk6           894        501        598        259
Fcrls          248        179        184         68
Slc2a4         248        350        317        796
dim(assay(se))
[1] 1474   22

We can access the sample metadata using the colData() function:

colData(se)
DataFrame with 22 rows and 9 columns
                sample     organism       age         sex   infection
           <character>  <character> <integer> <character> <character>
GSM2545336  GSM2545336 Mus musculus         8      Female  InfluenzaA
GSM2545337  GSM2545337 Mus musculus         8      Female NonInfected
GSM2545338  GSM2545338 Mus musculus         8      Female NonInfected
GSM2545339  GSM2545339 Mus musculus         8      Female  InfluenzaA
GSM2545340  GSM2545340 Mus musculus         8        Male  InfluenzaA
...                ...          ...       ...         ...         ...
GSM2545353  GSM2545353 Mus musculus         8      Female NonInfected
GSM2545354  GSM2545354 Mus musculus         8        Male NonInfected
GSM2545362  GSM2545362 Mus musculus         8      Female  InfluenzaA
GSM2545363  GSM2545363 Mus musculus         8        Male  InfluenzaA
GSM2545380  GSM2545380 Mus musculus         8      Female  InfluenzaA
                strain      time      tissue     mouse
           <character> <integer> <character> <integer>
GSM2545336     C57BL/6         8  Cerebellum        14
GSM2545337     C57BL/6         0  Cerebellum         9
GSM2545338     C57BL/6         0  Cerebellum        10
GSM2545339     C57BL/6         4  Cerebellum        15
GSM2545340     C57BL/6         4  Cerebellum        18
...                ...       ...         ...       ...
GSM2545353     C57BL/6         0  Cerebellum         4
GSM2545354     C57BL/6         0  Cerebellum         2
GSM2545362     C57BL/6         4  Cerebellum        20
GSM2545363     C57BL/6         4  Cerebellum        12
GSM2545380     C57BL/6         8  Cerebellum        19
dim(colData(se))
[1] 22  9

We can also access the feature metadata using the rowData() function:

head(rowData(se))
DataFrame with 6 rows and 9 columns
               gene  ENTREZID                product    ensembl_gene_id
        <character> <integer>            <character>        <character>
Asl             Asl    109900 argininosuccinate ly.. ENSMUSG00000025533
Apod           Apod     11815 apolipoprotein D, tr.. ENSMUSG00000022548
Cyp2d22     Cyp2d22     56448 cytochrome P450, fam.. ENSMUSG00000061740
Klk6           Klk6     19144 kallikrein related-p.. ENSMUSG00000050063
Fcrls         Fcrls     80891 Fc receptor-like S, .. ENSMUSG00000015852
Slc2a4       Slc2a4     20528 solute carrier famil.. ENSMUSG00000018566
        external_synonym chromosome_name   gene_biotype  phenotype_description
             <character>     <character>    <character>            <character>
Asl        2510006M18Rik               5 protein_coding abnormal circulating..
Apod                  NA              16 protein_coding abnormal lipid homeo..
Cyp2d22             2D22              15 protein_coding abnormal skin morpho..
Klk6                Bssp               7 protein_coding abnormal cytokine le..
Fcrls      2810439C17Rik               3 protein_coding decreased CD8-positi..
Slc2a4            Glut-4              11 protein_coding abnormal circulating..
        hsapiens_homolog_associated_gene_name
                                  <character>
Asl                                       ASL
Apod                                     APOD
Cyp2d22                                CYP2D6
Klk6                                     KLK6
Fcrls                                   FCRL2
Slc2a4                                 SLC2A4
dim(rowData(se))
[1] 1474    9

4.4.2 Subsetting a SummarizedExperiment

SummarizedExperiment objects can be subset just like with data frames, with numerics or with characters of logicals.

Below, we create a new instance of class SummarizedExperiment that contains only the 5 first features for the 3 first samples.

se1 <- se[1:5, 1:3]
se1
class: SummarizedExperiment 
dim: 5 3 
metadata(0):
assays(1): counts
rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
rowData names(9): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(3): GSM2545336 GSM2545337 GSM2545338
colData names(9): sample organism ... tissue mouse
colData(se1)
DataFrame with 3 rows and 9 columns
                sample     organism       age         sex   infection
           <character>  <character> <integer> <character> <character>
GSM2545336  GSM2545336 Mus musculus         8      Female  InfluenzaA
GSM2545337  GSM2545337 Mus musculus         8      Female NonInfected
GSM2545338  GSM2545338 Mus musculus         8      Female NonInfected
                strain      time      tissue     mouse
           <character> <integer> <character> <integer>
GSM2545336     C57BL/6         8  Cerebellum        14
GSM2545337     C57BL/6         0  Cerebellum         9
GSM2545338     C57BL/6         0  Cerebellum        10
rowData(se1)
DataFrame with 5 rows and 9 columns
               gene  ENTREZID                product    ensembl_gene_id
        <character> <integer>            <character>        <character>
Asl             Asl    109900 argininosuccinate ly.. ENSMUSG00000025533
Apod           Apod     11815 apolipoprotein D, tr.. ENSMUSG00000022548
Cyp2d22     Cyp2d22     56448 cytochrome P450, fam.. ENSMUSG00000061740
Klk6           Klk6     19144 kallikrein related-p.. ENSMUSG00000050063
Fcrls         Fcrls     80891 Fc receptor-like S, .. ENSMUSG00000015852
        external_synonym chromosome_name   gene_biotype  phenotype_description
             <character>     <character>    <character>            <character>
Asl        2510006M18Rik               5 protein_coding abnormal circulating..
Apod                  NA              16 protein_coding abnormal lipid homeo..
Cyp2d22             2D22              15 protein_coding abnormal skin morpho..
Klk6                Bssp               7 protein_coding abnormal cytokine le..
Fcrls      2810439C17Rik               3 protein_coding decreased CD8-positi..
        hsapiens_homolog_associated_gene_name
                                  <character>
Asl                                       ASL
Apod                                     APOD
Cyp2d22                                CYP2D6
Klk6                                     KLK6
Fcrls                                   FCRL2

We can also use the colData() function to subset on something from the sample metadata or the rowData() to subset on something from the feature metadata. For example, here we keep only miRNAs and the non infected samples:

# Subset object
se1 <- se[rowData(se)$gene_biotype == "miRNA",
          colData(se)$infection == "NonInfected"]
se1
class: SummarizedExperiment 
dim: 7 7 
metadata(0):
assays(1): counts
rownames(7): Mir1901 Mir378a ... Mir128-1 Mir7682
rowData names(9): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(7): GSM2545337 GSM2545338 ... GSM2545353 GSM2545354
colData names(9): sample organism ... tissue mouse
assay(se1)
         GSM2545337 GSM2545338 GSM2545343 GSM2545348 GSM2545349 GSM2545353
Mir1901          45         44         74         55         68         33
Mir378a          11          7          9          4         12          4
Mir133b           4          6          5          4          6          7
Mir30c-2         10          6         16         12          8         17
Mir149            1          2          0          0          0          0
Mir128-1          4          1          2          2          1          2
Mir7682           2          0          4          1          3          5
         GSM2545354
Mir1901          60
Mir378a           8
Mir133b           3
Mir30c-2         15
Mir149            2
Mir128-1          1
Mir7682           5
colData(se1)
DataFrame with 7 rows and 9 columns
                sample     organism       age         sex   infection
           <character>  <character> <integer> <character> <character>
GSM2545337  GSM2545337 Mus musculus         8      Female NonInfected
GSM2545338  GSM2545338 Mus musculus         8      Female NonInfected
GSM2545343  GSM2545343 Mus musculus         8        Male NonInfected
GSM2545348  GSM2545348 Mus musculus         8      Female NonInfected
GSM2545349  GSM2545349 Mus musculus         8        Male NonInfected
GSM2545353  GSM2545353 Mus musculus         8      Female NonInfected
GSM2545354  GSM2545354 Mus musculus         8        Male NonInfected
                strain      time      tissue     mouse
           <character> <integer> <character> <integer>
GSM2545337     C57BL/6         0  Cerebellum         9
GSM2545338     C57BL/6         0  Cerebellum        10
GSM2545343     C57BL/6         0  Cerebellum        11
GSM2545348     C57BL/6         0  Cerebellum         8
GSM2545349     C57BL/6         0  Cerebellum         7
GSM2545353     C57BL/6         0  Cerebellum         4
GSM2545354     C57BL/6         0  Cerebellum         2
rowData(se1)
DataFrame with 7 rows and 9 columns
                gene  ENTREZID        product    ensembl_gene_id
         <character> <integer>    <character>        <character>
Mir1901      Mir1901 100316686  microRNA 1901 ENSMUSG00000084565
Mir378a      Mir378a    723889  microRNA 378a ENSMUSG00000105200
Mir133b      Mir133b    723817  microRNA 133b ENSMUSG00000065480
Mir30c-2    Mir30c-2    723964 microRNA 30c-2 ENSMUSG00000065567
Mir149        Mir149    387167   microRNA 149 ENSMUSG00000065470
Mir128-1    Mir128-1    387147 microRNA 128-1 ENSMUSG00000065520
Mir7682      Mir7682 102466847  microRNA 7682 ENSMUSG00000106406
         external_synonym chromosome_name gene_biotype  phenotype_description
              <character>     <character>  <character>            <character>
Mir1901          Mirn1901              18        miRNA                     NA
Mir378a           Mirn378              18        miRNA abnormal mitochondri..
Mir133b          mir 133b               1        miRNA no abnormal phenotyp..
Mir30c-2        mir 30c-2               1        miRNA                     NA
Mir149            Mirn149               1        miRNA increased circulatin..
Mir128-1          Mirn128               1        miRNA no abnormal phenotyp..
Mir7682      mmu-mir-7682               1        miRNA                     NA
         hsapiens_homolog_associated_gene_name
                                   <character>
Mir1901                                     NA
Mir378a                                MIR378A
Mir133b                                MIR133B
Mir30c-2                               MIR30C2
Mir149                                      NA
Mir128-1                              MIR128-1
Mir7682                                     NA
Practice
  1. Extract the gene expression levels of the 3 first genes in samples at time 0 and at time 8.
assay(se)[1:3, colData(se)$time != 4]
        GSM2545336 GSM2545337 GSM2545338 GSM2545341 GSM2545342 GSM2545343
Asl           1170        361        400        988        836        535
Apod         36194      10347       9173      29594      24959      13668
Cyp2d22       4060       1616       1603       3349       3122       2008
        GSM2545346 GSM2545347 GSM2545348 GSM2545349 GSM2545351 GSM2545353
Asl            938       1035        494        481        937        541
Apod         27769      34301      11258      11812      29242      13682
Cyp2d22       2985       3452       1883       2014       3678       2216
        GSM2545354 GSM2545380
Asl            473       1192
Apod         11088      38148
Cyp2d22       1821       4019
# Equivalent to
assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8]
        GSM2545336 GSM2545337 GSM2545338 GSM2545341 GSM2545342 GSM2545343
Asl           1170        361        400        988        836        535
Apod         36194      10347       9173      29594      24959      13668
Cyp2d22       4060       1616       1603       3349       3122       2008
        GSM2545346 GSM2545347 GSM2545348 GSM2545349 GSM2545351 GSM2545353
Asl            938       1035        494        481        937        541
Apod         27769      34301      11258      11812      29242      13682
Cyp2d22       2985       3452       1883       2014       3678       2216
        GSM2545354 GSM2545380
Asl            473       1192
Apod         11088      38148
Cyp2d22       1821       4019

4.4.3 Saving data

Exporting data to a spreadsheet, as we did in a previous lesson, has several limitations, such as those described in the first chapter (possible inconsistencies with , and . for decimal separators and lack of variable type definitions). Furthermore, exporting data to a spreadsheet is only relevant for rectangular data such as dataframes and matrices.

A more general way to save data, that is specific to R and is guaranteed to work on any operating system, is to use the saveRDS() function. Saving objects like this will generate a binary representation on disk (using the rds file extension here), which can be loaded back into R using the readRDS function.

# Save R object as an RDS file
saveRDS(se, file = here("output", "se.rds"))
rm(se)

se <- readRDS(here("output", "se.rds"))
head(se)
class: SummarizedExperiment 
dim: 6 22 
metadata(0):
assays(1): counts
rownames(6): Asl Apod ... Fcrls Slc2a4
rowData names(9): gene ENTREZID ... phenotype_description
  hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(9): sample organism ... tissue mouse

To conclude, when it comes to saving data from R that will be loaded again in R, saving and loading with saveRDS and readRDS is the preferred approach. If tabular data need to be shared with somebody that is not using R, then exporting to a text-based spreadsheet is a good alternative.