bears: building expression atlases in R

Introduction

In the past decades, there has been an exponential accumulation of RNA-seq data in public repositories. This steep increase paved the way to the creation of gene expression atlases, which consist in comprehensive collections of expression data from public databases, analyzed with a single pipeline for consistency and cross-project comparison. bears is a package that allows you to create your own gene expression atlas for a given species using public data. The package features:

Data download from NCBI’s Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).
Sequence quality check and trimming of low-quality sequences and adapters.
Removal of rRNA, which are typical problems of libraries that were prepared with the rRNA depletion protocol.
Read mapping against a reference genome.
Transcriptome assembly.
Quantification of gene- and transcript-level transcript abundance with alignment-based and alignment-free methods.

Installation and setup

To install bears, use the following code:

remotes::install_github("almeidasilvaf/bears")

Then, create a standard directory structure with create_dir_structure() to store your results. This is optional, but it will make your life much easier. The output is a list of paths to common directories that you will need to specify in several functions of this package.

library(bears)

# Create directory structure using a temporary directory as root
ds <- create_dir_structure(rootdir = tempdir())

# Look at the output
ds
#> $fastqdir
#> [1] "/tmp/RtmpGIVQm5/01_FASTQ_files"
#> 
#> $qcdir
#> [1] "/tmp/RtmpGIVQm5/QC_dir"
#> 
#> $filtdir
#> [1] "/tmp/RtmpGIVQm5/02_filtered_FASTQ"
#> 
#> $mappingdir
#> [1] "/tmp/RtmpGIVQm5/03_read_mapping"
#> 
#> $salmonindex
#> [1] "/tmp/RtmpGIVQm5/04_quantification/salmon/idx"
#> 
#> $salmondir
#> [1] "/tmp/RtmpGIVQm5/04_quantification/salmon"
#> 
#> $kallistoindex
#> [1] "/tmp/RtmpGIVQm5/04_quantification/kallisto/idx"
#> 
#> $kallistodir
#> [1] "/tmp/RtmpGIVQm5/04_quantification/kallisto"
#> 
#> $fcountsdir
#> [1] "/tmp/RtmpGIVQm5/04_quantification/featureCounts"
#> 
#> $stringtiedir
#> [1] "/tmp/RtmpGIVQm5/04_quantification/stringtie"

To run the full pipeline implemented in bears, you will need to have some external software tools installed in your machine. The names of the tools are listed below. In the Functions column, you can see the names of functions in bears that require each tool.

Software	Version	Functions
fastp	>=0.22.0	trim_reads()
kallisto	>=0.11.9	kallisto_index(), kallisto_quantify()
RSeQC	>=4.0.0	infer_strandedness()
salmon	>=1.8.0	salmon_index(), salmon_quantify()
SortMeRNA	>=4.3.4	remove_rrna()
STAR	>=2.7.10a	star_genome_index(), star_align()
StringTie	>=2.2.1	stringtie_assemble(), stringtie_quantify()
Subread	>=2.0.1	fcount()
TACO	>=0.7.3	taco_merge()

To make your life easier, we have created .yml files with Conda environments containing each of these external tools. You can create an environment for each tool and manage Conda environments from the R session with the Bioconductor package Herper (see the FAQ section for details).

Alternatively, you can see here the code we used to install all external dependencies in the Ubuntu 20.04 virtual machine provided by GitHub Actions, which is how this package is tested and how this document was created.

As a sanity check, let’s see if all external dependencies are installed.

# Test installation of external dependencies
fastp_is_installed()
#> [1] TRUE
star_is_installed()
#> [1] TRUE
sortmerna_is_installed()
#> [1] TRUE
rseqc_is_installed()
#> [1] TRUE
salmon_is_installed()
#> [1] TRUE
kallisto_is_installed()
#> [1] TRUE
subread_is_installed()
#> [1] TRUE
stringtie_is_installed()
#> [1] TRUE
taco_is_installed()
#> [1] TRUE

Retrieving sample metadata

First of all, you need to choose which samples you want to download and create a metadata data frame for your samples. To create this data frame, you will pass a search term to the function create_sample_info(). The search term has the same syntax of the SRA search term syntax. For example, you can search by:

BioSample accession - SAMN08903403[BSPL]
BioProject accession - PRJNA229998[GPRJ]
All RNA-seq samples for an organism - Glycine max[ORGN] AND RNA-seq[STRA]
And so on…

Let’s create a metadata data frame for a human RNA-seq sample that is included in the airway Bioconductor package.

# Create metadata data frame
term <- "SAMN02422669[BSPL]"
metadata <- create_sample_info(term)
metadata
#>      BioSample Experiment        Run                           Tissue   Pubmed
#> 1 SAMN02422669  SRX384345 SRR1039508 human airway smooth muscle cells 24926665
#>    BioProject          Instrument Layout Selection_method SRA_sample SRA_study
#> 1 PRJNA229998 Illumina HiSeq 2000 PAIRED             cDNA  SRS508568 SRP033351
#>   Treatment Cultivar
#> 1 Untreated     <NA>
#>                                                                          Study_title
#> 1 Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Study_abstract
#> 1 Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are ß2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions. Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a ß2-agonist (i.e. Albuterol, 1µM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1µM for 18h); 4) simultaneous treatment with a ß2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The Tuxedo Suite Tools were used to align reads to the hg19 reference genome, assemble transcripts, and perform differential expression analysis using the protocol described in https://github.com/blancahimes/taffeta Overall design: mRNA profiles obtained via RNA-Seq for four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated.
#>                  Date Origin
#> 1 2014-01-02 09:16:11   <NA>

Downloading FASTQ files

To download the .fastq files from ENA, you will use the function download_from_ena(). As input, you only need to give the metadata data frame and the path the output directory where .fastq files will be stored.

IMPORTANT: Note below that you must change the timeout limit, or your downloads will be interrupted after the default timeout limit of 60 seconds. You will probably need more than 60 seconds to download some samples, especially if your internet connection is not very good.

Here’s an example of how to run download_from_ena():

# Change default timeout limit
options(timeout = 6000)

# Download sample to temporary directory
download <- download_from_ena(metadata, fastqdir = ds$fastqdir)

For running time issues, we will simply copy the example FASTQ files from the extdata subdirectory of this package to ds$fastqdir. These example files contain a subset of the SAMN02422669 BioSample we mentioned before.

# For running time issues, copy example FASTQ files to ds$fastqdir
f <- list.files(
    system.file("extdata", package = "bears"), 
    pattern = ".fastq.gz", 
    full.names = TRUE
)
copy <- file.copy(f, ds$fastqdir)

After downloading FASTQ files, it’s important to check the integrity of your files. The function check_md5() does that for you by checking if the md5 checksum of the downloaded file matches the md5 checksum of the original file.

check_md5(run_accessions = metadata$Run, fastqdir = ds$fastqdir)
#>          Run Status
#> 1 SRR1039508  FALSE

Here, expectedly, the function reported an issue. As the example files we are using contain only a subset of the SAMN02422669 BioSample, the md5 checksums are different. If the md5 checksums were the same, the Status variable would be TRUE.

Read QC and filtering

After downloading the .fastq files, you need to perform some filtering steps to remove low quality bases, adapters, and rRNA. Read filtering consists in 2 steps:

trim_reads() - trim adapters and low quality bases (if any). This function runs fastp (Chen et al. 2018) and saves filtered .fastq files in a directory named filtdir.
remove_rrna() - remove rRNA (if any) from .fastq files. rRNA removal relies on the SortMeRNA (Kopylova, Noé, and Touzet 2012) program.

Trimming adapters and low quality bases

First, let’s use trim_reads() to filter reads. ¹

if(fastp_is_installed()) {

    # Trim reads
    fastp_status <- trim_reads(
        metadata, 
        fastqdir = ds$fastqdir, 
        filtdir = ds$filtdir,
        qcdir = ds$qcdir
    )
    
    fastp_status # check run status
}
#>          run status
#> 1 SRR1039508     OK

The function trim_reads() stores .json files containing fastp summary statistics for each sample in the directory specified in qcdir. You can read it and parse it into a data frame with the function summary_stats_fastp(). Let’s demonstrate how it works.

# Path to directory containing .json file from fastp
qcdir <- system.file("extdata", package = "bears")
qc_table <- summary_stats_fastp(qcdir)

qc_table
#>       Sample                         sequencing before_nreads before_nbases
#> 1 SRR1039508 paired end (63 cycles + 63 cycles)         14194        894222
#>   before_q20bases before_q30bases before_q20rate before_q30rate
#> 1          882850          866181       0.987283       0.968642
#>   before_GCcontent before_meanlength after_nreads after_nbases after_q20bases
#> 1         0.486978                63        14076       886742         878073
#>   after_q30bases after_q20rate after_q30rate after_GCcontent after_meanlength
#> 1         862056      0.990224      0.972161        0.486906               62
#>   filter_n_passed filter_n_lowquality filter_n_too_many_N filter_n_tooshort
#> 1           14076                  84                  34                 0
#>   filter_n_toolong duplication_rate
#> 1                0       0.00986332

Removing rRNA

In this vignette, we will use a small rRNA database as an example. In real-world applications, your rRNA database directory should contain all FASTA files distributed in the SortMeRNA GitHub repo. However, if you think some of these files (e.g., 5s and 5.8s rRNA) are not a concern in your data set, you don’t need to include them in the database.

# Create a directory to store the rRNA db
rrna_db_dir <- file.path(tempdir(), "rrna")
dir.create(rrna_db_dir)

# Copy the example 16S rRNA file to the db directory.
rrna_file <- system.file("extdata", "bac_16s_subset.fa", package = "bears")
copy <- file.copy(from = rrna_file, to = rrna_db_dir)

# Remove rRNA (if any)
if(sortmerna_is_installed()) {
    rrna_removal <- remove_rrna(
        metadata,
        fastqdir = ds$fastqdir,
        filtdir = ds$filtdir,
        rrna_db_dir = rrna_db_dir
    )
    
    rrna_removal # check run status
}
#>          run status
#> 1 SRR1039508     OK

Now that we have performed all quality checks, we’re good to go. ²

Quantification of transcript abundance

Quantification of transcript abundance can be done in two ways:

Alignment-based approaches, which involves mapping reads to a reference genome using STAR (Dobin et al. 2013) and quantifying the expression based on uniquely mapped reads with featureCounts (Liao, Smyth, and Shi 2014) (in raw counts) and/or StringTie (Pertea et al. 2015) (in TPM).
Alignment-free approaches, which involves pseudo-aligning or quasi-mapping reads to a reference transcriptome with kallisto (Bray et al. 2016) or salmon (Patro et al. 2017), respectively.

Below, we will describe how to perform both approaches.

Alignment-based quantification

To start with, we will need to map the reads to a reference genome. In bears, reads are mapped to the reference genome with the software tool STAR.

Read mapping

Here, for the purpose of demonstration, we will map reads to a subset of the human genome. The FASTA and GTF files corresponding to the subset of the genome are available in the extdata/ subdirectory of this package.

Before mapping reads, we need to create a genome index. This can be done with star_genome_index().

# Get paths to genome subset
genome_path <- system.file(
    "extdata", "Homo_sapiens.GRCh37.75_subset.fa", package = "bears"
)
gff_path <- system.file(
    "extdata", "Homo_sapiens.GRCh37.75_subset.gtf", package = "bears"
)

# Create genome index
if(star_is_installed()) {
    genome_idx <- star_genome_index(
        genome_path = genome_path, 
        gff_path = gff_path, 
        mappingdir = ds$mappingdir
    )
    
    genome_idx # check run status
}
#>                                    index_path status
#> 1 /tmp/RtmpGIVQm5/03_read_mapping/genomeIndex     OK

Now that we have the genome index, we can map reads to it.

# Map reads to the genome
if(star_is_installed()) {
    read_mapping <- star_align(
        metadata, 
        filtdir = ds$filtdir,
        qc_table = qc_table,
        mappingdir = ds$mappingdir,
        gff_path = gff_path
    )
    
    read_mapping # check run status
}
#>         sample status
#> 1 SAMN02422669     OK

Finally, let’s get read mapping statistics with the function summary_stats_star(). Here, we will use an example Log.final.out that STAR returns stored in the extdata/* subdirectory of this package.

# Obtaining read mapping statistics
star_dir <- system.file("extdata", package = "bears")

star_stats <- summary_stats_star(star_dir = star_dir)
star_stats
#>         Sample total_reads avg_input_read_length uniquely_mapped
#> 1 SAMN02422669        7097                   126            7088
#>   uniquely_mapped_percent avg_mapped_read_length num_splices
#> 1                   99.87                 125.64        3825
#>   num_annotated_splices num_GTAG_splices num_GCAG_splices num_ATAC_splices
#> 1                  3815             3810               15                0
#>   num_noncanonical_splices mismatch_rate deletion_rate deletion_length
#> 1                        0          0.17             0            1.38
#>   insertion_rate insertion_length multimapped multimapped_percent
#> 1           0.01             1.35           9                0.13
#>   multimapped_toomany multimapped_toomany_percent unmapped_mismatches_percent
#> 1                   0                           0                           0
#>   unmapped_tooshort_percent unmapped_other_percent unmapped_mismatches
#> 1                         0                      0                   0
#>   unmapped_tooshort unmapped_other
#> 1                 0              0

Now, let’s check if samples passed the minimum quality criteria. Here, samples are excluded if:

>=50% of the reads fail to map or;
>=40% of the reads fail to uniquely map.

The function mapping_pass() takes the metadata data frame and returns the same data frame, but only with the samples that passed the minimum criteria.

# Check if samples passed the filtering criterion
align_passed <- mapping_pass(star_stats, metadata)
align_passed # inspect data
#>      BioSample Experiment        Run                           Tissue   Pubmed
#> 1 SAMN02422669  SRX384345 SRR1039508 human airway smooth muscle cells 24926665
#>    BioProject          Instrument Layout Selection_method SRA_sample SRA_study
#> 1 PRJNA229998 Illumina HiSeq 2000 PAIRED             cDNA  SRS508568 SRP033351
#>   Treatment Cultivar
#> 1 Untreated     <NA>
#>                                                                          Study_title
#> 1 Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Study_abstract
#> 1 Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are ß2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions. Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a ß2-agonist (i.e. Albuterol, 1µM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1µM for 18h); 4) simultaneous treatment with a ß2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The Tuxedo Suite Tools were used to align reads to the hg19 reference genome, assemble transcripts, and perform differential expression analysis using the protocol described in https://github.com/blancahimes/taffeta Overall design: mRNA profiles obtained via RNA-Seq for four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated.
#>                  Date Origin
#> 1 2014-01-02 09:16:11   <NA>

# Compare to the original data set
nrow(metadata)
#> [1] 1
nrow(align_passed)
#> [1] 1

As you can see, the sample we used for read mapping passed the minimum quality criteria. Good, huh? We can now proceed to the next step.

Inferring library strandedness

Before quantification, we need to infer library strandedness with the RSeQC (Wang, Wang, and Li 2012) tool. The function infer_strandedness() runs RSeQC and returns the metadata data frame with an additional column named Orientation containing library strandedness information.

This function requires the annotation in BED format, not GTF/GFF. To convert from GTF/GFF to BED, use the function gff2bed().

# Convert GFF to BED
bedpath <- gff2bed(gff_path)
bedpath # check path
#> [1] "/__w/_temp/Library/bears/extdata/Homo_sapiens.GRCh37.75_subset.bed"

# Infer strandedness
if(rseqc_is_installed()) {
    new_metadata <- infer_strandedness(
        mapping_passed = align_passed,
        bedpath = bedpath,
        mappingdir = ds$mappingdir
    )
    
    new_metadata
}
#>    BioProject    BioSample Experiment        Run
#> 1 PRJNA229998 SAMN02422669  SRX384345 SRR1039508
#>                             Tissue   Pubmed          Instrument Layout
#> 1 human airway smooth muscle cells 24926665 Illumina HiSeq 2000 PAIRED
#>   Selection_method SRA_sample SRA_study Treatment Cultivar
#> 1             cDNA  SRS508568 SRP033351 Untreated     <NA>
#>                                                                          Study_title
#> 1 Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Study_abstract
#> 1 Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are ß2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions. Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a ß2-agonist (i.e. Albuterol, 1µM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1µM for 18h); 4) simultaneous treatment with a ß2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The Tuxedo Suite Tools were used to align reads to the hg19 reference genome, assemble transcripts, and perform differential expression analysis using the protocol described in https://github.com/blancahimes/taffeta Overall design: mRNA profiles obtained via RNA-Seq for four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated.
#>                  Date Origin Orientation
#> 1 2014-01-02 09:16:11   <NA>       first

As you can see, there is a new Orientation column with strandedness info for this sample.

Using featureCounts

Now that we have the .bam files from STAR and information on library strandedness for each sample, we can quantify the expression with featureCounts. This tool quantifies gene expression measured in raw read counts per gene.

To quantify gene expression with featureCounts, use the function feaureCounts(). This function runs featureCounts and returns a gene expression matrix with genes in rows and samples in columns.

# Get gene expression in raw read counts
if(subread_is_installed()) {
    fcounts_quant <- featureCounts(
        new_metadata, 
        mappingdir = ds$mappingdir,
        gff_path = gff_path,
        fcountsdir = ds$fcountsdir
    )
    
    # Explore the expression matrix
    head(fcounts_quant)
}
#> 
#>         ==========     _____ _    _ ____  _____  ______          _____  
#>         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
#>           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
#>             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
#>               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
#>         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
#>        Rsubread 2.10.5
#> 
#> //========================== featureCounts setting ===========================\\
#> ||                                                                            ||
#> ||             Input files : 1 BAM file                                       ||
#> ||                                                                            ||
#> ||                           SAMN02422669Aligned.sortedByCoord.out.bam        ||
#> ||                                                                            ||
#> ||              Paired-end : yes                                              ||
#> ||        Count read pairs : yes                                              ||
#> ||              Annotation : Homo_sapiens.GRCh37.75_subset.gtf (GTF)          ||
#> ||      Dir for temp files : .                                                ||
#> ||                 Threads : 2                                                ||
#> ||                   Level : meta-feature level                               ||
#> ||      Multimapping reads : counted                                          ||
#> || Multi-overlapping reads : not counted                                      ||
#> ||   Min overlapping bases : 1                                                ||
#> ||                                                                            ||
#> \\============================================================================//
#> 
#> //================================= Running ==================================\\
#> ||                                                                            ||
#> || Load annotation file Homo_sapiens.GRCh37.75_subset.gtf ...                 ||
#> ||    Features : 20                                                           ||
#> ||    Meta-features : 2                                                       ||
#> ||    Chromosomes/contigs : 1                                                 ||
#> ||                                                                            ||
#> || Process BAM file SAMN02422669Aligned.sortedByCoord.out.bam...              ||
#> ||    Strand specific : reversely stranded                                    ||
#> ||    Paired-end reads are included.                                          ||
#> ||    Total alignments : 7147                                                 ||
#> ||    Successfully assigned alignments : 14 (0.2%)                            ||
#> ||    Running time : 0.00 minutes                                             ||
#> ||                                                                            ||
#> || Write the final count table.                                               ||
#> || Write the read assignment summary.                                         ||
#> ||                                                                            ||
#> \\============================================================================//
#>                 SAMN02422669
#> ENSG00000171819            2
#> ENSG00000120942           12

Whenever you are working with gene expression data, we recommend storing your data as SummarizedExperiment objects, so you have the expression matrix and sample metadata in a single object. If you are not familiar with SummarizedExperiment objects, take a look at the documentation of the SummarizedExperiment package.

To get a SummarizedExperiment object from featureCounts, use the function featureCounts2se().

# Create a SummarizedExperiment object from featureCounts output
fcounts_se <- featureCounts2se(
    new_metadata, fc_output = fcounts_quant
)

# Take a look at the SummarizedExperiment object
fcounts_se
#> class: SummarizedExperiment 
#> dim: 2 1 
#> metadata(0):
#> assays(1): fcounts
#> rownames(2): ENSG00000171819 ENSG00000120942
#> rowData names(0):
#> colnames(1): SAMN02422669
#> colData names(17): BioProject Experiment ... Origin Orientation

# Exploring sample metadata
SummarizedExperiment::colData(fcounts_se)
#> DataFrame with 1 row and 17 columns
#>               BioProject  Experiment         Run                 Tissue
#>              <character> <character> <character>            <character>
#> SAMN02422669 PRJNA229998   SRX384345  SRR1039508 human airway smooth ..
#>                   Pubmed          Instrument      Layout Selection_method
#>              <character>         <character> <character>      <character>
#> SAMN02422669    24926665 Illumina HiSeq 2000      PAIRED             cDNA
#>               SRA_sample   SRA_study   Treatment    Cultivar
#>              <character> <character> <character> <character>
#> SAMN02422669   SRS508568   SRP033351   Untreated          NA
#>                         Study_title         Study_abstract                Date
#>                         <character>            <character>         <character>
#> SAMN02422669 Human Airway Smooth .. Rationale: Asthma is.. 2014-01-02 09:16:11
#>                   Origin Orientation
#>              <character> <character>
#> SAMN02422669          NA       first

# Exploring gene expression matrix
SummarizedExperiment::assay(fcounts_se)
#>                 SAMN02422669
#> ENSG00000171819            2
#> ENSG00000120942           12

Using StringTie

StringTie quantifies transcript-level and gene-level transcript abundances in normalized values (transcripts per million, TPM).

To obtain gene expression levels in TPM with StringTie, use the function stringtie_quantify().

# Quantify expression in TPM with StringTie
if(stringtie_is_installed()) {
    stringtie_quant <- stringtie_quantify(
        new_metadata,
        qc_table = qc_table,
        mappingdir = ds$mappingdir,
        gff_path = gff_path,
        stringtiedir = ds$stringtiedir
    )
    
    stringtie_quant # check run status
}
#>         sample status
#> 1 SAMN02422669     OK

Now, let’s read the output from StringTie as a SummarizedExperiment object with the function stringtie2se. You can choose if you want the expression at the gene level, at the transcript level, or both. Here, let’s get the gene-level expression. For that, you will need to give a 2-column data frame with transcript IDs and their corresponding genes.

# Load transcript-to-gene correspondence
data(tx2gene)
head(tx2gene)
#>            TXNAME          GENEID
#> 1 ENST00000473118 ENSG00000120948
#> 2 ENST00000240185 ENSG00000120948
#> 3 ENST00000476201 ENSG00000120948
#> 4 ENST00000472476 ENSG00000120948
#> 5 ENST00000439080 ENSG00000120948
#> 6 ENST00000315091 ENSG00000120948

# Read StringTie output as a SummarizedExperiment object
stringtiese <- stringtie2se(
    new_metadata,
    stringtiedir = ds$stringtiedir,
    level = "gene",
    tx2gene = tx2gene
)
#> reading in files with read.delim (install 'readr' package for speed up)
#> 1 
#> summarizing abundance
#> summarizing counts
#> summarizing length

# Exploring the SummarizedExperiment object
stringtiese
#> class: SummarizedExperiment 
#> dim: 2 1 
#> metadata(0):
#> assays(1): gene_TPM
#> rownames(2): ENSG00000120942 ENSG00000171819
#> rowData names(0):
#> colnames(1): SAMN02422669
#> colData names(17): BioProject Experiment ... Origin Orientation

# Looking at gene expression matrix
SummarizedExperiment::assay(stringtiese, "gene_TPM")
#>                 SAMN02422669
#> ENSG00000120942     207072.5
#> ENSG00000171819          0.0

Bonus: Transcript assembly and merging

Besides quantifying transcript abundance, StringTie can also be used to assemble transcripts for each BioSample. Assembled transcripts for each BioSample are represented as .gtf files.

However, if you want a single .gtf file with the assembled transcripts for all BioSamples you are studying, you can merge the individual .gtf files from StringTie with the software tool TACO (Niknafs et al. 2017). Below, we will demonstrate how that can be achieved.

# Transcript assembly with StringTie
if(stringtie_is_installed()) {
    assembled_transcripts <- stringtie_assemble(
        new_metadata,
        qc_table = qc_table,
        mappingdir = ds$mappingdir,
        gff_path = gff_path,
        stringtiedir = ds$stringtiedir
    )
    
    assembled_transcripts # check run status
}
#>         sample status
#> 1 SAMN02422669     OK

In this vignette, we have a single BioSample. However, in real-life scenarios, you would have several samples. To merge the .gtf files for each sample in a single .gtf file, use the function taco_merge().

# Merge assembled transcripts with TACO
if(taco_is_installed()) {
    merged_transcripts <- taco_merge(
        new_metadata,
        stringtiedir = ds$stringtiedir
    )
    
    merged_transcripts # check run status
}
#>                 file status
#> 1 final_assembly.gtf     OK

The merged transcript assembly will be stored in a file named final_assembly.gtf in the subdirectory assembly/merged_assembly, inside stringtiedir. To get the path to the .gtf file, use:

# Get path to merged transcript assembly
final_assembly <- file.path(
    ds$stringtiedir, "assembly", "merged_assembly", "final_assembly.gtf"
)

final_assembly
#> [1] "/tmp/RtmpGIVQm5/04_quantification/stringtie/assembly/merged_assembly/final_assembly.gtf"

But why would I want to assemble transcripts and merge them if I already have a .gtf file the transcript annotations?

That’s a great question! This is a way to identify novel transcripts that are not present in your reference .gtf file. Some transcripts can be missing in the reference annotation (.gtf file) mainly because: i. genome assembly does not have a good quality, so these transcripts could not be predicted. ii. false-positives from the transcript annotation software tool that was used.

If you want to have a more comprehensive transcript abundance quantification, you can assemble transcripts for each sample, merge them, and input the output file final_assembly.gtf to the quantification functions. This way, instead of using the reference transcript annotation, you will use your own transcript annotation, which may contain novel transcripts.

Alignment-free quantification

To quantify the expression without mapping reads to the genome, you have two options:

Using kallisto, which quantifies transcript abundance based on pseudoalignments.
Using salmon, which quantifies transcript abundance based on quasi-mapping.

For both kallisto and salmon, you will need to have a reference transcriptome, not a reference genome. This is a FASTA file containing the sequences of all annotated transcripts in your genome. You can easily create this file with the function extractTranscriptSeqs() from the GenomicFeatures package.

Using salmon

First of all, we will need to index the reference transcriptome with the function salmon_index().

# Path to reference transcriptome
transcriptome_path <- system.file(
    "extdata", "Homo_sapiens.GRCh37.75_subset_transcripts.fa.gz",
    package = "bears"
)

# Index the transcriptome
if(salmon_is_installed()) {
    idx_salmon <- salmon_index(
        salmonindex = ds$salmonindex,
        transcriptome_path = transcriptome_path
    )
    
    idx_salmon # check run status
}
#>                                     index_path status
#> 1 /tmp/RtmpGIVQm5/04_quantification/salmon/idx     OK

Now, we can quantify transcript abundance with salmon_quantify().

# Quantify transcript abundance
if(salmon_is_installed()) {
    quant_salmon <- salmon_quantify(
        new_metadata,
        filtdir = ds$filtdir,
        salmonindex = ds$salmonindex,
        salmondir = ds$salmondir
    )
    
    quant_salmon # check run status
}
#>         sample status
#> 1 SAMN02422669     OK

After running salmon_quantify(), salmon output in .sf format will be stored in the directory specified in salmondir.

To read salmon output as a SummarizedExperiment object, use the function salmon2se(). You can choose if you want the expression at the gene level, at the transcript level, or both. Here, let’s get the gene-level expression. For that, you will need to give a 2-column data frame with transcript IDs and their corresponding genes.

# Load transcript-to-gene data frame
data(tx2gene)
head(tx2gene)
#>            TXNAME          GENEID
#> 1 ENST00000473118 ENSG00000120948
#> 2 ENST00000240185 ENSG00000120948
#> 3 ENST00000476201 ENSG00000120948
#> 4 ENST00000472476 ENSG00000120948
#> 5 ENST00000439080 ENSG00000120948
#> 6 ENST00000315091 ENSG00000120948

# Read salmon output as a SummarizedExperiment object
salmon_se <- salmon2se(
    new_metadata, 
    level = "gene", 
    salmondir = ds$salmondir,
    tx2gene
)
#> reading in files with read.delim (install 'readr' package for speed up)
#> 1 
#> summarizing abundance
#> summarizing counts
#> summarizing length

# Exploring the output
salmon_se
#> class: SummarizedExperiment 
#> dim: 2 1 
#> metadata(0):
#> assays(2): gene_TPM gene_counts
#> rownames(2): ENSG00000120942 ENSG00000171819
#> rowData names(0):
#> colnames(1): SAMN02422669
#> colData names(17): BioProject Experiment ... Origin Orientation

# Get gene expression matrix in TPM
SummarizedExperiment::assay(salmon_se, "gene_TPM")
#>                 SAMN02422669
#> ENSG00000120942    950095.05
#> ENSG00000171819     49904.95

# Get gene expression matrix as raw counts
SummarizedExperiment::assay(salmon_se, "gene_counts")
#>                 SAMN02422669
#> ENSG00000120942          113
#> ENSG00000171819            2

Using kallisto

Like we do in salmon, we will start by indexing the transcriptome.

# Index the transcriptome
if(kallisto_is_installed()) {
    idx_kallisto <- kallisto_index(
        kallistoindex = ds$kallistoindex,
        transcriptome_path = transcriptome_path
    )
    
    idx_kallisto # check run status
}
#>                                                       index_path status
#> 1 /tmp/RtmpGIVQm5/04_quantification/kallisto/idx/transcripts.idx     OK

Now, we can quantify the transcript abundance.

# Quantify transcript abundance
if(kallisto_is_installed()) {
    quant_kallisto <- kallisto_quantify(
        new_metadata,
        qc_table, 
        filtdir = ds$filtdir,
        kallistoindex = ds$kallistoindex,
        kallistodir = ds$kallistodir
    )
    
    quant_kallisto # check run status
}
#>         sample status
#> 1 SAMN02422669     OK

To read kallisto output in a SummarizedExperiment object, use the function kallisto2se(). Again, you will have specify if you want the expressiona at the gene level, transcript level, or both. Let’s get the gene expression here.

# Read kallisto output to SummarizedExperiment object
kallisto_se <- kallisto2se(
    new_metadata, 
    level = "gene", 
    kallistodir = ds$kallistodir,
    tx2gene
)
#> Note: importing `abundance.h5` is typically faster than `abundance.tsv`
#> reading in files with read.delim (install 'readr' package for speed up)
#> 1 
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> summarizing inferential replicates

# Exploring the output
kallisto_se
#> class: SummarizedExperiment 
#> dim: 2 1 
#> metadata(0):
#> assays(2): gene_TPM gene_counts
#> rownames(2): ENSG00000120942 ENSG00000171819
#> rowData names(0):
#> colnames(1): SAMN02422669
#> colData names(17): BioProject Experiment ... Origin Orientation

# Get gene expression matrix in TPM
SummarizedExperiment::assay(kallisto_se, "gene_TPM")
#>                 SAMN02422669
#> ENSG00000120942     988556.3
#> ENSG00000171819      11443.4

# Get gene expression matrix as raw counts
SummarizedExperiment::assay(kallisto_se, "gene_counts")
#>                 SAMN02422669
#> ENSG00000120942   127.952080
#> ENSG00000171819     2.047949

Closing remarks

If you are using bears, there are two things you must keep in mind. First, this package was designed to be as complete as possible, which means you don’t need to run the complete pipeline for your own project. For instance, if you just want gene expression values in TPM for a particular BioProject or set of BioProjects, you can simply go through the salmon path of the pipeline, skipping the read mapping section. Likewise, if you are using bears for your own data set and you have already cleaned the reads, you can skip the sequence quality checks and read filtering sections. The second thing to consider is that bears is a work in progress. Bioinformatics is a fast-evolving field, and new (and better) methods to address a particular question are developed continuously. Hence, we aim to keep bears up to date with state-of-the-art methods.

FAQ

How do I manage Conda environments from the R session?

You only need to do the following 2 things:

Install miniconda. After installing miniconda in your machine, create an R object containing the path to your miniconda installation. For example:

miniconda_path <- "~/tools/miniconda"

Create a different environment for each external tool. You can use the Herper package to create an environment for each tool from .yml files stored in the extdata/ subdirectory of this package. To avoid conflicts, it is important to keep each tool in its own environment. Here, each tool will be installed in an environment named <tool-name-lowercase>_env (e.g., star_env, rseqc_env, salmon_env).

library(Herper)

# Path to .yml files to create environments
envs <- list.files(
    system.file("extdata", package = "bears"), pattern = ".yml",
    full.names = TRUE
)

# Install miniconda in `my_miniconda` and create envs
create_envs <- sapply(envs, function(x) {
    import_CondaEnv(x, pathToMiniConda = miniconda_path)
})

Now, you can run functions that call external tools inside a call to Herper::withCondaEnv(). This function allows you to run an R function inside a particular Conda environment, which you need to specify. For example, to run the function salmon_quantify() inside the environment salmon_env, you would do:

withCondaEnv(
    "salmon_env", 
    quant_salmon <- salmon_quantify(
        new_metadata,
        filtdir = ds$filtdir,
        salmonindex = ds$salmonindex,
        salmondir = ds$salmondir
    ),
    pathToMiniconda = miniconda_path
)

Can I use this package with my own RNA-seq data (not from a database)?

You surely can. For that, you will first need to create a metadata data frame for your samples that looks like the data frame created by create_sample_info() (see the example data set sample_info). While you can add as many columns as you want, 5 columns MUST be present:

BioSample: BioSample ID. These can be fictional sample names, e.g., “Sample1”, “Sample2”, “Sample3”, etc.
Run: Run accession. These must be the basename of your files. Make sure you don’t repeat run accessions for paired-end reads, i.e., files example_1.fastq.gz and example_2.fastq.gz should both be represented by the same run accession. Examples of file names and the run accessions they should have:

Layout	File name	Run accession
PAIRED	SampleA_1.fastq.gz and SampleA_2.fastq.gz	SampleA
SINGLE	Sample1_control_replicate1.fastq.gz	Sample1_control_replicate1

BioProject: Project ID. If all files come from a single project, you can give it whatever name you want, such as MyNicePhDProject.
Instrument: Sequencing instrument (e.g., Illumina HiSeq 2500). Note that long-read technologies (e.g., PacBio) are not supported and, hence, will be skipped.
Layout: Sequencing protocol. Either “PAIRED” or “SINGLE”.

Finally, if you want to create a standard directory structure with create_dir_structure(), you will have to either move your files to the directory indicated in ds$fastqdir or change ds$fastqdir manually to include the path to the directory where your FASTQ files are. For example:

ds$fastqdir <- "/home/myusername/my_nice_project/fastq_files"

Session information

This document was created under the following sections:

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       Ubuntu 20.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       UTC
#>  date     2023-04-04
#>  pandoc   2.18 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package              * version   date (UTC) lib source
#>  bears                * 0.99.0    2023-04-04 [1] Bioconductor
#>  Biobase                2.56.0    2022-04-26 [1] Bioconductor
#>  BiocGenerics           0.42.0    2022-04-26 [1] Bioconductor
#>  BiocIO                 1.6.0     2022-04-26 [1] Bioconductor
#>  BiocManager            1.30.20   2023-02-24 [1] RSPM (R 4.2.0)
#>  BiocParallel           1.30.4    2022-10-11 [1] Bioconductor
#>  BiocStyle            * 2.24.0    2022-04-26 [1] Bioconductor
#>  Biostrings             2.64.1    2022-08-18 [1] Bioconductor
#>  bitops                 1.0-7     2021-04-24 [1] CRAN (R 4.2.0)
#>  bookdown               0.33      2023-03-06 [1] RSPM (R 4.2.0)
#>  bslib                  0.4.2     2022-12-16 [2] RSPM (R 4.2.0)
#>  cachem                 1.0.7     2023-02-24 [2] RSPM (R 4.2.0)
#>  cli                    3.6.1     2023-03-23 [2] RSPM (R 4.2.0)
#>  codetools              0.2-19    2023-02-01 [3] RSPM (R 4.2.0)
#>  crayon                 1.5.2     2022-09-29 [2] RSPM (R 4.2.0)
#>  curl                   5.0.0     2023-01-12 [2] RSPM (R 4.2.0)
#>  DelayedArray           0.22.0    2022-04-26 [1] Bioconductor
#>  desc                   1.4.2     2022-09-08 [2] RSPM (R 4.2.0)
#>  digest                 0.6.31    2022-12-11 [2] RSPM (R 4.2.0)
#>  downloader             0.4       2015-07-09 [1] RSPM (R 4.2.0)
#>  evaluate               0.20      2023-01-17 [2] RSPM (R 4.2.0)
#>  fansi                  1.0.4     2023-01-22 [2] RSPM (R 4.2.0)
#>  fastmap                1.1.1     2023-02-24 [2] RSPM (R 4.2.0)
#>  fs                     1.6.1     2023-02-06 [2] RSPM (R 4.2.0)
#>  GenomeInfoDb           1.32.4    2022-09-06 [1] Bioconductor
#>  GenomeInfoDbData       1.2.8     2022-05-02 [1] Bioconductor
#>  GenomicAlignments      1.32.1    2022-07-24 [1] Bioconductor
#>  GenomicRanges          1.48.0    2022-04-26 [1] Bioconductor
#>  glue                   1.6.2     2022-02-24 [2] CRAN (R 4.2.0)
#>  htmltools              0.5.5     2023-03-23 [2] RSPM (R 4.2.0)
#>  httr                   1.4.5     2023-02-24 [2] RSPM (R 4.2.0)
#>  IRanges                2.30.1    2022-08-18 [1] Bioconductor
#>  jquerylib              0.1.4     2021-04-26 [2] CRAN (R 4.2.0)
#>  jsonlite               1.8.4     2022-12-06 [1] RSPM (R 4.2.0)
#>  knitr                  1.42      2023-01-25 [2] RSPM (R 4.2.0)
#>  lattice                0.20-45   2021-09-22 [3] CRAN (R 4.2.1)
#>  lifecycle              1.0.3     2022-10-07 [2] RSPM (R 4.2.0)
#>  magrittr               2.0.3     2022-03-30 [2] CRAN (R 4.2.0)
#>  Matrix                 1.5-3     2022-11-11 [3] RSPM (R 4.2.0)
#>  MatrixGenerics         1.8.1     2022-06-26 [1] Bioconductor
#>  matrixStats            0.63.0    2022-11-18 [1] RSPM (R 4.2.0)
#>  memoise                2.0.1     2021-11-26 [2] CRAN (R 4.2.0)
#>  pillar                 1.9.0     2023-03-22 [2] RSPM (R 4.2.0)
#>  pkgconfig              2.0.3     2019-09-22 [2] CRAN (R 4.2.0)
#>  pkgdown                2.0.7     2022-12-14 [2] RSPM (R 4.2.0)
#>  purrr                  1.0.1     2023-01-10 [2] RSPM (R 4.2.0)
#>  R6                     2.5.1     2021-08-19 [2] CRAN (R 4.2.0)
#>  ragg                   1.2.5     2023-01-12 [2] RSPM (R 4.2.0)
#>  RCurl                  1.98-1.12 2023-03-27 [1] RSPM (R 4.2.0)
#>  rentrez                1.2.3     2020-11-10 [1] RSPM (R 4.2.0)
#>  restfulr               0.0.15    2022-06-16 [1] RSPM (R 4.2.0)
#>  rhdf5                  2.40.0    2022-04-26 [1] Bioconductor
#>  rhdf5filters           1.8.0     2022-04-26 [1] Bioconductor
#>  Rhdf5lib               1.18.2    2022-05-15 [1] Bioconductor
#>  rjson                  0.2.21    2022-01-09 [1] RSPM (R 4.2.0)
#>  rlang                  1.1.0     2023-03-14 [2] RSPM (R 4.2.0)
#>  rmarkdown              2.21      2023-03-26 [2] RSPM (R 4.2.0)
#>  rprojroot              2.0.3     2022-04-02 [2] CRAN (R 4.2.0)
#>  Rsamtools              2.12.0    2022-04-26 [1] Bioconductor
#>  Rsubread               2.10.5    2022-08-09 [1] Bioconductor
#>  rtracklayer            1.56.1    2022-06-23 [1] Bioconductor
#>  S4Vectors              0.34.0    2022-04-26 [1] Bioconductor
#>  sass                   0.4.5     2023-01-24 [2] RSPM (R 4.2.0)
#>  sessioninfo            1.2.2     2021-12-06 [2] RSPM (R 4.2.0)
#>  stringi                1.7.12    2023-01-11 [2] RSPM (R 4.2.0)
#>  stringr                1.5.0     2022-12-02 [2] RSPM (R 4.2.0)
#>  SummarizedExperiment   1.26.1    2022-04-29 [1] Bioconductor
#>  systemfonts            1.0.4     2022-02-11 [2] RSPM (R 4.2.0)
#>  textshaping            0.3.6     2021-10-13 [2] RSPM (R 4.2.0)
#>  tibble                 3.2.1     2023-03-20 [2] RSPM (R 4.2.0)
#>  tximport               1.24.0    2022-04-26 [1] Bioconductor
#>  utf8                   1.2.3     2023-01-31 [2] RSPM (R 4.2.0)
#>  vctrs                  0.6.1     2023-03-22 [2] RSPM (R 4.2.0)
#>  xfun                   0.38      2023-03-24 [2] RSPM (R 4.2.0)
#>  XML                    3.99-0.14 2023-03-19 [1] RSPM (R 4.2.0)
#>  XVector                0.36.0    2022-04-26 [1] Bioconductor
#>  yaml                   2.3.7     2023-01-23 [2] RSPM (R 4.2.0)
#>  zlibbioc               1.42.0    2022-04-26 [1] Bioconductor
#> 
#>  [1] /__w/_temp/Library
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/local/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

References

Bray, Nicolas L, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal Probabilistic RNA-Seq Quantification.” Nature Biotechnology 34 (5): 525–27.

Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.” Bioinformatics 34 (17): i884–90.

Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21.

Kopylova, Evguenia, Laurent Noé, and Hélène Touzet. 2012. “SortMeRNA: Fast and Accurate Filtering of Ribosomal RNAs in Metatranscriptomic Data.” Bioinformatics 28 (24): 3211–17.

Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.

Niknafs, Yashar S, Balaji Pandian, Hariharan K Iyer, Arul M Chinnaiyan, and Matthew K Iyer. 2017. “TACO Produces Robust Multisample Transcriptome Assemblies from RNA-Seq.” Nature Methods 14 (1): 68–70.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14 (4): 417–19.

Pertea, Mihaela, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, and Steven L Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from RNA-Seq Reads.” Nature Biotechnology 33 (3): 290–95.

Wang, Liguo, Shengqin Wang, and Wei Li. 2012. “RSeQC: Quality Control of RNA-Seq Experiments.” Bioinformatics 28 (16): 2184–85.

Fabricio Almeida-Silva

Thiago Motta Venancio

2023-04-04