bears: building expression atlases in R
Fabricio Almeida-Silva
Universidade Estadual do Norte Fluminense Darcy Ribeiro, RJ, BrazilThiago Motta Venancio
Universidade Estadual do Norte Fluminense Darcy Ribeiro, RJ, Brazil2023-04-04
Source:vignettes/bears_vignette.Rmd
bears_vignette.Rmd
Introduction
In the past decades, there has been an exponential accumulation of RNA-seq data in public repositories. This steep increase paved the way to the creation of gene expression atlases, which consist in comprehensive collections of expression data from public databases, analyzed with a single pipeline for consistency and cross-project comparison. bears is a package that allows you to create your own gene expression atlas for a given species using public data. The package features:
- Data download from NCBI’s Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).
- Sequence quality check and trimming of low-quality sequences and adapters.
- Removal of rRNA, which are typical problems of libraries that were prepared with the rRNA depletion protocol.
- Read mapping against a reference genome.
- Transcriptome assembly.
- Quantification of gene- and transcript-level transcript abundance with alignment-based and alignment-free methods.
Installation and setup
To install bears, use the following code:
remotes::install_github("almeidasilvaf/bears")
Then, create a standard directory structure with
create_dir_structure()
to store your results. This is
optional, but it will make your life much easier. The
output is a list of paths to common directories that you will need to
specify in several functions of this package.
library(bears)
# Create directory structure using a temporary directory as root
ds <- create_dir_structure(rootdir = tempdir())
# Look at the output
ds
#> $fastqdir
#> [1] "/tmp/RtmpGIVQm5/01_FASTQ_files"
#>
#> $qcdir
#> [1] "/tmp/RtmpGIVQm5/QC_dir"
#>
#> $filtdir
#> [1] "/tmp/RtmpGIVQm5/02_filtered_FASTQ"
#>
#> $mappingdir
#> [1] "/tmp/RtmpGIVQm5/03_read_mapping"
#>
#> $salmonindex
#> [1] "/tmp/RtmpGIVQm5/04_quantification/salmon/idx"
#>
#> $salmondir
#> [1] "/tmp/RtmpGIVQm5/04_quantification/salmon"
#>
#> $kallistoindex
#> [1] "/tmp/RtmpGIVQm5/04_quantification/kallisto/idx"
#>
#> $kallistodir
#> [1] "/tmp/RtmpGIVQm5/04_quantification/kallisto"
#>
#> $fcountsdir
#> [1] "/tmp/RtmpGIVQm5/04_quantification/featureCounts"
#>
#> $stringtiedir
#> [1] "/tmp/RtmpGIVQm5/04_quantification/stringtie"
To run the full pipeline implemented in bears, you will need to have some external software tools installed in your machine. The names of the tools are listed below. In the Functions column, you can see the names of functions in bears that require each tool.
Software | Version | Functions |
---|---|---|
fastp | >=0.22.0 | trim_reads() |
kallisto | >=0.11.9 | kallisto_index(), kallisto_quantify() |
RSeQC | >=4.0.0 | infer_strandedness() |
salmon | >=1.8.0 | salmon_index(), salmon_quantify() |
SortMeRNA | >=4.3.4 | remove_rrna() |
STAR | >=2.7.10a | star_genome_index(), star_align() |
StringTie | >=2.2.1 | stringtie_assemble(), stringtie_quantify() |
Subread | >=2.0.1 | fcount() |
TACO | >=0.7.3 | taco_merge() |
To make your life easier, we have created .yml files with Conda environments containing each of these external tools. You can create an environment for each tool and manage Conda environments from the R session with the Bioconductor package Herper (see the FAQ section for details).
Alternatively, you can see here the code we used to install all external dependencies in the Ubuntu 20.04 virtual machine provided by GitHub Actions, which is how this package is tested and how this document was created.
As a sanity check, let’s see if all external dependencies are installed.
# Test installation of external dependencies
fastp_is_installed()
#> [1] TRUE
star_is_installed()
#> [1] TRUE
sortmerna_is_installed()
#> [1] TRUE
rseqc_is_installed()
#> [1] TRUE
salmon_is_installed()
#> [1] TRUE
kallisto_is_installed()
#> [1] TRUE
subread_is_installed()
#> [1] TRUE
stringtie_is_installed()
#> [1] TRUE
taco_is_installed()
#> [1] TRUE
Retrieving sample metadata
First of all, you need to choose which samples you want to download
and create a metadata data frame for your samples. To
create this data frame, you will pass a search term to the function
create_sample_info()
. The search term has the same syntax
of the SRA search term syntax. For example, you can search by:
- BioSample accession - SAMN08903403[BSPL]
- BioProject accession - PRJNA229998[GPRJ]
- All RNA-seq samples for an organism - Glycine max[ORGN] AND RNA-seq[STRA]
- And so on…
Let’s create a metadata data frame for a human RNA-seq sample that is included in the airway Bioconductor package.
# Create metadata data frame
term <- "SAMN02422669[BSPL]"
metadata <- create_sample_info(term)
metadata
#> BioSample Experiment Run Tissue Pubmed
#> 1 SAMN02422669 SRX384345 SRR1039508 human airway smooth muscle cells 24926665
#> BioProject Instrument Layout Selection_method SRA_sample SRA_study
#> 1 PRJNA229998 Illumina HiSeq 2000 PAIRED cDNA SRS508568 SRP033351
#> Treatment Cultivar
#> 1 Untreated <NA>
#> Study_title
#> 1 Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications
#> Study_abstract
#> 1 Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are ß2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions. Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a ß2-agonist (i.e. Albuterol, 1µM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1µM for 18h); 4) simultaneous treatment with a ß2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The Tuxedo Suite Tools were used to align reads to the hg19 reference genome, assemble transcripts, and perform differential expression analysis using the protocol described in https://github.com/blancahimes/taffeta Overall design: mRNA profiles obtained via RNA-Seq for four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated.
#> Date Origin
#> 1 2014-01-02 09:16:11 <NA>
Downloading FASTQ files
To download the .fastq files from ENA, you will use the function
download_from_ena()
. As input, you only need to give the
metadata data frame and the path the output directory where .fastq files
will be stored.
IMPORTANT: Note below that you must change the timeout limit, or your downloads will be interrupted after the default timeout limit of 60 seconds. You will probably need more than 60 seconds to download some samples, especially if your internet connection is not very good.
Here’s an example of how to run download_from_ena()
:
# Change default timeout limit
options(timeout = 6000)
# Download sample to temporary directory
download <- download_from_ena(metadata, fastqdir = ds$fastqdir)
For running time issues, we will simply copy the example FASTQ files
from the extdata subdirectory of this package to
ds$fastqdir
. These example files contain a subset of the
SAMN02422669 BioSample we mentioned before.
# For running time issues, copy example FASTQ files to ds$fastqdir
f <- list.files(
system.file("extdata", package = "bears"),
pattern = ".fastq.gz",
full.names = TRUE
)
copy <- file.copy(f, ds$fastqdir)
After downloading FASTQ files, it’s important to check the integrity
of your files. The function check_md5()
does that for you
by checking if the md5 checksum of the downloaded file matches the md5
checksum of the original file.
check_md5(run_accessions = metadata$Run, fastqdir = ds$fastqdir)
#> Run Status
#> 1 SRR1039508 FALSE
Here, expectedly, the function reported an issue. As the example
files we are using contain only a subset of the
SAMN02422669 BioSample, the md5 checksums are
different. If the md5 checksums were the same, the Status
variable would be TRUE.
Read QC and filtering
After downloading the .fastq files, you need to perform some filtering steps to remove low quality bases, adapters, and rRNA. Read filtering consists in 2 steps:
trim_reads()
- trim adapters and low quality bases (if any). This function runs fastp (Chen et al. 2018) and saves filtered .fastq files in a directory namedfiltdir
.remove_rrna()
- remove rRNA (if any) from .fastq files. rRNA removal relies on the SortMeRNA (Kopylova, Noé, and Touzet 2012) program.
Trimming adapters and low quality bases
First, let’s use trim_reads()
to filter reads. 1
if(fastp_is_installed()) {
# Trim reads
fastp_status <- trim_reads(
metadata,
fastqdir = ds$fastqdir,
filtdir = ds$filtdir,
qcdir = ds$qcdir
)
fastp_status # check run status
}
#> run status
#> 1 SRR1039508 OK
The function trim_reads()
stores .json files containing
fastp summary statistics for each sample in the directory specified in
qcdir
. You can read it and parse it into a data frame with
the function summary_stats_fastp()
. Let’s demonstrate how
it works.
# Path to directory containing .json file from fastp
qcdir <- system.file("extdata", package = "bears")
qc_table <- summary_stats_fastp(qcdir)
qc_table
#> Sample sequencing before_nreads before_nbases
#> 1 SRR1039508 paired end (63 cycles + 63 cycles) 14194 894222
#> before_q20bases before_q30bases before_q20rate before_q30rate
#> 1 882850 866181 0.987283 0.968642
#> before_GCcontent before_meanlength after_nreads after_nbases after_q20bases
#> 1 0.486978 63 14076 886742 878073
#> after_q30bases after_q20rate after_q30rate after_GCcontent after_meanlength
#> 1 862056 0.990224 0.972161 0.486906 62
#> filter_n_passed filter_n_lowquality filter_n_too_many_N filter_n_tooshort
#> 1 14076 84 34 0
#> filter_n_toolong duplication_rate
#> 1 0 0.00986332
Removing rRNA
In this vignette, we will use a small rRNA database as an example. In real-world applications, your rRNA database directory should contain all FASTA files distributed in the SortMeRNA GitHub repo. However, if you think some of these files (e.g., 5s and 5.8s rRNA) are not a concern in your data set, you don’t need to include them in the database.
# Create a directory to store the rRNA db
rrna_db_dir <- file.path(tempdir(), "rrna")
dir.create(rrna_db_dir)
# Copy the example 16S rRNA file to the db directory.
rrna_file <- system.file("extdata", "bac_16s_subset.fa", package = "bears")
copy <- file.copy(from = rrna_file, to = rrna_db_dir)
# Remove rRNA (if any)
if(sortmerna_is_installed()) {
rrna_removal <- remove_rrna(
metadata,
fastqdir = ds$fastqdir,
filtdir = ds$filtdir,
rrna_db_dir = rrna_db_dir
)
rrna_removal # check run status
}
#> run status
#> 1 SRR1039508 OK
Now that we have performed all quality checks, we’re good to go. 2
Quantification of transcript abundance
Quantification of transcript abundance can be done in two ways:
Alignment-based approaches, which involves mapping reads to a reference genome using STAR (Dobin et al. 2013) and quantifying the expression based on uniquely mapped reads with featureCounts (Liao, Smyth, and Shi 2014) (in raw counts) and/or StringTie (Pertea et al. 2015) (in TPM).
Alignment-free approaches, which involves pseudo-aligning or quasi-mapping reads to a reference transcriptome with kallisto (Bray et al. 2016) or salmon (Patro et al. 2017), respectively.
Below, we will describe how to perform both approaches.
Alignment-based quantification
To start with, we will need to map the reads to a reference genome. In bears, reads are mapped to the reference genome with the software tool STAR.
Read mapping
Here, for the purpose of demonstration, we will map reads to a subset of the human genome. The FASTA and GTF files corresponding to the subset of the genome are available in the extdata/ subdirectory of this package.
Before mapping reads, we need to create a genome index. This can be
done with star_genome_index()
.
# Get paths to genome subset
genome_path <- system.file(
"extdata", "Homo_sapiens.GRCh37.75_subset.fa", package = "bears"
)
gff_path <- system.file(
"extdata", "Homo_sapiens.GRCh37.75_subset.gtf", package = "bears"
)
# Create genome index
if(star_is_installed()) {
genome_idx <- star_genome_index(
genome_path = genome_path,
gff_path = gff_path,
mappingdir = ds$mappingdir
)
genome_idx # check run status
}
#> index_path status
#> 1 /tmp/RtmpGIVQm5/03_read_mapping/genomeIndex OK
Now that we have the genome index, we can map reads to it.
# Map reads to the genome
if(star_is_installed()) {
read_mapping <- star_align(
metadata,
filtdir = ds$filtdir,
qc_table = qc_table,
mappingdir = ds$mappingdir,
gff_path = gff_path
)
read_mapping # check run status
}
#> sample status
#> 1 SAMN02422669 OK
Finally, let’s get read mapping statistics with the function
summary_stats_star()
. Here, we will use an example
Log.final.out that STAR returns stored in the extdata/*
subdirectory of this package.
# Obtaining read mapping statistics
star_dir <- system.file("extdata", package = "bears")
star_stats <- summary_stats_star(star_dir = star_dir)
star_stats
#> Sample total_reads avg_input_read_length uniquely_mapped
#> 1 SAMN02422669 7097 126 7088
#> uniquely_mapped_percent avg_mapped_read_length num_splices
#> 1 99.87 125.64 3825
#> num_annotated_splices num_GTAG_splices num_GCAG_splices num_ATAC_splices
#> 1 3815 3810 15 0
#> num_noncanonical_splices mismatch_rate deletion_rate deletion_length
#> 1 0 0.17 0 1.38
#> insertion_rate insertion_length multimapped multimapped_percent
#> 1 0.01 1.35 9 0.13
#> multimapped_toomany multimapped_toomany_percent unmapped_mismatches_percent
#> 1 0 0 0
#> unmapped_tooshort_percent unmapped_other_percent unmapped_mismatches
#> 1 0 0 0
#> unmapped_tooshort unmapped_other
#> 1 0 0
Now, let’s check if samples passed the minimum quality criteria. Here, samples are excluded if:
- >=50% of the reads fail to map or;
- >=40% of the reads fail to uniquely map.
The function mapping_pass()
takes the metadata data
frame and returns the same data frame, but only with the samples that
passed the minimum criteria.
# Check if samples passed the filtering criterion
align_passed <- mapping_pass(star_stats, metadata)
align_passed # inspect data
#> BioSample Experiment Run Tissue Pubmed
#> 1 SAMN02422669 SRX384345 SRR1039508 human airway smooth muscle cells 24926665
#> BioProject Instrument Layout Selection_method SRA_sample SRA_study
#> 1 PRJNA229998 Illumina HiSeq 2000 PAIRED cDNA SRS508568 SRP033351
#> Treatment Cultivar
#> 1 Untreated <NA>
#> Study_title
#> 1 Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications
#> Study_abstract
#> 1 Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are ß2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions. Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a ß2-agonist (i.e. Albuterol, 1µM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1µM for 18h); 4) simultaneous treatment with a ß2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The Tuxedo Suite Tools were used to align reads to the hg19 reference genome, assemble transcripts, and perform differential expression analysis using the protocol described in https://github.com/blancahimes/taffeta Overall design: mRNA profiles obtained via RNA-Seq for four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated.
#> Date Origin
#> 1 2014-01-02 09:16:11 <NA>
# Compare to the original data set
nrow(metadata)
#> [1] 1
nrow(align_passed)
#> [1] 1
As you can see, the sample we used for read mapping passed the minimum quality criteria. Good, huh? We can now proceed to the next step.
Inferring library strandedness
Before quantification, we need to infer library strandedness with the
RSeQC (Wang, Wang, and Li 2012)
tool. The function infer_strandedness()
runs RSeQC
and returns the metadata data frame with an additional column named
Orientation containing library strandedness
information.
This function requires the annotation in BED format, not GTF/GFF. To
convert from GTF/GFF to BED, use the function
gff2bed()
.
# Convert GFF to BED
bedpath <- gff2bed(gff_path)
bedpath # check path
#> [1] "/__w/_temp/Library/bears/extdata/Homo_sapiens.GRCh37.75_subset.bed"
# Infer strandedness
if(rseqc_is_installed()) {
new_metadata <- infer_strandedness(
mapping_passed = align_passed,
bedpath = bedpath,
mappingdir = ds$mappingdir
)
new_metadata
}
#> BioProject BioSample Experiment Run
#> 1 PRJNA229998 SAMN02422669 SRX384345 SRR1039508
#> Tissue Pubmed Instrument Layout
#> 1 human airway smooth muscle cells 24926665 Illumina HiSeq 2000 PAIRED
#> Selection_method SRA_sample SRA_study Treatment Cultivar
#> 1 cDNA SRS508568 SRP033351 Untreated <NA>
#> Study_title
#> 1 Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications
#> Study_abstract
#> 1 Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are ß2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions. Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a ß2-agonist (i.e. Albuterol, 1µM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1µM for 18h); 4) simultaneous treatment with a ß2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The Tuxedo Suite Tools were used to align reads to the hg19 reference genome, assemble transcripts, and perform differential expression analysis using the protocol described in https://github.com/blancahimes/taffeta Overall design: mRNA profiles obtained via RNA-Seq for four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated.
#> Date Origin Orientation
#> 1 2014-01-02 09:16:11 <NA> first
As you can see, there is a new Orientation
column with
strandedness info for this sample.
Using featureCounts
Now that we have the .bam files from STAR and information on library strandedness for each sample, we can quantify the expression with featureCounts. This tool quantifies gene expression measured in raw read counts per gene.
To quantify gene expression with featureCounts, use the
function feaureCounts()
. This function runs
featureCounts and returns a gene expression matrix with genes
in rows and samples in columns.
# Get gene expression in raw read counts
if(subread_is_installed()) {
fcounts_quant <- featureCounts(
new_metadata,
mappingdir = ds$mappingdir,
gff_path = gff_path,
fcountsdir = ds$fcountsdir
)
# Explore the expression matrix
head(fcounts_quant)
}
#>
#> ========== _____ _ _ ____ _____ ______ _____
#> ===== / ____| | | | _ \| __ \| ____| /\ | __ \
#> ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
#> ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
#> ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
#> ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
#> Rsubread 2.10.5
#>
#> //========================== featureCounts setting ===========================\\
#> || ||
#> || Input files : 1 BAM file ||
#> || ||
#> || SAMN02422669Aligned.sortedByCoord.out.bam ||
#> || ||
#> || Paired-end : yes ||
#> || Count read pairs : yes ||
#> || Annotation : Homo_sapiens.GRCh37.75_subset.gtf (GTF) ||
#> || Dir for temp files : . ||
#> || Threads : 2 ||
#> || Level : meta-feature level ||
#> || Multimapping reads : counted ||
#> || Multi-overlapping reads : not counted ||
#> || Min overlapping bases : 1 ||
#> || ||
#> \\============================================================================//
#>
#> //================================= Running ==================================\\
#> || ||
#> || Load annotation file Homo_sapiens.GRCh37.75_subset.gtf ... ||
#> || Features : 20 ||
#> || Meta-features : 2 ||
#> || Chromosomes/contigs : 1 ||
#> || ||
#> || Process BAM file SAMN02422669Aligned.sortedByCoord.out.bam... ||
#> || Strand specific : reversely stranded ||
#> || Paired-end reads are included. ||
#> || Total alignments : 7147 ||
#> || Successfully assigned alignments : 14 (0.2%) ||
#> || Running time : 0.00 minutes ||
#> || ||
#> || Write the final count table. ||
#> || Write the read assignment summary. ||
#> || ||
#> \\============================================================================//
#> SAMN02422669
#> ENSG00000171819 2
#> ENSG00000120942 12
Whenever you are working with gene expression data, we recommend
storing your data as SummarizedExperiment
objects, so you
have the expression matrix and sample metadata in a single object. If
you are not familiar with SummarizedExperiment
objects,
take a look at the documentation of the SummarizedExperiment
package.
To get a SummarizedExperiment
object from
featureCounts, use the function
featureCounts2se()
.
# Create a SummarizedExperiment object from featureCounts output
fcounts_se <- featureCounts2se(
new_metadata, fc_output = fcounts_quant
)
# Take a look at the SummarizedExperiment object
fcounts_se
#> class: SummarizedExperiment
#> dim: 2 1
#> metadata(0):
#> assays(1): fcounts
#> rownames(2): ENSG00000171819 ENSG00000120942
#> rowData names(0):
#> colnames(1): SAMN02422669
#> colData names(17): BioProject Experiment ... Origin Orientation
# Exploring sample metadata
SummarizedExperiment::colData(fcounts_se)
#> DataFrame with 1 row and 17 columns
#> BioProject Experiment Run Tissue
#> <character> <character> <character> <character>
#> SAMN02422669 PRJNA229998 SRX384345 SRR1039508 human airway smooth ..
#> Pubmed Instrument Layout Selection_method
#> <character> <character> <character> <character>
#> SAMN02422669 24926665 Illumina HiSeq 2000 PAIRED cDNA
#> SRA_sample SRA_study Treatment Cultivar
#> <character> <character> <character> <character>
#> SAMN02422669 SRS508568 SRP033351 Untreated NA
#> Study_title Study_abstract Date
#> <character> <character> <character>
#> SAMN02422669 Human Airway Smooth .. Rationale: Asthma is.. 2014-01-02 09:16:11
#> Origin Orientation
#> <character> <character>
#> SAMN02422669 NA first
# Exploring gene expression matrix
SummarizedExperiment::assay(fcounts_se)
#> SAMN02422669
#> ENSG00000171819 2
#> ENSG00000120942 12
Using StringTie
StringTie quantifies transcript-level and gene-level transcript abundances in normalized values (transcripts per million, TPM).
To obtain gene expression levels in TPM with StringTie, use
the function stringtie_quantify()
.
# Quantify expression in TPM with StringTie
if(stringtie_is_installed()) {
stringtie_quant <- stringtie_quantify(
new_metadata,
qc_table = qc_table,
mappingdir = ds$mappingdir,
gff_path = gff_path,
stringtiedir = ds$stringtiedir
)
stringtie_quant # check run status
}
#> sample status
#> 1 SAMN02422669 OK
Now, let’s read the output from StringTie as a
SummarizedExperiment
object with the function
stringtie2se
. You can choose if you want the expression at
the gene level, at the transcript level, or both. Here, let’s get the
gene-level expression. For that, you will need to give a 2-column data
frame with transcript IDs and their corresponding genes.
# Load transcript-to-gene correspondence
data(tx2gene)
head(tx2gene)
#> TXNAME GENEID
#> 1 ENST00000473118 ENSG00000120948
#> 2 ENST00000240185 ENSG00000120948
#> 3 ENST00000476201 ENSG00000120948
#> 4 ENST00000472476 ENSG00000120948
#> 5 ENST00000439080 ENSG00000120948
#> 6 ENST00000315091 ENSG00000120948
# Read StringTie output as a SummarizedExperiment object
stringtiese <- stringtie2se(
new_metadata,
stringtiedir = ds$stringtiedir,
level = "gene",
tx2gene = tx2gene
)
#> reading in files with read.delim (install 'readr' package for speed up)
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
# Exploring the SummarizedExperiment object
stringtiese
#> class: SummarizedExperiment
#> dim: 2 1
#> metadata(0):
#> assays(1): gene_TPM
#> rownames(2): ENSG00000120942 ENSG00000171819
#> rowData names(0):
#> colnames(1): SAMN02422669
#> colData names(17): BioProject Experiment ... Origin Orientation
# Looking at gene expression matrix
SummarizedExperiment::assay(stringtiese, "gene_TPM")
#> SAMN02422669
#> ENSG00000120942 207072.5
#> ENSG00000171819 0.0
Bonus: Transcript assembly and merging
Besides quantifying transcript abundance, StringTie can also be used to assemble transcripts for each BioSample. Assembled transcripts for each BioSample are represented as .gtf files.
However, if you want a single .gtf file with the assembled transcripts for all BioSamples you are studying, you can merge the individual .gtf files from StringTie with the software tool TACO (Niknafs et al. 2017). Below, we will demonstrate how that can be achieved.
# Transcript assembly with StringTie
if(stringtie_is_installed()) {
assembled_transcripts <- stringtie_assemble(
new_metadata,
qc_table = qc_table,
mappingdir = ds$mappingdir,
gff_path = gff_path,
stringtiedir = ds$stringtiedir
)
assembled_transcripts # check run status
}
#> sample status
#> 1 SAMN02422669 OK
In this vignette, we have a single BioSample. However, in real-life
scenarios, you would have several samples. To merge the .gtf files for
each sample in a single .gtf file, use the function
taco_merge()
.
# Merge assembled transcripts with TACO
if(taco_is_installed()) {
merged_transcripts <- taco_merge(
new_metadata,
stringtiedir = ds$stringtiedir
)
merged_transcripts # check run status
}
#> file status
#> 1 final_assembly.gtf OK
The merged transcript assembly will be stored in a file named
final_assembly.gtf in the subdirectory
assembly/merged_assembly, inside stringtiedir
. To
get the path to the .gtf file, use:
# Get path to merged transcript assembly
final_assembly <- file.path(
ds$stringtiedir, "assembly", "merged_assembly", "final_assembly.gtf"
)
final_assembly
#> [1] "/tmp/RtmpGIVQm5/04_quantification/stringtie/assembly/merged_assembly/final_assembly.gtf"
But why would I want to assemble transcripts and merge them if I already have a .gtf file the transcript annotations?
That’s a great question! This is a way to identify novel transcripts that are not present in your reference .gtf file. Some transcripts can be missing in the reference annotation (.gtf file) mainly because: i. genome assembly does not have a good quality, so these transcripts could not be predicted. ii. false-positives from the transcript annotation software tool that was used.
If you want to have a more comprehensive transcript abundance quantification, you can assemble transcripts for each sample, merge them, and input the output file final_assembly.gtf to the quantification functions. This way, instead of using the reference transcript annotation, you will use your own transcript annotation, which may contain novel transcripts.
Alignment-free quantification
To quantify the expression without mapping reads to the genome, you have two options:
- Using kallisto, which quantifies transcript abundance based on pseudoalignments.
- Using salmon, which quantifies transcript abundance based on quasi-mapping.
For both kallisto and salmon, you will need to have a
reference transcriptome, not a reference genome. This
is a FASTA file containing the sequences of all annotated transcripts in
your genome. You can easily create this file with the function
extractTranscriptSeqs()
from the GenomicFeatures
package.
Using salmon
First of all, we will need to index the reference transcriptome with
the function salmon_index()
.
# Path to reference transcriptome
transcriptome_path <- system.file(
"extdata", "Homo_sapiens.GRCh37.75_subset_transcripts.fa.gz",
package = "bears"
)
# Index the transcriptome
if(salmon_is_installed()) {
idx_salmon <- salmon_index(
salmonindex = ds$salmonindex,
transcriptome_path = transcriptome_path
)
idx_salmon # check run status
}
#> index_path status
#> 1 /tmp/RtmpGIVQm5/04_quantification/salmon/idx OK
Now, we can quantify transcript abundance with
salmon_quantify()
.
# Quantify transcript abundance
if(salmon_is_installed()) {
quant_salmon <- salmon_quantify(
new_metadata,
filtdir = ds$filtdir,
salmonindex = ds$salmonindex,
salmondir = ds$salmondir
)
quant_salmon # check run status
}
#> sample status
#> 1 SAMN02422669 OK
After running salmon_quantify()
, salmon output in .sf
format will be stored in the directory specified in
salmondir
.
To read salmon output as a SummarizedExperiment
object,
use the function salmon2se()
. You can choose if you want
the expression at the gene level, at the transcript level, or both.
Here, let’s get the gene-level expression. For that, you will need to
give a 2-column data frame with transcript IDs and their corresponding
genes.
# Load transcript-to-gene data frame
data(tx2gene)
head(tx2gene)
#> TXNAME GENEID
#> 1 ENST00000473118 ENSG00000120948
#> 2 ENST00000240185 ENSG00000120948
#> 3 ENST00000476201 ENSG00000120948
#> 4 ENST00000472476 ENSG00000120948
#> 5 ENST00000439080 ENSG00000120948
#> 6 ENST00000315091 ENSG00000120948
# Read salmon output as a SummarizedExperiment object
salmon_se <- salmon2se(
new_metadata,
level = "gene",
salmondir = ds$salmondir,
tx2gene
)
#> reading in files with read.delim (install 'readr' package for speed up)
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
# Exploring the output
salmon_se
#> class: SummarizedExperiment
#> dim: 2 1
#> metadata(0):
#> assays(2): gene_TPM gene_counts
#> rownames(2): ENSG00000120942 ENSG00000171819
#> rowData names(0):
#> colnames(1): SAMN02422669
#> colData names(17): BioProject Experiment ... Origin Orientation
# Get gene expression matrix in TPM
SummarizedExperiment::assay(salmon_se, "gene_TPM")
#> SAMN02422669
#> ENSG00000120942 950095.05
#> ENSG00000171819 49904.95
# Get gene expression matrix as raw counts
SummarizedExperiment::assay(salmon_se, "gene_counts")
#> SAMN02422669
#> ENSG00000120942 113
#> ENSG00000171819 2
Using kallisto
Like we do in salmon, we will start by indexing the transcriptome.
# Index the transcriptome
if(kallisto_is_installed()) {
idx_kallisto <- kallisto_index(
kallistoindex = ds$kallistoindex,
transcriptome_path = transcriptome_path
)
idx_kallisto # check run status
}
#> index_path status
#> 1 /tmp/RtmpGIVQm5/04_quantification/kallisto/idx/transcripts.idx OK
Now, we can quantify the transcript abundance.
# Quantify transcript abundance
if(kallisto_is_installed()) {
quant_kallisto <- kallisto_quantify(
new_metadata,
qc_table,
filtdir = ds$filtdir,
kallistoindex = ds$kallistoindex,
kallistodir = ds$kallistodir
)
quant_kallisto # check run status
}
#> sample status
#> 1 SAMN02422669 OK
To read kallisto output in a SummarizedExperiment
object, use the function kallisto2se()
. Again, you will
have specify if you want the expressiona at the gene level, transcript
level, or both. Let’s get the gene expression here.
# Read kallisto output to SummarizedExperiment object
kallisto_se <- kallisto2se(
new_metadata,
level = "gene",
kallistodir = ds$kallistodir,
tx2gene
)
#> Note: importing `abundance.h5` is typically faster than `abundance.tsv`
#> reading in files with read.delim (install 'readr' package for speed up)
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> summarizing inferential replicates
# Exploring the output
kallisto_se
#> class: SummarizedExperiment
#> dim: 2 1
#> metadata(0):
#> assays(2): gene_TPM gene_counts
#> rownames(2): ENSG00000120942 ENSG00000171819
#> rowData names(0):
#> colnames(1): SAMN02422669
#> colData names(17): BioProject Experiment ... Origin Orientation
# Get gene expression matrix in TPM
SummarizedExperiment::assay(kallisto_se, "gene_TPM")
#> SAMN02422669
#> ENSG00000120942 988556.3
#> ENSG00000171819 11443.4
# Get gene expression matrix as raw counts
SummarizedExperiment::assay(kallisto_se, "gene_counts")
#> SAMN02422669
#> ENSG00000120942 127.952080
#> ENSG00000171819 2.047949
Closing remarks
If you are using bears, there are two things you must keep in mind. First, this package was designed to be as complete as possible, which means you don’t need to run the complete pipeline for your own project. For instance, if you just want gene expression values in TPM for a particular BioProject or set of BioProjects, you can simply go through the salmon path of the pipeline, skipping the read mapping section. Likewise, if you are using bears for your own data set and you have already cleaned the reads, you can skip the sequence quality checks and read filtering sections. The second thing to consider is that bears is a work in progress. Bioinformatics is a fast-evolving field, and new (and better) methods to address a particular question are developed continuously. Hence, we aim to keep bears up to date with state-of-the-art methods.
FAQ
How do I manage Conda environments from the R session?
You only need to do the following 2 things:
- Install miniconda. After installing miniconda in your machine, create an R object containing the path to your miniconda installation. For example:
miniconda_path <- "~/tools/miniconda"
-
Create a different environment for each external
tool. You can use the Herper
package to create an environment for each tool from .yml files stored in
the extdata/ subdirectory of this package. To avoid conflicts,
it is important to keep each tool in its own environment. Here, each
tool will be installed in an environment named
<tool-name-lowercase>_env
(e.g.,star_env
,rseqc_env
,salmon_env
).
library(Herper)
# Path to .yml files to create environments
envs <- list.files(
system.file("extdata", package = "bears"), pattern = ".yml",
full.names = TRUE
)
# Install miniconda in `my_miniconda` and create envs
create_envs <- sapply(envs, function(x) {
import_CondaEnv(x, pathToMiniConda = miniconda_path)
})
Now, you can run functions that call external tools inside a call to
Herper::withCondaEnv()
. This function allows you to run an
R function inside a particular Conda environment, which you need to
specify. For example, to run the function salmon_quantify()
inside the environment salmon_env
, you would do:
withCondaEnv(
"salmon_env",
quant_salmon <- salmon_quantify(
new_metadata,
filtdir = ds$filtdir,
salmonindex = ds$salmonindex,
salmondir = ds$salmondir
),
pathToMiniconda = miniconda_path
)
Can I use this package with my own RNA-seq data (not from a database)?
You surely can. For that, you will first need to create a metadata
data frame for your samples that looks like the data frame created by
create_sample_info()
(see the example data set
sample_info
). While you can add as many columns as you
want, 5 columns MUST be present:
BioSample: BioSample ID. These can be fictional sample names, e.g., “Sample1”, “Sample2”, “Sample3”, etc.
Run: Run accession. These must be the basename of your files. Make sure you don’t repeat run accessions for paired-end reads, i.e., files example_1.fastq.gz and example_2.fastq.gz should both be represented by the same run accession. Examples of file names and the run accessions they should have:
Layout | File name | Run accession |
---|---|---|
PAIRED | SampleA_1.fastq.gz and SampleA_2.fastq.gz | SampleA |
SINGLE | Sample1_control_replicate1.fastq.gz | Sample1_control_replicate1 |
BioProject: Project ID. If all files come from a single project, you can give it whatever name you want, such as MyNicePhDProject.
Instrument: Sequencing instrument (e.g., Illumina HiSeq 2500). Note that long-read technologies (e.g., PacBio) are not supported and, hence, will be skipped.
Layout: Sequencing protocol. Either “PAIRED” or “SINGLE”.
Finally, if you want to create a standard directory structure with
create_dir_structure()
, you will have to either move your
files to the directory indicated in ds$fastqdir
or change
ds$fastqdir
manually to include the path to the directory
where your FASTQ files are. For example:
ds$fastqdir <- "/home/myusername/my_nice_project/fastq_files"
Session information
This document was created under the following sections:
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os Ubuntu 20.04.4 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz UTC
#> date 2023-04-04
#> pandoc 2.18 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bears * 0.99.0 2023-04-04 [1] Bioconductor
#> Biobase 2.56.0 2022-04-26 [1] Bioconductor
#> BiocGenerics 0.42.0 2022-04-26 [1] Bioconductor
#> BiocIO 1.6.0 2022-04-26 [1] Bioconductor
#> BiocManager 1.30.20 2023-02-24 [1] RSPM (R 4.2.0)
#> BiocParallel 1.30.4 2022-10-11 [1] Bioconductor
#> BiocStyle * 2.24.0 2022-04-26 [1] Bioconductor
#> Biostrings 2.64.1 2022-08-18 [1] Bioconductor
#> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.0)
#> bookdown 0.33 2023-03-06 [1] RSPM (R 4.2.0)
#> bslib 0.4.2 2022-12-16 [2] RSPM (R 4.2.0)
#> cachem 1.0.7 2023-02-24 [2] RSPM (R 4.2.0)
#> cli 3.6.1 2023-03-23 [2] RSPM (R 4.2.0)
#> codetools 0.2-19 2023-02-01 [3] RSPM (R 4.2.0)
#> crayon 1.5.2 2022-09-29 [2] RSPM (R 4.2.0)
#> curl 5.0.0 2023-01-12 [2] RSPM (R 4.2.0)
#> DelayedArray 0.22.0 2022-04-26 [1] Bioconductor
#> desc 1.4.2 2022-09-08 [2] RSPM (R 4.2.0)
#> digest 0.6.31 2022-12-11 [2] RSPM (R 4.2.0)
#> downloader 0.4 2015-07-09 [1] RSPM (R 4.2.0)
#> evaluate 0.20 2023-01-17 [2] RSPM (R 4.2.0)
#> fansi 1.0.4 2023-01-22 [2] RSPM (R 4.2.0)
#> fastmap 1.1.1 2023-02-24 [2] RSPM (R 4.2.0)
#> fs 1.6.1 2023-02-06 [2] RSPM (R 4.2.0)
#> GenomeInfoDb 1.32.4 2022-09-06 [1] Bioconductor
#> GenomeInfoDbData 1.2.8 2022-05-02 [1] Bioconductor
#> GenomicAlignments 1.32.1 2022-07-24 [1] Bioconductor
#> GenomicRanges 1.48.0 2022-04-26 [1] Bioconductor
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
#> htmltools 0.5.5 2023-03-23 [2] RSPM (R 4.2.0)
#> httr 1.4.5 2023-02-24 [2] RSPM (R 4.2.0)
#> IRanges 2.30.1 2022-08-18 [1] Bioconductor
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.2.0)
#> jsonlite 1.8.4 2022-12-06 [1] RSPM (R 4.2.0)
#> knitr 1.42 2023-01-25 [2] RSPM (R 4.2.0)
#> lattice 0.20-45 2021-09-22 [3] CRAN (R 4.2.1)
#> lifecycle 1.0.3 2022-10-07 [2] RSPM (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.2.0)
#> Matrix 1.5-3 2022-11-11 [3] RSPM (R 4.2.0)
#> MatrixGenerics 1.8.1 2022-06-26 [1] Bioconductor
#> matrixStats 0.63.0 2022-11-18 [1] RSPM (R 4.2.0)
#> memoise 2.0.1 2021-11-26 [2] CRAN (R 4.2.0)
#> pillar 1.9.0 2023-03-22 [2] RSPM (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
#> pkgdown 2.0.7 2022-12-14 [2] RSPM (R 4.2.0)
#> purrr 1.0.1 2023-01-10 [2] RSPM (R 4.2.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
#> ragg 1.2.5 2023-01-12 [2] RSPM (R 4.2.0)
#> RCurl 1.98-1.12 2023-03-27 [1] RSPM (R 4.2.0)
#> rentrez 1.2.3 2020-11-10 [1] RSPM (R 4.2.0)
#> restfulr 0.0.15 2022-06-16 [1] RSPM (R 4.2.0)
#> rhdf5 2.40.0 2022-04-26 [1] Bioconductor
#> rhdf5filters 1.8.0 2022-04-26 [1] Bioconductor
#> Rhdf5lib 1.18.2 2022-05-15 [1] Bioconductor
#> rjson 0.2.21 2022-01-09 [1] RSPM (R 4.2.0)
#> rlang 1.1.0 2023-03-14 [2] RSPM (R 4.2.0)
#> rmarkdown 2.21 2023-03-26 [2] RSPM (R 4.2.0)
#> rprojroot 2.0.3 2022-04-02 [2] CRAN (R 4.2.0)
#> Rsamtools 2.12.0 2022-04-26 [1] Bioconductor
#> Rsubread 2.10.5 2022-08-09 [1] Bioconductor
#> rtracklayer 1.56.1 2022-06-23 [1] Bioconductor
#> S4Vectors 0.34.0 2022-04-26 [1] Bioconductor
#> sass 0.4.5 2023-01-24 [2] RSPM (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [2] RSPM (R 4.2.0)
#> stringi 1.7.12 2023-01-11 [2] RSPM (R 4.2.0)
#> stringr 1.5.0 2022-12-02 [2] RSPM (R 4.2.0)
#> SummarizedExperiment 1.26.1 2022-04-29 [1] Bioconductor
#> systemfonts 1.0.4 2022-02-11 [2] RSPM (R 4.2.0)
#> textshaping 0.3.6 2021-10-13 [2] RSPM (R 4.2.0)
#> tibble 3.2.1 2023-03-20 [2] RSPM (R 4.2.0)
#> tximport 1.24.0 2022-04-26 [1] Bioconductor
#> utf8 1.2.3 2023-01-31 [2] RSPM (R 4.2.0)
#> vctrs 0.6.1 2023-03-22 [2] RSPM (R 4.2.0)
#> xfun 0.38 2023-03-24 [2] RSPM (R 4.2.0)
#> XML 3.99-0.14 2023-03-19 [1] RSPM (R 4.2.0)
#> XVector 0.36.0 2022-04-26 [1] Bioconductor
#> yaml 2.3.7 2023-01-23 [2] RSPM (R 4.2.0)
#> zlibbioc 1.42.0 2022-04-26 [1] Bioconductor
#>
#> [1] /__w/_temp/Library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/local/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────