Gene regulatory network inference
Fabricio Almeida-Silva
Universidade Estadual do Norte Fluminense Darcy Ribeiro, RJ, BrazilThiago Motta Venancio
Universidade Estadual do Norte Fluminense Darcy Ribeiro, RJ, BrazilSource:
vignettes/vignette_02_GRN_inference.Rmd
vignette_02_GRN_inference.Rmd
Installation
if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install("BioNERO")
Introduction and algorithm description
In the previous vignette, we explored all aspects of gene coexpression networks (GCNs), which are represented as undirected weighted graphs. It is undirected because, for a given link between gene A and gene B, we can only say that these genes are coexpressed, but we cannot know whether gene A controls gene B or otherwise. Further, weighted means that some coexpression relationships between gene pairs are stronger than others. In this vignette, we will demonstrate how to infer gene regulatory networks (GRNs) from expression data with BioNERO. GRNs display interactions between regulators (e.g., transcription factors or miRNAs) and their targets (e.g., genes). Hence, they are represented as directed unweighted graphs.
Numerous algorithms have been developed to infer GRNs from expression
data. However, the algorithm performances are highly dependent on the
benchmark data set. To solve this uncertainty, Marbach et al. (2012) proposed the application
of the “wisdom of the crowds” principle to GRN inference. This
approach consists in inferring GRNs with different algorithms, ranking
the interactions identified by each method, and calculating the average
rank for each interaction across all algorithms used. This way, we can
have consensus, high-confidence edges to be used in biological
interpretations. For that, BioNERO
implements three popular
algorithms: GENIE3 (Huynh-Thu et al.
2010), ARACNE (Margolin et al.
2006) and CLR (Faith et al.
2007).
Data preprocessing
Before inferring the GRN, we will preprocess the expression data the same way we did in the previous vignette.
# Load example data set
data(zma.se)
# Preprocess the expression data
final_exp <- exp_preprocess(
zma.se,
min_exp = 10,
variance_filter = TRUE,
n = 2000
)
## Number of removed samples: 1
Gene regulatory network inference
BioNERO
requires only 2 objects for GRN inference: the
expression data (SummarizedExperiment, matrix or data
frame) and a character vector of regulators
(transcription factors or miRNAs). The transcription factors used in
this vignette were downloaded from PlantTFDB 4.0 (Jin et al. 2017).
data(zma.tfs)
head(zma.tfs)
## Gene Family
## 6 Zm00001d022525 Dof
## 25 Zm00001d037605 GATA
## 28 Zm00001d049540 NAC
## 45 Zm00001d042287 MYB
## 46 Zm00001d042288 NAC
## 54 Zm00001d039371 TCP
Consensus GRN inference
Inferring GRNs based on the wisdom of the crowds principle
can be done with a single function: exp2grn()
. This
function will infer GRNs with GENIE3, ARACNE and CLR, calculate average
ranks for each interaction and filter the resulting network based on the
optimal scale-free topology (SFT) fit. In the filtering step, n
different networks are created by subsetting the top n
quantiles. For instance, if a network of 10,000 edges is given as input
with nsplit = 10
, 10 different networks will be created:
the first with 1,000 edges, the second with 2,000 edges, and so on, with
the last network being the original input network. Then, for each
network, the function will calculate the SFT fit and select the best
fit.
# Using 10 trees for demonstration purposes. Use the default: 1000
grn <- exp2grn(
exp = final_exp,
regulators = zma.tfs$Gene,
nTrees = 10
)
## The top number of edges that best fits the scale-free topology is 247
head(grn)
## Regulator Target
## 290 Zm00001d041474 Zm00001d018986
## 280 Zm00001d041474 Zm00001d006602
## 281 Zm00001d041474 Zm00001d006942
## 325 Zm00001d044315 Zm00001d043497
## 65 Zm00001d013777 Zm00001d046996
## 252 Zm00001d038832 Zm00001d021147
Algorithm-specific GRN inference
This section is directed to users who, for some reason (e.g., comparison, exploration), want to infer GRNs with particular algorithms. The available algorithms are:
GENIE3: a regression-tree based algorithm that decomposes the prediction of GRNs for n genes into n regression problems. For each regression problem, the expression profile of a target gene is predicted from the expression profiles of all other genes using random forests (default) or extra-trees.
# Using 10 trees for demonstration purposes. Use the default: 1000
genie3 <- grn_infer(
final_exp,
method = "genie3",
regulators = zma.tfs$Gene,
nTrees = 10)
head(genie3)
## Node1 Node2 Weight
## 20352 Zm00001d041474 Zm00001d017881 0.5439514
## 41340 Zm00001d034751 Zm00001d037111 0.5322394
## 13037 Zm00001d034751 Zm00001d012407 0.4348469
## 13207 Zm00001d045323 Zm00001d012513 0.4203583
## 55378 Zm00001d028432 Zm00001d048693 0.4071160
## 50200 Zm00001d013777 Zm00001d044212 0.3957483
dim(genie3)
## [1] 60136 3
ARACNE: information-theoretic algorithm that aims to remove indirect interactions inferred by coexpression.
aracne <- grn_infer(final_exp, method = "aracne", regulators = zma.tfs$Gene)
head(aracne)
## Node1 Node2 Weight
## 23861 Zm00001d038832 Zm00001d021147 1.789818
## 1758 Zm00001d038832 Zm00001d000432 1.692232
## 11337 Zm00001d038832 Zm00001d011086 1.692232
## 27014 Zm00001d011139 Zm00001d024274 1.674840
## 51070 Zm00001d011139 Zm00001d045069 1.658043
## 28387 Zm00001d038832 Zm00001d025784 1.641802
dim(aracne)
## [1] 411 3
CLR: extension of the relevance networks algorithm that uses mutual information to identify regulatory interactions.
clr <- grn_infer(final_exp, method = "clr", regulators = zma.tfs$Gene)
head(clr)
## Node1 Node2 Weight
## 26302 Zm00001d046937 Zm00001d023376 12.70216
## 11267 Zm00001d046937 Zm00001d011080 12.25336
## 12540 Zm00001d041474 Zm00001d012007 10.74023
## 51019 Zm00001d042263 Zm00001d045042 10.50925
## 17810 Zm00001d041474 Zm00001d015811 10.33216
## 29278 Zm00001d046937 Zm00001d026632 10.20075
dim(clr)
## [1] 26657 3
Users can also infer GRNs with the 3 algorithms at once using the
function exp_combined()
. The resulting edge lists are
stored in a list of 3 elements. 1
grn_list <- grn_combined(final_exp, regulators = zma.tfs$Gene, nTrees = 10)
head(grn_list$genie3)
## Node1 Node2 Weight
## 12013 Zm00001d041474 Zm00001d011541 0.4629469
## 30418 Zm00001d046568 Zm00001d027841 0.4289222
## 33403 Zm00001d041474 Zm00001d030748 0.4140894
## 6910 Zm00001d044315 Zm00001d006725 0.4103733
## 22057 Zm00001d041474 Zm00001d018986 0.4020641
## 45153 Zm00001d034751 Zm00001d039733 0.3935705
head(grn_list$aracne)
## Node1 Node2 Weight
## 23861 Zm00001d038832 Zm00001d021147 1.789818
## 1758 Zm00001d038832 Zm00001d000432 1.692232
## 11337 Zm00001d038832 Zm00001d011086 1.692232
## 27014 Zm00001d011139 Zm00001d024274 1.674840
## 51070 Zm00001d011139 Zm00001d045069 1.658043
## 28387 Zm00001d038832 Zm00001d025784 1.641802
head(grn_list$clr)
## Node1 Node2 Weight
## 26302 Zm00001d046937 Zm00001d023376 12.70216
## 11267 Zm00001d046937 Zm00001d011080 12.25336
## 12540 Zm00001d041474 Zm00001d012007 10.74023
## 51019 Zm00001d042263 Zm00001d045042 10.50925
## 17810 Zm00001d041474 Zm00001d015811 10.33216
## 29278 Zm00001d046937 Zm00001d026632 10.20075
Gene regulatory network analysis
After inferring the GRN, BioNERO
allows users to perform
some common downstream analyses.
Hub gene identification
GRN hubs are defined as the top 10% most highly connected regulators,
but this percentile is flexible in BioNERO
.2 They can be identified
with get_hubs_grn()
.
hubs <- get_hubs_grn(grn)
hubs
## Gene Degree
## 1 Zm00001d038832 16
## 2 Zm00001d041474 13
## 3 Zm00001d046937 13
## 4 Zm00001d011139 12
## 5 Zm00001d052229 11
## 6 Zm00001d013777 10
## 7 Zm00001d039989 10
## 8 Zm00001d038227 10
## 9 Zm00001d030617 10
## 10 Zm00001d044315 9
## 11 Zm00001d003822 9
## 12 Zm00001d020020 9
## 13 Zm00001d046568 9
## 14 Zm00001d010227 9
## 15 Zm00001d025339 8
## 16 Zm00001d028974 8
## 17 Zm00001d042267 7
## 18 Zm00001d014377 7
## 19 Zm00001d054038 6
## 20 Zm00001d042263 6
## 21 Zm00001d035440 6
## 22 Zm00001d036148 6
## 23 Zm00001d031655 6
## 24 Zm00001d034751 6
## 25 Zm00001d018081 6
## 26 Zm00001d027957 5
Network visualization
plot_grn(grn)
GRNs can also be visualized interactively for exploratory purposes.
Finally, BioNERO
can also be used for visualization and
hub identification in protein-protein (PPI) interaction networks. The
functions get_hubs_ppi()
and plot_ppi()
work
the same way as their equivalents for GRNs (get_hubs_grn()
and plot_grn()
).
Session information
This vignette was created under the following conditions:
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BioNERO_1.13.1 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 ggdendro_0.2.0
## [3] rstudioapi_0.16.0 jsonlite_1.8.8
## [5] shape_1.4.6.1 NetRep_1.2.7
## [7] magrittr_2.0.3 farver_2.1.1
## [9] rmarkdown_2.26 GlobalOptions_0.1.2
## [11] fs_1.6.3 zlibbioc_1.48.2
## [13] ragg_1.3.0 vctrs_0.6.5
## [15] memoise_2.0.1 RCurl_1.98-1.14
## [17] base64enc_0.1-3 htmltools_0.5.8.1
## [19] S4Arrays_1.2.1 dynamicTreeCut_1.63-1
## [21] SparseArray_1.2.4 Formula_1.2-5
## [23] sass_0.4.9 bslib_0.7.0
## [25] htmlwidgets_1.6.4 desc_1.4.3
## [27] plyr_1.8.9 impute_1.76.0
## [29] cachem_1.0.8 networkD3_0.4
## [31] igraph_2.0.3 lifecycle_1.0.4
## [33] ggnetwork_0.5.13 iterators_1.0.14
## [35] pkgconfig_2.0.3 Matrix_1.6-5
## [37] R6_2.5.1 fastmap_1.1.1
## [39] GenomeInfoDbData_1.2.11 MatrixGenerics_1.14.0
## [41] clue_0.3-65 digest_0.6.35
## [43] colorspace_2.1-0 patchwork_1.2.0
## [45] AnnotationDbi_1.64.1 S4Vectors_0.40.2
## [47] GENIE3_1.24.0 textshaping_0.3.7
## [49] Hmisc_5.1-2 GenomicRanges_1.54.1
## [51] RSQLite_2.3.6 labeling_0.4.3
## [53] fansi_1.0.6 mgcv_1.9-1
## [55] httr_1.4.7 abind_1.4-5
## [57] compiler_4.3.3 withr_3.0.0
## [59] bit64_4.0.5 doParallel_1.0.17
## [61] htmlTable_2.4.2 backports_1.4.1
## [63] BiocParallel_1.36.0 DBI_1.2.2
## [65] intergraph_2.0-4 highr_0.10
## [67] MASS_7.3-60.0.1 DelayedArray_0.28.0
## [69] rjson_0.2.21 tools_4.3.3
## [71] foreign_0.8-86 nnet_7.3-19
## [73] glue_1.7.0 nlme_3.1-164
## [75] grid_4.3.3 checkmate_2.3.1
## [77] reshape2_1.4.4 cluster_2.1.6
## [79] sva_3.50.0 generics_0.1.3
## [81] gtable_0.3.5 preprocessCore_1.64.0
## [83] data.table_1.15.4 WGCNA_1.72-5
## [85] utf8_1.2.4 XVector_0.42.0
## [87] BiocGenerics_0.48.1 ggrepel_0.9.5
## [89] foreach_1.5.2 pillar_1.9.0
## [91] stringr_1.5.1 limma_3.58.1
## [93] genefilter_1.84.0 circlize_0.4.16
## [95] splines_4.3.3 dplyr_1.1.4
## [97] lattice_0.22-6 survival_3.5-8
## [99] bit_4.0.5 annotate_1.80.0
## [101] tidyselect_1.2.1 locfit_1.5-9.9
## [103] GO.db_3.18.0 ComplexHeatmap_2.18.0
## [105] Biostrings_2.70.3 knitr_1.46
## [107] gridExtra_2.3 bookdown_0.39
## [109] IRanges_2.36.0 edgeR_4.0.16
## [111] SummarizedExperiment_1.32.0 RhpcBLASctl_0.23-42
## [113] stats4_4.3.3 xfun_0.43
## [115] Biobase_2.62.0 statmod_1.5.0
## [117] matrixStats_1.3.0 stringi_1.8.3
## [119] statnet.common_4.9.0 yaml_2.3.8
## [121] minet_3.60.0 evaluate_0.23
## [123] codetools_0.2-20 tibble_3.2.1
## [125] BiocManager_1.30.22 cli_3.6.2
## [127] rpart_4.1.23 xtable_1.8-4
## [129] systemfonts_1.0.6 munsell_0.5.1
## [131] jquerylib_0.1.4 network_1.18.2
## [133] Rcpp_1.0.12 GenomeInfoDb_1.38.8
## [135] coda_0.19-4.1 png_0.1-8
## [137] XML_3.99-0.16.1 fastcluster_1.2.6
## [139] parallel_4.3.3 pkgdown_2.0.9
## [141] ggplot2_3.5.0 blob_1.2.4
## [143] bitops_1.0-7 scales_1.3.0
## [145] purrr_1.0.2 crayon_1.5.2
## [147] GetoptLong_1.0.5 rlang_1.1.3
## [149] KEGGREST_1.42.0