This vignette demonstrates how to use Grand Forest to find a gene module associated with survival in non-small cell lung cancer.
First we need to prepare the gene-gene interaction network. We can obtain a protein-protein interaction network from BioGRID using the simpIntLists package from Bioconductor.
library(data.table)
library(tidyverse)
library(simpIntLists)
data("HumanBioGRIDInteractionEntrezId")
# convert edge lists into two-column data frame
edges <- lapply(HumanBioGRIDInteractionEntrezId, function(x) {
data.frame(
source = as.character(x$name),
target = as.character(x$interactors),
stringsAsFactors = FALSE
)
})
edges <- rbindlist(edges)
head(edges)
#> source target
#> 1: 6416 2318
#> 2: 6416 192176
#> 3: 6416 9043
#> 4: 6416 5599
#> 5: 6416 5871
#> 6: 6416 5609
The resulting data frame should contain two columns with gene identifiers as character
strings. In this example we will be using Entrez Gene IDs, but any other identifier can be used, provided the same type is used in the experimental data as well.
Next we download a gene expression data set from non-small cell lung cancer patients. The dataset was extracted from GSE30219. Besides gene expression values, we also obtain survival information for each patient. The os_time
-column contains the follow-up time for each patient in months, and the os_event
indicates whether an event has happened (0 = no event, 1 = death). Each remaining column corresponds to a gene.
# Download lung cancer expression data set with survival times.
D <- read_csv("https://grandforest.compbio.sdu.dk/files/survival_example.csv.gz")
print(D[1:8,1:8])
#> # A tibble: 8 x 8
#> os_time os_event `780` `100616237` `5982` `3310` `7849` `2978`
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 221 0 10.2 10.2 7.16 6.73 5.40 4.33
#> 2 11 1 8.66 8.66 8.27 6.27 5.15 5.28
#> 3 57 1 10.3 10.3 8.68 5.77 4.97 7.79
#> 4 43 1 10.9 10.9 8.05 6.32 5.02 4.57
#> 5 95 1 10.8 10.8 8.33 6.22 5.48 4.52
#> 6 34 1 10.6 10.6 7.64 7.21 4.95 4.14
#> 7 12 0 9.85 9.85 7.27 6.22 5.12 4.46
#> 8 33 1 10.2 10.2 8.77 6.56 5.02 4.77
A Grand Forest model is trained on the expression data using the grandforest
function. We supply the expression data and network with the data
and graph_data
arguments. We also indicate which columns are reponse variables by setting the dependent.variable.name
and status.variable.name
arguments.
library(grandforest)
model <- grandforest(
data = D,
graph_data = edges,
dependent.variable.name = "os_time",
status.variable.name = "os_event",
num.trees=500 # 10000 trees recommended for real analysis
)
print(model)
#> Grand Forest result
#>
#> Call:
#> grandforest(data = D, graph_data = edges, dependent.variable.name = "os_time", status.variable.name = "os_event", num.trees = 500)
#>
#> Type: Survival
#> Number of trees: 500
#> Sample size: 268
#> Number of independent variables: 22600
#> Mtry: 151
#> Target node size: 3
#> Variable importance mode: impurity
#> Splitrule: logrank
#> Number of unique death times: 137
#> OOB prediction error: 0.3645463
In this example we only train 500 decision trees but in a real analysis we recommend using at least 10000 trees for optimal results.
Once we have trained a model we can obtain gene importance estimates using the importance
-method. We can use this to obtain a table of the 25 most important genes:
library(org.Hs.eg.db) # for mapping Entrez IDs to gene names
top25 <- importance(model) %>%
sort(decreasing=TRUE) %>%
head(25) %>%
as_data_frame %>%
rownames_to_column(var="gene") %>%
mutate(label=mapIds(org.Hs.eg.db, gene, "SYMBOL", "ENTREZID"))
print(top25)
#> # A tibble: 25 x 3
#> gene value label
#> <chr> <dbl> <chr>
#> 1 7316 2.30 UBC
#> 2 7157 1.15 TP53
#> 3 672 0.533 BRCA1
#> 4 1956 0.517 EGFR
#> 5 6714 0.482 SRC
#> 6 3065 0.463 HDAC1
#> 7 2885 0.442 GRB2
#> 8 1499 0.438 CTNNB1
#> 9 5781 0.413 PTPN11
#> 10 2534 0.371 FYN
#> # ... with 15 more rows
ggplot(top25, aes(reorder(label, -value), value)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
labs(x="gene", y="importance")
We can also visualize the gene module as a network. Here we extract the subnetwork induced by the 25 genes and visualize the network using geomnet.
library(geomnet)
subnetwork <- filter(edges, source %in% top25$gene & target %in% top25$gene)
net.df <- fortify(as.edgedf(subnetwork), top25)
ggplot(net.df, aes(from_id=from_id, to_id=to_id)) +
geom_net(aes(colour=importance, label=label),
layout.alg = "circle", directed=FALSE,
colour = "lightblue", size = 15,
labelon = TRUE, labelcolour="black", vjust = 0.5, fontsize=3
) +
theme_net()
#> R version 3.4.4 (2018-03-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] geomnet_0.2.0 bindrcpp_0.2.2 org.Hs.eg.db_3.5.0
#> [4] AnnotationDbi_1.40.0 IRanges_2.12.0 S4Vectors_0.16.0
#> [7] Biobase_2.38.0 BiocGenerics_0.24.0 grandforest_0.1
#> [10] simpIntLists_1.14.0 forcats_0.3.0 stringr_1.3.1
#> [13] dplyr_0.7.5 purrr_0.2.4 readr_1.1.1
#> [16] tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1
#> [19] tidyverse_1.2.1 data.table_1.11.4
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.3.1 bit64_0.9-7 jsonlite_1.5
#> [4] viridisLite_0.3.0 network_1.13.0.1 modelr_0.1.2
#> [7] assertthat_0.2.0 blob_1.1.1 cellranger_1.1.0
#> [10] yaml_2.1.19 pillar_1.2.2 RSQLite_2.1.1
#> [13] backports_1.1.2 lattice_0.20-35 glue_1.2.0
#> [16] digest_0.6.15 rvest_0.3.2 colorspace_1.3-2
#> [19] htmltools_0.3.6 Matrix_1.2-14 plyr_1.8.4
#> [22] psych_1.8.4 pkgconfig_2.0.1 broom_0.4.4
#> [25] haven_1.1.1 scales_0.5.0 sna_2.4
#> [28] lazyeval_0.2.1 cli_1.0.0 mnormt_1.5-5
#> [31] statnet.common_4.0.0 magrittr_1.5 crayon_1.3.4
#> [34] readxl_1.1.0 memoise_1.1.0 evaluate_0.10.1
#> [37] nlme_3.1-137 xml2_1.2.0 foreign_0.8-70
#> [40] tools_3.4.4 hms_0.4.2 plotly_4.7.1
#> [43] munsell_0.4.3 compiler_3.4.4 rlang_0.2.0
#> [46] grid_3.4.4 rstudioapi_0.7 htmlwidgets_1.2
#> [49] labeling_0.3 rmarkdown_1.9 gtable_0.2.0
#> [52] DBI_1.0.0 curl_3.2 reshape2_1.4.3
#> [55] R6_2.2.2 lubridate_1.7.4 knitr_1.20
#> [58] bit_1.1-13 utf8_1.1.3 bindr_0.1.1
#> [61] rprojroot_1.3-2 stringi_1.2.2 Rcpp_0.12.17
#> [64] tidyselect_0.2.4