This vignette demonstrates how to perform an unsupervised analysis with Grand Forest to search for de novo endophenotypes and gene modules in non-small cell lung cancer.
First we need to prepare the gene-gene interaction network. We can obtain a protein-protein interaction network from BioGRID using the simpIntLists package from Bioconductor.
library(data.table)
library(tidyverse)
library(simpIntLists)
data("HumanBioGRIDInteractionEntrezId")
# convert edge lists into two-column data frame
edges <- lapply(HumanBioGRIDInteractionEntrezId, function(x) {
data.frame(
source = as.character(x$name),
target = as.character(x$interactors),
stringsAsFactors = FALSE
)
})
edges <- rbindlist(edges)
head(edges)
#> source target
#> 1: 6416 2318
#> 2: 6416 192176
#> 3: 6416 9043
#> 4: 6416 5599
#> 5: 6416 5871
#> 6: 6416 5609
The resulting data frame should contain two columns with gene identifiers as character
strings. In this example we will be using Entrez Gene IDs, but any other identifier can be used, provided the same type is used in the experimental data as well.
Next we download a gene expression data set from non-small cell lung cancer patients. The dataset was extracted from GSE30219. The first two columns contain survival information for each patient. Since we will be doing unsupervised analysis, we will simply remove these columns. The resulting data frame contains a column for each gene with gene expression values for each patient.
# Download lung cancer expression data set with survival times.
D <- read_csv("https://grandforest.compbio.sdu.dk/files/survival_example.csv.gz")
# remove survival information and keep for latter
survival <- D[,1:2]
D <- D[,-(1:2)]
print(D[1:8,1:8])
#> # A tibble: 8 x 8
#> `780` `100616237` `5982` `3310` `7849` `2978` `7318` `100847079`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10.2 10.2 7.16 6.73 5.40 4.33 8.02 8.02
#> 2 8.66 8.66 8.27 6.27 5.15 5.28 7.00 7.00
#> 3 10.3 10.3 8.68 5.77 4.97 7.79 7.18 7.18
#> 4 10.9 10.9 8.05 6.32 5.02 4.57 8.90 8.90
#> 5 10.8 10.8 8.33 6.22 5.48 4.52 6.42 6.42
#> 6 10.6 10.6 7.64 7.21 4.95 4.14 8.12 8.12
#> 7 9.85 9.85 7.27 6.22 5.12 4.46 7.46 7.46
#> 8 10.2 10.2 8.77 6.56 5.02 4.77 6.64 6.64
We can train an unsupervised model using the grandforest_unsupervised
helper function. The function will generate a background distribution from the input data and train a Grand Forest model to recognize the foreground from the background.
library(grandforest)
model <- grandforest_unsupervised(
data = D,
graph_data = edges,
num.trees = 1000
)
In this example we only train 1000 decision trees but in a real analysis we recommend using at least 10000 trees for optimal results.
Once we have trained a model we can obtain gene importance estimates using the importance
-method. We can use this to obtain a table of the 25 most important genes:
library(org.Hs.eg.db) # for mapping Entrez IDs to gene names
top25 <- importance(model) %>%
sort(decreasing=TRUE) %>%
head(25) %>%
as_data_frame %>%
rownames_to_column(var="gene") %>%
mutate(label=mapIds(org.Hs.eg.db, gene, "SYMBOL", "ENTREZID"))
print(top25)
#> # A tibble: 25 x 3
#> gene value label
#> <chr> <dbl> <chr>
#> 1 7316 1.33 UBC
#> 2 7157 1.02 TP53
#> 3 672 0.883 BRCA1
#> 4 1956 0.675 EGFR
#> 5 3065 0.647 HDAC1
#> 6 367 0.642 AR
#> 7 2885 0.556 GRB2
#> 8 6667 0.477 SP1
#> 9 6714 0.473 SRC
#> 10 2099 0.471 ESR1
#> # ... with 15 more rows
ggplot(top25, aes(reorder(label, -value), value)) +
geom_bar(stat="identity") +
theme_classic() + theme(axis.text.x=element_text(angle=45, hjust=1)) +
labs(x="gene", y="importance")
We can also visualize the gene module as a network. Here we extract the subnetwork induced by the 25 genes and visualize the network using geomnet.
library(geomnet)
subnetwork <- filter(edges, source %in% top25$gene & target %in% top25$gene)
net.df <- fortify(as.edgedf(subnetwork), top25)
ggplot(net.df, aes(from_id=from_id, to_id=to_id)) +
geom_net(aes(colour=importance, label=label),
layout.alg = "circle", directed=FALSE,
colour = "lightblue", size = 15,
labelon = TRUE, labelcolour="black", vjust = 0.5, fontsize=3
) +
theme_net()
We can stratify the patients into new endophenotypes by clustering them using the genes in the gene module we obtained with Grand Forest.
We first extract the 25 most important genes and scale/center the expression values. Then we cluster the patients into two groups using k-means clustering.
# Extract module features and scale/center values.
D.scaled <- scale(D[,top25$gene], center=TRUE, scale=TRUE)
colnames(D.scaled) <- top25$label
# Cluster into two groups using k-means
cl <- kmeans(D.scaled, centers=2, nstart=20)
print(cl$cluster)
#> [1] 2 2 1 1 1 2 2 2 1 1 1 2 2 2 2 2 2 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 1 2 2
#> [36] 2 2 2 2 2 2 2 1 1 2 2 1 1 2 1 1 1 2 2 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1
#> [71] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 2 1 1 2 2 1 2 2 2 2 1 1 1 2
#> [106] 1 1 2 2 1 2 1 2 2 2 1 2 1 2 2 1 2 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
#> [141] 2 1 1 1 2 1 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
#> [176] 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [211] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [246] 2 2 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 2 1 2
We can visualize the clustering as a heatmap to check which genes appear to be up- and downregulated in each group.
library(ComplexHeatmap)
Heatmap(D.scaled, split = cl$cluster, name = "expression")
We can also use the survival information we removed earlier to compare the survival curves for the two endophenotypes we identified. Here we observe that the stratification we identified is highly significantly associated to overall survival.
library(survival)
library(survminer)
cl.survival <- data.frame(survival, cluster=cl$cluster)
ggsurvplot(survfit(Surv(os_time, os_event)~cluster, data=cl.survival), pval=TRUE)$plot
#> R version 3.4.4 (2018-03-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] grid parallel stats4 stats graphics grDevices utils
#> [8] datasets methods base
#>
#> other attached packages:
#> [1] survminer_0.4.2 ggpubr_0.1.6 magrittr_1.5
#> [4] survival_2.42-3 ComplexHeatmap_1.17.1 geomnet_0.2.0
#> [7] bindrcpp_0.2.2 org.Hs.eg.db_3.5.0 AnnotationDbi_1.40.0
#> [10] IRanges_2.12.0 S4Vectors_0.16.0 Biobase_2.38.0
#> [13] BiocGenerics_0.24.0 grandforest_0.1 simpIntLists_1.14.0
#> [16] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5
#> [19] purrr_0.2.4 readr_1.1.1 tidyr_0.8.1
#> [22] tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
#> [25] data.table_1.11.4
#>
#> loaded via a namespace (and not attached):
#> [1] nlme_3.1-137 cmprsk_2.2-7 lubridate_1.7.4
#> [4] bit64_0.9-7 RColorBrewer_1.1-2 httr_1.3.1
#> [7] rprojroot_1.3-2 tools_3.4.4 backports_1.1.2
#> [10] utf8_1.1.3 R6_2.2.2 DBI_1.0.0
#> [13] lazyeval_0.2.1 colorspace_1.3-2 GetoptLong_0.1.6
#> [16] tidyselect_0.2.4 gridExtra_2.3 mnormt_1.5-5
#> [19] bit_1.1-13 curl_3.2 compiler_3.4.4
#> [22] cli_1.0.0 rvest_0.3.2 xml2_1.2.0
#> [25] network_1.13.0.1 plotly_4.7.1 labeling_0.3
#> [28] scales_0.5.0 survMisc_0.5.4 psych_1.8.4
#> [31] digest_0.6.15 foreign_0.8-70 rmarkdown_1.9
#> [34] pkgconfig_2.0.1 htmltools_0.3.6 htmlwidgets_1.2
#> [37] rlang_0.2.0 GlobalOptions_0.0.13 readxl_1.1.0
#> [40] rstudioapi_0.7 RSQLite_2.1.1 shape_1.4.4
#> [43] bindr_0.1.1 zoo_1.8-1 jsonlite_1.5
#> [46] statnet.common_4.0.0 Matrix_1.2-14 Rcpp_0.12.17
#> [49] munsell_0.4.3 stringi_1.2.2 yaml_2.1.19
#> [52] plyr_1.8.4 blob_1.1.1 crayon_1.3.4
#> [55] lattice_0.20-35 haven_1.1.1 splines_3.4.4
#> [58] circlize_0.4.3 hms_0.4.2 sna_2.4
#> [61] knitr_1.20 pillar_1.2.2 rjson_0.2.19
#> [64] reshape2_1.4.3 glue_1.2.0 evaluate_0.10.1
#> [67] modelr_0.1.2 cellranger_1.1.0 gtable_0.2.0
#> [70] km.ci_0.5-2 assertthat_0.2.0 xtable_1.8-2
#> [73] broom_0.4.4 viridisLite_0.3.0 KMsurv_0.1-5
#> [76] memoise_1.1.0