OncoSig is a supervised Machine Learning algorithm for constructing molecular-interaction Signaling Maps (SigMaps) for an oncoprotein specific for a given tumor type. OncoSig integrates features from PrePPI and protein-protein interactions (PPIs) inferred from genomics data from, for example, patient samples with lung adenocarcinoma, by reverse engineering algorithms. A SigMap for KRAS recapitulated published KRAS biology and identified novel proteins synthetic lethal with mutant KRAS, 18 of 21 of which were validated in 3D spheroid models for LUAD. The KRAS LUAD SigMap consists of established and novel K-Ras pathway members and is enriched in known targets of FDA approved drugs.
In this example script, we create a SigMap for the oncoprotein KRAS using a reduced-size version of a network file generated by processing lung adenocarcinoma (LUAD) samples, from the Genome Cancer Atlas project (TCGA). The companion Docker container includes the full version of the network, as well as a network generated from the TCGA colon adenocarcinoma (COAD) sample collection. In addition, information is provided for creating SigMaps for nine other oncoproteins: CDKN2A, EGFR, MAPK, NTRK3, PI3K, TP53, STK11, YAP1 and CTNNB1. Details about how the networks are generated can be found in the OncoSig publication.
Load the required libraries.
if(!require("randomForest", quietly = TRUE)) {
install.packages("randomForest")
require(randomForest, quietly = TRUE)
}
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
if(!require("ROCR", quietly = TRUE)) {
install.packages("ROCR")
require(ROCR, quietly = TRUE)
}
## Warning: package 'ROCR' was built under R version 4.3.3
Set the working directory, where the example code and data files are located.
#setwd("/usr/local/RF/")
Source the code needed to run the OncoSig algorithm.
source("./R/listToMatrix.R")
source("./R/OncoSigRF.R")
Set the network file path and read in the network file.
Network_location <- "./Data_files/luad_network_sample.txt"
Network <- read.delim(Network_location, header = F)
The network is comprised of human protein-protein interactions (PPIs) based on difference evidence types. Inspect several entries:
Network[sample(1:nrow(Network),10),]
## V1 V2 V3
## 156214 Q9UJT9_VIPER Q99996 7.4388000
## 93625 O76080_CINDY Q9BWF2 56.0000000
## 14867 Q9H3D4_ARACNE Q6ZUI0 0.2594047
## 91430 P49795_CINDY Q6N043 52.0000000
## 73285 O75182_CINDY Q07912 180.0000000
## 157999 Q00535_VIPER P31644 3.6184000
## 161535 O75528_VIPER P27986 4.2813000
## 53632 O75340_CINDY Q86VN1 122.0000000
## 185921 Q9Y2Q1_VIPER P48048 7.3987000
## 25779 Q9UH73_ARACNE Q9H4E5 0.5794823
The entries of Network are of the form: UniProt1_Feature
UniProt2 Feature_value
The first two columns contain UniProt identifiers for the interacting proteins, e.g. KRAS would be denoted as P01116. The “Feature” term appended to the first UniProt ID gives the type of evidence, and the third column provides the predicted interaction strength. A given PPI (UniProt1-UniProt2) may have one or more of the following evidence types (features) with the respective values: -PREPPI likelhood ratio -ARACNE mutual information -VIPER negative log p-value -CINDY Number of statistically significant triplets
See the OncoSig publication for more detail.
Random forest is a tree-based model and does not require feature scaling. Additional features may be included by modifying the network file.
The network file gets converted to a numerical matrix.
Network$V1 <- as.character(Network$V1)
Network$V2 <- as.character(Network$V2)
Network$V3 <- as.numeric(Network$V3)
Network <- as.matrix(Network)
Network[, 3] <- as.numeric(Network[, 3])
Network_matrix <- listToMatrix(Network)
Rows and columns of Network_matrix correspond to
proteins, and the entries are interaction scores. We can examine some of
the feature values for KRAS:
KRAS_features<-Network_matrix[row.names(Network_matrix)=="P01116",]
print(round(sample(KRAS_features[KRAS_features>0],20),2))
## O14757_VIPER Q9UPQ3_VIPER Q13185_VIPER Q15006_PREPPI Q9Y450_VIPER
## 6.28 5.96 3.00 694.17 6.45
## O14545_VIPER O15120_VIPER P28482_PREPPI Q92729_VIPER Q96RU8_VIPER
## 6.05 8.37 70367.54 3.10 6.13
## P55196_PREPPI P17676_VIPER Q9NSE2_VIPER Q8IVI9_VIPER Q6ZN18_ARACNE
## 16507.57 3.85 8.96 8.13 0.22
## Q92734_VIPER Q9NRW1_PREPPI P49716_VIPER P17544_VIPER Q99653_VIPER
## 3.44 1027.80 11.60 5.50 8.53
In this tutorial, the gold standard set comprises canonical KRAS signaling proteins compiled from the KEGG, Biocarta, and Reactome databases as delineated in the Human MSigDB curated gene sets (C2) collection. Gold standard sets for the nine other oncoproteins (CDKN2A, EGFR, MAPK, NTRK3, PI3K, TP53, STK11, YAP1 and CTNNB1) can be selected by modifying the file path and changing KRAS to the appropriate gene symbol. The KRAS gold standard set contains 250 proteins. The negative set corresponds to those proteins not in the gold standard set.
Gold_Standard_location <- "./Data_files/10_oncogene_pathways/KRAS/gold_standard.txt"
Gold_Standard <- read.delim(Gold_Standard_location, header = F)
Gold_Standard$V1 <- as.character(Gold_Standard$V1)
length(Gold_Standard$V1)
## [1] 250
head(Gold_Standard$V1)
## [1] "C9J798" "O00329" "O00459" "O14610" "O14775" "O14807"
The feature matrix is compiled by combining
Network_matrix and Gold_Standard. A gold
standard feature is created where proteins that are part of the gold
standard set are assigned a value of 1 and all others proteins are
assigned a value of 0.
Network_matrix_df <- as.data.frame(Network_matrix)
Gold_Standard_in_Network_names <- intersect(rownames(Network_matrix_df), Gold_Standard$V1)
Negative_Set_names <- setdiff(rownames(Network_matrix_df), Gold_Standard_in_Network_names)
remove(Network_matrix)
For the OncoSig publication, with the full feature matrix, the classifier was trained and tested with the oncoprotein’s gold standard and negative sets using Monte Carlo cross-validation, creating 50 forests each with 50 trees. For this sample calculation, 10 trees and 10 iterations are sufficient.
Query_output_results <- OncoSigRF(Network_matrix_df, Gold_Standard_in_Network_names, Fraction_Gold_sample = 0.5, ntrees = 10, max_iterations = 10, balance = 1)
## Running OncoSig
## Fraction of Gold Standard to train on for each Random Forest: 0.5
## Number of Trees per Iteration: 10
## Number of Iterations: 10
## Balance: 1
## Number of positive results to sample: 115
## Number of negative results to sample: 115
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 50.00% 10.00% 84.78%
## 2: 45.11% 4.76% 81.43%
## 3: 46.51% 10.47% 82.56%
## 4: 45.31% 8.16% 84.04%
## 5: 45.41% 5.77% 85.44%
## 6: 46.98% 6.54% 87.04%
## 7: 47.09% 6.42% 85.96%
## 8: 45.61% 6.19% 84.35%
## 9: 43.48% 5.22% 81.74%
## 10: 43.48% 2.61% 84.35%
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 36.14% 10.87% 67.57%
## 2: 39.29% 8.11% 74.24%
## 3: 38.46% 4.65% 73.49%
## 4: 42.05% 9.80% 77.42%
## 5: 43.78% 6.73% 83.51%
## 6: 46.26% 8.26% 85.71%
## 7: 45.25% 4.50% 86.36%
## 8: 46.88% 6.19% 88.29%
## 9: 45.37% 4.39% 86.73%
## 10: 46.26% 5.26% 87.61%
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 37.33% 5.13% 72.22%
## 2: 45.74% 7.69% 84.38%
## 3: 45.35% 11.49% 80.00%
## 4: 44.62% 10.10% 80.21%
## 5: 43.54% 5.66% 82.52%
## 6: 43.58% 5.45% 82.41%
## 7: 44.14% 4.46% 84.55%
## 8: 43.61% 4.35% 83.93%
## 9: 45.41% 4.35% 86.84%
## 10: 44.98% 4.35% 85.96%
## At iteration 3
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 45.98% 5.13% 79.17%
## 2: 39.23% 6.06% 73.44%
## 3: 36.90% 4.55% 72.50%
## 4: 38.25% 5.38% 72.22%
## 5: 35.15% 4.90% 66.00%
## 6: 36.74% 2.75% 71.70%
## 7: 40.64% 4.50% 77.78%
## 8: 38.12% 2.65% 74.55%
## 9: 41.33% 3.54% 79.46%
## 10: 40.27% 3.51% 77.68%
## At iteration 4
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 43.90% 5.26% 77.27%
## 2: 40.85% 2.86% 77.78%
## 3: 41.67% 1.22% 80.23%
## 4: 45.99% 6.52% 84.21%
## 5: 44.78% 6.00% 83.17%
## 6: 42.92% 5.56% 81.73%
## 7: 42.47% 3.60% 82.41%
## 8: 43.84% 4.50% 84.26%
## 9: 41.52% 3.54% 80.18%
## 10: 44.00% 4.42% 83.93%
## At iteration 5
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 46.51% 4.88% 84.44%
## 2: 42.42% 3.08% 80.60%
## 3: 45.18% 4.82% 85.54%
## 4: 42.63% 4.17% 81.91%
## 5: 43.06% 5.71% 80.77%
## 6: 44.29% 6.36% 82.57%
## 7: 41.63% 3.60% 80.00%
## 8: 42.92% 5.26% 81.25%
## 9: 43.61% 4.39% 83.19%
## 10: 44.74% 5.26% 84.21%
## At iteration 6
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 37.50% 2.27% 72.73%
## 2: 42.67% 2.94% 75.61%
## 3: 43.41% 3.45% 80.00%
## 4: 42.64% 5.15% 79.00%
## 5: 38.94% 3.88% 73.33%
## 6: 39.63% 2.70% 78.30%
## 7: 39.46% 3.51% 77.06%
## 8: 40.89% 3.48% 80.00%
## 9: 40.97% 1.74% 81.25%
## 10: 41.67% 1.74% 82.30%
## At iteration 7
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 36.05% 4.26% 74.36%
## 2: 43.18% 10.14% 79.37%
## 3: 45.71% 6.90% 84.09%
## 4: 44.74% 8.33% 81.91%
## 5: 45.85% 10.78% 80.58%
## 6: 44.95% 8.33% 80.91%
## 7: 46.64% 9.01% 83.93%
## 8: 45.78% 6.25% 84.96%
## 9: 48.46% 7.08% 89.47%
## 10: 46.72% 8.70% 85.09%
## At iteration 8
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 54.02% 10.53% 87.76%
## 2: 50.37% 8.06% 86.30%
## 3: 48.31% 11.63% 82.61%
## 4: 52.60% 13.19% 88.12%
## 5: 47.60% 10.20% 80.91%
## 6: 45.21% 5.61% 83.04%
## 7: 46.40% 6.36% 85.71%
## 8: 44.20% 3.60% 84.07%
## 9: 44.89% 4.50% 84.21%
## 10: 45.41% 3.51% 86.96%
## At iteration 9
## Performing Random Forest
## mtry equals 167
## ntree OOB 1 2
## 1: 44.71% 5.13% 78.26%
## 2: 42.06% 3.28% 78.46%
## 3: 41.07% 6.25% 72.73%
## 4: 42.63% 5.43% 77.55%
## 5: 44.93% 7.14% 78.90%
## 6: 43.32% 7.62% 76.79%
## 7: 43.64% 6.48% 79.46%
## 8: 43.11% 6.31% 78.95%
## 9: 43.17% 10.62% 75.44%
## 10: 44.30% 11.40% 77.19%
## At iteration 10
library("data.table")
Query_output_results_scores <- setorder(Query_output_results[[1]],-Score)
head(Query_output_results_scores,10)
## Score
## Q15831 0.9700000
## Q13822 0.9000000
## P08575 0.8900000
## Q6GYQ0 0.8888889
## P05771 0.8800000
## Q8WZ64 0.8800000
## Q16799 0.8700000
## Q86YN6 0.8600000
## P00533 0.8500000
## Q99490 0.8500000
Among the top-scoring KRAS LUAD-specific SigMap members are proteins in the gold standard set (PRKCB and PLCG2) and novel predictions (ARAP2 and STK11).
result_dir <- "../result/"
if (!dir.exists(result_dir)) dir.create(result_dir, recursive = TRUE)
write.csv(Query_output_results_scores, paste0(result_dir, "results.csv"))
Calculate receiver operating characteristic (ROC) curves. Save it to
a PDF file by setting save_to_file = TRUE.
The full ROC curve:
save_to_file = FALSE
if (save_to_file)
pdf(file = paste0(result_dir, "KRAS_ROC.pdf"))
Query_output_results_scores <- Query_output_results[[1]]
Query_output_results_scores$label <- 0
Query_output_results_scores[Gold_Standard_in_Network_names, 2] <- 1
Query_output_results_scores <- na.omit(Query_output_results_scores)
pred <- prediction(Query_output_results_scores$Score, Query_output_results_scores$label)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col = "black", xlim = c(0, 1), lwd = 4)
if (save_to_file)
dev.off()
The ROC curve at low FPR < 0.05:
if (save_to_file)
pdf(file = paste0(result_dir, "KRAS_ROC_frp05.pdf"), height = 5, width = 5)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col = "black", xlim = c(0, 0.05), lwd = 4)
if (save_to_file)
dev.off()
The term pathway, although widely used, is a loosely defined biological concept. SigMaps are a fundamentally different representation of the signaling and regulatory machinery necessary to modulate and affect the function of a specific protein of interest in a specific tissue context, which is equivalent to a protein’s mechanism of action.
OncoSig generates a single integrated score representing the probability that a protein belongs to a specific SigMap. Use of PrePPI is instrumental for identifying PPIs, whereas ARACNe, VIPER and CINDy provide critical tissue specificity and additional evidence supporting both physical and functional interactions. Taken together, these individual evidence sources can effectively place proteins within the conceptual molecular interaction architecture.