OncoSig Introduction

OncoSig is a supervised Machine Learning algorithm for constructing molecular-interaction Signaling Maps (SigMaps) for an oncoprotein specific for a given tumor type. OncoSig integrates features from PrePPI and protein-protein interactions (PPIs) inferred from genomics data from, for example, patient samples with lung adenocarcinoma, by reverse engineering algorithms. A SigMap for KRAS recapitulated published KRAS biology and identified novel proteins synthetic lethal with mutant KRAS, 18 of 21 of which were validated in 3D spheroid models for LUAD. The KRAS LUAD SigMap consists of established and novel K-Ras pathway members and is enriched in known targets of FDA approved drugs.

In this example script, we create a SigMap for the oncoprotein KRAS using a reduced-size version of a network file generated by processing lung adenocarcinoma (LUAD) samples, from the Genome Cancer Atlas project (TCGA). The companion Docker container includes the full version of the network, as well as a network generated from the TCGA colon adenocarcinoma (COAD) sample collection. In addition, information is provided for creating SigMaps for nine other oncoproteins: CDKN2A, EGFR, MAPK, NTRK3, PI3K, TP53, STK11, YAP1 and CTNNB1. Details about how the networks are generated can be found in the OncoSig publication.


OncoSig Setup

Libraries

Load the required libraries.

if(!require("randomForest", quietly = TRUE)) {
  install.packages("randomForest")
  require(randomForest, quietly = TRUE)
}
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
if(!require("ROCR", quietly = TRUE)) {
  install.packages("ROCR")
  require(ROCR, quietly = TRUE)
}
## Warning: package 'ROCR' was built under R version 4.3.3

Environment

Set the working directory, where the example code and data files are located.

#setwd("/usr/local/RF/")

Code

Source the code needed to run the OncoSig algorithm.

source("./R/listToMatrix.R")
source("./R/OncoSigRF.R")

OncoSig data

Set the network file path and read in the network file.

Network_location <- "./Data_files/luad_network_sample.txt"
Network <- read.delim(Network_location, header = F)

OncoSig network

The network is comprised of human protein-protein interactions (PPIs) based on difference evidence types. Inspect several entries:

Network[sample(1:nrow(Network),10),]
##                   V1     V2          V3
## 156214  Q9UJT9_VIPER Q99996   7.4388000
## 93625   O76080_CINDY Q9BWF2  56.0000000
## 14867  Q9H3D4_ARACNE Q6ZUI0   0.2594047
## 91430   P49795_CINDY Q6N043  52.0000000
## 73285   O75182_CINDY Q07912 180.0000000
## 157999  Q00535_VIPER P31644   3.6184000
## 161535  O75528_VIPER P27986   4.2813000
## 53632   O75340_CINDY Q86VN1 122.0000000
## 185921  Q9Y2Q1_VIPER P48048   7.3987000
## 25779  Q9UH73_ARACNE Q9H4E5   0.5794823

The entries of Network are of the form: UniProt1_Feature UniProt2 Feature_value

The first two columns contain UniProt identifiers for the interacting proteins, e.g. KRAS would be denoted as P01116. The “Feature” term appended to the first UniProt ID gives the type of evidence, and the third column provides the predicted interaction strength. A given PPI (UniProt1-UniProt2) may have one or more of the following evidence types (features) with the respective values: -PREPPI likelhood ratio -ARACNE mutual information -VIPER negative log p-value -CINDY Number of statistically significant triplets

See the OncoSig publication for more detail.

Random forest is a tree-based model and does not require feature scaling. Additional features may be included by modifying the network file.

The network file gets converted to a numerical matrix.

Network$V1 <- as.character(Network$V1)
Network$V2 <- as.character(Network$V2)
Network$V3 <- as.numeric(Network$V3)
Network <- as.matrix(Network)
Network[, 3] <- as.numeric(Network[, 3])
Network_matrix <- listToMatrix(Network)

Rows and columns of Network_matrix correspond to proteins, and the entries are interaction scores. We can examine some of the feature values for KRAS:

KRAS_features<-Network_matrix[row.names(Network_matrix)=="P01116",]
print(round(sample(KRAS_features[KRAS_features>0],20),2))
##  O14757_VIPER  Q9UPQ3_VIPER  Q13185_VIPER Q15006_PREPPI  Q9Y450_VIPER 
##          6.28          5.96          3.00        694.17          6.45 
##  O14545_VIPER  O15120_VIPER P28482_PREPPI  Q92729_VIPER  Q96RU8_VIPER 
##          6.05          8.37      70367.54          3.10          6.13 
## P55196_PREPPI  P17676_VIPER  Q9NSE2_VIPER  Q8IVI9_VIPER Q6ZN18_ARACNE 
##      16507.57          3.85          8.96          8.13          0.22 
##  Q92734_VIPER Q9NRW1_PREPPI  P49716_VIPER  P17544_VIPER  Q99653_VIPER 
##          3.44       1027.80         11.60          5.50          8.53

Define the gold standard set

In this tutorial, the gold standard set comprises canonical KRAS signaling proteins compiled from the KEGG, Biocarta, and Reactome databases as delineated in the Human MSigDB curated gene sets (C2) collection. Gold standard sets for the nine other oncoproteins (CDKN2A, EGFR, MAPK, NTRK3, PI3K, TP53, STK11, YAP1 and CTNNB1) can be selected by modifying the file path and changing KRAS to the appropriate gene symbol. The KRAS gold standard set contains 250 proteins. The negative set corresponds to those proteins not in the gold standard set.

Gold_Standard_location <- "./Data_files/10_oncogene_pathways/KRAS/gold_standard.txt"
Gold_Standard <- read.delim(Gold_Standard_location, header = F)
Gold_Standard$V1 <- as.character(Gold_Standard$V1)

length(Gold_Standard$V1)
## [1] 250
head(Gold_Standard$V1)
## [1] "C9J798" "O00329" "O00459" "O14610" "O14775" "O14807"

Running OncoSig

Create the feature matrix for classification

The feature matrix is compiled by combining Network_matrix and Gold_Standard. A gold standard feature is created where proteins that are part of the gold standard set are assigned a value of 1 and all others proteins are assigned a value of 0.

Network_matrix_df <- as.data.frame(Network_matrix)
Gold_Standard_in_Network_names <- intersect(rownames(Network_matrix_df), Gold_Standard$V1)
Negative_Set_names <- setdiff(rownames(Network_matrix_df), Gold_Standard_in_Network_names)
remove(Network_matrix)

Run the random forest classifier

For the OncoSig publication, with the full feature matrix, the classifier was trained and tested with the oncoprotein’s gold standard and negative sets using Monte Carlo cross-validation, creating 50 forests each with 50 trees. For this sample calculation, 10 trees and 10 iterations are sufficient.

Query_output_results <- OncoSigRF(Network_matrix_df, Gold_Standard_in_Network_names, Fraction_Gold_sample = 0.5, ntrees = 10, max_iterations = 10, balance = 1)
## Running OncoSig
## Fraction of Gold Standard to train on for each Random Forest: 0.5
## Number of Trees per Iteration: 10
## Number of Iterations: 10
## Balance: 1
## Number of positive results to sample: 115
## Number of negative results to sample: 115
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  50.00% 10.00% 84.78%
##     2:  45.11%  4.76% 81.43%
##     3:  46.51% 10.47% 82.56%
##     4:  45.31%  8.16% 84.04%
##     5:  45.41%  5.77% 85.44%
##     6:  46.98%  6.54% 87.04%
##     7:  47.09%  6.42% 85.96%
##     8:  45.61%  6.19% 84.35%
##     9:  43.48%  5.22% 81.74%
##    10:  43.48%  2.61% 84.35%
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  36.14% 10.87% 67.57%
##     2:  39.29%  8.11% 74.24%
##     3:  38.46%  4.65% 73.49%
##     4:  42.05%  9.80% 77.42%
##     5:  43.78%  6.73% 83.51%
##     6:  46.26%  8.26% 85.71%
##     7:  45.25%  4.50% 86.36%
##     8:  46.88%  6.19% 88.29%
##     9:  45.37%  4.39% 86.73%
##    10:  46.26%  5.26% 87.61%
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  37.33%  5.13% 72.22%
##     2:  45.74%  7.69% 84.38%
##     3:  45.35% 11.49% 80.00%
##     4:  44.62% 10.10% 80.21%
##     5:  43.54%  5.66% 82.52%
##     6:  43.58%  5.45% 82.41%
##     7:  44.14%  4.46% 84.55%
##     8:  43.61%  4.35% 83.93%
##     9:  45.41%  4.35% 86.84%
##    10:  44.98%  4.35% 85.96%
## At iteration 3
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  45.98%  5.13% 79.17%
##     2:  39.23%  6.06% 73.44%
##     3:  36.90%  4.55% 72.50%
##     4:  38.25%  5.38% 72.22%
##     5:  35.15%  4.90% 66.00%
##     6:  36.74%  2.75% 71.70%
##     7:  40.64%  4.50% 77.78%
##     8:  38.12%  2.65% 74.55%
##     9:  41.33%  3.54% 79.46%
##    10:  40.27%  3.51% 77.68%
## At iteration 4
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  43.90%  5.26% 77.27%
##     2:  40.85%  2.86% 77.78%
##     3:  41.67%  1.22% 80.23%
##     4:  45.99%  6.52% 84.21%
##     5:  44.78%  6.00% 83.17%
##     6:  42.92%  5.56% 81.73%
##     7:  42.47%  3.60% 82.41%
##     8:  43.84%  4.50% 84.26%
##     9:  41.52%  3.54% 80.18%
##    10:  44.00%  4.42% 83.93%
## At iteration 5
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  46.51%  4.88% 84.44%
##     2:  42.42%  3.08% 80.60%
##     3:  45.18%  4.82% 85.54%
##     4:  42.63%  4.17% 81.91%
##     5:  43.06%  5.71% 80.77%
##     6:  44.29%  6.36% 82.57%
##     7:  41.63%  3.60% 80.00%
##     8:  42.92%  5.26% 81.25%
##     9:  43.61%  4.39% 83.19%
##    10:  44.74%  5.26% 84.21%
## At iteration 6
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  37.50%  2.27% 72.73%
##     2:  42.67%  2.94% 75.61%
##     3:  43.41%  3.45% 80.00%
##     4:  42.64%  5.15% 79.00%
##     5:  38.94%  3.88% 73.33%
##     6:  39.63%  2.70% 78.30%
##     7:  39.46%  3.51% 77.06%
##     8:  40.89%  3.48% 80.00%
##     9:  40.97%  1.74% 81.25%
##    10:  41.67%  1.74% 82.30%
## At iteration 7
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  36.05%  4.26% 74.36%
##     2:  43.18% 10.14% 79.37%
##     3:  45.71%  6.90% 84.09%
##     4:  44.74%  8.33% 81.91%
##     5:  45.85% 10.78% 80.58%
##     6:  44.95%  8.33% 80.91%
##     7:  46.64%  9.01% 83.93%
##     8:  45.78%  6.25% 84.96%
##     9:  48.46%  7.08% 89.47%
##    10:  46.72%  8.70% 85.09%
## At iteration 8
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  54.02% 10.53% 87.76%
##     2:  50.37%  8.06% 86.30%
##     3:  48.31% 11.63% 82.61%
##     4:  52.60% 13.19% 88.12%
##     5:  47.60% 10.20% 80.91%
##     6:  45.21%  5.61% 83.04%
##     7:  46.40%  6.36% 85.71%
##     8:  44.20%  3.60% 84.07%
##     9:  44.89%  4.50% 84.21%
##    10:  45.41%  3.51% 86.96%
## At iteration 9
## Performing Random Forest
## mtry equals 167
## ntree      OOB      1      2
##     1:  44.71%  5.13% 78.26%
##     2:  42.06%  3.28% 78.46%
##     3:  41.07%  6.25% 72.73%
##     4:  42.63%  5.43% 77.55%
##     5:  44.93%  7.14% 78.90%
##     6:  43.32%  7.62% 76.79%
##     7:  43.64%  6.48% 79.46%
##     8:  43.11%  6.31% 78.95%
##     9:  43.17% 10.62% 75.44%
##    10:  44.30% 11.40% 77.19%
## At iteration 10

Examine the output

library("data.table")
Query_output_results_scores <- setorder(Query_output_results[[1]],-Score)
head(Query_output_results_scores,10)
##            Score
## Q15831 0.9700000
## Q13822 0.9000000
## P08575 0.8900000
## Q6GYQ0 0.8888889
## P05771 0.8800000
## Q8WZ64 0.8800000
## Q16799 0.8700000
## Q86YN6 0.8600000
## P00533 0.8500000
## Q99490 0.8500000

Among the top-scoring KRAS LUAD-specific SigMap members are proteins in the gold standard set (PRKCB and PLCG2) and novel predictions (ARAP2 and STK11).

Save the results

result_dir <- "../result/"
if (!dir.exists(result_dir)) dir.create(result_dir, recursive = TRUE)
write.csv(Query_output_results_scores, paste0(result_dir, "results.csv"))

OncoSig evaluation

ROC curves

Calculate receiver operating characteristic (ROC) curves. Save it to a PDF file by setting save_to_file = TRUE.

The full ROC curve:

save_to_file = FALSE
if (save_to_file)
  pdf(file = paste0(result_dir, "KRAS_ROC.pdf"))
Query_output_results_scores <- Query_output_results[[1]]
Query_output_results_scores$label <- 0
Query_output_results_scores[Gold_Standard_in_Network_names, 2] <- 1
Query_output_results_scores <- na.omit(Query_output_results_scores)
pred <- prediction(Query_output_results_scores$Score, Query_output_results_scores$label)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col = "black", xlim = c(0, 1), lwd = 4)

if (save_to_file)
  dev.off()

The ROC curve at low FPR < 0.05:

if (save_to_file)
  pdf(file = paste0(result_dir, "KRAS_ROC_frp05.pdf"), height = 5, width = 5)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col = "black", xlim = c(0, 0.05), lwd = 4)

if (save_to_file)
  dev.off()

Summary

The term pathway, although widely used, is a loosely defined biological concept. SigMaps are a fundamentally different representation of the signaling and regulatory machinery necessary to modulate and affect the function of a specific protein of interest in a specific tissue context, which is equivalent to a protein’s mechanism of action.

OncoSig generates a single integrated score representing the probability that a protein belongs to a specific SigMap. Use of PrePPI is instrumental for identifying PPIs, whereas ARACNe, VIPER and CINDy provide critical tissue specificity and additional evidence supporting both physical and functional interactions. Taken together, these individual evidence sources can effectively place proteins within the conceptual molecular interaction architecture.