The NCI-60 cell lines are the most frequently studied human tumor cell lines in cancer research. This panel has generated the most extensive cancer pharmacology database worldwide. In addition, these cell lines have been intensely investigated, providing a unique platform for hypothesis-driven research focused on enhancing our understanding of tumor biology. Here, we report a comprehensive analysis of coding variants in the NCI-60 panel of cell lines identified by whole exome sequencing, providing a list of possible cancer specific variants for the community. Furthermore, we identify pharmacogenomic correlations between compounds and associated drug screen data.
The variants have been subjected to a set of filters to:
TRUE
or FALSE
Each cell line has been treated with MANY different compounds and the growth of the cells monitored. The concentration of each compound is titrated so that one can observe a growth response curve. The concentration of a given compound that results in 50% of the maximal reduction in cell growth. This concentration, the 50% Growth Inhibition concentration, or GI50, is a measure of the effective of a given compound on a given cell line. Note that this is a “relative” measurement.
All screened compound data can be downloaded from https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-60+Growth+Inhibition+Data. The GI50 Data (Sept 2014) release contains a file CANCER60GI50.csv. We used this file to extract a few compounds of interest for prediction models.
We have processed the data down to Gene X Sample after removing common variants. These data are directly available at the Github site, so we can read them into R.
# We can grab the data directly from github.
tmp <- read.csv("https://raw.githubusercontent.com/seandavi/NCI60_SuperLearner/master/Data/VariantTableByGene.csv")
VariantTable = as.matrix(tmp[,-1])
rownames(VariantTable) = tmp[,1]
## Load drug/agent GI50 data
DrugDat <- read.csv("https://raw.githubusercontent.com/seandavi/NCI60_SuperLearner/master/Data/AOD_IOA_GI50.csv")
# find overlap in cell lines
setdiff(rownames(VariantTable), unique(DrugDat$CELL))
## [1] "A549_ATCC" "COLO-205" "DU145" "HCC2998" "HL-60"
## [6] "Hs_578T" "IGR-OV1" "K562" "LOX_IMVI" "MDA-MB-231"
## [11] "NCI-ADR-RES" "RXF-393"
setdiff(unique(DrugDat$CELL), rownames(VariantTable))
## [1] "A549/ATCC" "COLO 205" "NCI/ADR-RES"
## [4] "HS 578T" "IGROV1" "K-562"
## [7] "RXF 393" "DU-145" "HCC-2998"
## [10] "RKOp53RE1" "HT29p53RE22" "H1299p53RE29"
## [13] "RKO Waf1" "MDA-MB-231/ATCC" "T47D FOS1"
## [16] "T47D NFkB15" "T47D ERE4" "MCF7-E6"
## [19] "HL-60(TB)" "LOX IMVI" "LXFL 529"
## [22] "DMS 114" "DMS 273" "DLD-1"
## [25] "KM20L2" "RXF-631" "UOK-57"
## [28] "M19-MEL" "SNB-78" "XF 498"
## [31] "HOP-18" "P388" "P388/ADR"
## [34] "SN12K1" "CALU-1" "MLI-076"
## [37] "CACO-2" "MCF7/ATCC" "SK-BR-3"
## [40] "MAXF 401" "ES-2" "SW-156"
## [43] "TK-164" "UABMEL3" "JCA-1"
## [46] "ND-1" "TSU-PRI" "A204/ATCC"
## [49] "A673" "HT" "RL"
## [52] "DB" "NB4"
# fix names in exome data (to match the names in the drug data)
rownames(VariantTable) <- sub("A549_ATCC", "A549/ATCC", rownames(VariantTable))
rownames(VariantTable) <- sub("COLO-205", "COLO 205", rownames(VariantTable))
rownames(VariantTable) <- sub("DU145", "DU-145", rownames(VariantTable))
rownames(VariantTable) <- sub("HCC2998", "HCC-2998", rownames(VariantTable))
rownames(VariantTable) <- sub("HL-60", "HL-60(TB)", rownames(VariantTable))
rownames(VariantTable) <- sub("Hs_578T", "HS 578T", rownames(VariantTable))
rownames(VariantTable) <- sub("IGR-OV1", "IGROV1", rownames(VariantTable))
rownames(VariantTable) <- sub("K562", "K-562", rownames(VariantTable))
rownames(VariantTable) <- sub("LOX_IMVI", "LOX IMVI", rownames(VariantTable))
rownames(VariantTable) <- sub("MDA-MB-231", "MDA-MB-231/ATCC", rownames(VariantTable))
rownames(VariantTable) <- sub("NCI-ADR-RES", "NCI/ADR-RES", rownames(VariantTable))
rownames(VariantTable) <- sub("RXF-393", "RXF 393", rownames(VariantTable))
# restrict drug data to overlap with exome data
DrugDat <- DrugDat[which(DrugDat$CELL %in% rownames(VariantTable)), ]
# Change our drug data into a more usable format
GI50Wide <- reshape(DrugDat[, c("NSC", "CELL", "NLOGGI50")], direction = "wide", timevar = "NSC", idvar = "CELL")
colnames(GI50Wide) <- sub("NLOGGI50.", "NSC", colnames(GI50Wide))
# use cell line name as rowname
rownames(GI50Wide) <- GI50Wide[, 1]
# remove cell line name, and line up cell lines with Variant Table
GI50Wide <- GI50Wide[rownames(VariantTable), -1]
all.equal(rownames(VariantTable), rownames(GI50Wide))
## [1] TRUE
## filter VariantTable -- why?
CountVar <- colSums(VariantTable)
VariantTable_sub <- as.data.frame(VariantTable[, CountVar > 4])
## clean variable names in variant table
colnames(VariantTable_sub) <- make.names(colnames(VariantTable_sub)) # some genes have "-" in the name, but this creates problems for regression models/formulas, replace with "."
rfPredict = function(i) {
require(randomForest)
idx = !is.na(GI50Wide[,i])
rf = randomForest(x = VariantTable_sub[idx, ], y = GI50Wide[idx, i])
return(rf)
}
We can run the randomForest model in parallel using all the cores on our machines.
library(BiocParallel)
register(MulticoreParam())
res = bplapply(seq_along(GI50Wide),rfPredict)
Let’s use the amount of variation explained as a measure of “goodness” of fit.
varExplained = sapply(res,function(x) {return(abs(x$rsq[length(x$rsq)]))})
hist(varExplained)
# and find
which(varExplained>0.7)
## [1] 3 147 149
We can look at the variables most important to the prediction and see if they have any meaning in a biological context.
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
# what is the compound name?
colnames(GI50Wide)[3]
## [1] "NSC1390"
# Which gene is most important in the prediction?
varImpPlot(res[[3]])
And let Google tell us if there is a relationship: