Background

The NCI-60 cell lines are the most frequently studied human tumor cell lines in cancer research. This panel has generated the most extensive cancer pharmacology database worldwide. In addition, these cell lines have been intensely investigated, providing a unique platform for hypothesis-driven research focused on enhancing our understanding of tumor biology. Here, we report a comprehensive analysis of coding variants in the NCI-60 panel of cell lines identified by whole exome sequencing, providing a list of possible cancer specific variants for the community. Furthermore, we identify pharmacogenomic correlations between compounds and associated drug screen data.

NCI-60 Exomes paper

The Data

Genetic variation

The variants have been subjected to a set of filters to:

Remove variants common in human populations to increase the chances of focusing on cancer-associated mutations
Keep variants that cause a protein-coding change in a gene
Collapsed to the level of a gene, resulting in a table of Gene X Sample with each cell in the table being TRUE or FALSE

Drug screen data

Each cell line has been treated with MANY different compounds and the growth of the cells monitored. The concentration of each compound is titrated so that one can observe a growth response curve. The concentration of a given compound that results in 50% of the maximal reduction in cell growth. This concentration, the 50% Growth Inhibition concentration, or GI50, is a measure of the effective of a given compound on a given cell line. Note that this is a “relative” measurement.

All screened compound data can be downloaded from https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-60+Growth+Inhibition+Data. The GI50 Data (Sept 2014) release contains a file CANCER60GI50.csv. We used this file to extract a few compounds of interest for prediction models.

The experiment

Objective: Use the variants, after filtering, to predict drug GI50.

Load the data

We have processed the data down to Gene X Sample after removing common variants. These data are directly available at the Github site, so we can read them into R.

# We can grab the data directly from github.
tmp <- read.csv("https://raw.githubusercontent.com/seandavi/NCI60_SuperLearner/master/Data/VariantTableByGene.csv")
VariantTable = as.matrix(tmp[,-1])
rownames(VariantTable) = tmp[,1]

## Load drug/agent GI50 data
DrugDat <- read.csv("https://raw.githubusercontent.com/seandavi/NCI60_SuperLearner/master/Data/AOD_IOA_GI50.csv")

# find overlap in cell lines
setdiff(rownames(VariantTable), unique(DrugDat$CELL))

##  [1] "A549_ATCC"   "COLO-205"    "DU145"       "HCC2998"     "HL-60"      
##  [6] "Hs_578T"     "IGR-OV1"     "K562"        "LOX_IMVI"    "MDA-MB-231" 
## [11] "NCI-ADR-RES" "RXF-393"

setdiff(unique(DrugDat$CELL), rownames(VariantTable))

##  [1] "A549/ATCC"       "COLO 205"        "NCI/ADR-RES"    
##  [4] "HS 578T"         "IGROV1"          "K-562"          
##  [7] "RXF 393"         "DU-145"          "HCC-2998"       
## [10] "RKOp53RE1"       "HT29p53RE22"     "H1299p53RE29"   
## [13] "RKO Waf1"        "MDA-MB-231/ATCC" "T47D FOS1"      
## [16] "T47D NFkB15"     "T47D ERE4"       "MCF7-E6"        
## [19] "HL-60(TB)"       "LOX IMVI"        "LXFL 529"       
## [22] "DMS 114"         "DMS 273"         "DLD-1"          
## [25] "KM20L2"          "RXF-631"         "UOK-57"         
## [28] "M19-MEL"         "SNB-78"          "XF 498"         
## [31] "HOP-18"          "P388"            "P388/ADR"       
## [34] "SN12K1"          "CALU-1"          "MLI-076"        
## [37] "CACO-2"          "MCF7/ATCC"       "SK-BR-3"        
## [40] "MAXF 401"        "ES-2"            "SW-156"         
## [43] "TK-164"          "UABMEL3"         "JCA-1"          
## [46] "ND-1"            "TSU-PRI"         "A204/ATCC"      
## [49] "A673"            "HT"              "RL"             
## [52] "DB"              "NB4"

# fix names in exome data (to match the names in the drug data)
rownames(VariantTable) <- sub("A549_ATCC", "A549/ATCC", rownames(VariantTable))
rownames(VariantTable) <- sub("COLO-205", "COLO 205", rownames(VariantTable))
rownames(VariantTable) <- sub("DU145", "DU-145", rownames(VariantTable))
rownames(VariantTable) <- sub("HCC2998", "HCC-2998", rownames(VariantTable))
rownames(VariantTable) <- sub("HL-60", "HL-60(TB)", rownames(VariantTable))
rownames(VariantTable) <- sub("Hs_578T", "HS 578T", rownames(VariantTable))
rownames(VariantTable) <- sub("IGR-OV1", "IGROV1", rownames(VariantTable))
rownames(VariantTable) <- sub("K562", "K-562", rownames(VariantTable))
rownames(VariantTable) <- sub("LOX_IMVI", "LOX IMVI", rownames(VariantTable))
rownames(VariantTable) <- sub("MDA-MB-231", "MDA-MB-231/ATCC", rownames(VariantTable))
rownames(VariantTable) <- sub("NCI-ADR-RES", "NCI/ADR-RES", rownames(VariantTable))
rownames(VariantTable) <- sub("RXF-393", "RXF 393", rownames(VariantTable))

# restrict drug data to overlap with exome data
DrugDat <- DrugDat[which(DrugDat$CELL %in% rownames(VariantTable)), ]

# Change our drug data into a more usable format
GI50Wide <- reshape(DrugDat[, c("NSC", "CELL", "NLOGGI50")], direction = "wide", timevar = "NSC", idvar = "CELL")
colnames(GI50Wide) <- sub("NLOGGI50.", "NSC", colnames(GI50Wide))

# use cell line name as rowname
rownames(GI50Wide) <- GI50Wide[, 1]
# remove cell line name, and line up cell lines with Variant Table
GI50Wide <- GI50Wide[rownames(VariantTable), -1]
all.equal(rownames(VariantTable), rownames(GI50Wide))

## [1] TRUE

## filter VariantTable -- why?
CountVar <- colSums(VariantTable)
VariantTable_sub <- as.data.frame(VariantTable[, CountVar > 4])

## clean variable names in variant table
colnames(VariantTable_sub) <- make.names(colnames(VariantTable_sub)) # some genes have "-" in the name, but this creates problems for regression models/formulas, replace with "."

rfPredict = function(i) {
  require(randomForest)
  idx = !is.na(GI50Wide[,i])
  rf  = randomForest(x = VariantTable_sub[idx, ], y = GI50Wide[idx, i])
  return(rf)
}

We can run the randomForest model in parallel using all the cores on our machines.

library(BiocParallel)
register(MulticoreParam())
res = bplapply(seq_along(GI50Wide),rfPredict)

Let’s use the amount of variation explained as a measure of “goodness” of fit.

varExplained = sapply(res,function(x) {return(abs(x$rsq[length(x$rsq)]))})
hist(varExplained)

# and find 
which(varExplained>0.7)

## [1]   3 147 149

We can look at the variables most important to the prediction and see if they have any meaning in a biological context.

require(randomForest)

## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

# what is the compound name?
colnames(GI50Wide)[3]

## [1] "NSC1390"

# Which gene is most important in the prediction?
varImpPlot(res[[3]])

And let Google tell us if there is a relationship:

Search for Allopurinol and IL4R

NCI-60 Drug Sensitivity Prediction