The goal of this project is to use machine learning to predict the results of Development Therapeutics Program (DTP) AIDS Antiviral Screening. Therefore, this report documents both the predictive power of the model and all corresponding data used for the analysis in an effort to identify effective therapeutic agents for treating patients with AIDS, and/or cancer.
The NIH website hosting the data is located at https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data, and the May 2004 data can be found at https://wiki.nci.nih.gov/pages/viewpageattachments.action?pageId=158204006&metadataLink=true. The target classes for prediction are:
Using a rule-based algorithm (C5.0) and boosting, the statistics of the model chosen has an accuracy of 93.8% (94% CI of 0.9355, 0.9405), and a Kappa value of .907. Lastly, the analysis uncovered that logEC50 concentrations can be used to indicate whether a drug is inactive (CI)/ active (CA), or moderately active (CM)/ active (CA).
Because the results of the screenings (classes) were provided separately from the rest of the predictors. The data was joined by the variable NSC (the NCI’s internal ID number) to create a new data frame for the analysis.
suppressMessages(library(data.table))
#Screening Results
download.file("https://wiki.nci.nih.gov/download/attachments/158204006/aids_conc_may04.txt?version=1&modificationDate=1378736563000&api=v2", "aids_conc_may04.txt")
# concentrations necessary to see a protective effect on the infected cells
download.file("https://wiki.nci.nih.gov/download/attachments/158204006/aids_ec50_may04.txt?version=1&modificationDate=1378736563000&api=v2", "aids_ec50_protective_may04.txt")
#concentrations necessary to inhibit the growth of uninfected cells
download.file("https://wiki.nci.nih.gov/download/attachments/158204006/aids_ic50_may04.txt?api=v2","aids_ic50_inhibit_may04.txt")
aids_classes <- read.csv("aids_conc_may04.txt", stringsAsFactors = F,strip.white = T) # read in data
aids_inhiit <- read.csv("aids_ic50_inhibit_may04.txt", stringsAsFactors = F)# read in data
aids_protective <- read.csv("aids_ec50_protective_may04.txt", stringsAsFactors = F)# read in data
aids_df <- data.table(Reduce(function(...) merge(..., by="NSC"), list(aids_classes, aids_inhiit, aids_protective)))# merge data
From the information on NIH’s website, some of the features will need to be coded as factors:
aids_df$Conclusion<- as.factor(aids_df$Conclusion); # assign factors
aids_df$Flag.x <- as.factor(aids_df$Flag.x);aids_df$Flag.y <- as.factor(aids_df$Flag.y);# assign factors
rm(aids_classes,aids_inhiit,aids_protective); # clean up workspace
aids_df$ConcUnit.x <- as.factor(aids_df$ConcUnit.x);# assign factors
aids_df$ConcUnit.y <- as.factor(aids_df$ConcUnit.y)# assign factors
head(aids_df) # inspect some of the data
## NSC Conclusion Log10HiConc.x ConcUnit.x Flag.x Log10IC50 NumExp.x
## 1: 48 CI -3.7 M = -4.61 3
## 2: 78 CI -3.7 M = -4.02 3
## 3: 128 CI -3.7 M = -4.34 4
## 4: 164 CI -3.7 M > -3.70 2
## 5: 180 CI -2.7 M = -3.16 2
## 6: 186 CI -3.8 M = -4.60 2
## StdDev.x Log10HiConc.y ConcUnit.y Flag.y Log10EC50 NumExp.y StdDev.y
## 1: 0.02 -3.7 M > -3.70 3 0
## 2: 0.09 -3.7 M > -3.70 3 0
## 3: 0.06 -3.7 M > -3.70 4 0
## 4: 0.00 -3.7 M > -3.70 2 0
## 5: 0.07 -2.7 M > -2.74 2 0
## 6: 0.02 -3.8 M > -3.84 2 0
nums<- which(aids_df[,.(lapply(.SD,is.numeric))]==T)
Because of the similarities of the screenings, a check was done to identify if any of the variables were highly correlated, or possibly identical.
aids_df[rep(duplicated(lapply(aids_df,summary)), length=.N)] # check for duplicates
## Empty data.table (0 rows) of 14 cols: NSC,Conclusion,Log10HiConc.x,ConcUnit.x,Flag.x,Log10IC50...
panel.hist <- function(x, ...) # panel function for histograms in lower
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr)) # panel function for correlations in upper
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(aids_df[-1], panel = panel.smooth, # pairs plot with both panel functions
cex = 1.5, pch = 24, bg = "light blue",
diag.panel = panel.hist, upper.panel = panel.cor,cex.labels = 1, font.labels = 2)
The plot shows correlations in the upper triangle, scatter-plots in the lower triangle, and variable names and distributions on the main diagonal. After setting the correlation threshold to .80, we observe three features showing correlation which may or may not need to be addressed until the fine tuning stage of the building the model.
suppressMessages(library(dplyr));suppressMessages(library(lattice))
bwplot(Log10EC50~factor(Conclusion),data=aids_df) # bow and whiskers plots
bwplot(Log10IC50~factor(Conclusion),data=aids_df)
xyplot(Log10IC50~Log10EC50|Conclusion,data = aids_df, type=c("p","r"),main="Inhibitory vs. Effective Conc." ) # scatter plot
data.table(Conclusions=c("CA","CI","CM"),adj.r.squared=c(summary(lm(Log10IC50~Log10EC50,data = aids_df[Conclusion=="CA"]))$adj.r.squared,summary(lm(Log10IC50~Log10EC50,data = aids_df[Conclusion=="CI"]))$adj.r.squared,summary(lm(Log10IC50~Log10EC50,data = aids_df[Conclusion=="CM"]))$adj.r.squared)) # data frame with adjusted R squared values used in the scatter plot
## Conclusions adj.r.squared
## 1: CA 0.5239549
## 2: CI 0.4538609
## 3: CM 0.6678528
qqnorm(aids_df$Log10IC50,main = "Log10IC50 t statistics");# test for normailty
qqnorm(aids_df$Log10EC50,main = "Log10EC50 t statistics");
wilcox.test(aids_df$Log10IC50,aids_df$Log10EC50) # test to see if log10IC50 and log10EC50 are different
##
## Wilcoxon rank sum test with continuity correction
##
## data: aids_df$Log10IC50 and aids_df$Log10EC50
## W = 636810000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Hmisc::describe(aids_df)$Conclusion # view the proportion of the classes
## Conclusion
## n missing distinct
## 42923 0 3
##
## Value CA CI CM
## Frequency 1809 39103 2011
## Proportion 0.042 0.911 0.047
The two box and whiskers plots were made to visualize the distribution of the means between the three classes. It is apparent that Log10EC50 has better separation between the means for the other classes, and should have influence on the models ability to separate the classes. Unfortunately, the Log10IC50 plot does not provide good separation between the class means, to the naked eye.
A quick scatter plot of Log10IC50 vs. Log10EC50 shows no correlation for CA and CI, but for CM, there is an improvement with the amount of variability that can be explained parsimoniously. In addition, a non-parametric test indicated that Log10IC50 vs. Log10EC50 are in fact different. Not surprisingly, both LogIC50 and LogEC50 are not normally distributed as evidenced by the Q-Q plot tests for normality. Furthermore, it is clear that within this data-set there is an extreme case of class imbalance for the active classifications of the drugs.
suppressMessages(library(lattice))
xyplot(Log10HiConc.y~Log10HiConc.x|Conclusion,data = aids_df, type=c("p","r"),main="Log X vs. Log y." )
wilcox.test(aids_df$Log10HiConc.y,aids_df$Log10HiConc.x)
##
## Wilcoxon rank sum test with continuity correction
##
## data: aids_df$Log10HiConc.y and aids_df$Log10HiConc.x
## W = 920420000, p-value = 0.8174
## alternative hypothesis: true location shift is not equal to 0
The same test was done on Log10HiConc.y vs. Log10HiConc.x and based on the p-value, we can not reject the null hypothesis indicating there are not different.
Now that we have seen a glimpse of the data, the next step was to ascertain the difficulty with separating the classes. Using the K-means clustering algorithm, it appears that the CM class might difficult to classify as is.
means<- kmeans(scale(aids_df[,nums[2:9],with=F]),3, nstart = 100) # build a kmeans based cluster with 3 centers
table(means$cluster,aids_df$Conclusion) # relationship table
##
## CA CI CM
## 1 449 31443 1586
## 2 219 308 53
## 3 1141 7352 372
aids_df[,units.y:=ifelse(ConcUnit.y=="u","c",ifelse(ConcUnit.y=="V","c","s"))]# feature enginerring based on knowledge of chemistry
aids_df[,units.x:=ifelse(ConcUnit.x=="u","c",ifelse(ConcUnit.x=="V","c","s"))]# feature enginerring based on knowledge of chemistry
aids_df[,`:=`(NSC=NULL,NumExp.x=NULL,NumExp.y=NULL)] # drop unused columns
Because the conclusion data was determined by inspection of individual dose response curves by subject matter experts (SME) in this field, the results corresponds to their overall judgment. Furthermore, the EC50 and IC50 data are computer generated averages and don’t necessarily capture everything that was considered when making the judgement. Therefore, a rule-based system was chosen to aid the SMEs for predicting future classes, based on the nature of the Flag variable and what it represents.
The C5.0 rule-set was selected because of its scalability and it’s known performance with generating rules that are easy to interpret. http://perun.pmf.uns.ac.rs/radovanovic/dmsem/cd/install/Weka/doc/classifiers-papers/trees/ADTree/atrees.pdf
suppressMessages(library(caret)); library(C50); suppressMessages(library(dplyr))
set.seed(323) # setting the seed to reproduce the results
aids_df<- upSample(y=aids_df$Conclusion,x=aids_df[,-1],yname = "Conclusion") # addressing the minority class by up sampling
M.intrain<- createDataPartition(aids_df$Conclusion,p=.70,list = F) # split the data to have 70% training and 30% testing
M.training <- data.table(aids_df[M.intrain,])# subset the data for training
M.testing <- data.table(aids_df[-M.intrain,])# subset the data for testing
cost.mat<-matrix(c(0,20,10,1,0,1,10,20,0),3, byrow = F) # creating the cost matrix for C5.0
rownames(cost.mat)<- colnames(cost.mat)<- c("CA","CI","CM") # assign the names
M.train.conclusion <- M.training$Conclusion; M.training[,Conclusion:=NULL]
M.test.conclusion<-M.testing$Conclusion; M.testing[,Conclusion:=NULL]
The data are split 70-30 to create the training and testing sets, respectively.
fit_M.c5<- C5.0.default(y=M.train.conclusion,x=M.training,trials = 2, rules = T,costs =cost.mat, allowParallel = T ) # building the model
Classification analysis from the training data was utilized to identify the important features, and predict with high accuracy the classification of the screening results. The tuning parameters were selected automatically based on repeated cross-validation to evaluate its accuracy.
varImp(fit_M.c5,scale = F) # look at the top predictors used in the model
## Overall
## Log10EC50 96.56
## Log10HiConc.y 88.73
## StdDev.y 84.26
## Log10HiConc.x 73.73
## ConcUnit.y 60.17
## Log10IC50 48.35
## Flag.y 30.55
## StdDev.x 27.66
## Flag.x 17.16
## ConcUnit.x 0.00
## units.y 0.00
## units.x 0.00
predictions.c5<- predict(fit_M.c5,M.training)# assign predictions to training set
print(confusionMatrix(predictions.c5,M.train.conclusion)) # view the score of the model
## Confusion Matrix and Statistics
##
## Reference
## Prediction CA CI CM
## CA 27152 442 32
## CI 92 22861 171
## CM 129 4070 27170
##
## Overall Statistics
##
## Accuracy : 0.9399
## 95% CI : (0.9382, 0.9415)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9098
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: CA Class: CI Class: CM
## Sensitivity 0.9919 0.8352 0.9926
## Specificity 0.9913 0.9952 0.9233
## Pos Pred Value 0.9828 0.9886 0.8661
## Neg Pred Value 0.9959 0.9235 0.9960
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3306 0.2784 0.3309
## Detection Prevalence 0.3364 0.2816 0.3820
## Balanced Accuracy 0.9916 0.9152 0.9579
Based on the plot, the top 5 variables of importance are Log10EC50, Flag.EC50= >, ConcUnit.EC50= u, Log10HiConc.IC50, and Log10IC50. The C5.0 model has an accuracy of 0.9399 (93.99%) and a kappa value of .9098 on the training data.
The final model built from the training set was selected by an internal majority voting scheme to evaluate its accuracy based on the highest kappa value. Next, that final model was then assessed using a testing set to determine its true accuracy.
predictions.c5<- predict(fit_M.c5,M.testing)
print(confusionMatrix(predictions.c5,M.test.conclusion))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CA CI CM
## CA 11639 199 17
## CI 39 9739 82
## CM 52 1792 11631
##
## Overall Statistics
##
## Accuracy : 0.938
## 95% CI : (0.9355, 0.9405)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.907
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: CA Class: CI Class: CM
## Sensitivity 0.9922 0.8303 0.9916
## Specificity 0.9908 0.9948 0.9214
## Pos Pred Value 0.9818 0.9877 0.8632
## Neg Pred Value 0.9961 0.9214 0.9954
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3307 0.2768 0.3305
## Detection Prevalence 0.3369 0.2802 0.3829
## Balanced Accuracy 0.9915 0.9126 0.9565
The model was applied to the testing set, and the prediction accuracy deviates only slightly from the training set with an accuracy of 0.938 (93.8%) and a kappa value of .907
The analysis from this report presented an accurate model with the aim of giving the SMEs in this field a better method for identifying the active class other than from their collective agreement. Lastly, we observe that logEC50 concentrations can be used to indicate whether a drug is inactive/ active, or moderately active/ active.