AIDS Antiviral Screen Data

0.1 Summary

The goal of this project is to use machine learning to predict the results of Development Therapeutics Program (DTP) AIDS Antiviral Screening. Therefore, this report documents both the predictive power of the model and all corresponding data used for the analysis in an effort to identify effective therapeutic agents for treating patients with AIDS, and/or cancer.

The NIH website hosting the data is located at https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data, and the May 2004 data can be found at https://wiki.nci.nih.gov/pages/viewpageattachments.action?pageId=158204006&metadataLink=true. The target classes for prediction are:

CA - Confirmed active
CM - Confirmed moderately active
CI - Confirmed inactive

Using a rule-based algorithm (C5.0) and boosting, the statistics of the model chosen has an accuracy of 93.8% (94% CI of 0.9355, 0.9405), and a Kappa value of .907. Lastly, the analysis uncovered that logEC50 concentrations can be used to indicate whether a drug is inactive (CI)/ active (CA), or moderately active (CM)/ active (CA).

0.2 Gathering and Cleaning the Data

0.2.1 The data:

Because the results of the screenings (classes) were provided separately from the rest of the predictors. The data was joined by the variable NSC (the NCI’s internal ID number) to create a new data frame for the analysis.

suppressMessages(library(data.table))
#Screening Results
download.file("https://wiki.nci.nih.gov/download/attachments/158204006/aids_conc_may04.txt?version=1&modificationDate=1378736563000&api=v2", "aids_conc_may04.txt")

# concentrations necessary to see a protective effect on the infected cells
download.file("https://wiki.nci.nih.gov/download/attachments/158204006/aids_ec50_may04.txt?version=1&modificationDate=1378736563000&api=v2", "aids_ec50_protective_may04.txt")

#concentrations necessary to inhibit the growth of uninfected cells
download.file("https://wiki.nci.nih.gov/download/attachments/158204006/aids_ic50_may04.txt?api=v2","aids_ic50_inhibit_may04.txt")

aids_classes <- read.csv("aids_conc_may04.txt", stringsAsFactors = F,strip.white = T) # read in data
aids_inhiit <- read.csv("aids_ic50_inhibit_may04.txt", stringsAsFactors = F)# read in data
aids_protective <- read.csv("aids_ec50_protective_may04.txt", stringsAsFactors = F)# read in data
aids_df <- data.table(Reduce(function(...) merge(..., by="NSC"), list(aids_classes, aids_inhiit, aids_protective)))# merge data

From the information on NIH’s website, some of the features will need to be coded as factors:

ConcUnit
- M = molar
- u = micro grams per ml
- V = Volumetric
Flag
- > indicates EC50 would be higher than the highest concentration tested in at least one of the experiments
- < indicates EC50 would be less than the lowest concentration tested in at least one of the experiments
- = indicated that all the experiments reached EC50

aids_df$Conclusion<- as.factor(aids_df$Conclusion); # assign factors
aids_df$Flag.x <- as.factor(aids_df$Flag.x);aids_df$Flag.y <- as.factor(aids_df$Flag.y);# assign factors
rm(aids_classes,aids_inhiit,aids_protective); # clean up workspace
aids_df$ConcUnit.x <- as.factor(aids_df$ConcUnit.x);# assign factors
aids_df$ConcUnit.y <- as.factor(aids_df$ConcUnit.y)# assign factors
head(aids_df) # inspect some of the data

##    NSC Conclusion Log10HiConc.x ConcUnit.x Flag.x Log10IC50 NumExp.x
## 1:  48         CI          -3.7          M      =     -4.61        3
## 2:  78         CI          -3.7          M      =     -4.02        3
## 3: 128         CI          -3.7          M      =     -4.34        4
## 4: 164         CI          -3.7          M      >     -3.70        2
## 5: 180         CI          -2.7          M      =     -3.16        2
## 6: 186         CI          -3.8          M      =     -4.60        2
##    StdDev.x Log10HiConc.y ConcUnit.y Flag.y Log10EC50 NumExp.y StdDev.y
## 1:     0.02          -3.7          M      >     -3.70        3        0
## 2:     0.09          -3.7          M      >     -3.70        3        0
## 3:     0.06          -3.7          M      >     -3.70        4        0
## 4:     0.00          -3.7          M      >     -3.70        2        0
## 5:     0.07          -2.7          M      >     -2.74        2        0
## 6:     0.02          -3.8          M      >     -3.84        2        0

nums<- which(aids_df[,.(lapply(.SD,is.numeric))]==T)

Because of the similarities of the screenings, a check was done to identify if any of the variables were highly correlated, or possibly identical.

aids_df[rep(duplicated(lapply(aids_df,summary)), length=.N)] # check for duplicates

## Empty data.table (0 rows) of 14 cols: NSC,Conclusion,Log10HiConc.x,ConcUnit.x,Flag.x,Log10IC50...

panel.hist <- function(x, ...) # panel function for histograms in lower
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr)) # panel function for correlations in upper
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}


pairs(aids_df[-1], panel = panel.smooth,  # pairs plot with both panel functions
      cex = 1.5, pch = 24, bg = "light blue",
      diag.panel = panel.hist, upper.panel = panel.cor,cex.labels = 1, font.labels = 2)

The plot shows correlations in the upper triangle, scatter-plots in the lower triangle, and variable names and distributions on the main diagonal. After setting the correlation threshold to .80, we observe three features showing correlation which may or may not need to be addressed until the fine tuning stage of the building the model.

suppressMessages(library(dplyr));suppressMessages(library(lattice))

bwplot(Log10EC50~factor(Conclusion),data=aids_df) # bow and whiskers plots

bwplot(Log10IC50~factor(Conclusion),data=aids_df)

xyplot(Log10IC50~Log10EC50|Conclusion,data = aids_df, type=c("p","r"),main="Inhibitory vs. Effective Conc." ) # scatter plot

data.table(Conclusions=c("CA","CI","CM"),adj.r.squared=c(summary(lm(Log10IC50~Log10EC50,data = aids_df[Conclusion=="CA"]))$adj.r.squared,summary(lm(Log10IC50~Log10EC50,data = aids_df[Conclusion=="CI"]))$adj.r.squared,summary(lm(Log10IC50~Log10EC50,data = aids_df[Conclusion=="CM"]))$adj.r.squared)) # data frame with adjusted R squared values used in the scatter plot

##    Conclusions adj.r.squared
## 1:          CA     0.5239549
## 2:          CI     0.4538609
## 3:          CM     0.6678528

qqnorm(aids_df$Log10IC50,main = "Log10IC50 t statistics");# test for normailty

qqnorm(aids_df$Log10EC50,main = "Log10EC50 t statistics");

wilcox.test(aids_df$Log10IC50,aids_df$Log10EC50) # test to see if log10IC50 and log10EC50 are different

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  aids_df$Log10IC50 and aids_df$Log10EC50
## W = 636810000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Hmisc::describe(aids_df)$Conclusion # view the proportion of the classes

## Conclusion 
##        n  missing distinct 
##    42923        0        3 
##                             
## Value         CA    CI    CM
## Frequency   1809 39103  2011
## Proportion 0.042 0.911 0.047

The two box and whiskers plots were made to visualize the distribution of the means between the three classes. It is apparent that Log10EC50 has better separation between the means for the other classes, and should have influence on the models ability to separate the classes. Unfortunately, the Log10IC50 plot does not provide good separation between the class means, to the naked eye.

A quick scatter plot of Log10IC50 vs. Log10EC50 shows no correlation for CA and CI, but for CM, there is an improvement with the amount of variability that can be explained parsimoniously. In addition, a non-parametric test indicated that Log10IC50 vs. Log10EC50 are in fact different. Not surprisingly, both LogIC50 and LogEC50 are not normally distributed as evidenced by the Q-Q plot tests for normality. Furthermore, it is clear that within this data-set there is an extreme case of class imbalance for the active classifications of the drugs.

suppressMessages(library(lattice))
xyplot(Log10HiConc.y~Log10HiConc.x|Conclusion,data = aids_df, type=c("p","r"),main="Log X vs. Log y." )

wilcox.test(aids_df$Log10HiConc.y,aids_df$Log10HiConc.x)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  aids_df$Log10HiConc.y and aids_df$Log10HiConc.x
## W = 920420000, p-value = 0.8174
## alternative hypothesis: true location shift is not equal to 0

The same test was done on Log10HiConc.y vs. Log10HiConc.x and based on the p-value, we can not reject the null hypothesis indicating there are not different.

Now that we have seen a glimpse of the data, the next step was to ascertain the difficulty with separating the classes. Using the K-means clustering algorithm, it appears that the CM class might difficult to classify as is.

means<- kmeans(scale(aids_df[,nums[2:9],with=F]),3, nstart = 100) # build a kmeans based cluster with 3 centers
table(means$cluster,aids_df$Conclusion) # relationship table

##    
##        CA    CI    CM
##   1   449 31443  1586
##   2   219   308    53
##   3  1141  7352   372

aids_df[,units.y:=ifelse(ConcUnit.y=="u","c",ifelse(ConcUnit.y=="V","c","s"))]# feature enginerring based on knowledge of chemistry
aids_df[,units.x:=ifelse(ConcUnit.x=="u","c",ifelse(ConcUnit.x=="V","c","s"))]# feature enginerring based on knowledge of chemistry
aids_df[,`:=`(NSC=NULL,NumExp.x=NULL,NumExp.y=NULL)] # drop unused columns

0.3 Model Selection and Tunning

Because the conclusion data was determined by inspection of individual dose response curves by subject matter experts (SME) in this field, the results corresponds to their overall judgment. Furthermore, the EC50 and IC50 data are computer generated averages and don’t necessarily capture everything that was considered when making the judgement. Therefore, a rule-based system was chosen to aid the SMEs for predicting future classes, based on the nature of the Flag variable and what it represents.

The C5.0 rule-set was selected because of its scalability and it’s known performance with generating rules that are easy to interpret. http://perun.pmf.uns.ac.rs/radovanovic/dmsem/cd/install/Weka/doc/classifiers-papers/trees/ADTree/atrees.pdf

suppressMessages(library(caret)); library(C50); suppressMessages(library(dplyr))


set.seed(323) # setting the seed to reproduce the results
aids_df<- upSample(y=aids_df$Conclusion,x=aids_df[,-1],yname = "Conclusion") # addressing the minority class by up sampling

M.intrain<- createDataPartition(aids_df$Conclusion,p=.70,list = F) # split the data to have 70% training and 30% testing
M.training <- data.table(aids_df[M.intrain,])# subset the data for training
M.testing <- data.table(aids_df[-M.intrain,])# subset the data for testing

cost.mat<-matrix(c(0,20,10,1,0,1,10,20,0),3, byrow = F) # creating the cost matrix for C5.0 
rownames(cost.mat)<- colnames(cost.mat)<- c("CA","CI","CM") # assign the names
M.train.conclusion <- M.training$Conclusion; M.training[,Conclusion:=NULL]
M.test.conclusion<-M.testing$Conclusion; M.testing[,Conclusion:=NULL]

The data are split 70-30 to create the training and testing sets, respectively.

fit_M.c5<- C5.0.default(y=M.train.conclusion,x=M.training,trials = 2, rules = T,costs =cost.mat, allowParallel = T ) # building the model

0.3.1 Training Data

Classification analysis from the training data was utilized to identify the important features, and predict with high accuracy the classification of the screening results. The tuning parameters were selected automatically based on repeated cross-validation to evaluate its accuracy.

varImp(fit_M.c5,scale = F) # look at the top predictors used in the model

##               Overall
## Log10EC50       96.56
## Log10HiConc.y   88.73
## StdDev.y        84.26
## Log10HiConc.x   73.73
## ConcUnit.y      60.17
## Log10IC50       48.35
## Flag.y          30.55
## StdDev.x        27.66
## Flag.x          17.16
## ConcUnit.x       0.00
## units.y          0.00
## units.x          0.00

predictions.c5<- predict(fit_M.c5,M.training)# assign predictions to training set
print(confusionMatrix(predictions.c5,M.train.conclusion)) # view the score of the model

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CA    CI    CM
##         CA 27152   442    32
##         CI    92 22861   171
##         CM   129  4070 27170
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9399          
##                  95% CI : (0.9382, 0.9415)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9098          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: CA Class: CI Class: CM
## Sensitivity             0.9919    0.8352    0.9926
## Specificity             0.9913    0.9952    0.9233
## Pos Pred Value          0.9828    0.9886    0.8661
## Neg Pred Value          0.9959    0.9235    0.9960
## Prevalence              0.3333    0.3333    0.3333
## Detection Rate          0.3306    0.2784    0.3309
## Detection Prevalence    0.3364    0.2816    0.3820
## Balanced Accuracy       0.9916    0.9152    0.9579

Based on the plot, the top 5 variables of importance are Log10EC50, Flag.EC50= >, ConcUnit.EC50= u, Log10HiConc.IC50, and Log10IC50. The C5.0 model has an accuracy of 0.9399 (93.99%) and a kappa value of .9098 on the training data.

0.3.2 Testing Data

The final model built from the training set was selected by an internal majority voting scheme to evaluate its accuracy based on the highest kappa value. Next, that final model was then assessed using a testing set to determine its true accuracy.

predictions.c5<- predict(fit_M.c5,M.testing)
print(confusionMatrix(predictions.c5,M.test.conclusion))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CA    CI    CM
##         CA 11639   199    17
##         CI    39  9739    82
##         CM    52  1792 11631
## 
## Overall Statistics
##                                           
##                Accuracy : 0.938           
##                  95% CI : (0.9355, 0.9405)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.907           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: CA Class: CI Class: CM
## Sensitivity             0.9922    0.8303    0.9916
## Specificity             0.9908    0.9948    0.9214
## Pos Pred Value          0.9818    0.9877    0.8632
## Neg Pred Value          0.9961    0.9214    0.9954
## Prevalence              0.3333    0.3333    0.3333
## Detection Rate          0.3307    0.2768    0.3305
## Detection Prevalence    0.3369    0.2802    0.3829
## Balanced Accuracy       0.9915    0.9126    0.9565

The model was applied to the testing set, and the prediction accuracy deviates only slightly from the training set with an accuracy of 0.938 (93.8%) and a kappa value of .907

0.4 Concluding Remarks

The analysis from this report presented an accurate model with the aim of giving the SMEs in this field a better method for identifying the active class other than from their collective agreement. Lastly, we observe that logEC50 concentrations can be used to indicate whether a drug is inactive/ active, or moderately active/ active.