Background

Selected molecular descriptors from the Dragon chemoinformatics application were used to predict bioconcentration factors for 779 chemicals in order to evaluate QSAR (Quantitative Structure Activity Relationship). This dataset was obtained from the UCI machine learning repository.

The dataset consists of 779 observations of 10 attributes. Below is a brief description of each feature and the response variable (logBCF) in our dataset:

  1. nHM - number of heavy atoms (integer)
  2. piPC09 - molecular multiple path count (numeric)
  3. PCD - difference between multiple path count and path count (numeric)
  4. X2Av - average valence connectivity (numeric)
  5. MLOGP - Moriguchi octanol-water partition coefficient (numeric)
  6. ON1V - overall modified Zagreb index by valence vertex degrees (numeric)
  7. N.072 - Frequency of RCO-N< / >N-X=X fragments (integer)
  8. B02[C-N] - Presence/Absence of C-N atom pairs (binary)
  9. F04[C-O] - Frequency of C-O atom pairs (integer)
  10. logBCF - Bioconcentration Factor in log units (numeric)

Note that all predictors with the exception of B02[C-N] are quantitative. For the purpose of this assignment, DO NOT CONVERT B02[C-N] to factor. Leave the data in its original format - numeric in R.

Please load the dataset “Bio_pred” and then split the dataset into a train and test set in a 80:20 ratio. Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7. Please make sure that you are using R version 3.6.X or above (i.e. version 4.X is also acceptable).

Read Data

# Clear variables in memory
rm(list=ls())

# Import the libraries
library(CombMSC)
library(boot)
library(leaps)
library(MASS)
library(glmnet)
library(dplyr)

# Ensure that the sampling type is correct
RNGkind(sample.kind="Rejection")

# Set a seed for reproducibility
set.seed(100)

# Read data
fullData = read.csv("Bio_pred.csv",header=TRUE)
attach(fullData)
head(fullData,5)
##   nHM piPC09  PCD X2Av MLOGP ON1V N.072 B02.C.N. F04.C.O. logBCF
## 1   0      0 1.49 0.14  1.35 0.72     0        1        5   0.74
## 2   0      0 1.47 0.14  1.70 0.88     0        1        5   0.93
## 3   0      0 1.20 0.25  4.14 2.06     0        0        0   3.24
## 4   0      0 1.69 0.13  1.89 0.79     0        1        8  -0.40
## 5   0      0 0.52 0.25  2.65 1.31     0        0        0   2.24
# Split data for traIning and testing
testRows = sample(nrow(fullData),0.2*nrow(fullData))
testData = fullData[testRows, ]
trainData = fullData[-testRows, ]

Note: Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7.

Question 1: Full Model

  1. Fit a multiple linear regression with the variable logBCF as the response and the other variables as predictors. Call it model1. Display the model summary.
model1 = lm(logBCF ~ ., data=trainData)
summary(model1)
## 
## Call:
## lm(formula = logBCF ~ ., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2577 -0.5180  0.0448  0.5117  4.0423 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.001422   0.138057   0.010  0.99179    
## nHM          0.137022   0.022462   6.100 1.88e-09 ***
## piPC09       0.031158   0.020874   1.493  0.13603    
## PCD          0.055655   0.063874   0.871  0.38391    
## X2Av        -0.031890   0.253574  -0.126  0.89996    
## MLOGP        0.506088   0.034211  14.793  < 2e-16 ***
## ON1V         0.140595   0.066810   2.104  0.03575 *  
## N.072       -0.073334   0.070993  -1.033  0.30202    
## B02.C.N.    -0.158231   0.080143  -1.974  0.04879 *  
## F04.C.O.    -0.030763   0.009667  -3.182  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7957 on 614 degrees of freedom
## Multiple R-squared:  0.6672, Adjusted R-squared:  0.6623 
## F-statistic: 136.8 on 9 and 614 DF,  p-value: < 2.2e-16
  1. Which regression coefficients are significant at the 95% confidence level? At the 99% confidence level?

#Answer:

  1. What are the Mallow’s Cp, AIC, and BIC criterion values for this model?
set.seed(100)
library(leaps)
## full model
n = nrow(trainData)
#S2 <- summary(model1)$sigma
c(Cp(model1, S2= summary(model1)$sigma^2), 
  AIC(model1, k=2), AIC(model1,k=log(n)))
## [1]   10.000 1497.477 1546.274

#Answer:

  1. Build a new model on the training data with only the variables which coefficients were found to be statistically significant at the 99% confident level. Call it model2. Perform a Partial F-test to compare this new model with the full model (model1). Which one would you prefer? Is it good practice to select variables based on statistical significance of individual coefficients? Explain.
set.seed(100)

## reduced model vs full model
model2= lm(logBCF ~ nHM+F04.C.O.+MLOGP, data=trainData)
#Mallow's Cp, AIC and BIC for reduced model
c(Cp(model2, S2= summary(model2)$sigma^2), 
  AIC(model2, k=2), AIC(model2,k=log(n)))
## [1]    4.000 1504.152 1526.333
anova(model2, model1)
## Analysis of Variance Table
## 
## Model 1: logBCF ~ nHM + F04.C.O. + MLOGP
## Model 2: logBCF ~ nHM + piPC09 + PCD + X2Av + MLOGP + ON1V + N.072 + B02.C.N. + 
##     F04.C.O.
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)   
## 1    620 400.51                              
## 2    614 388.70  6    11.809 3.109 0.00523 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Answer

Question 3: Stepwise Regression

  1. Perform backward stepwise regression using BIC. Allow the minimum model to be the model with only an intercept, and the full model to be model1. Display the model summary of your final model. Call it model4
set.seed(100)
# Backward Stepwise regresion
min_model <- lm(logBCF ~ 1, trainData)#intercept only model
model4 <- step(model1, scope = list(lower=min_model, upper=model1),direction = "backward", k=log(n),trace=F)#k=log(n) uses BIC

cat("Model4 Summary (BIC Criterion):")
## Model4 Summary (BIC Criterion):
summary(model4)
## 
## Call:
## lm(formula = logBCF ~ nHM + piPC09 + MLOGP + F04.C.O., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2611 -0.5126  0.0517  0.5353  4.3488 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.008695   0.078196  -0.111  0.91150    
## nHM          0.114029   0.017574   6.489 1.78e-10 ***
## piPC09       0.041119   0.013636   3.015  0.00267 ** 
## MLOGP        0.566473   0.025990  21.796  < 2e-16 ***
## F04.C.O.    -0.022104   0.008000  -2.763  0.00590 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7985 on 619 degrees of freedom
## Multiple R-squared:  0.662,  Adjusted R-squared:  0.6599 
## F-statistic: 303.1 on 4 and 619 DF,  p-value: < 2.2e-16
  1. How many variables are in model4? Which regression coefficients are significant at the 99% confidence level?

#Answer

  1. Perform forward stepwise selection with AIC. Allow the minimum model to be the model with only an intercept, and the full model to be model1. Display the model summary of your final model. Call it model5. Do the variables included in model5 differ from the variables in model4?
set.seed(100)
# Forward Stepwise regresion
min_model <- lm(logBCF ~ 1, trainData)#intercept only model
model5 <- step(min_model, scope = list(lower=min_model, upper=model1),direction = "forward", k=2,trace=F)#k=2 uses AIC

cat("Model5 Summary (AIC Criterion):")
## Model5 Summary (AIC Criterion):
summary(model5)
## 
## Call:
## lm(formula = logBCF ~ MLOGP + nHM + piPC09 + F04.C.O. + B02.C.N. + 
##     ON1V, data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2364 -0.5234  0.0421  0.5196  4.1159 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.035785   0.099454   0.360  0.71911    
## MLOGP        0.528522   0.029434  17.956  < 2e-16 ***
## nHM          0.124086   0.019083   6.502 1.63e-10 ***
## piPC09       0.042167   0.014135   2.983  0.00297 ** 
## F04.C.O.    -0.028644   0.009415  -3.042  0.00245 ** 
## B02.C.N.    -0.160204   0.073225  -2.188  0.02906 *  
## ON1V         0.098099   0.055457   1.769  0.07740 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7951 on 617 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.6628 
## F-statistic: 205.1 on 6 and 617 DF,  p-value: < 2.2e-16

#Answer

  1. Compare the adjusted \(R^2\), Mallow’s Cp, AICs and BICs of the full model (model1), the model found in Question 2 (model3), and the model found using backward selection with BIC (model4). Which model is preferred based on these criteria and why?
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
set.seed(100)

# c(Cp(model1, S2= summary(model1)$sigma^2), 
#   AIC(model1, k=2), AIC(model1,k=log(n)),summary(model1)$r.squared)

model <- c("Model1","Model3","Model4")
Mallow_Cp <- c(Cp(model1, S2= summary(model1)$sigma^2), Cp(model3, S2= summary(model3)$sigma^2),
               Cp(model4, S2= summary(model4)$sigma^2))
AIC <- c(AIC(model1, k=2), AIC(model3, k=2), AIC(model4, k=2))
BIC <- c(AIC(model1,k=log(n)), AIC(model3,k=log(n)), AIC(model4,k=log(n)))
Adj_R_Square <- c(summary(model1)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared)
#create my dataframe
df <- data.frame(model,Adj_R_Square, Mallow_Cp, AIC,BIC)
#create kable table
df %>%
  kbl(caption="Table I: Performance Measures") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), position="center") %>% 
  kable_classic(full_width = T, html_font = "Cambria") %>% row_spec(0, bold=T, background="gold")
Table I: Performance Measures
model Adj_R_Square Mallow_Cp AIC BIC
Model1 0.6623027 10 1497.477 1546.274
Model3 0.6627864 7 1493.623 1529.113
Model4 0.6598504 5 1497.052 1523.669

#Answer

Question 4: Ridge Regression

  1. Perform ridge regression on the training set. Use cv.glmnet() to find the lambda value that minimizes the cross-validation error using 10 fold CV.
set.seed(100)
train.X <- trainData[,-10]
train.Y <- trainData[,10]
ridge.cv <- cv.glmnet(as.matrix(trainData[,-10]), trainData[,10], alpha=0, nfolds=10)
ridge.cv$lambda.min
## [1] 0.108775
  1. List the value of coefficients at the optimum lambda value.
set.seed(100)
ridge_model <- glmnet(as.matrix(train.X), train.Y, alpha=0,nlambda = 100)

coef(ridge_model, s=ridge.cv$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.13841426
## nHM          0.14391877
## piPC09       0.03735762
## PCD          0.08235334
## X2Av        -0.06901352
## MLOGP        0.44403655
## ON1V         0.15770114
## N.072       -0.09683534
## B02.C.N.    -0.20919397
## F04.C.O.    -0.03177144
  1. How many variables were selected? Was this result expected? Explain.

#Answer

Question 5: Lasso Regression

  1. Perform lasso regression on the training set.Use cv.glmnet() to find the lambda value that minimizes the cross-validation error using 10 fold CV.
set.seed(100)
lasso.cv <- cv.glmnet(as.matrix(train.X), train.Y, alpha=1, nfolds=10)

lasso.cv
## 
## Call:  cv.glmnet(x = as.matrix(train.X), y = train.Y, nfolds = 10, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.00785    54  0.6482 0.05770       8
## 1se 0.18572    20  0.7029 0.05249       2
cat('Optimal lasso lambda:',lasso.cv$lambda.min)
## Optimal lasso lambda: 0.007854436
  1. Plot the regression coefficient path.
set.seed(100)
lasso.model = glmnet(as.matrix(train.X),train.Y,alpha = 1, nlambda = 100)

## Extract coefficients at optimal lambda
#mod4
lasso.cv
## 
## Call:  cv.glmnet(x = as.matrix(train.X), y = train.Y, nfolds = 10, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.00785    54  0.6482 0.05770       8
## 1se 0.18572    20  0.7029 0.05249       2
coef(lasso.model, s=lasso.cv$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.02722838
## nHM          0.12543866
## piPC09       0.03387665
## PCD          0.03194878
## X2Av         .         
## MLOGP        0.52174346
## ON1V         0.09633951
## N.072       -0.05487196
## B02.C.N.    -0.13961811
## F04.C.O.    -0.02535576
## Plot coefficient paths
plot(lasso.model,xvar="lambda", label=T, lwd=2)
abline(v=log(lasso.cv$lambda.min), col='black', lty=2,lwd=3)#log(min-cv lambda)#this line show where the optimal lambda exist
abline(v=0,col='purple',lty=1, lwd=4)#this line simply show where the zero lambda occurs 

  1. How many variables were selected? Which are they?
set.seed(100)
coef(lasso.model, s=lasso.cv$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.02722838
## nHM          0.12543866
## piPC09       0.03387665
## PCD          0.03194878
## X2Av         .         
## MLOGP        0.52174346
## ON1V         0.09633951
## N.072       -0.05487196
## B02.C.N.    -0.13961811
## F04.C.O.    -0.02535576
cat("There were 8 variables selected, see above")
## There were 8 variables selected, see above

#Answer

Question 6: Elastic Net

  1. Perform elastic net regression on the training set. Use cv.glmnet() to find the lambda value that minimizes the cross-validation error using 10 fold CV. Give equal weight to both penalties.
set.seed(100)
elastic.cv <- cv.glmnet(as.matrix(train.X), train.Y, alpha=0.5, nfolds=10)
elastic.cv$lambda.min#optimal lambda that minimize Cross validation error
## [1] 0.0207662
  1. List the coefficient values at the optimal lambda. How many variables were selected? How do these variables compare to those from Lasso in Question 5?
set.seed(100)
elastic.model <- glmnet(as.matrix(train.X), train.Y, alpha=0.5)
coef(elastic.model, s=elastic.cv$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.04903516
## nHM          0.12397290
## piPC09       0.03470891
## PCD          0.03060034
## X2Av         .         
## MLOGP        0.51776470
## ON1V         0.08901088
## N.072       -0.05236840
## B02.C.N.    -0.14155538
## F04.C.O.    -0.02420217
cat("There were 8 variables selected, like Lasso but the coefficient estimates are differnt")
## There were 8 variables selected, like Lasso but the coefficient estimates are differnt

#Answer

Question 7: Model comparison

  1. Predict logBCF for each of the rows in the test data using the full model, and the models found using backward stepwise regression with BIC, ridge regression, lasso regression, and elastic net. Display the first few predictions for each model.
set.seed(100)
#set up test data
test.X <- testData[,-10]
test.Y <- testData[,10]
#full model = model1
pred.full <- predict(model1, test.X)

cat("full prediction:",pred.full[1:5])
## full prediction: 2.446479 4.333759 3.266892 1.66477 1.955362
#model4 = model with backward regression
pred.backward <- predict(model4, test.X)

cat("backward stepwise prediction:",pred.backward[1:5])
## backward stepwise prediction: 2.424916 4.353167 3.274192 1.297175 2.001399
#ridge
pred.ridge <- predict(ridge_model, as.matrix(test.X), s=ridge.cv$lambda.min)

cat("ridge prediction:",pred.ridge[1:5])
## ridge prediction: 2.454878 4.234425 3.223166 1.734094 1.993967
#lasso
pred.lasso <- predict(lasso.model, as.matrix(test.X), s=lasso.cv$lambda.min)

cat("lasso prediction:",pred.lasso[1:5])
## lasso prediction: 2.442895 4.313509 3.260617 1.610006 1.939946
#elastic
pred.elastic <- predict(elastic.model, as.matrix(test.X),s=elastic.cv$lambda.min)

cat("Elastic-net prediction:",pred.elastic[1:5])
## Elastic-net prediction: 2.441506 4.296451 3.252638 1.606774 1.939465
  1. Compare the predictions using mean squared prediction error. Which model performed the best?
set.seed(100)
MSPE <- c(mean((pred.full-test.Y)^2), mean((pred.backward-test.Y)^2),mean((pred.ridge-test.Y)^2),
 mean((pred.lasso-test.Y)^2),mean((pred.elastic-test.Y)^2))
models <- c("Full model","Backward stepwise","Ridge model", "Lasso model", "Elastic net model")
df2 <- data.frame(models,MSPE)
df2 <- df2 %>% arrange((MSPE))
#create kable table
df2 %>%
  kbl(caption="Table II: Prediction Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), position="center") %>% 
  kable_classic(full_width = F, html_font = "Cambria") %>% row_spec(0, bold=T, background="yellow")
Table II: Prediction Performance
models MSPE
Backward stepwise 0.5742198
Elastic net model 0.5782750
Lasso model 0.5790832
Full model 0.5839296
Ridge model 0.5877835
cat("Backward Stepwise had the lowest MSPE:", min(df2$MSPE))
## Backward Stepwise had the lowest MSPE: 0.5742198

#Answer

-Backward was the best model based on it’s lowest MSPE value

  1. Provide a table listing each method described in Question 7a and the variables selected by each method (see Lesson 5.8 for an example). Which variables were selected consistently?
Backward Stepwise Ridge Lasso Elastic Net
nHM x x x x
piPC09 x x x x
PCD x x x
X2AV x
MLOGP x x x x
ON1V x x x
N.072 x x x
B02.C.N. x x x
F04.C.O. x x x x

#Answer