Background

Selected molecular descriptors from the Dragon chemoinformatics application were used to predict bioconcentration factors for 779 chemicals in order to evaluate QSAR (Quantitative Structure Activity Relationship). This dataset was obtained from the UCI machine learning repository.

The dataset consists of 779 observations of 10 attributes. Below is a brief description of each feature and the response variable (logBCF) in our dataset:

  1. nHM - number of heavy atoms (integer)
  2. piPC09 - molecular multiple path count (numeric)
  3. PCD - difference between multiple path count and path count (numeric)
  4. X2Av - average valence connectivity (numeric)
  5. MLOGP - Moriguchi octanol-water partition coefficient (numeric)
  6. ON1V - overall modified Zagreb index by valence vertex degrees (numeric)
  7. N.072 - Frequency of RCO-N< / >N-X=X fragments (integer)
  8. B02[C-N] - Presence/Absence of C-N atom pairs (binary)
  9. F04[C-O] - Frequency of C-O atom pairs (integer)
  10. logBCF - Bioconcentration Factor in log units (numeric)

Note that all predictors with the exception of B02[C-N] are quantitative. For the purpose of this assignment, DO NOT CONVERT B02[C-N] to factor. Leave the data in its original format - numeric in R.

Please load the dataset “Bio_pred” and then split the dataset into a train and test set in a 80:20 ratio. Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7. Please make sure that you are using R version 3.6.X or above (i.e. version 4.X is also acceptable).

Read Data

# Clear variables in memory
rm(list=ls())

# Import the libraries
library(CombMSC)
library(boot)
library(leaps)
library(MASS)
library(glmnet)

# Ensure that the sampling type is correct
RNGkind(sample.kind="Rejection")

# Set a seed for reproducibility
set.seed(100)

# Read data
fullData = read.csv("Bio_pred.csv",header=TRUE)

# Split data for traIning and testing
testRows = sample(nrow(fullData),0.2*nrow(fullData))
testData = fullData[testRows, ]
trainData = fullData[-testRows, ]

Note: Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7.

Question 1: Full Model

  1. Fit a multiple linear regression with the variable logBCF as the response and the other variables as predictors. Call it model1. Display the model summary.
model1 = lm(logBCF ~ ., data=trainData)

summary(model1)
## 
## Call:
## lm(formula = logBCF ~ ., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2577 -0.5180  0.0448  0.5117  4.0423 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.001422   0.138057   0.010  0.99179    
## nHM          0.137022   0.022462   6.100 1.88e-09 ***
## piPC09       0.031158   0.020874   1.493  0.13603    
## PCD          0.055655   0.063874   0.871  0.38391    
## X2Av        -0.031890   0.253574  -0.126  0.89996    
## MLOGP        0.506088   0.034211  14.793  < 2e-16 ***
## ON1V         0.140595   0.066810   2.104  0.03575 *  
## N.072       -0.073334   0.070993  -1.033  0.30202    
## B02.C.N.    -0.158231   0.080143  -1.974  0.04879 *  
## F04.C.O.    -0.030763   0.009667  -3.182  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7957 on 614 degrees of freedom
## Multiple R-squared:  0.6672, Adjusted R-squared:  0.6623 
## F-statistic: 136.8 on 9 and 614 DF,  p-value: < 2.2e-16
  1. Which regression coefficients are significant at the 95% confidence level? At the 99% confidence level?

At 95% confidence level, we have the following coefficients as significant – nHM, MLOGP, ON1V, B02.C.N., F04.C.0

At 99% confidence level, we have the following coefficients as significant – nHM, MLOGP and F04.C.0

  1. What are the Mallow’s Cp, AIC, and BIC criterion values for this model?
set.seed(100)

n = nrow(trainData)

Cp_Value1 <- Cp(model1, S2 = summary(model1)$sigma^2)

paste("Mallows Cp value for model 1 is ", Cp_Value1)
## [1] "Mallows Cp value for model 1 is  10"
AIC_Values1 <- AIC(model1, k = 2)

paste("AIC Value  for model 1 is ",AIC_Values1)
## [1] "AIC Value  for model 1 is  1497.47653274345"
BIC_Value1 <- AIC(model1, k = log(n))

paste("BIC Value  for model 1 is ",BIC_Value1)
## [1] "BIC Value  for model 1 is  1546.27418679551"
  1. Build a new model on the training data with only the variables which coefficients were found to be statistically significant at the 99% confident level. Call it model2. Perform a Partial F-test to compare this new model with the full model (model1). Which one would you prefer? Is it good practice to select variables based on statistical significance of individual coefficients? Explain.
set.seed(100)

model2 <- lm(logBCF~nHM + MLOGP + F04.C.O., data = trainData)

summary(model2)
## 
## Call:
## lm(formula = logBCF ~ nHM + MLOGP + F04.C.O., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2555 -0.5097  0.0374  0.5471  4.2704 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03076    0.07836  -0.393   0.6948    
## nHM          0.10948    0.01762   6.213 9.56e-10 ***
## MLOGP        0.60993    0.02177  28.018  < 2e-16 ***
## F04.C.O.    -0.01295    0.00745  -1.738   0.0826 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8037 on 620 degrees of freedom
## Multiple R-squared:  0.6571, Adjusted R-squared:  0.6554 
## F-statistic:   396 on 3 and 620 DF,  p-value: < 2.2e-16
anova(model2, model1)
## Analysis of Variance Table
## 
## Model 1: logBCF ~ nHM + MLOGP + F04.C.O.
## Model 2: logBCF ~ nHM + piPC09 + PCD + X2Av + MLOGP + ON1V + N.072 + B02.C.N. + 
##     F04.C.O.
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)   
## 1    620 400.51                              
## 2    614 388.70  6    11.809 3.109 0.00523 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
n = nrow(trainData)

Cp_Value2 <- Cp(model1, S2 = summary(model2)$sigma^2)

paste("Mallows Cp value for model 2 is ", Cp_Value2)
## [1] "Mallows Cp value for model 2 is  -2.28090634678222"
AIC_Values2 <- AIC(model2, k = 2)

paste("AIC Value  for model 2 is ",AIC_Values2)
## [1] "AIC Value  for model 2 is  1504.15208114915"
BIC_Value2 <- AIC(model2, k = log(n))

paste("BIC Value  for model 2 is ",BIC_Value2)
## [1] "BIC Value  for model 2 is  1526.33283299099"

Here we can see that Model2 has lower Mallows CP (-2.281 vs 10) and BIC Values (1526.332 vs 1546.274) and higher AIC Values (1504.152 vs 1497.476).This means that Model2 is better than Model1.

But, when we perform ANOVA we get that Model1 is better. This means that we cannot use only ANOVA or Mallows etc. to identify which of these are better.

Question 3: Stepwise Regression

  1. Perform backward stepwise regression using BIC. Allow the minimum model to be the model with only an intercept, and the full model to be model1. Display the model summary of your final model. Call it model4
set.seed(100)

minimum_model <- lm(logBCF ~ 1, data = trainData)

model4 <- step(model1, scope = list(lower = minimum_model, upper = model1), direction = "backward", k = log(n), trace = F)

summary(model4)
## 
## Call:
## lm(formula = logBCF ~ nHM + piPC09 + MLOGP + F04.C.O., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2611 -0.5126  0.0517  0.5353  4.3488 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.008695   0.078196  -0.111  0.91150    
## nHM          0.114029   0.017574   6.489 1.78e-10 ***
## piPC09       0.041119   0.013636   3.015  0.00267 ** 
## MLOGP        0.566473   0.025990  21.796  < 2e-16 ***
## F04.C.O.    -0.022104   0.008000  -2.763  0.00590 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7985 on 619 degrees of freedom
## Multiple R-squared:  0.662,  Adjusted R-squared:  0.6599 
## F-statistic: 303.1 on 4 and 619 DF,  p-value: < 2.2e-16
  1. How many variables are in model4? Which regression coefficients are significant at the 99% confidence level?

In model4, all the coefficients are statistically significant at 99% confidence level - 1. nHM 2. piPC09 3. MLOGP 4. F04.C.O.

  1. Perform forward stepwise selection with AIC. Allow the minimum model to be the model with only an intercept, and the full model to be model1. Display the model summary of your final model. Call it model5. Do the variables included in model5 differ from the variables in model4?
set.seed(100)

minimum_Model_2 <- lm(logBCF ~ 1, data = trainData)

model5 <- step(minimum_Model_2, scope = list(lower = minimum_Model_2, upper = model1), direction = "forward", k = 2, trace = F)

summary(model5)
## 
## Call:
## lm(formula = logBCF ~ MLOGP + nHM + piPC09 + F04.C.O. + B02.C.N. + 
##     ON1V, data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2364 -0.5234  0.0421  0.5196  4.1159 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.035785   0.099454   0.360  0.71911    
## MLOGP        0.528522   0.029434  17.956  < 2e-16 ***
## nHM          0.124086   0.019083   6.502 1.63e-10 ***
## piPC09       0.042167   0.014135   2.983  0.00297 ** 
## F04.C.O.    -0.028644   0.009415  -3.042  0.00245 ** 
## B02.C.N.    -0.160204   0.073225  -2.188  0.02906 *  
## ON1V         0.098099   0.055457   1.769  0.07740 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7951 on 617 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.6628 
## F-statistic: 205.1 on 6 and 617 DF,  p-value: < 2.2e-16

model5 has more coefficients than model4 –> Model5 has (MLOGP, nHM, piPC09, F04.C.O., B02.C.N., ON1V) coefficients and Model4 has (nHM, piPC09, ML0GP, F04.C.O.)

However, Model5 does not have all the variables as statistically significant at 99% confidence level.

  1. Compare the adjusted \(R^2\), Mallow’s Cp, AICs and BICs of the full model (model1), the model found in Question 2 (model3), and the model found using backward selection with BIC (model4). Which model is preferred based on these criteria and why?
set.seed(100)

paste("Mallows Cp value for model 1 is ", Cp_Value1)
## [1] "Mallows Cp value for model 1 is  10"
paste("Adjusted R-Squared for Model 1 is ", summary(model1)$r.squared)
## [1] "Adjusted R-Squared for Model 1 is  0.667181132470475"
paste("AIC Value  for model 4 is ",AIC_Values1)
## [1] "AIC Value  for model 4 is  1497.47653274345"
paste("BIC Value  for model 4 is ",BIC_Value1)
## [1] "BIC Value  for model 4 is  1546.27418679551"
Cp_Value3 <- Cp(model3, S =  summary(model3)$sigma^2)

paste("Mallows Cp value for model 3 is ", Cp_Value3)
## [1] "Mallows Cp value for model 3 is  7.00000000000023"
paste("Adjusted R-Squared for Model 3 is ", summary(model3)$r.squared)
## [1] "Adjusted R-Squared for Model 3 is  0.666034059962189"
AIC_Values3 <- AIC(model3, k = 2)

paste("AIC Value  for model 3 is ",AIC_Values3)
## [1] "AIC Value  for model 3 is  1493.62347413487"
BIC_Value3 <- AIC(model3, k = log(n))

paste("BIC Value  for model 3 is ",BIC_Value3)
## [1] "BIC Value  for model 3 is  1529.11267708182"
Cp_Value4 <- Cp(model4, S2 = summary(model4)$sigma^2)

paste("Mallows Cp value for model 4 is ", Cp_Value4)
## [1] "Mallows Cp value for model 4 is  5.00000000000034"
paste("Adjusted R-Squared for Model 4 is ", summary(model4)$r.squared)
## [1] "Adjusted R-Squared for Model 4 is  0.662034343151643"
AIC_Values4 <- AIC(model4, k = 2)

paste("AIC Value  for model 4 is ",AIC_Values4)
## [1] "AIC Value  for model 4 is  1497.05236356401"
BIC_Value4 <- AIC(model4, k = log(n))

paste("BIC Value  for model 4 is ",BIC_Value4)
## [1] "BIC Value  for model 4 is  1523.66926577422"

Here you can see that Model4 has the lowest Mallows CP and the lowest BIC value. Also, Model4 has the lowest Adjusted R-Squared values which means that prediction using Model4 is going to be not so accurate.

Question 4: Ridge Regression

  1. Perform ridge regression on the training set. Use cv.glmnet() to find the lambda value that minimizes the cross-validation error using 10 fold CV.
set.seed(100)

ridge_Regression <- cv.glmnet(x = as.matrix(trainData[,-10]), trainData[,10],alpha = 0 , nfolds = 10)

paste("The Lambda Value that minimizes the cross-validation error using 10 fold CV is ",ridge_Regression$lambda.min)
## [1] "The Lambda Value that minimizes the cross-validation error using 10 fold CV is  0.108775011463851"
  1. List the value of coefficients at the optimum lambda value.

Used some data from here –> https://wavedatalab.github.io/machinelearningwithr/post4.html and http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/

set.seed(100)

ridge_models <- glmnet(as.matrix(trainData[,-10]), trainData[,10], alpha=0, nlambda = 100)

coef(ridge_models, s = ridge_Regression$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.13841426
## nHM          0.14391877
## piPC09       0.03735762
## PCD          0.08235334
## X2Av        -0.06901352
## MLOGP        0.44403655
## ON1V         0.15770114
## N.072       -0.09683534
## B02.C.N.    -0.20919397
## F04.C.O.    -0.03177144
  1. How many variables were selected? Was this result expected? Explain.

Here all the variables are selected. However, this is expected. Ridge Regression is not used for variable selectionn but to reduce the coefficient estimate.

Question 5: Lasso Regression

  1. Perform lasso regression on the training set.Use cv.glmnet() to find the lambda value that minimizes the cross-validation error using 10 fold CV.
set.seed(100)


Lasso_Regression <- cv.glmnet(as.matrix(trainData[,-10]), trainData[,10], alpha = 1, nfolds = 10, nlamda = 10, standardize = TRUE)

paste("The Lambda Value that minimizes the cross-validation error for Lasso Regression using 10 fold CV is ",Lasso_Regression$lambda.min)
## [1] "The Lambda Value that minimizes the cross-validation error for Lasso Regression using 10 fold CV is  0.00785443583753157"
  1. Plot the regression coefficient path.
set.seed(100)
plot(Lasso_Regression,xvar="lambda", lwd=2)
## Warning in plot.window(...): "xvar" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "xvar" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "xvar" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "xvar" is not a
## graphical parameter
## Warning in box(...): "xvar" is not a graphical parameter
## Warning in title(...): "xvar" is not a graphical parameter
abline(v=log(Lasso_Regression$lambda.min), col='black', lty=2)

  1. How many variables were selected? Which are they?
set.seed(100)


coef(Lasso_Regression, s = Lasso_Regression$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.02722838
## nHM          0.12543866
## piPC09       0.03387665
## PCD          0.03194878
## X2Av         .         
## MLOGP        0.52174346
## ON1V         0.09633951
## N.072       -0.05487196
## B02.C.N.    -0.13961811
## F04.C.O.    -0.02535576

Here only 8 are selected. The only variable that is not selected X2Av

Question 6: Elastic Net

  1. Perform elastic net regression on the training set. Use cv.glmnet() to find the lambda value that minimizes the cross-validation error using 10 fold CV. Give equal weight to both penalties.
set.seed(100)

Elastic_Net <- cv.glmnet(as.matrix(trainData[,-10]),trainData[,10], alpha = 0.5, nfolds = 10)

paste("The Lambda Value that minimizes the cross-validation error for Elastic Net Regression using 10 fold CV is ",Elastic_Net$lambda.min)
## [1] "The Lambda Value that minimizes the cross-validation error for Elastic Net Regression using 10 fold CV is  0.0207662038632384"
  1. List the coefficient values at the optimal lambda. How many variables were selected? How do these variables compare to those from Lasso in Question 5?
set.seed(100)


coef(Elastic_Net, s = Elastic_Net$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.04903516
## nHM          0.12397290
## piPC09       0.03470891
## PCD          0.03060034
## X2Av         .         
## MLOGP        0.51776470
## ON1V         0.08901088
## N.072       -0.05236840
## B02.C.N.    -0.14155538
## F04.C.O.    -0.02420217

The variables selected in Question 5 and Question 6 are the same. In this case, values for each coefficients are different.

Question 7: Model comparison

  1. Predict logBCF for each of the rows in the test data using the full model, and the models found using backward stepwise regression with BIC, ridge regression, lasso regression, and elastic net. Display the first few predictions for each model.
set.seed(100)

prediction_Values_full_model <- predict(model1, newdata = testData, type = "response")

paste("The prediction for Model1 (full model) for value # ",seq(1:5), prediction_Values_full_model[1:5])
## [1] "The prediction for Model1 (full model) for value #  1 2.44647901736994"
## [2] "The prediction for Model1 (full model) for value #  2 4.33375904061707"
## [3] "The prediction for Model1 (full model) for value #  3 3.2668916868588" 
## [4] "The prediction for Model1 (full model) for value #  4 1.66476979256873"
## [5] "The prediction for Model1 (full model) for value #  5 1.95536248348231"
backward_stepwise_regression_with_BIC <- predict(model4, newdata = testData, type = "response")

paste("The prediction for Model4 (Backward Regression with BIC model) for value # ",seq(1:5), backward_stepwise_regression_with_BIC[1:5])
## [1] "The prediction for Model4 (Backward Regression with BIC model) for value #  1 2.42491603104022"
## [2] "The prediction for Model4 (Backward Regression with BIC model) for value #  2 4.35316712936592"
## [3] "The prediction for Model4 (Backward Regression with BIC model) for value #  3 3.27419195052767"
## [4] "The prediction for Model4 (Backward Regression with BIC model) for value #  4 1.29717540994159"
## [5] "The prediction for Model4 (Backward Regression with BIC model) for value #  5 2.00139850585601"
ridge_regression_predictions <- predict(ridge_models, newx = as.matrix(testData[,-10]),  s=ridge_Regression$lambda.min)

paste("The prediction for Ridge Regression for value # ",seq(1:5), ridge_regression_predictions[1:5])
## [1] "The prediction for Ridge Regression for value #  1 2.45487772173491"
## [2] "The prediction for Ridge Regression for value #  2 4.2344245114965" 
## [3] "The prediction for Ridge Regression for value #  3 3.22316616341148"
## [4] "The prediction for Ridge Regression for value #  4 1.73409395953377"
## [5] "The prediction for Ridge Regression for value #  5 1.99396666442829"
lasso_regression_predictions <- predict(Lasso_Regression, newx = as.matrix(testData[,-10]), s = Lasso_Regression$lambda.min)

paste("The prediction for Lasso Regression for value # ",seq(1:5), lasso_regression_predictions[1:5])
## [1] "The prediction for Lasso Regression for value #  1 2.44289514114691"
## [2] "The prediction for Lasso Regression for value #  2 4.31350910513579"
## [3] "The prediction for Lasso Regression for value #  3 3.26061710267137"
## [4] "The prediction for Lasso Regression for value #  4 1.61000550628355"
## [5] "The prediction for Lasso Regression for value #  5 1.93994593639576"
elastic_net_predictions <- predict(Elastic_Net, newx = as.matrix(testData[-10]), s = Elastic_Net$lambda.min)

paste("The prediction for Elastic Net for value # ",seq(1:5), elastic_net_predictions[1:5])
## [1] "The prediction for Elastic Net for value #  1 2.44150647524178"
## [2] "The prediction for Elastic Net for value #  2 4.29645083046171"
## [3] "The prediction for Elastic Net for value #  3 3.25263824834337"
## [4] "The prediction for Elastic Net for value #  4 1.60677410735428"
## [5] "The prediction for Elastic Net for value #  5 1.93946452648376"
  1. Compare the predictions using mean squared prediction error. Which model performed the best?
set.seed(100)

Full_Model_MSPE <- mean((prediction_Values_full_model-testData[,10])^2)
Backward_Stepwise_Regression_MSPE <- mean((backward_stepwise_regression_with_BIC-testData[,10])^2)
Ridge_Regression_MSPE <- mean((ridge_regression_predictions-testData[,10])^2)
Lasso_Model_MSPE <- mean((lasso_regression_predictions-testData[,10])^2)
Elastic_Net_Predictions_MSPE <- mean((elastic_net_predictions-testData[,10])^2)

paste("MSPE - Full Model - ",Full_Model_MSPE)
## [1] "MSPE - Full Model -  0.583929593178489"
paste("MSPE - Backward Stepwise Regression", Backward_Stepwise_Regression_MSPE)
## [1] "MSPE - Backward Stepwise Regression 0.57421978017202"
paste("MSPE - Ridge Regression", Ridge_Regression_MSPE)
## [1] "MSPE - Ridge Regression 0.587783468691171"
paste("MSPE - Lasso Model MSPE", Lasso_Model_MSPE)
## [1] "MSPE - Lasso Model MSPE 0.579083172989946"
paste("MSPE - Elastic Net MSPE", Elastic_Net_Predictions_MSPE)
## [1] "MSPE - Elastic Net MSPE 0.578275004765455"

The lowest value of MSPE is MSPE - Backward Stepwise Regression (0.57421978017202)

  1. Provide a table listing each method described in Question 7a and the variables selected by each method (see Lesson 5.8 for an example). Which variables were selected consistently?
Backward Stepwise Ridge Lasso Elastic Net
nHM TRUE T T T
piPC09 T T T T
PCD T T T
X2AV T
MLOGP T T T T
ON1V T T T
N.072 T T T
B02.C.N. T T T
F04.C.O. T T T T