Selected molecular descriptors from the Dragon chemoinformatics application were used to predict bioconcentration factors for 779 chemicals in order to evaluate QSAR (Quantitative Structure Activity Relationship). This dataset was obtained from the UCI machine learning repository.
The dataset consists of 779 observations of 10 attributes. Below is a brief description of each feature and the response variable (logBCF) in our dataset:
Note that all predictors with the exception of B02[C-N] are quantitative. For the purpose of this assignment, DO NOT CONVERT B02[C-N] to factor. Leave the data in its original format - numeric in R.
Please load the dataset “Bio_pred” and then split the dataset into a train and test set in a 80:20 ratio. Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7. Please make sure that you are using R version 3.6.X or above (i.e. version 4.X is also acceptable).
# Clear variables in memory
rm(list=ls())
# Import the libraries
library(CombMSC)
library(boot)
library(leaps)
library(MASS)
library(glmnet)
# Ensure that the sampling type is correct
RNGkind(sample.kind="Rejection")
# Set a seed for reproducibility
set.seed(100)
# Read data
fullData = read.csv("Bio_pred.csv",header=TRUE)
# Split data for traIning and testing
testRows = sample(nrow(fullData),0.2*nrow(fullData))
testData = fullData[testRows, ]
trainData = fullData[-testRows, ]
Note: Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7.
model1 = lm(logBCF ~ ., data=trainData)
summary(model1)
##
## Call:
## lm(formula = logBCF ~ ., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2577 -0.5180 0.0448 0.5117 4.0423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001422 0.138057 0.010 0.99179
## nHM 0.137022 0.022462 6.100 1.88e-09 ***
## piPC09 0.031158 0.020874 1.493 0.13603
## PCD 0.055655 0.063874 0.871 0.38391
## X2Av -0.031890 0.253574 -0.126 0.89996
## MLOGP 0.506088 0.034211 14.793 < 2e-16 ***
## ON1V 0.140595 0.066810 2.104 0.03575 *
## N.072 -0.073334 0.070993 -1.033 0.30202
## B02.C.N. -0.158231 0.080143 -1.974 0.04879 *
## F04.C.O. -0.030763 0.009667 -3.182 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7957 on 614 degrees of freedom
## Multiple R-squared: 0.6672, Adjusted R-squared: 0.6623
## F-statistic: 136.8 on 9 and 614 DF, p-value: < 2.2e-16
At 95% confidence level, we have the following coefficients as significant – nHM, MLOGP, ON1V, B02.C.N., F04.C.0
At 99% confidence level, we have the following coefficients as significant – nHM, MLOGP and F04.C.0
set.seed(100)
n = nrow(trainData)
Cp_Value1 <- Cp(model1, S2 = summary(model1)$sigma^2)
paste("Mallows Cp value for model 1 is ", Cp_Value1)
## [1] "Mallows Cp value for model 1 is 10"
AIC_Values1 <- AIC(model1, k = 2)
paste("AIC Value for model 1 is ",AIC_Values1)
## [1] "AIC Value for model 1 is 1497.47653274345"
BIC_Value1 <- AIC(model1, k = log(n))
paste("BIC Value for model 1 is ",BIC_Value1)
## [1] "BIC Value for model 1 is 1546.27418679551"
set.seed(100)
model2 <- lm(logBCF~nHM + MLOGP + F04.C.O., data = trainData)
summary(model2)
##
## Call:
## lm(formula = logBCF ~ nHM + MLOGP + F04.C.O., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2555 -0.5097 0.0374 0.5471 4.2704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03076 0.07836 -0.393 0.6948
## nHM 0.10948 0.01762 6.213 9.56e-10 ***
## MLOGP 0.60993 0.02177 28.018 < 2e-16 ***
## F04.C.O. -0.01295 0.00745 -1.738 0.0826 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8037 on 620 degrees of freedom
## Multiple R-squared: 0.6571, Adjusted R-squared: 0.6554
## F-statistic: 396 on 3 and 620 DF, p-value: < 2.2e-16
anova(model2, model1)
## Analysis of Variance Table
##
## Model 1: logBCF ~ nHM + MLOGP + F04.C.O.
## Model 2: logBCF ~ nHM + piPC09 + PCD + X2Av + MLOGP + ON1V + N.072 + B02.C.N. +
## F04.C.O.
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 620 400.51
## 2 614 388.70 6 11.809 3.109 0.00523 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
n = nrow(trainData)
Cp_Value2 <- Cp(model1, S2 = summary(model2)$sigma^2)
paste("Mallows Cp value for model 2 is ", Cp_Value2)
## [1] "Mallows Cp value for model 2 is -2.28090634678222"
AIC_Values2 <- AIC(model2, k = 2)
paste("AIC Value for model 2 is ",AIC_Values2)
## [1] "AIC Value for model 2 is 1504.15208114915"
BIC_Value2 <- AIC(model2, k = log(n))
paste("BIC Value for model 2 is ",BIC_Value2)
## [1] "BIC Value for model 2 is 1526.33283299099"
Here we can see that Model2 has lower Mallows CP (-2.281 vs 10) and BIC Values (1526.332 vs 1546.274) and higher AIC Values (1504.152 vs 1497.476).This means that Model2 is better than Model1.
But, when we perform ANOVA we get that Model1 is better. This means that we cannot use only ANOVA or Mallows etc. to identify which of these are better.
Hint: You can use nbest parameter.
set.seed(100)
library(leaps)
out = leaps(trainData[,-c(10)], trainData[,c(10)], method = "Cp", nbes=1, names=names(trainData[,-c(10)]))
comparison_value <- cbind(as.matrix(out$which, out$Cp))
comparison_value
## nHM piPC09 PCD X2Av MLOGP ON1V N.072 B02.C.N. F04.C.O.
## 1 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 4 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 5 TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## 6 TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
## 7 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 9 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
set.seed(100)
best.model <- which(out$Cp == min(out$Cp))
best.model
## [1] 6
cbind(as.matrix(out$which), out$Cp)[best.model,]
## nHM piPC09 PCD X2Av MLOGP ON1V N.072 B02.C.N.
## 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000
## F04.C.O.
## 1.000000 6.116174
model3 <- lm(logBCF ~ nHM + piPC09 + MLOGP + ON1V + B02.C.N.+ F04.C.O., data = trainData)
summary(model3)
##
## Call:
## lm(formula = logBCF ~ nHM + piPC09 + MLOGP + ON1V + B02.C.N. +
## F04.C.O., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2364 -0.5234 0.0421 0.5196 4.1159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035785 0.099454 0.360 0.71911
## nHM 0.124086 0.019083 6.502 1.63e-10 ***
## piPC09 0.042167 0.014135 2.983 0.00297 **
## MLOGP 0.528522 0.029434 17.956 < 2e-16 ***
## ON1V 0.098099 0.055457 1.769 0.07740 .
## B02.C.N. -0.160204 0.073225 -2.188 0.02906 *
## F04.C.O. -0.028644 0.009415 -3.042 0.00245 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7951 on 617 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.6628
## F-statistic: 205.1 on 6 and 617 DF, p-value: < 2.2e-16
The Variables that are significant are –> nHM, piPC09, MLOGP, ON1V, B02.C.N. and F04.C.O.
set.seed(100)
minimum_model <- lm(logBCF ~ 1, data = trainData)
model4 <- step(model1, scope = list(lower = minimum_model, upper = model1), direction = "backward", k = log(n), trace = F)
summary(model4)
##
## Call:
## lm(formula = logBCF ~ nHM + piPC09 + MLOGP + F04.C.O., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2611 -0.5126 0.0517 0.5353 4.3488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.008695 0.078196 -0.111 0.91150
## nHM 0.114029 0.017574 6.489 1.78e-10 ***
## piPC09 0.041119 0.013636 3.015 0.00267 **
## MLOGP 0.566473 0.025990 21.796 < 2e-16 ***
## F04.C.O. -0.022104 0.008000 -2.763 0.00590 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7985 on 619 degrees of freedom
## Multiple R-squared: 0.662, Adjusted R-squared: 0.6599
## F-statistic: 303.1 on 4 and 619 DF, p-value: < 2.2e-16
In model4, all the coefficients are statistically significant at 99% confidence level - 1. nHM 2. piPC09 3. MLOGP 4. F04.C.O.
set.seed(100)
minimum_Model_2 <- lm(logBCF ~ 1, data = trainData)
model5 <- step(minimum_Model_2, scope = list(lower = minimum_Model_2, upper = model1), direction = "forward", k = 2, trace = F)
summary(model5)
##
## Call:
## lm(formula = logBCF ~ MLOGP + nHM + piPC09 + F04.C.O. + B02.C.N. +
## ON1V, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2364 -0.5234 0.0421 0.5196 4.1159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035785 0.099454 0.360 0.71911
## MLOGP 0.528522 0.029434 17.956 < 2e-16 ***
## nHM 0.124086 0.019083 6.502 1.63e-10 ***
## piPC09 0.042167 0.014135 2.983 0.00297 **
## F04.C.O. -0.028644 0.009415 -3.042 0.00245 **
## B02.C.N. -0.160204 0.073225 -2.188 0.02906 *
## ON1V 0.098099 0.055457 1.769 0.07740 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7951 on 617 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.6628
## F-statistic: 205.1 on 6 and 617 DF, p-value: < 2.2e-16
model5 has more coefficients than model4 –> Model5 has (MLOGP, nHM, piPC09, F04.C.O., B02.C.N., ON1V) coefficients and Model4 has (nHM, piPC09, ML0GP, F04.C.O.)
However, Model5 does not have all the variables as statistically significant at 99% confidence level.
set.seed(100)
paste("Mallows Cp value for model 1 is ", Cp_Value1)
## [1] "Mallows Cp value for model 1 is 10"
paste("Adjusted R-Squared for Model 1 is ", summary(model1)$r.squared)
## [1] "Adjusted R-Squared for Model 1 is 0.667181132470475"
paste("AIC Value for model 4 is ",AIC_Values1)
## [1] "AIC Value for model 4 is 1497.47653274345"
paste("BIC Value for model 4 is ",BIC_Value1)
## [1] "BIC Value for model 4 is 1546.27418679551"
Cp_Value3 <- Cp(model3, S = summary(model3)$sigma^2)
paste("Mallows Cp value for model 3 is ", Cp_Value3)
## [1] "Mallows Cp value for model 3 is 7.00000000000023"
paste("Adjusted R-Squared for Model 3 is ", summary(model3)$r.squared)
## [1] "Adjusted R-Squared for Model 3 is 0.666034059962189"
AIC_Values3 <- AIC(model3, k = 2)
paste("AIC Value for model 3 is ",AIC_Values3)
## [1] "AIC Value for model 3 is 1493.62347413487"
BIC_Value3 <- AIC(model3, k = log(n))
paste("BIC Value for model 3 is ",BIC_Value3)
## [1] "BIC Value for model 3 is 1529.11267708182"
Cp_Value4 <- Cp(model4, S2 = summary(model4)$sigma^2)
paste("Mallows Cp value for model 4 is ", Cp_Value4)
## [1] "Mallows Cp value for model 4 is 5.00000000000034"
paste("Adjusted R-Squared for Model 4 is ", summary(model4)$r.squared)
## [1] "Adjusted R-Squared for Model 4 is 0.662034343151643"
AIC_Values4 <- AIC(model4, k = 2)
paste("AIC Value for model 4 is ",AIC_Values4)
## [1] "AIC Value for model 4 is 1497.05236356401"
BIC_Value4 <- AIC(model4, k = log(n))
paste("BIC Value for model 4 is ",BIC_Value4)
## [1] "BIC Value for model 4 is 1523.66926577422"
Here you can see that Model4 has the lowest Mallows CP and the lowest BIC value. Also, Model4 has the lowest Adjusted R-Squared values which means that prediction using Model4 is going to be not so accurate.
set.seed(100)
ridge_Regression <- cv.glmnet(x = as.matrix(trainData[,-10]), trainData[,10],alpha = 0 , nfolds = 10)
paste("The Lambda Value that minimizes the cross-validation error using 10 fold CV is ",ridge_Regression$lambda.min)
## [1] "The Lambda Value that minimizes the cross-validation error using 10 fold CV is 0.108775011463851"
Used some data from here –> https://wavedatalab.github.io/machinelearningwithr/post4.html and http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/
set.seed(100)
ridge_models <- glmnet(as.matrix(trainData[,-10]), trainData[,10], alpha=0, nlambda = 100)
coef(ridge_models, s = ridge_Regression$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.13841426
## nHM 0.14391877
## piPC09 0.03735762
## PCD 0.08235334
## X2Av -0.06901352
## MLOGP 0.44403655
## ON1V 0.15770114
## N.072 -0.09683534
## B02.C.N. -0.20919397
## F04.C.O. -0.03177144
Here all the variables are selected. However, this is expected. Ridge Regression is not used for variable selectionn but to reduce the coefficient estimate.
set.seed(100)
Lasso_Regression <- cv.glmnet(as.matrix(trainData[,-10]), trainData[,10], alpha = 1, nfolds = 10, nlamda = 10, standardize = TRUE)
paste("The Lambda Value that minimizes the cross-validation error for Lasso Regression using 10 fold CV is ",Lasso_Regression$lambda.min)
## [1] "The Lambda Value that minimizes the cross-validation error for Lasso Regression using 10 fold CV is 0.00785443583753157"
set.seed(100)
plot(Lasso_Regression,xvar="lambda", lwd=2)
## Warning in plot.window(...): "xvar" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "xvar" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "xvar" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "xvar" is not a
## graphical parameter
## Warning in box(...): "xvar" is not a graphical parameter
## Warning in title(...): "xvar" is not a graphical parameter
abline(v=log(Lasso_Regression$lambda.min), col='black', lty=2)
set.seed(100)
coef(Lasso_Regression, s = Lasso_Regression$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.02722838
## nHM 0.12543866
## piPC09 0.03387665
## PCD 0.03194878
## X2Av .
## MLOGP 0.52174346
## ON1V 0.09633951
## N.072 -0.05487196
## B02.C.N. -0.13961811
## F04.C.O. -0.02535576
Here only 8 are selected. The only variable that is not selected X2Av
set.seed(100)
Elastic_Net <- cv.glmnet(as.matrix(trainData[,-10]),trainData[,10], alpha = 0.5, nfolds = 10)
paste("The Lambda Value that minimizes the cross-validation error for Elastic Net Regression using 10 fold CV is ",Elastic_Net$lambda.min)
## [1] "The Lambda Value that minimizes the cross-validation error for Elastic Net Regression using 10 fold CV is 0.0207662038632384"
set.seed(100)
coef(Elastic_Net, s = Elastic_Net$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.04903516
## nHM 0.12397290
## piPC09 0.03470891
## PCD 0.03060034
## X2Av .
## MLOGP 0.51776470
## ON1V 0.08901088
## N.072 -0.05236840
## B02.C.N. -0.14155538
## F04.C.O. -0.02420217
The variables selected in Question 5 and Question 6 are the same. In this case, values for each coefficients are different.
set.seed(100)
prediction_Values_full_model <- predict(model1, newdata = testData, type = "response")
paste("The prediction for Model1 (full model) for value # ",seq(1:5), prediction_Values_full_model[1:5])
## [1] "The prediction for Model1 (full model) for value # 1 2.44647901736994"
## [2] "The prediction for Model1 (full model) for value # 2 4.33375904061707"
## [3] "The prediction for Model1 (full model) for value # 3 3.2668916868588"
## [4] "The prediction for Model1 (full model) for value # 4 1.66476979256873"
## [5] "The prediction for Model1 (full model) for value # 5 1.95536248348231"
backward_stepwise_regression_with_BIC <- predict(model4, newdata = testData, type = "response")
paste("The prediction for Model4 (Backward Regression with BIC model) for value # ",seq(1:5), backward_stepwise_regression_with_BIC[1:5])
## [1] "The prediction for Model4 (Backward Regression with BIC model) for value # 1 2.42491603104022"
## [2] "The prediction for Model4 (Backward Regression with BIC model) for value # 2 4.35316712936592"
## [3] "The prediction for Model4 (Backward Regression with BIC model) for value # 3 3.27419195052767"
## [4] "The prediction for Model4 (Backward Regression with BIC model) for value # 4 1.29717540994159"
## [5] "The prediction for Model4 (Backward Regression with BIC model) for value # 5 2.00139850585601"
ridge_regression_predictions <- predict(ridge_models, newx = as.matrix(testData[,-10]), s=ridge_Regression$lambda.min)
paste("The prediction for Ridge Regression for value # ",seq(1:5), ridge_regression_predictions[1:5])
## [1] "The prediction for Ridge Regression for value # 1 2.45487772173491"
## [2] "The prediction for Ridge Regression for value # 2 4.2344245114965"
## [3] "The prediction for Ridge Regression for value # 3 3.22316616341148"
## [4] "The prediction for Ridge Regression for value # 4 1.73409395953377"
## [5] "The prediction for Ridge Regression for value # 5 1.99396666442829"
lasso_regression_predictions <- predict(Lasso_Regression, newx = as.matrix(testData[,-10]), s = Lasso_Regression$lambda.min)
paste("The prediction for Lasso Regression for value # ",seq(1:5), lasso_regression_predictions[1:5])
## [1] "The prediction for Lasso Regression for value # 1 2.44289514114691"
## [2] "The prediction for Lasso Regression for value # 2 4.31350910513579"
## [3] "The prediction for Lasso Regression for value # 3 3.26061710267137"
## [4] "The prediction for Lasso Regression for value # 4 1.61000550628355"
## [5] "The prediction for Lasso Regression for value # 5 1.93994593639576"
elastic_net_predictions <- predict(Elastic_Net, newx = as.matrix(testData[-10]), s = Elastic_Net$lambda.min)
paste("The prediction for Elastic Net for value # ",seq(1:5), elastic_net_predictions[1:5])
## [1] "The prediction for Elastic Net for value # 1 2.44150647524178"
## [2] "The prediction for Elastic Net for value # 2 4.29645083046171"
## [3] "The prediction for Elastic Net for value # 3 3.25263824834337"
## [4] "The prediction for Elastic Net for value # 4 1.60677410735428"
## [5] "The prediction for Elastic Net for value # 5 1.93946452648376"
set.seed(100)
Full_Model_MSPE <- mean((prediction_Values_full_model-testData[,10])^2)
Backward_Stepwise_Regression_MSPE <- mean((backward_stepwise_regression_with_BIC-testData[,10])^2)
Ridge_Regression_MSPE <- mean((ridge_regression_predictions-testData[,10])^2)
Lasso_Model_MSPE <- mean((lasso_regression_predictions-testData[,10])^2)
Elastic_Net_Predictions_MSPE <- mean((elastic_net_predictions-testData[,10])^2)
paste("MSPE - Full Model - ",Full_Model_MSPE)
## [1] "MSPE - Full Model - 0.583929593178489"
paste("MSPE - Backward Stepwise Regression", Backward_Stepwise_Regression_MSPE)
## [1] "MSPE - Backward Stepwise Regression 0.57421978017202"
paste("MSPE - Ridge Regression", Ridge_Regression_MSPE)
## [1] "MSPE - Ridge Regression 0.587783468691171"
paste("MSPE - Lasso Model MSPE", Lasso_Model_MSPE)
## [1] "MSPE - Lasso Model MSPE 0.579083172989946"
paste("MSPE - Elastic Net MSPE", Elastic_Net_Predictions_MSPE)
## [1] "MSPE - Elastic Net MSPE 0.578275004765455"
The lowest value of MSPE is MSPE - Backward Stepwise Regression (0.57421978017202)
| Backward Stepwise | Ridge | Lasso | Elastic Net | |
|---|---|---|---|---|
| nHM | TRUE | T | T | T |
| piPC09 | T | T | T | T |
| PCD | T | T | T | |
| X2AV | T | |||
| MLOGP | T | T | T | T |
| ON1V | T | T | T | |
| N.072 | T | T | T | |
| B02.C.N. | T | T | T | |
| F04.C.O. | T | T | T | T |