Selected molecular descriptors from the Dragon chemoinformatics application were used to predict bioconcentration factors for 779 chemicals in order to evaluate QSAR (Quantitative Structure Activity Relationship). This dataset was obtained from the UCI machine learning repository.
The dataset consists of 779 observations of 10 attributes. Below is a brief description of each feature and the response variable (logBCF) in our dataset:
Note that all predictors with the exception of B02[C-N] are quantitative. For the purpose of this assignment, DO NOT CONVERT B02[C-N] to factor. Leave the data in its original format - numeric in R.
Please load the dataset “Bio_pred” and then split the dataset into a train and test set in a 80:20 ratio. Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7. Please make sure that you are using R version 3.6.X or above (i.e. version 4.X is also acceptable).
# Clear variables in memory
rm(list=ls())
# Import the libraries
library(CombMSC)
library(boot)
library(leaps)
library(MASS)
library(glmnet)
library(dplyr)
# Ensure that the sampling type is correct
RNGkind(sample.kind="Rejection")
# Set a seed for reproducibility
set.seed(100)
# Read data
fullData = read.csv("Bio_pred.csv",header=TRUE)
attach(fullData)
head(fullData,5)
## nHM piPC09 PCD X2Av MLOGP ON1V N.072 B02.C.N. F04.C.O. logBCF
## 1 0 0 1.49 0.14 1.35 0.72 0 1 5 0.74
## 2 0 0 1.47 0.14 1.70 0.88 0 1 5 0.93
## 3 0 0 1.20 0.25 4.14 2.06 0 0 0 3.24
## 4 0 0 1.69 0.13 1.89 0.79 0 1 8 -0.40
## 5 0 0 0.52 0.25 2.65 1.31 0 0 0 2.24
# Split data for traIning and testing
testRows = sample(nrow(fullData),0.2*nrow(fullData))
testData = fullData[testRows, ]
trainData = fullData[-testRows, ]
Note: Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7.
model1 = lm(logBCF ~ ., data=trainData)
summary(model1)
##
## Call:
## lm(formula = logBCF ~ ., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2577 -0.5180 0.0448 0.5117 4.0423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001422 0.138057 0.010 0.99179
## nHM 0.137022 0.022462 6.100 1.88e-09 ***
## piPC09 0.031158 0.020874 1.493 0.13603
## PCD 0.055655 0.063874 0.871 0.38391
## X2Av -0.031890 0.253574 -0.126 0.89996
## MLOGP 0.506088 0.034211 14.793 < 2e-16 ***
## ON1V 0.140595 0.066810 2.104 0.03575 *
## N.072 -0.073334 0.070993 -1.033 0.30202
## B02.C.N. -0.158231 0.080143 -1.974 0.04879 *
## F04.C.O. -0.030763 0.009667 -3.182 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7957 on 614 degrees of freedom
## Multiple R-squared: 0.6672, Adjusted R-squared: 0.6623
## F-statistic: 136.8 on 9 and 614 DF, p-value: < 2.2e-16
#Answer:
At 95% confidence level:
ON1V, B02.C.N.,nHM, F04.C.O. & MLOGP
At 99% confidence level:
nHM, F04.C.O. & MLOGP
set.seed(100)
library(leaps)
## full model
n = nrow(trainData)
#S2 <- summary(model1)$sigma
c(Cp(model1, S2= summary(model1)$sigma^2),
AIC(model1, k=2), AIC(model1,k=log(n)))
## [1] 10.000 1497.477 1546.274
#Answer:
Mallow’s Cp = 10
AIC = 1497.477
BIC = 1546.274
set.seed(100)
## reduced model vs full model
model2= lm(logBCF ~ nHM+F04.C.O.+MLOGP, data=trainData)
#Mallow's Cp, AIC and BIC for reduced model
c(Cp(model2, S2= summary(model2)$sigma^2),
AIC(model2, k=2), AIC(model2,k=log(n)))
## [1] 4.000 1504.152 1526.333
anova(model2, model1)
## Analysis of Variance Table
##
## Model 1: logBCF ~ nHM + F04.C.O. + MLOGP
## Model 2: logBCF ~ nHM + piPC09 + PCD + X2Av + MLOGP + ON1V + N.072 + B02.C.N. +
## F04.C.O.
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 620 400.51
## 2 614 388.70 6 11.809 3.109 0.00523 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Answer
On one hand, the reduced model (model2) has lower Mallow’s Cp, higher AIC and lower BIC than the full model, using this criteria, we should go with the reduced model, right?
However on the other hand, Anova is telling us that adding those extra predictors made the full model statistically significant; meaning at least one of those extra predictors is significant and therefore, we could use the full model
These are confounding and opposite directions which is precisely why the Professor has repeatedly mentioned in her lecture series NOT to use statistical significance of the regression coefficients as guiding principle to perform variable selection
Hint: You can use nbest parameter.
set.seed(100)
### Search over all (2^9=512 models total)
library(leaps)
#train.x=names(trainData[,-c(10)]) you can prename the col headers here if u like
out = leaps(trainData[,-c(10)], trainData[,c(10)], method = "Cp", nbes=1, names=names(trainData[,-c(10)]))
c <- cbind(as.matrix(out$which),out$Cp)
c
## nHM piPC09 PCD X2Av MLOGP ON1V N.072 B02.C.N. F04.C.O.
## 1 0 0 0 0 1 0 0 0 0 58.596851
## 2 1 0 0 0 1 0 0 0 0 17.737801
## 3 1 1 0 0 1 0 0 0 0 15.184626
## 4 1 1 0 0 1 0 0 0 1 9.495041
## 5 1 1 0 0 1 0 0 1 1 7.240754
## 6 1 1 0 0 1 1 0 1 1 6.116174
## 7 1 1 0 0 1 1 1 1 1 6.831852
## 8 1 1 1 0 1 1 1 1 1 8.015816
## 9 1 1 1 1 1 1 1 1 1 10.000000
#Answer
set.seed(100)
best.model = which(out$Cp==min(out$Cp))
cbind(as.matrix(out$which),out$Cp)[best.model,]
## nHM piPC09 PCD X2Av MLOGP ON1V N.072 B02.C.N.
## 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000
## F04.C.O.
## 1.000000 6.116174
#fitting using linear model after selecting using lowest Mallow's Cp
model3 <- lm(logBCF ~ nHM + piPC09 + MLOGP + ON1V + B02.C.N. + F04.C.O.,trainData)
summary(model3)
##
## Call:
## lm(formula = logBCF ~ nHM + piPC09 + MLOGP + ON1V + B02.C.N. +
## F04.C.O., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2364 -0.5234 0.0421 0.5196 4.1159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035785 0.099454 0.360 0.71911
## nHM 0.124086 0.019083 6.502 1.63e-10 ***
## piPC09 0.042167 0.014135 2.983 0.00297 **
## MLOGP 0.528522 0.029434 17.956 < 2e-16 ***
## ON1V 0.098099 0.055457 1.769 0.07740 .
## B02.C.N. -0.160204 0.073225 -2.188 0.02906 *
## F04.C.O. -0.028644 0.009415 -3.042 0.00245 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7951 on 617 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.6628
## F-statistic: 205.1 on 6 and 617 DF, p-value: < 2.2e-16
#Answer
-The six variables in this best model: nHM, piPC09, MLOGP,ON1V, B02.C.N. and F04.C.O.
set.seed(100)
# Backward Stepwise regresion
min_model <- lm(logBCF ~ 1, trainData)#intercept only model
model4 <- step(model1, scope = list(lower=min_model, upper=model1),direction = "backward", k=log(n),trace=F)#k=log(n) uses BIC
cat("Model4 Summary (BIC Criterion):")
## Model4 Summary (BIC Criterion):
summary(model4)
##
## Call:
## lm(formula = logBCF ~ nHM + piPC09 + MLOGP + F04.C.O., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2611 -0.5126 0.0517 0.5353 4.3488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.008695 0.078196 -0.111 0.91150
## nHM 0.114029 0.017574 6.489 1.78e-10 ***
## piPC09 0.041119 0.013636 3.015 0.00267 **
## MLOGP 0.566473 0.025990 21.796 < 2e-16 ***
## F04.C.O. -0.022104 0.008000 -2.763 0.00590 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7985 on 619 degrees of freedom
## Multiple R-squared: 0.662, Adjusted R-squared: 0.6599
## F-statistic: 303.1 on 4 and 619 DF, p-value: < 2.2e-16
#Answer
set.seed(100)
# Forward Stepwise regresion
min_model <- lm(logBCF ~ 1, trainData)#intercept only model
model5 <- step(min_model, scope = list(lower=min_model, upper=model1),direction = "forward", k=2,trace=F)#k=2 uses AIC
cat("Model5 Summary (AIC Criterion):")
## Model5 Summary (AIC Criterion):
summary(model5)
##
## Call:
## lm(formula = logBCF ~ MLOGP + nHM + piPC09 + F04.C.O. + B02.C.N. +
## ON1V, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2364 -0.5234 0.0421 0.5196 4.1159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035785 0.099454 0.360 0.71911
## MLOGP 0.528522 0.029434 17.956 < 2e-16 ***
## nHM 0.124086 0.019083 6.502 1.63e-10 ***
## piPC09 0.042167 0.014135 2.983 0.00297 **
## F04.C.O. -0.028644 0.009415 -3.042 0.00245 **
## B02.C.N. -0.160204 0.073225 -2.188 0.02906 *
## ON1V 0.098099 0.055457 1.769 0.07740 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7951 on 617 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.6628
## F-statistic: 205.1 on 6 and 617 DF, p-value: < 2.2e-16
#Answer
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
set.seed(100)
# c(Cp(model1, S2= summary(model1)$sigma^2),
# AIC(model1, k=2), AIC(model1,k=log(n)),summary(model1)$r.squared)
model <- c("Model1","Model3","Model4")
Mallow_Cp <- c(Cp(model1, S2= summary(model1)$sigma^2), Cp(model3, S2= summary(model3)$sigma^2),
Cp(model4, S2= summary(model4)$sigma^2))
AIC <- c(AIC(model1, k=2), AIC(model3, k=2), AIC(model4, k=2))
BIC <- c(AIC(model1,k=log(n)), AIC(model3,k=log(n)), AIC(model4,k=log(n)))
Adj_R_Square <- c(summary(model1)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared)
#create my dataframe
df <- data.frame(model,Adj_R_Square, Mallow_Cp, AIC,BIC)
#create kable table
df %>%
kbl(caption="Table I: Performance Measures") %>%
kable_styling(bootstrap_options = c("striped", "hover"), position="center") %>%
kable_classic(full_width = T, html_font = "Cambria") %>% row_spec(0, bold=T, background="gold")
model | Adj_R_Square | Mallow_Cp | AIC | BIC |
---|---|---|---|---|
Model1 | 0.6623027 | 10 | 1497.477 | 1546.274 |
Model3 | 0.6627864 | 7 | 1493.623 | 1529.113 |
Model4 | 0.6598504 | 5 | 1497.052 | 1523.669 |
#Answer
set.seed(100)
train.X <- trainData[,-10]
train.Y <- trainData[,10]
ridge.cv <- cv.glmnet(as.matrix(trainData[,-10]), trainData[,10], alpha=0, nfolds=10)
ridge.cv$lambda.min
## [1] 0.108775
set.seed(100)
ridge_model <- glmnet(as.matrix(train.X), train.Y, alpha=0,nlambda = 100)
coef(ridge_model, s=ridge.cv$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.13841426
## nHM 0.14391877
## piPC09 0.03735762
## PCD 0.08235334
## X2Av -0.06901352
## MLOGP 0.44403655
## ON1V 0.15770114
## N.072 -0.09683534
## B02.C.N. -0.20919397
## F04.C.O. -0.03177144
#Answer
set.seed(100)
lasso.cv <- cv.glmnet(as.matrix(train.X), train.Y, alpha=1, nfolds=10)
lasso.cv
##
## Call: cv.glmnet(x = as.matrix(train.X), y = train.Y, nfolds = 10, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.00785 54 0.6482 0.05770 8
## 1se 0.18572 20 0.7029 0.05249 2
cat('Optimal lasso lambda:',lasso.cv$lambda.min)
## Optimal lasso lambda: 0.007854436
set.seed(100)
lasso.model = glmnet(as.matrix(train.X),train.Y,alpha = 1, nlambda = 100)
## Extract coefficients at optimal lambda
#mod4
lasso.cv
##
## Call: cv.glmnet(x = as.matrix(train.X), y = train.Y, nfolds = 10, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.00785 54 0.6482 0.05770 8
## 1se 0.18572 20 0.7029 0.05249 2
coef(lasso.model, s=lasso.cv$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.02722838
## nHM 0.12543866
## piPC09 0.03387665
## PCD 0.03194878
## X2Av .
## MLOGP 0.52174346
## ON1V 0.09633951
## N.072 -0.05487196
## B02.C.N. -0.13961811
## F04.C.O. -0.02535576
## Plot coefficient paths
plot(lasso.model,xvar="lambda", label=T, lwd=2)
abline(v=log(lasso.cv$lambda.min), col='black', lty=2,lwd=3)#log(min-cv lambda)#this line show where the optimal lambda exist
abline(v=0,col='purple',lty=1, lwd=4)#this line simply show where the zero lambda occurs
set.seed(100)
coef(lasso.model, s=lasso.cv$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.02722838
## nHM 0.12543866
## piPC09 0.03387665
## PCD 0.03194878
## X2Av .
## MLOGP 0.52174346
## ON1V 0.09633951
## N.072 -0.05487196
## B02.C.N. -0.13961811
## F04.C.O. -0.02535576
cat("There were 8 variables selected, see above")
## There were 8 variables selected, see above
#Answer
set.seed(100)
elastic.cv <- cv.glmnet(as.matrix(train.X), train.Y, alpha=0.5, nfolds=10)
elastic.cv$lambda.min#optimal lambda that minimize Cross validation error
## [1] 0.0207662
set.seed(100)
elastic.model <- glmnet(as.matrix(train.X), train.Y, alpha=0.5)
coef(elastic.model, s=elastic.cv$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.04903516
## nHM 0.12397290
## piPC09 0.03470891
## PCD 0.03060034
## X2Av .
## MLOGP 0.51776470
## ON1V 0.08901088
## N.072 -0.05236840
## B02.C.N. -0.14155538
## F04.C.O. -0.02420217
cat("There were 8 variables selected, like Lasso but the coefficient estimates are differnt")
## There were 8 variables selected, like Lasso but the coefficient estimates are differnt
#Answer
set.seed(100)
#set up test data
test.X <- testData[,-10]
test.Y <- testData[,10]
#full model = model1
pred.full <- predict(model1, test.X)
cat("full prediction:",pred.full[1:5])
## full prediction: 2.446479 4.333759 3.266892 1.66477 1.955362
#model4 = model with backward regression
pred.backward <- predict(model4, test.X)
cat("backward stepwise prediction:",pred.backward[1:5])
## backward stepwise prediction: 2.424916 4.353167 3.274192 1.297175 2.001399
#ridge
pred.ridge <- predict(ridge_model, as.matrix(test.X), s=ridge.cv$lambda.min)
cat("ridge prediction:",pred.ridge[1:5])
## ridge prediction: 2.454878 4.234425 3.223166 1.734094 1.993967
#lasso
pred.lasso <- predict(lasso.model, as.matrix(test.X), s=lasso.cv$lambda.min)
cat("lasso prediction:",pred.lasso[1:5])
## lasso prediction: 2.442895 4.313509 3.260617 1.610006 1.939946
#elastic
pred.elastic <- predict(elastic.model, as.matrix(test.X),s=elastic.cv$lambda.min)
cat("Elastic-net prediction:",pred.elastic[1:5])
## Elastic-net prediction: 2.441506 4.296451 3.252638 1.606774 1.939465
set.seed(100)
MSPE <- c(mean((pred.full-test.Y)^2), mean((pred.backward-test.Y)^2),mean((pred.ridge-test.Y)^2),
mean((pred.lasso-test.Y)^2),mean((pred.elastic-test.Y)^2))
models <- c("Full model","Backward stepwise","Ridge model", "Lasso model", "Elastic net model")
df2 <- data.frame(models,MSPE)
df2 <- df2 %>% arrange((MSPE))
#create kable table
df2 %>%
kbl(caption="Table II: Prediction Performance") %>%
kable_styling(bootstrap_options = c("striped", "hover"), position="center") %>%
kable_classic(full_width = F, html_font = "Cambria") %>% row_spec(0, bold=T, background="yellow")
models | MSPE |
---|---|
Backward stepwise | 0.5742198 |
Elastic net model | 0.5782750 |
Lasso model | 0.5790832 |
Full model | 0.5839296 |
Ridge model | 0.5877835 |
cat("Backward Stepwise had the lowest MSPE:", min(df2$MSPE))
## Backward Stepwise had the lowest MSPE: 0.5742198
#Answer
-Backward was the best model based on it’s lowest MSPE value
Backward Stepwise | Ridge | Lasso | Elastic Net | ||
---|---|---|---|---|---|
nHM | x | x | x | x | |
piPC09 | x | x | x | x | |
PCD | x | x | x | ||
X2AV | x | ||||
MLOGP | x | x | x | x | |
ON1V | x | x | x | ||
N.072 | x | x | x | ||
B02.C.N. | x | x | x | ||
F04.C.O. | x | x | x | x |
#Answer