Assignment4.knit

Data

Consider the dataset titled Boston, which is available in R. The list of variables and their descriptions are as follows:

crim - per capita crime rate by town
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox - nitric oxides concentration (parts per 10 million)
rm - average number of rooms per dwelling
age - proportion of owner-occupied units built prior to 1940
dis - weighted distances to five Boston employment centers
rad - index of accessibility to radial highways
tax - full-value property-tax rate per $10,000
ptratio - student to teacher ratio by town
medv - Median value of owner-occupied homes in $1000’s

Our variable of interest to predict is medv. We will be working with the entire data set in this example, hence, no need to do test/train split.

Questions

Data prep:

Data is available in R, please see the code chunk below, You are not asked to do test/train split on this assignment
Check your data using the head() command in R. This step will help you to identify whether the variables are numeric or categorical.

NOTE - variable chas from int to factor (0=no, 1=yes for tract bounds river) NOTE - With only 504 rows, LOOCV method should not be computationally costly

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Boston$chas <- as.factor(Boston$chas)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

nrow(Boston)

## [1] 506

ncol(Boston)

## [1] 14

(18 points) Cross Validation:

Consider 2 multiple linear regression models where medv is the dependent variable:
- Model 1) Use all the above variables, i.e., medv~.
- Model 2) Use only rm and ptratio, i.e., medv~rm+ptratio
Using 3 cross-validation techniques (leave-one-out, kfolds with K=5, and kfolds with K=10), determine which model provides the best predictive performance. (Hint: glm and cv.glm)

NOTE - Model 1, including all predictors (not just rm+ptratio) provides the best predictive performance with lower lower scores for LOOCV, KFCV(5, 10), and MSE. See chart below for further details.

modelcv1<-glm(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, data=Boston)
modelcv2<-glm(medv~rm+ptratio, data=Boston)

Boston-LOOCV

Model 1:

loocv1 <- cv.glm(Boston, modelcv1)
loocv1$delta[1]

## [1] 29.48657

Model 2:

loocv2 <- cv.glm(Boston, modelcv2)
loocv2$delta[1]

## [1] 37.70066

Boston-K-Fold CV

Model 1, 5-fold:

kcv1 <- cv.glm(Boston, modelcv1, K=5) 
kcv1$delta[1]

## [1] 30.75023

Model 1, 10-fold:

kcv2 <- cv.glm(Boston, modelcv1, K=10)
kcv2$delta[1]

## [1] 29.83194

Model 2, 5-fold:

kcv3 <- cv.glm(Boston, modelcv2, K=5) 
kcv3$delta[1]

## [1] 38.0965

Model 2, 10-fold:

kcv4 <- cv.glm(Boston, modelcv2, K=10)
kcv4$delta[1]

## [1] 37.8784

##Prediction Accuracy of Model - LOOCV, K-FoldCV, MSE Comparisons (Using Kable)

m1=lm(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, data=Boston)
m2=lm(medv~rm+ptratio, data=Boston)
library(kableExtra)
Models=c("Model1","Model2")
LOOCV = c(loocv1$delta[1], loocv2$delta[1])
k5.Fold.cv = c(kcv1$delta[1], kcv3$delta[1])
k10.Fold.cv = c(kcv2$delta[1], kcv4$delta[1])
MSE=c(mean(m1$residuals^2), mean(m2$residuals^2))
text_tbl <- data.frame(Models, LOOCV, k5.Fold.cv, k10.Fold.cv, MSE)
  

kable(text_tbl) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Models	LOOCV	k5.Fold.cv	k10.Fold.cv	MSE
Model1	29.48657	30.75023	29.83194	27.83194
Model2	37.70066	38.09650	37.87840	37.03879

(18 points) Subset Selection:

Using the 3 model selection algorithms (subset selection, forward, and backward selection) identify the best combination of independent variables (you need to use all independent variables as potential candidates).
Choose the model(s) that minimize BIC and Cp
Estimate the best model(s) and calculate the MSE (Hint:...$delta[1]) using 5 fold cross validation

Example - Best Subset Regression

model<-regsubsets(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio,
                  Boston, method= "exhaustive", nvmax=15)
    

which.max(summary(model)$adjr2)

## [1] 10

which.min(summary(model)$bic)

## [1] 10

which.min(summary(model)$cp)

## [1] 10

plot(model)

summary(model)

## Subset selection object
## Call: regsubsets.formula(medv ~ crim + zn + indus + factor(chas) + 
##     nox + rm + age + dis + rad + tax + ptratio, Boston, method = "exhaustive", 
##     nvmax = 15)
## 11 Variables  (and intercept)
##               Forced in Forced out
## crim              FALSE      FALSE
## zn                FALSE      FALSE
## indus             FALSE      FALSE
## factor(chas)1     FALSE      FALSE
## nox               FALSE      FALSE
## rm                FALSE      FALSE
## age               FALSE      FALSE
## dis               FALSE      FALSE
## rad               FALSE      FALSE
## tax               FALSE      FALSE
## ptratio           FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
##           crim zn  indus factor(chas)1 nox rm  age dis rad tax ptratio
## 1  ( 1 )  " "  " " " "   " "           " " "*" " " " " " " " " " "    
## 2  ( 1 )  " "  " " " "   " "           " " "*" " " " " " " " " "*"    
## 3  ( 1 )  " "  " " " "   " "           "*" "*" " " " " " " " " "*"    
## 4  ( 1 )  " "  " " " "   " "           "*" "*" " " "*" " " " " "*"    
## 5  ( 1 )  "*"  " " " "   " "           "*" "*" " " "*" " " " " "*"    
## 6  ( 1 )  "*"  " " " "   "*"           "*" "*" " " "*" " " " " "*"    
## 7  ( 1 )  "*"  " " " "   "*"           "*" "*" "*" "*" " " " " "*"    
## 8  ( 1 )  "*"  "*" " "   "*"           "*" "*" "*" "*" " " " " "*"    
## 9  ( 1 )  "*"  " " " "   "*"           "*" "*" "*" "*" "*" "*" "*"    
## 10  ( 1 ) "*"  "*" " "   "*"           "*" "*" "*" "*" "*" "*" "*"    
## 11  ( 1 ) "*"  "*" "*"   "*"           "*" "*" "*" "*" "*" "*" "*"

par(mfrow=c(2,2))
plot(summary(model)$rsq, type="o", ylab="R-Squared", xlab="")
plot(summary(model)$adjr2, type="o", ylab="Adj-R-Squared", xlab="")
plot(summary(model)$bic, type="o", ylab="BIC", xlab="")
plot(summary(model)$cp, type="o", ylab="Cp", xlab="")

Boston - Forward Selection

model1 <- regsubsets(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio,
                     Boston, nvmax=19, method="forward")

which.max(summary(model1)$adjr2)

## [1] 11

which.min(summary(model1)$bic)

## [1] 7

which.min(summary(model1)$cp)

## [1] 11

plot(model1)

summary(model1)

## Subset selection object
## Call: regsubsets.formula(medv ~ crim + zn + indus + factor(chas) + 
##     nox + rm + age + dis + rad + tax + ptratio, Boston, nvmax = 19, 
##     method = "forward")
## 11 Variables  (and intercept)
##               Forced in Forced out
## crim              FALSE      FALSE
## zn                FALSE      FALSE
## indus             FALSE      FALSE
## factor(chas)1     FALSE      FALSE
## nox               FALSE      FALSE
## rm                FALSE      FALSE
## age               FALSE      FALSE
## dis               FALSE      FALSE
## rad               FALSE      FALSE
## tax               FALSE      FALSE
## ptratio           FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: forward
##           crim zn  indus factor(chas)1 nox rm  age dis rad tax ptratio
## 1  ( 1 )  " "  " " " "   " "           " " "*" " " " " " " " " " "    
## 2  ( 1 )  " "  " " " "   " "           " " "*" " " " " " " " " "*"    
## 3  ( 1 )  " "  " " " "   " "           "*" "*" " " " " " " " " "*"    
## 4  ( 1 )  " "  " " " "   " "           "*" "*" " " "*" " " " " "*"    
## 5  ( 1 )  "*"  " " " "   " "           "*" "*" " " "*" " " " " "*"    
## 6  ( 1 )  "*"  " " " "   "*"           "*" "*" " " "*" " " " " "*"    
## 7  ( 1 )  "*"  " " " "   "*"           "*" "*" "*" "*" " " " " "*"    
## 8  ( 1 )  "*"  "*" " "   "*"           "*" "*" "*" "*" " " " " "*"    
## 9  ( 1 )  "*"  "*" "*"   "*"           "*" "*" "*" "*" " " " " "*"    
## 10  ( 1 ) "*"  "*" "*"   "*"           "*" "*" "*" "*" "*" " " "*"    
## 11  ( 1 ) "*"  "*" "*"   "*"           "*" "*" "*" "*" "*" "*" "*"

par(mfrow=c(2,2))
plot(summary(model1)$rsq, type="o", ylab="R-Squared", xlab="")
plot(summary(model1)$adjr2, type="o", ylab="Adj-R-Squared", xlab="")
plot(summary(model1)$bic, type="o", ylab="BIC", xlab="")
plot(summary(model1)$cp, type="o", ylab="Cp", xlab="")

Boston - Backward Selection

model2<-regsubsets(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, 
                   Boston, nvmax=19, method="backward")

which.max(summary(model2)$adjr2)

## [1] 10

which.min(summary(model2)$bic)

## [1] 10

which.min(summary(model2)$cp)

## [1] 10

plot(model2)

summary(model2)

## Subset selection object
## Call: regsubsets.formula(medv ~ crim + zn + indus + factor(chas) + 
##     nox + rm + age + dis + rad + tax + ptratio, Boston, nvmax = 19, 
##     method = "backward")
## 11 Variables  (and intercept)
##               Forced in Forced out
## crim              FALSE      FALSE
## zn                FALSE      FALSE
## indus             FALSE      FALSE
## factor(chas)1     FALSE      FALSE
## nox               FALSE      FALSE
## rm                FALSE      FALSE
## age               FALSE      FALSE
## dis               FALSE      FALSE
## rad               FALSE      FALSE
## tax               FALSE      FALSE
## ptratio           FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: backward
##           crim zn  indus factor(chas)1 nox rm  age dis rad tax ptratio
## 1  ( 1 )  " "  " " " "   " "           " " "*" " " " " " " " " " "    
## 2  ( 1 )  " "  " " " "   " "           " " "*" " " " " " " " " "*"    
## 3  ( 1 )  " "  " " " "   " "           "*" "*" " " " " " " " " "*"    
## 4  ( 1 )  " "  " " " "   " "           "*" "*" " " "*" " " " " "*"    
## 5  ( 1 )  "*"  " " " "   " "           "*" "*" " " "*" " " " " "*"    
## 6  ( 1 )  "*"  " " " "   "*"           "*" "*" " " "*" " " " " "*"    
## 7  ( 1 )  "*"  " " " "   "*"           "*" "*" "*" "*" " " " " "*"    
## 8  ( 1 )  "*"  " " " "   "*"           "*" "*" "*" "*" "*" " " "*"    
## 9  ( 1 )  "*"  " " " "   "*"           "*" "*" "*" "*" "*" "*" "*"    
## 10  ( 1 ) "*"  "*" " "   "*"           "*" "*" "*" "*" "*" "*" "*"    
## 11  ( 1 ) "*"  "*" "*"   "*"           "*" "*" "*" "*" "*" "*" "*"

par(mfrow=c(2,2))
plot(summary(model2)$rsq, type="o", ylab="R-Squared", xlab="")
plot(summary(model2)$adjr2, type="o", ylab="Adj-R-Squared", xlab="")
plot(summary(model2)$bic, type="o", ylab="BIC", xlab="")
plot(summary(model2)$cp, type="o", ylab="Cp", xlab="")

## Estimate the best models (10, 11, and 7, as determined above) and calculate MSEs using 5-fold CV

*NOTE - Best Subset Selection (max adj. R^2 = 10. min BIC = 10, min CP = 10) Model_10_Subset = crim, zn, factor(chas), nox, rm, age, dis, rad, tax, ptratio

*NOTE - Best Forward Stepwise Selection (max adj. R^2 = 11. min BIC = 7, min CP = 11) Model_11_Forward = crim, zn, indus, factor(chas), nox, rm, age, dis, rad, ptratio Model_7_Forward = crim, zn, indus, factor(chas), nox, rm, age, dis, ptratio

*NOTE - Best Forward Stepwise Selection (max adj. R^2 = 10. min BIC = 10, min CP = 10) Model_10_Backward = crim, zn, indus, factor(chas), nox, rm, age, dis, rad, tax, ptratio

#Estimate Model_10_Subset and Calculate MSE using 5-fold CV 
Model_10_Subset_lm <- lm(medv~crim+zn+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, data=Boston)
Model_10_Subset_glm <- glm(medv~crim+zn+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, data=Boston) 

Model_10_Subset_lm

## 
## Call:
## lm(formula = medv ~ crim + zn + factor(chas) + nox + rm + age + 
##     dis + rad + tax + ptratio, data = Boston)
## 
## Coefficients:
##   (Intercept)           crim             zn  factor(chas)1            nox  
##      27.31009       -0.18360        0.04008        3.43034      -22.90594  
##            rm            age            dis            rad            tax  
##       6.11392       -0.04543       -1.55523        0.26724       -0.01336  
##       ptratio  
##      -1.00820

kcv_Model_10_Subset_glm <- cv.glm(Boston, Model_10_Subset_glm, K=5)
kcv_Model_10_Subset_glm$delta[1]

## [1] 30.53076

#Estimate Model_11_Forward and Calculate MSE using 5-fold CV 
Model_11_Forward_lm <- lm(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+ptratio, data=Boston)
Model_11_Forward_glm <- glm(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+ptratio, data=Boston)

Model_11_Forward_lm

## 
## Call:
## lm(formula = medv ~ crim + zn + indus + factor(chas) + nox + 
##     rm + age + dis + rad + ptratio, data = Boston)
## 
## Coefficients:
##   (Intercept)           crim             zn          indus  factor(chas)1  
##      25.51530       -0.18289        0.02918       -0.12985        3.82859  
##           nox             rm            age            dis            rad  
##     -23.12157        6.16478       -0.04624       -1.59654        0.08450  
##       ptratio  
##      -1.02772

kcv_Model_11_Forward_glm <- cv.glm(Boston, Model_11_Forward_glm, K=5)
kcv_Model_11_Forward_glm$delta[1]

## [1] 30.1764

#Estimate Model_7_Forward and Calculate MSE using 5-fold CV 
Model_7_Forward_lm <- lm(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+ptratio, data=Boston)
Model_7_Forward_glm <-  glm(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+ptratio, data=Boston)

Model_7_Forward_lm

## 
## Call:
## lm(formula = medv ~ crim + zn + indus + factor(chas) + nox + 
##     rm + age + dis + ptratio, data = Boston)
## 
## Coefficients:
##   (Intercept)           crim             zn          indus  factor(chas)1  
##      21.71783       -0.15166        0.03257       -0.11417        3.84178  
##           nox             rm            age            dis        ptratio  
##     -20.33621        6.29314       -0.04830       -1.59433       -0.91617

kcv_Model_7_Forward_glm <- cv.glm(Boston, Model_7_Forward_glm, K=5)
kcv_Model_7_Forward_glm$delta[1]

## [1] 30.65317

#Estimate Model_10_Backward and Calculate MSE using 5-fold CV 
Model_10_Backward_lm <- lm(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, data=Boston)
Model_10_Backward_glm <- glm(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, data=Boston)

Model_10_Backward_lm

## 
## Call:
## lm(formula = medv ~ crim + zn + indus + factor(chas) + nox + 
##     rm + age + dis + rad + tax + ptratio, data = Boston)
## 
## Coefficients:
##   (Intercept)           crim             zn          indus  factor(chas)1  
##      27.15237       -0.18403        0.03910       -0.04232        3.48753  
##           nox             rm            age            dis            rad  
##     -22.18211        6.07574       -0.04519       -1.58385        0.25472  
##           tax        ptratio  
##      -0.01221       -0.99621

kcv_Model_10_Backward_glm <- cv.glm(Boston, Model_10_Backward_glm, K=5)
kcv_Model_10_Backward_glm$delta[1]

## [1] 29.32097

(18 points) Shrinkage:

Using all the variables (except MEDV), complete data set, and same grid used in lecture, find the best lambda based on a 10 fold cross validation.
Estimate a Lasso regression using the best lambda, predict using the whole data set (Hint: newx=x) and get the MSE.

Boston - Lasso

# Omit Missing values:
Boston <- na.omit(Boston)

# Create model matrix and grid for lambda
x <- model.matrix(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, Boston)[,-1]
grid <- 10^seq(10,-2, length=100)

# Train (80%) - Test (20%) Split
set.seed(123)
train <- sample(1:nrow(x), size=0.8*nrow(x))
test <- setdiff(1:nrow(x), train)
# Cross-Validation for Lasso
cv.out <- cv.glmnet(x[train,], Boston$medv[train], alpha=1, lambda=grid, nfolds=10)
plot(cv.out)

# Optimal Lambda Determination 
bestlam=cv.out$lambda.min
print(bestlam)

## [1] 0.01

print(log(bestlam))

## [1] -4.60517

print(bestlam %in% grid)

## [1] TRUE

out <- glmnet(x[train, ], Boston$medv[train], alpha=1, lambda=bestlam)
lasso_pred <- predict(out, s=bestlam, newx=x[test, ])
lasso_coefs <- coef(out, s=bestlam)

lasso_coefs_matrix <- as.matrix(lasso_coefs)
num_coefs <- nrow(lasso_coefs_matrix)
print(lasso_coefs_matrix)

##                         s1
## (Intercept)    27.48993315
## crim           -0.17930031
## zn              0.03824751
## indus          -0.02473282
## factor(chas)1   3.55577963
## nox           -22.56939957
## rm              6.03929936
## age            -0.03783326
## dis            -1.49138918
## rad             0.26012025
## tax            -0.01140096
## ptratio        -1.06801318

Comparison - BOston: Lasso, Regression Variance Comparison (MSE)

#Running a regression model for comparison 
out2<-glmnet(x,Boston$medv, lambda=0)
regpred<-predict(out2, s=0, newx=x[test,], exact=T)

Methods=c("Lasso", "Regression")
Testing.MSE=c(mean((lasso_pred-Boston$medv[test])^2), mean((regpred-Boston$medv[test])^2))
tbl <- data.frame(Methods, Testing.MSE)

tbll<-kable(tbl, format = "html")
kable_styling(tbll, bootstrap_options = c("striped", "hover"))

Methods	Testing.MSE
Lasso	24.12722
Regression	23.45699

(18 points) Dimension Reduction:

Using the whole data, estimate a principle components regression. (Hint: No need to use subset=train argument)
Predict the medv using 5 principle components and calculate the MSE.

Boston - PCR

pcrfit <- pcr(medv~crim+zn+indus+factor(chas)+nox+rm+age+dis+rad+tax+ptratio, data=Boston, scale=TRUE, validation="CV")
summary(pcrfit)

## Data:    X dimension: 506 11 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.206    7.660    7.131    6.301    6.225    5.753    5.758
## adjCV        9.206    7.657    7.117    6.295    6.221    5.746    5.751
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV       5.688    5.698    5.644     5.479     5.438
## adjCV    5.681    5.691    5.637     5.471     5.429
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.80    61.40    71.61    79.29    85.69    90.53    93.67    96.13
## medv    31.49    41.22    54.01    55.97    62.18    62.30    63.33    63.39
##       9 comps  10 comps  11 comps
## X       97.87     99.42    100.00
## medv    64.40     66.41     67.03

validationplot(pcrfit, val.type="MSEP")

pcrpredict_5_PC <- predict(pcrfit, x[test,], ncomp=5)
mse_PC_5=mean((pcrpredict_5_PC-Boston$medv[test])^2)
mse_PC_5

## [1] 27.40716

(18 points) Pick one of the 4 exercises you have worked on, i.e., cross-validation, subset selection, shrinkage, or dimension reduction, then please explain the procedure in your own words to someone who has absolutely have no experience in data analytics, i.e., a 10 year old, or someone in your company who does not know predictive analytics.

Shrinkage involves the process of reducing or fully eliminating the impact on response variable medv of predictior variables that have less of a direct impact on the response variable. While the ridge shrinkage methods does not result in variables being eliminated, the lasso shrinkage method will result in both reduced and eliminated predictor variables. This is advantageous in creating a model with lower variance and easier interpret ability. Note, however, that a trade off of lower model variance is increased bias as the model does not fully account for as much fluctuation in the response variable given changes in the independent predictor variables.

Lasso method shrinkage can be applied in R statistical software using a simple process. First, ensure there are no missing values in the Boston dataset to allow for eventual comparison of MSE between normal lm model and lasso model. Next create a matrix, called X, containing all independent variables (hence [,-1]) and grid containing all possible lambda from 0.01 to 100. Separate the Boston dataset into training and testing to allow for model development and independent prediction accuracy error testing, set a seed to allow for reproducible results. using cv.glmnet function, perform cross-validation using k-fold=10 to determine the optimal value of lambda (the regularization parameter for Lasso) for the Boston predictor variables (matrix X) given lambda 0.01 to 100. The cross-validation k-fold=10 test will run 10 different tests using 9/10 available data observations as training data for the model and the other 1/10 data observations as testing data. The average of all 10 k-fold accuracy tests will provide a output in cv.out that can be plotted to observe the average MSE for each model with lambda ranging from 0.01 to 100. Note that the best lambda will be the lambda that produces the lowest MSE, which can be determined by expecting the cv.out plot visual or utilizing the minimum function. Note that lambda 0.01 producs the lowest MSE and should be used to fit the generalized linear model with added penalty (lambda = bestlambda or 0.01). For the Boston dataset above, the out2 assigned name using glmnet to fit the model has a low lambda, meaning that there is a lower penalty for the existing predictor variables. The result is that, despite using lasso method in which certain predictor variables could be reduced to 0 assuming a higher lambda, no predictor variables are omitted in the adjusted model, meaning that while there is less bias (potential changes in the response variable not explained by predictor variables in a given model), variance will likely be lower. Lower variance is not necessarily a positive, as the model may be overfitted suggesting that the model attempts to explain noise that, in essence, reflect variance in the response variable due to happenstance rather than changes in the predictor variables. Consequently, MSE may be higher when testing the model with unseen test data. While the coefficient for variables X1 to X12 in the Boston dataset above have been reduced to decrease potential variance after using Lasso shrinkage method, no variables were completely omitted. The MSE for the lasso predict model is 24.12 while the untouched regression model has a lower MSE, 23.45. The MSE comparison results between the Lasso predict model and the regression model is logical as there is 509 records and only 14 variabls in the Boston dataset. Shrinkage is often more important when P=>N, the number of characteristics (predictor variables) is simlar to the total number of rows or records in the data set.

(10 points) Please summarize your learning outcome for this assignment both in terms of R and methods learned in this module.

In this module 4 exercise, I learned how to better fit a regression model or classification model to better represent the predictor variable’s relationship with the independent response variable using a variety of dimension reduction and shrinkage methods. I also learned how to test the regression model using leave one out cross-validation (LOOCV) and k-fold cross-validation (KFCV). LOOCV involves training the data on n-1 observations and averaging the MSE or error measurement by the n-1 tests that run. LOOCV can be a computationally expensive method for training and testing the data but ensures that the underlying model will be able to accurately predict the true values of the independent variables. K-fold is much less computationally costly and involves 5 or 10 tests (depending upon whether k-fold= 5 or k-fold=10 is selected) that is based on 90% or 80% (contingent on k=5, 10) of the dataset with the X-80% or 90% belonging to the test group. The result is an increased level of assurance that the underlying model will be able to accurately predict new unseen data with little error, MSE. Furthermore, the concept of bias and variance was explored in greater detail, particularly in the context of shrinkage (Lasso and Ridge) and dimension reduction (principal component regression [PCR] and partial least squares regression [PLS]) techniques. Efforts to reduce variance with the objective of minimizing overfitting will often result in increased bias as the model fails to explain the same level of change in the response variable. Statistical methods or modeling techniques, such as Lasso, Ridge, PLC, and PC are powerful tools that can be used to adjust how a regression model is fitted to the underlying data. R statistical software can be used to simplify the underlying math and processes that occur in these length statistical modeling techniques. Understanding when and how to use R packages, such as pls and glmnet, as well as the statistical and mathematical processes can be helpful in adding value to an organization looking to maintain a competitive advantage in a global marketplace dominated by competitors that can leverage the potential of big data.

ADMN 872: Predictive Analytics

Assignment 4

Charles Martell

Data

Questions

Boston-LOOCV

Boston-K-Fold CV

Example - Best Subset Regression

Boston - Forward Selection

Boston - Backward Selection

Boston - Lasso

Comparison - BOston: Lasso, Regression Variance Comparison (MSE)

Boston - PCR

ADMN 872: Predictive Analytics Assignment 4

Charles Martell

Data

Questions

Boston-LOOCV

Boston-K-Fold CV

Example - Best Subset Regression

Boston - Forward Selection

Boston - Backward Selection

Boston - Lasso

Comparison - BOston: Lasso, Regression Variance Comparison (MSE)

Boston - PCR

ADMN 872: Predictive Analytics

Assignment 4