Please refer to the Red wine quality data set in this webpage ( http://www3.dsi.uminho.pt/pcortez/wine/ ) to answer the following questions. Use R to answer these questions, provide your code and answers.
Loading & Reading the “Red Wine Quality” data
setwd("C:\\") # start with setting the working directory in C:
if(!file.exists("./redwine")){dir.create("./redwine")}
url <- "http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip"
download.file(url, destfile = "./redwine/redwine.zip")
unzip("./redwine/redwine.zip",exdir = "./redwine")
setwd("C:\\redwine\\winequality")
redwine = read.csv("winequality-red.csv", sep = ";")
redwine = na.omit(redwine)
redwine = tbl_df(redwine)
redwine
## Source: local data frame [1,599 x 12]
##
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## 7 7.9 0.60 0.06 1.6 0.069
## 8 7.3 0.65 0.00 1.2 0.065
## 9 7.8 0.58 0.02 2.0 0.073
## 10 7.5 0.50 0.36 6.1 0.071
## .. ... ... ... ... ...
## Variables not shown: free.sulfur.dioxide (dbl), total.sulfur.dioxide
## (dbl), density (dbl), pH (dbl), sulphates (dbl), alcohol (dbl), quality
## (int)
creating Training and Test sets : 80/20 split
set.seed(1234)
dim(redwine)
## [1] 1599 12
n = nrow(redwine)
train_index = sample(n, ceiling(n*0.8), replace = F)
train_set = redwine[train_index,]
dim(train_set)
## [1] 1280 12
test_set = redwine[-train_index,]
dim(test_set)
## [1] 319 12
1. Use the Stepwise regression on the red wine quality data. (30 Points)
A. Create regression models using Forward, Backward, and Bi-direction methods .
we will use linear regression model to predict the “Quality” of the red wine.
Regression Model using “Forward Selection” method:
model_full = lm(quality ~ ., data = train_set)
model_int = lm(quality ~ -., data = train_set)
scopeformula = formula(model_full)
scopeformula
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol
fwd_sel = step(object=model_int, scope=scopeformula, direction="forward")
## Start: AIC=-580.15
## quality ~ -(fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol)
##
## Df Sum of Sq RSS AIC
## + alcohol 1 183.625 628.62 -906.19
## + volatile.acidity 1 121.366 690.88 -785.30
## + sulphates 1 47.317 764.93 -654.98
## + citric.acid 1 38.475 773.77 -640.27
## + total.sulfur.dioxide 1 30.955 781.29 -627.89
## + density 1 26.463 785.79 -620.55
## + chlorides 1 19.065 793.18 -608.56
## + fixed.acidity 1 10.848 801.40 -595.36
## + free.sulfur.dioxide 1 3.076 809.17 -583.01
## + pH 1 1.570 810.68 -580.63
## <none> 812.25 -580.15
## + residual.sugar 1 0.174 812.07 -578.43
##
## Step: AIC=-906.19
## quality ~ alcohol
##
## Df Sum of Sq RSS AIC
## + volatile.acidity 1 75.174 553.45 -1067.21
## + sulphates 1 32.072 596.55 -971.21
## + citric.acid 1 24.239 604.38 -954.52
## + pH 1 18.871 609.75 -943.20
## + fixed.acidity 1 17.650 610.97 -940.64
## + total.sulfur.dioxide 1 7.605 621.02 -919.76
## + density 1 3.473 625.15 -911.28
## + chlorides 1 1.739 626.88 -907.73
## <none> 628.62 -906.19
## + free.sulfur.dioxide 1 0.716 627.91 -905.64
## + residual.sugar 1 0.011 628.61 -904.21
##
## Step: AIC=-1067.21
## quality ~ alcohol + volatile.acidity
##
## Df Sum of Sq RSS AIC
## + sulphates 1 12.8066 540.64 -1095.2
## + total.sulfur.dioxide 1 6.3374 547.11 -1080.0
## + fixed.acidity 1 3.8207 549.63 -1074.1
## + pH 1 3.4882 549.96 -1073.3
## + density 1 1.4993 551.95 -1068.7
## + free.sulfur.dioxide 1 1.3670 552.08 -1068.4
## + chlorides 1 1.1030 552.35 -1067.8
## <none> 553.45 -1067.2
## + citric.acid 1 0.0932 553.36 -1065.4
## + residual.sugar 1 0.0449 553.40 -1065.3
##
## Step: AIC=-1095.18
## quality ~ alcohol + volatile.acidity + sulphates
##
## Df Sum of Sq RSS AIC
## + total.sulfur.dioxide 1 7.9365 532.71 -1112.1
## + chlorides 1 7.8925 532.75 -1112.0
## + fixed.acidity 1 2.2021 538.44 -1098.4
## + free.sulfur.dioxide 1 1.9554 538.69 -1097.8
## + pH 1 1.8119 538.83 -1097.5
## <none> 540.64 -1095.2
## + citric.acid 1 0.2425 540.40 -1093.8
## + density 1 0.1769 540.47 -1093.6
## + residual.sugar 1 0.0182 540.62 -1093.2
##
## Step: AIC=-1112.1
## quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide
##
## Df Sum of Sq RSS AIC
## + chlorides 1 8.3588 524.35 -1130.3
## + pH 1 1.9074 530.80 -1114.7
## + fixed.acidity 1 1.2362 531.47 -1113.1
## <none> 532.71 -1112.1
## + residual.sugar 1 0.4867 532.22 -1111.3
## + free.sulfur.dioxide 1 0.4724 532.23 -1111.2
## + density 1 0.0846 532.62 -1110.3
## + citric.acid 1 0.0755 532.63 -1110.3
##
## Step: AIC=-1130.35
## quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide +
## chlorides
##
## Df Sum of Sq RSS AIC
## + pH 1 3.9910 520.36 -1138.1
## + fixed.acidity 1 1.5035 522.84 -1132.0
## + residual.sugar 1 0.9156 523.43 -1130.6
## <none> 524.35 -1130.3
## + free.sulfur.dioxide 1 0.4293 523.92 -1129.4
## + density 1 0.1311 524.22 -1128.7
## + citric.acid 1 0.1261 524.22 -1128.7
##
## Step: AIC=-1138.13
## quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide +
## chlorides + pH
##
## Df Sum of Sq RSS AIC
## + free.sulfur.dioxide 1 0.96159 519.40 -1138.5
## <none> 520.36 -1138.1
## + residual.sugar 1 0.60068 519.76 -1137.6
## + citric.acid 1 0.54937 519.81 -1137.5
## + fixed.acidity 1 0.02049 520.34 -1136.2
## + density 1 0.01611 520.34 -1136.2
##
## Step: AIC=-1138.5
## quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide +
## chlorides + pH + free.sulfur.dioxide
##
## Df Sum of Sq RSS AIC
## <none> 519.40 -1138.5
## + residual.sugar 1 0.46327 518.93 -1137.6
## + citric.acid 1 0.34112 519.05 -1137.3
## + fixed.acidity 1 0.00933 519.39 -1136.5
## + density 1 0.00605 519.39 -1136.5
summary(fwd_sel)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64226 -0.37030 -0.05449 0.46926 1.94207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3587278 0.4406047 9.893 < 2e-16 ***
## alcohol 0.2846803 0.0186553 15.260 < 2e-16 ***
## volatile.acidity -1.0239207 0.1123801 -9.111 < 2e-16 ***
## sulphates 0.8629974 0.1219875 7.074 2.47e-12 ***
## total.sulfur.dioxide -0.0033743 0.0007656 -4.407 1.14e-05 ***
## chlorides -2.2754533 0.4472560 -5.088 4.17e-07 ***
## pH -0.4307869 0.1294315 -3.328 0.000899 ***
## free.sulfur.dioxide 0.0035739 0.0023289 1.535 0.125135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.639 on 1272 degrees of freedom
## Multiple R-squared: 0.3605, Adjusted R-squared: 0.357
## F-statistic: 102.5 on 7 and 1272 DF, p-value: < 2.2e-16
FwdSelection_AIC = AIC(fwd_sel)
FwdSelection_AIC # AIC of the model using forward selection method
## [1] 2495.986
Regression Model using “Backward Selection” method:
model_full = lm(quality ~ ., data = train_set)
scopeformula = formula(model_full)
back_sel = step(object=model_full, scope=scopeformula, direction="backward")
## Start: AIC=-1133.61
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - density 1 0.285 518.42 -1134.9
## - fixed.acidity 1 0.301 518.44 -1134.9
## - citric.acid 1 0.462 518.60 -1134.5
## - free.sulfur.dioxide 1 0.487 518.62 -1134.4
## <none> 518.14 -1133.6
## - residual.sugar 1 0.844 518.98 -1133.5
## - pH 1 1.197 519.33 -1132.7
## - total.sulfur.dioxide 1 6.268 524.40 -1120.2
## - chlorides 1 8.852 526.99 -1113.9
## - sulphates 1 20.680 538.81 -1085.5
## - volatile.acidity 1 26.987 545.12 -1070.6
## - alcohol 1 33.827 551.96 -1054.7
##
## Step: AIC=-1134.9
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - fixed.acidity 1 0.043 518.46 -1136.79
## - citric.acid 1 0.473 518.89 -1135.73
## - residual.sugar 1 0.560 518.98 -1135.52
## - free.sulfur.dioxide 1 0.574 518.99 -1135.48
## <none> 518.42 -1134.90
## - pH 1 2.923 521.34 -1129.70
## - total.sulfur.dioxide 1 6.584 525.00 -1120.75
## - chlorides 1 9.185 527.61 -1114.42
## - sulphates 1 20.840 539.26 -1086.45
## - volatile.acidity 1 28.324 546.74 -1068.81
## - alcohol 1 91.256 609.68 -929.36
##
## Step: AIC=-1136.79
## quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates +
## alcohol
##
## Df Sum of Sq RSS AIC
## - citric.acid 1 0.468 518.93 -1137.64
## - free.sulfur.dioxide 1 0.589 519.05 -1137.34
## - residual.sugar 1 0.590 519.05 -1137.34
## <none> 518.46 -1136.79
## - pH 1 4.502 522.97 -1127.73
## - total.sulfur.dioxide 1 7.308 525.77 -1120.88
## - chlorides 1 10.063 528.53 -1114.19
## - sulphates 1 21.080 539.54 -1087.78
## - volatile.acidity 1 29.575 548.04 -1067.78
## - alcohol 1 91.918 610.38 -929.88
##
## Step: AIC=-1137.64
## quality ~ volatile.acidity + residual.sugar + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - residual.sugar 1 0.463 519.40 -1138.50
## <none> 518.93 -1137.64
## - free.sulfur.dioxide 1 0.824 519.76 -1137.61
## - pH 1 4.167 523.10 -1129.40
## - total.sulfur.dioxide 1 8.188 527.12 -1119.60
## - chlorides 1 10.819 529.75 -1113.23
## - sulphates 1 20.648 539.58 -1089.69
## - volatile.acidity 1 34.121 553.05 -1058.13
## - alcohol 1 92.797 611.73 -929.06
##
## Step: AIC=-1138.5
## quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## <none> 519.40 -1138.50
## - free.sulfur.dioxide 1 0.962 520.36 -1138.13
## - pH 1 4.523 523.92 -1129.40
## - total.sulfur.dioxide 1 7.931 527.33 -1121.10
## - chlorides 1 10.569 529.96 -1114.71
## - sulphates 1 20.436 539.83 -1091.10
## - volatile.acidity 1 33.897 553.29 -1059.57
## - alcohol 1 95.087 614.48 -925.31
summary(back_sel)
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64226 -0.37030 -0.05449 0.46926 1.94207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3587278 0.4406047 9.893 < 2e-16 ***
## volatile.acidity -1.0239207 0.1123801 -9.111 < 2e-16 ***
## chlorides -2.2754533 0.4472560 -5.088 4.17e-07 ***
## free.sulfur.dioxide 0.0035739 0.0023289 1.535 0.125135
## total.sulfur.dioxide -0.0033743 0.0007656 -4.407 1.14e-05 ***
## pH -0.4307869 0.1294315 -3.328 0.000899 ***
## sulphates 0.8629974 0.1219875 7.074 2.47e-12 ***
## alcohol 0.2846803 0.0186553 15.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.639 on 1272 degrees of freedom
## Multiple R-squared: 0.3605, Adjusted R-squared: 0.357
## F-statistic: 102.5 on 7 and 1272 DF, p-value: < 2.2e-16
BackSelection_AIC = AIC(back_sel)
BackSelection_AIC # AIC of the model using backward selection method
## [1] 2495.986
Regression Model using “Bidirectional(Both) Selection” method:
model_full = lm(quality ~ ., data = train_set)
scopeformula = formula(model_full)
both_sel = step(object=model_full, scope=scopeformula, direction="both")
## Start: AIC=-1133.61
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - density 1 0.285 518.42 -1134.9
## - fixed.acidity 1 0.301 518.44 -1134.9
## - citric.acid 1 0.462 518.60 -1134.5
## - free.sulfur.dioxide 1 0.487 518.62 -1134.4
## <none> 518.14 -1133.6
## - residual.sugar 1 0.844 518.98 -1133.5
## - pH 1 1.197 519.33 -1132.7
## - total.sulfur.dioxide 1 6.268 524.40 -1120.2
## - chlorides 1 8.852 526.99 -1113.9
## - sulphates 1 20.680 538.81 -1085.5
## - volatile.acidity 1 26.987 545.12 -1070.6
## - alcohol 1 33.827 551.96 -1054.7
##
## Step: AIC=-1134.9
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - fixed.acidity 1 0.043 518.46 -1136.79
## - citric.acid 1 0.473 518.89 -1135.73
## - residual.sugar 1 0.560 518.98 -1135.52
## - free.sulfur.dioxide 1 0.574 518.99 -1135.48
## <none> 518.42 -1134.90
## + density 1 0.285 518.14 -1133.61
## - pH 1 2.923 521.34 -1129.70
## - total.sulfur.dioxide 1 6.584 525.00 -1120.75
## - chlorides 1 9.185 527.61 -1114.42
## - sulphates 1 20.840 539.26 -1086.45
## - volatile.acidity 1 28.324 546.74 -1068.81
## - alcohol 1 91.256 609.68 -929.36
##
## Step: AIC=-1136.79
## quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates +
## alcohol
##
## Df Sum of Sq RSS AIC
## - citric.acid 1 0.468 518.93 -1137.64
## - free.sulfur.dioxide 1 0.589 519.05 -1137.34
## - residual.sugar 1 0.590 519.05 -1137.34
## <none> 518.46 -1136.79
## + fixed.acidity 1 0.043 518.42 -1134.90
## + density 1 0.028 518.44 -1134.86
## - pH 1 4.502 522.97 -1127.73
## - total.sulfur.dioxide 1 7.308 525.77 -1120.88
## - chlorides 1 10.063 528.53 -1114.19
## - sulphates 1 21.080 539.54 -1087.78
## - volatile.acidity 1 29.575 548.04 -1067.78
## - alcohol 1 91.918 610.38 -929.88
##
## Step: AIC=-1137.64
## quality ~ volatile.acidity + residual.sugar + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - residual.sugar 1 0.463 519.40 -1138.50
## <none> 518.93 -1137.64
## - free.sulfur.dioxide 1 0.824 519.76 -1137.61
## + citric.acid 1 0.468 518.46 -1136.79
## + density 1 0.215 518.72 -1136.17
## + fixed.acidity 1 0.039 518.89 -1135.73
## - pH 1 4.167 523.10 -1129.40
## - total.sulfur.dioxide 1 8.188 527.12 -1119.60
## - chlorides 1 10.819 529.75 -1113.23
## - sulphates 1 20.648 539.58 -1089.69
## - volatile.acidity 1 34.121 553.05 -1058.13
## - alcohol 1 92.797 611.73 -929.06
##
## Step: AIC=-1138.5
## quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## <none> 519.40 -1138.50
## - free.sulfur.dioxide 1 0.962 520.36 -1138.13
## + residual.sugar 1 0.463 518.93 -1137.64
## + citric.acid 1 0.341 519.05 -1137.34
## + fixed.acidity 1 0.009 519.39 -1136.52
## + density 1 0.006 519.39 -1136.51
## - pH 1 4.523 523.92 -1129.40
## - total.sulfur.dioxide 1 7.931 527.33 -1121.10
## - chlorides 1 10.569 529.96 -1114.71
## - sulphates 1 20.436 539.83 -1091.10
## - volatile.acidity 1 33.897 553.29 -1059.57
## - alcohol 1 95.087 614.48 -925.31
summary(both_sel)
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64226 -0.37030 -0.05449 0.46926 1.94207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3587278 0.4406047 9.893 < 2e-16 ***
## volatile.acidity -1.0239207 0.1123801 -9.111 < 2e-16 ***
## chlorides -2.2754533 0.4472560 -5.088 4.17e-07 ***
## free.sulfur.dioxide 0.0035739 0.0023289 1.535 0.125135
## total.sulfur.dioxide -0.0033743 0.0007656 -4.407 1.14e-05 ***
## pH -0.4307869 0.1294315 -3.328 0.000899 ***
## sulphates 0.8629974 0.1219875 7.074 2.47e-12 ***
## alcohol 0.2846803 0.0186553 15.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.639 on 1272 degrees of freedom
## Multiple R-squared: 0.3605, Adjusted R-squared: 0.357
## F-statistic: 102.5 on 7 and 1272 DF, p-value: < 2.2e-16
BidirSelection_AIC = AIC(both_sel)
BidirSelection_AIC # AIC of the model using bidirectional(both) selection method
## [1] 2495.986
Comapring all 3 models for AIC score:
AIC_df = data.frame(FwdSelection=FwdSelection_AIC, BackSelection=BackSelection_AIC, BidirSelection=BidirSelection_AIC)
rownames(AIC_df) = c("AIC")
AIC_df
## FwdSelection BackSelection BidirSelection
## AIC 2495.986 2495.986 2495.986
As AIC scores of all 3 models are same at 2495.986, all 3 models are equal for this perticuler dataset.
- for large dataset with large no. of predictor variables where the goal is to get small no. of significant predictors, “forward selection” or “bidirectional” algorithms are better choices.
Significance of the Predictors:
- final variables chosen by all 3 models are same with same p-value for each.
- these variables out of 11 in original dataset are,
- alcohol
- volatile.acidity
- sulphates
- total.sulfur.dioxide
- chlorides
- pH
- free.sulfur.dioxide
- p-values of all variable except “free.sulfur.dioxide” are less than popular significance level of 0.05. so all those independent variables are significant predictors of the model.
- p-value of “free.sulfur.dioxide” is greater than 0.05, so it is not a significant predictor of the model.
2. Use the Partial Least Squares Regression on the red wine quality data. (30 Points)
A. Create regression models using PCR and PLSR.
PCR modeling:
pcr_model = pcr(quality ~ ., data=train_set)
summary(pcr_model)
## Data: X dimension: 1280 11
## Y dimension: 1280 1
## Fit method: svdpc
## Number of components considered: 11
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 94.578 99.485 99.750 99.905 99.99 100.00 100.00
## quality 3.614 4.717 5.718 5.972 26.10 33.12 33.13
## 8 comps 9 comps 10 comps 11 comps
## X 100.0 100.0 100.00 100.00
## quality 33.5 35.1 36.17 36.21
PLSR modeling:
plsr_model = plsr(quality ~ ., data=train_set)
summary(plsr_model)
## Data: X dimension: 1280 11
## Y dimension: 1280 1
## Fit method: kernelpls
## Number of components considered: 11
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 94.504 99.440 99.63 99.84 99.97 100.00 100.00
## quality 3.771 5.774 23.95 26.86 28.03 34.13 35.45
## 8 comps 9 comps 10 comps 11 comps
## X 100.00 100.00 100.00 100.00
## quality 35.99 36.15 36.17 36.21
PCR and PLSR are the model that help in situations when you have very large no. of attributes or the predictors are highly correlated (multicollinearity).
PCR uses Principle components generated by PCA method insteado f predictors in the model. PCs are linearly uncorrelated and generated basedo n variability of the predictor variables.
because PCR does not account for predictors correlation with response variable, sometimes PCs are also uncorrelated with response variable and that leads to low predictive performance of the model.
this issue is resolved in PLSR model which accounts for PC’s high correlation with response variable as well as variability of predictor variables.
3. Use the Penalized Regression on the red wine quality data. (30 Points)
A. Create models using LASSO, Ridge regression, and Elastic Net. (Use glmnet package for all)
Penalty-based models: Ridge, Lasso, Elastic
ridge_model <- glmnet(x = as.matrix(train_set[,-12]), y = as.matrix(train_set$quality), alpha = 0) #alpha =0 for ridge
summary(ridge_model)
## Length Class Mode
## a0 100 -none- numeric
## beta 1100 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 4 -none- call
## nobs 1 -none- numeric
elastic_model <- glmnet(x = as.matrix(train_set[,-12]), y = as.matrix(train_set$quality), alpha = 0.5) #alpha =0.5 for elastic
summary(elastic_model)
## Length Class Mode
## a0 71 -none- numeric
## beta 781 dgCMatrix S4
## df 71 -none- numeric
## dim 2 -none- numeric
## lambda 71 -none- numeric
## dev.ratio 71 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 4 -none- call
## nobs 1 -none- numeric
lasso_model <- glmnet(x = as.matrix(train_set[,-12]), y = as.matrix(train_set$quality), alpha = 1) #alpha =1 for lasso
summary(lasso_model)
## Length Class Mode
## a0 69 -none- numeric
## beta 759 dgCMatrix S4
## df 69 -none- numeric
## dim 2 -none- numeric
## lambda 69 -none- numeric
## dev.ratio 69 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 4 -none- call
## nobs 1 -none- numeric
regression coefficients generated by the least squares regression are unbiased estimators.
unbiased estomators are not necessary to provide the smallest variance or MSE of the model.
somestimes biased estimators may lead to better variance and MSE of the model. especially when collinearity exists which leads to inflation of the variance of the estimator.
the solution is to add a penalty term to the regression coefficient when we apply the least squares method to build up the regression model.
based on the way we add this penalty term, we have 3 different type of penalty-based models.
LASSO - adds a penalty on the summation of the absolute regression parameters. makes sparsity in parameters.
RIDGE - adds a penalty on the summation of the squared regression parameters. can handle the multicollinearity and provides smaller variance and MSE.
ELASTIC - combines the lasso and ridge regression concepts.
4. Compare all the models above. Which one gives the best prediction quality? (Use MSE) (10 Points)
Prediction of Quality of red wine:
step.forward.predicted <- predict(fwd_sel, test_set[,-12])
step.backward.predicted <- predict(back_sel, test_set[,-12])
step.both.predicted <- predict(both_sel, test_set[,-12])
pcr.predicted <- predict(pcr_model, test_set[,-12], ncomp = 6)
plsr.predicted <- predict(plsr_model, test_set[,-12], ncomp = 6)
lasso.predicted <- predict(lasso_model, as.matrix(test_set[,-12]), s = 0.1)
ridge.predicted <- predict(ridge_model, as.matrix(test_set[,-12]), s = 0.1)
elastic.predicted <- predict(elastic_model, as.matrix(test_set[,-12]), s = 0.1)
final_results <- data.frame(step.forward.predicted, step.backward.predicted, step.both.predicted, pcr.predicted, plsr.predicted, lasso.predicted, ridge.predicted, elastic.predicted)
colnames(final_results) <- c("Step_forward", "Step_backward", "Step_both", "PCR", "PLSR", "Lasso", "Ridge", "Elastic")
head(final_results)
## Step_forward Step_backward Step_both PCR PLSR Lasso Ridge
## 1 5.320067 5.320067 5.320067 5.266609 5.267298 5.388452 5.364428
## 2 5.941546 5.941546 5.941546 5.691850 5.743269 5.397953 5.922857
## 3 5.037195 5.037195 5.037195 5.070398 5.093969 5.297293 5.062500
## 4 5.147282 5.147282 5.147282 5.139064 5.140284 5.401897 5.185984
## 5 5.165358 5.165358 5.165358 5.016662 5.051697 5.369611 5.208209
## 6 5.074575 5.074575 5.074575 5.117242 5.124568 5.296019 5.219919
## Elastic
## 1 5.358380
## 2 5.650633
## 3 5.174022
## 4 5.304687
## 5 5.273243
## 6 5.200493
Mean Squared Error (MSE) calculation:
MSE = c()
for(i in 1:8){
MSE <- rbind(MSE, c(names(final_results)[i], mean((test_set$quality - final_results[,i])^2)))
}
MSE <- data.frame(Model = MSE[,1], MSE = as.numeric(MSE[,2]))
MSE[order(MSE$MSE),]
## Model MSE
## 1 Step_forward 0.4669804
## 2 Step_backward 0.4669804
## 3 Step_both 0.4669804
## 7 Ridge 0.4716701
## 5 PLSR 0.4723489
## 4 PCR 0.4786588
## 8 Elastic 0.4857038
## 6 Lasso 0.5071073
Evaluating Prediction Performance:
based on MSE of all models, looks like Step_forward, step_backward and step_both have the lowest Mean Squared Error (MSE). so these 3 models have the best prediction performance among all 8 models.