Please refer to the Red wine quality data set in this webpage ( http://www3.dsi.uminho.pt/pcortez/wine/ ) to answer the following questions. Use R to answer these questions, provide your code and answers.

Loading & Reading the “Red Wine Quality” data

setwd("C:\\") # start with setting the working directory in C:
if(!file.exists("./redwine")){dir.create("./redwine")}
url <- "http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip"
download.file(url, destfile = "./redwine/redwine.zip")
unzip("./redwine/redwine.zip",exdir = "./redwine")
setwd("C:\\redwine\\winequality")
redwine = read.csv("winequality-red.csv", sep = ";")
redwine = na.omit(redwine)
redwine = tbl_df(redwine)
redwine
## Source: local data frame [1,599 x 12]
## 
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            (dbl)            (dbl)       (dbl)          (dbl)     (dbl)
## 1            7.4             0.70        0.00            1.9     0.076
## 2            7.8             0.88        0.00            2.6     0.098
## 3            7.8             0.76        0.04            2.3     0.092
## 4           11.2             0.28        0.56            1.9     0.075
## 5            7.4             0.70        0.00            1.9     0.076
## 6            7.4             0.66        0.00            1.8     0.075
## 7            7.9             0.60        0.06            1.6     0.069
## 8            7.3             0.65        0.00            1.2     0.065
## 9            7.8             0.58        0.02            2.0     0.073
## 10           7.5             0.50        0.36            6.1     0.071
## ..           ...              ...         ...            ...       ...
## Variables not shown: free.sulfur.dioxide (dbl), total.sulfur.dioxide
##   (dbl), density (dbl), pH (dbl), sulphates (dbl), alcohol (dbl), quality
##   (int)

creating Training and Test sets : 80/20 split

set.seed(1234)
dim(redwine)
## [1] 1599   12
n = nrow(redwine)
train_index = sample(n, ceiling(n*0.8), replace = F)
train_set = redwine[train_index,]
dim(train_set)
## [1] 1280   12
test_set = redwine[-train_index,]
dim(test_set)
## [1] 319  12

1. Use the Stepwise regression on the red wine quality data. (30 Points)

A. Create regression models using Forward, Backward, and Bi-direction methods .

we will use linear regression model to predict the “Quality” of the red wine.

Regression Model using “Forward Selection” method:

model_full = lm(quality ~ ., data = train_set)
model_int = lm(quality ~ -., data = train_set)

scopeformula = formula(model_full)
scopeformula
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
fwd_sel = step(object=model_int, scope=scopeformula, direction="forward")
## Start:  AIC=-580.15
## quality ~ -(fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol)
## 
##                        Df Sum of Sq    RSS     AIC
## + alcohol               1   183.625 628.62 -906.19
## + volatile.acidity      1   121.366 690.88 -785.30
## + sulphates             1    47.317 764.93 -654.98
## + citric.acid           1    38.475 773.77 -640.27
## + total.sulfur.dioxide  1    30.955 781.29 -627.89
## + density               1    26.463 785.79 -620.55
## + chlorides             1    19.065 793.18 -608.56
## + fixed.acidity         1    10.848 801.40 -595.36
## + free.sulfur.dioxide   1     3.076 809.17 -583.01
## + pH                    1     1.570 810.68 -580.63
## <none>                              812.25 -580.15
## + residual.sugar        1     0.174 812.07 -578.43
## 
## Step:  AIC=-906.19
## quality ~ alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## + volatile.acidity      1    75.174 553.45 -1067.21
## + sulphates             1    32.072 596.55  -971.21
## + citric.acid           1    24.239 604.38  -954.52
## + pH                    1    18.871 609.75  -943.20
## + fixed.acidity         1    17.650 610.97  -940.64
## + total.sulfur.dioxide  1     7.605 621.02  -919.76
## + density               1     3.473 625.15  -911.28
## + chlorides             1     1.739 626.88  -907.73
## <none>                              628.62  -906.19
## + free.sulfur.dioxide   1     0.716 627.91  -905.64
## + residual.sugar        1     0.011 628.61  -904.21
## 
## Step:  AIC=-1067.21
## quality ~ alcohol + volatile.acidity
## 
##                        Df Sum of Sq    RSS     AIC
## + sulphates             1   12.8066 540.64 -1095.2
## + total.sulfur.dioxide  1    6.3374 547.11 -1080.0
## + fixed.acidity         1    3.8207 549.63 -1074.1
## + pH                    1    3.4882 549.96 -1073.3
## + density               1    1.4993 551.95 -1068.7
## + free.sulfur.dioxide   1    1.3670 552.08 -1068.4
## + chlorides             1    1.1030 552.35 -1067.8
## <none>                              553.45 -1067.2
## + citric.acid           1    0.0932 553.36 -1065.4
## + residual.sugar        1    0.0449 553.40 -1065.3
## 
## Step:  AIC=-1095.18
## quality ~ alcohol + volatile.acidity + sulphates
## 
##                        Df Sum of Sq    RSS     AIC
## + total.sulfur.dioxide  1    7.9365 532.71 -1112.1
## + chlorides             1    7.8925 532.75 -1112.0
## + fixed.acidity         1    2.2021 538.44 -1098.4
## + free.sulfur.dioxide   1    1.9554 538.69 -1097.8
## + pH                    1    1.8119 538.83 -1097.5
## <none>                              540.64 -1095.2
## + citric.acid           1    0.2425 540.40 -1093.8
## + density               1    0.1769 540.47 -1093.6
## + residual.sugar        1    0.0182 540.62 -1093.2
## 
## Step:  AIC=-1112.1
## quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide
## 
##                       Df Sum of Sq    RSS     AIC
## + chlorides            1    8.3588 524.35 -1130.3
## + pH                   1    1.9074 530.80 -1114.7
## + fixed.acidity        1    1.2362 531.47 -1113.1
## <none>                             532.71 -1112.1
## + residual.sugar       1    0.4867 532.22 -1111.3
## + free.sulfur.dioxide  1    0.4724 532.23 -1111.2
## + density              1    0.0846 532.62 -1110.3
## + citric.acid          1    0.0755 532.63 -1110.3
## 
## Step:  AIC=-1130.35
## quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide + 
##     chlorides
## 
##                       Df Sum of Sq    RSS     AIC
## + pH                   1    3.9910 520.36 -1138.1
## + fixed.acidity        1    1.5035 522.84 -1132.0
## + residual.sugar       1    0.9156 523.43 -1130.6
## <none>                             524.35 -1130.3
## + free.sulfur.dioxide  1    0.4293 523.92 -1129.4
## + density              1    0.1311 524.22 -1128.7
## + citric.acid          1    0.1261 524.22 -1128.7
## 
## Step:  AIC=-1138.13
## quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide + 
##     chlorides + pH
## 
##                       Df Sum of Sq    RSS     AIC
## + free.sulfur.dioxide  1   0.96159 519.40 -1138.5
## <none>                             520.36 -1138.1
## + residual.sugar       1   0.60068 519.76 -1137.6
## + citric.acid          1   0.54937 519.81 -1137.5
## + fixed.acidity        1   0.02049 520.34 -1136.2
## + density              1   0.01611 520.34 -1136.2
## 
## Step:  AIC=-1138.5
## quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide + 
##     chlorides + pH + free.sulfur.dioxide
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        519.40 -1138.5
## + residual.sugar  1   0.46327 518.93 -1137.6
## + citric.acid     1   0.34112 519.05 -1137.3
## + fixed.acidity   1   0.00933 519.39 -1136.5
## + density         1   0.00605 519.39 -1136.5
summary(fwd_sel)
## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide, 
##     data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64226 -0.37030 -0.05449  0.46926  1.94207 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.3587278  0.4406047   9.893  < 2e-16 ***
## alcohol               0.2846803  0.0186553  15.260  < 2e-16 ***
## volatile.acidity     -1.0239207  0.1123801  -9.111  < 2e-16 ***
## sulphates             0.8629974  0.1219875   7.074 2.47e-12 ***
## total.sulfur.dioxide -0.0033743  0.0007656  -4.407 1.14e-05 ***
## chlorides            -2.2754533  0.4472560  -5.088 4.17e-07 ***
## pH                   -0.4307869  0.1294315  -3.328 0.000899 ***
## free.sulfur.dioxide   0.0035739  0.0023289   1.535 0.125135    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.639 on 1272 degrees of freedom
## Multiple R-squared:  0.3605, Adjusted R-squared:  0.357 
## F-statistic: 102.5 on 7 and 1272 DF,  p-value: < 2.2e-16
FwdSelection_AIC = AIC(fwd_sel)
FwdSelection_AIC # AIC of the model using forward selection method
## [1] 2495.986

Regression Model using “Backward Selection” method:

model_full = lm(quality ~ ., data = train_set)
scopeformula = formula(model_full)

back_sel = step(object=model_full, scope=scopeformula, direction="backward")
## Start:  AIC=-1133.61
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - density               1     0.285 518.42 -1134.9
## - fixed.acidity         1     0.301 518.44 -1134.9
## - citric.acid           1     0.462 518.60 -1134.5
## - free.sulfur.dioxide   1     0.487 518.62 -1134.4
## <none>                              518.14 -1133.6
## - residual.sugar        1     0.844 518.98 -1133.5
## - pH                    1     1.197 519.33 -1132.7
## - total.sulfur.dioxide  1     6.268 524.40 -1120.2
## - chlorides             1     8.852 526.99 -1113.9
## - sulphates             1    20.680 538.81 -1085.5
## - volatile.acidity      1    26.987 545.12 -1070.6
## - alcohol               1    33.827 551.96 -1054.7
## 
## Step:  AIC=-1134.9
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## - fixed.acidity         1     0.043 518.46 -1136.79
## - citric.acid           1     0.473 518.89 -1135.73
## - residual.sugar        1     0.560 518.98 -1135.52
## - free.sulfur.dioxide   1     0.574 518.99 -1135.48
## <none>                              518.42 -1134.90
## - pH                    1     2.923 521.34 -1129.70
## - total.sulfur.dioxide  1     6.584 525.00 -1120.75
## - chlorides             1     9.185 527.61 -1114.42
## - sulphates             1    20.840 539.26 -1086.45
## - volatile.acidity      1    28.324 546.74 -1068.81
## - alcohol               1    91.256 609.68  -929.36
## 
## Step:  AIC=-1136.79
## quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + 
##     alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## - citric.acid           1     0.468 518.93 -1137.64
## - free.sulfur.dioxide   1     0.589 519.05 -1137.34
## - residual.sugar        1     0.590 519.05 -1137.34
## <none>                              518.46 -1136.79
## - pH                    1     4.502 522.97 -1127.73
## - total.sulfur.dioxide  1     7.308 525.77 -1120.88
## - chlorides             1    10.063 528.53 -1114.19
## - sulphates             1    21.080 539.54 -1087.78
## - volatile.acidity      1    29.575 548.04 -1067.78
## - alcohol               1    91.918 610.38  -929.88
## 
## Step:  AIC=-1137.64
## quality ~ volatile.acidity + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## - residual.sugar        1     0.463 519.40 -1138.50
## <none>                              518.93 -1137.64
## - free.sulfur.dioxide   1     0.824 519.76 -1137.61
## - pH                    1     4.167 523.10 -1129.40
## - total.sulfur.dioxide  1     8.188 527.12 -1119.60
## - chlorides             1    10.819 529.75 -1113.23
## - sulphates             1    20.648 539.58 -1089.69
## - volatile.acidity      1    34.121 553.05 -1058.13
## - alcohol               1    92.797 611.73  -929.06
## 
## Step:  AIC=-1138.5
## quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## <none>                              519.40 -1138.50
## - free.sulfur.dioxide   1     0.962 520.36 -1138.13
## - pH                    1     4.523 523.92 -1129.40
## - total.sulfur.dioxide  1     7.931 527.33 -1121.10
## - chlorides             1    10.569 529.96 -1114.71
## - sulphates             1    20.436 539.83 -1091.10
## - volatile.acidity      1    33.897 553.29 -1059.57
## - alcohol               1    95.087 614.48  -925.31
summary(back_sel)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64226 -0.37030 -0.05449  0.46926  1.94207 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.3587278  0.4406047   9.893  < 2e-16 ***
## volatile.acidity     -1.0239207  0.1123801  -9.111  < 2e-16 ***
## chlorides            -2.2754533  0.4472560  -5.088 4.17e-07 ***
## free.sulfur.dioxide   0.0035739  0.0023289   1.535 0.125135    
## total.sulfur.dioxide -0.0033743  0.0007656  -4.407 1.14e-05 ***
## pH                   -0.4307869  0.1294315  -3.328 0.000899 ***
## sulphates             0.8629974  0.1219875   7.074 2.47e-12 ***
## alcohol               0.2846803  0.0186553  15.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.639 on 1272 degrees of freedom
## Multiple R-squared:  0.3605, Adjusted R-squared:  0.357 
## F-statistic: 102.5 on 7 and 1272 DF,  p-value: < 2.2e-16
BackSelection_AIC = AIC(back_sel)
BackSelection_AIC # AIC of the model using backward selection method
## [1] 2495.986

Regression Model using “Bidirectional(Both) Selection” method:

model_full = lm(quality ~ ., data = train_set)
scopeformula = formula(model_full)

both_sel = step(object=model_full, scope=scopeformula, direction="both")
## Start:  AIC=-1133.61
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - density               1     0.285 518.42 -1134.9
## - fixed.acidity         1     0.301 518.44 -1134.9
## - citric.acid           1     0.462 518.60 -1134.5
## - free.sulfur.dioxide   1     0.487 518.62 -1134.4
## <none>                              518.14 -1133.6
## - residual.sugar        1     0.844 518.98 -1133.5
## - pH                    1     1.197 519.33 -1132.7
## - total.sulfur.dioxide  1     6.268 524.40 -1120.2
## - chlorides             1     8.852 526.99 -1113.9
## - sulphates             1    20.680 538.81 -1085.5
## - volatile.acidity      1    26.987 545.12 -1070.6
## - alcohol               1    33.827 551.96 -1054.7
## 
## Step:  AIC=-1134.9
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## - fixed.acidity         1     0.043 518.46 -1136.79
## - citric.acid           1     0.473 518.89 -1135.73
## - residual.sugar        1     0.560 518.98 -1135.52
## - free.sulfur.dioxide   1     0.574 518.99 -1135.48
## <none>                              518.42 -1134.90
## + density               1     0.285 518.14 -1133.61
## - pH                    1     2.923 521.34 -1129.70
## - total.sulfur.dioxide  1     6.584 525.00 -1120.75
## - chlorides             1     9.185 527.61 -1114.42
## - sulphates             1    20.840 539.26 -1086.45
## - volatile.acidity      1    28.324 546.74 -1068.81
## - alcohol               1    91.256 609.68  -929.36
## 
## Step:  AIC=-1136.79
## quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + 
##     alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## - citric.acid           1     0.468 518.93 -1137.64
## - free.sulfur.dioxide   1     0.589 519.05 -1137.34
## - residual.sugar        1     0.590 519.05 -1137.34
## <none>                              518.46 -1136.79
## + fixed.acidity         1     0.043 518.42 -1134.90
## + density               1     0.028 518.44 -1134.86
## - pH                    1     4.502 522.97 -1127.73
## - total.sulfur.dioxide  1     7.308 525.77 -1120.88
## - chlorides             1    10.063 528.53 -1114.19
## - sulphates             1    21.080 539.54 -1087.78
## - volatile.acidity      1    29.575 548.04 -1067.78
## - alcohol               1    91.918 610.38  -929.88
## 
## Step:  AIC=-1137.64
## quality ~ volatile.acidity + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## - residual.sugar        1     0.463 519.40 -1138.50
## <none>                              518.93 -1137.64
## - free.sulfur.dioxide   1     0.824 519.76 -1137.61
## + citric.acid           1     0.468 518.46 -1136.79
## + density               1     0.215 518.72 -1136.17
## + fixed.acidity         1     0.039 518.89 -1135.73
## - pH                    1     4.167 523.10 -1129.40
## - total.sulfur.dioxide  1     8.188 527.12 -1119.60
## - chlorides             1    10.819 529.75 -1113.23
## - sulphates             1    20.648 539.58 -1089.69
## - volatile.acidity      1    34.121 553.05 -1058.13
## - alcohol               1    92.797 611.73  -929.06
## 
## Step:  AIC=-1138.5
## quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS      AIC
## <none>                              519.40 -1138.50
## - free.sulfur.dioxide   1     0.962 520.36 -1138.13
## + residual.sugar        1     0.463 518.93 -1137.64
## + citric.acid           1     0.341 519.05 -1137.34
## + fixed.acidity         1     0.009 519.39 -1136.52
## + density               1     0.006 519.39 -1136.51
## - pH                    1     4.523 523.92 -1129.40
## - total.sulfur.dioxide  1     7.931 527.33 -1121.10
## - chlorides             1    10.569 529.96 -1114.71
## - sulphates             1    20.436 539.83 -1091.10
## - volatile.acidity      1    33.897 553.29 -1059.57
## - alcohol               1    95.087 614.48  -925.31
summary(both_sel)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64226 -0.37030 -0.05449  0.46926  1.94207 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.3587278  0.4406047   9.893  < 2e-16 ***
## volatile.acidity     -1.0239207  0.1123801  -9.111  < 2e-16 ***
## chlorides            -2.2754533  0.4472560  -5.088 4.17e-07 ***
## free.sulfur.dioxide   0.0035739  0.0023289   1.535 0.125135    
## total.sulfur.dioxide -0.0033743  0.0007656  -4.407 1.14e-05 ***
## pH                   -0.4307869  0.1294315  -3.328 0.000899 ***
## sulphates             0.8629974  0.1219875   7.074 2.47e-12 ***
## alcohol               0.2846803  0.0186553  15.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.639 on 1272 degrees of freedom
## Multiple R-squared:  0.3605, Adjusted R-squared:  0.357 
## F-statistic: 102.5 on 7 and 1272 DF,  p-value: < 2.2e-16
BidirSelection_AIC = AIC(both_sel)
BidirSelection_AIC # AIC of the model using bidirectional(both) selection method
## [1] 2495.986

B. Compare these three methods to see which one is the better method(s). Comment on the significance of the predictors?

Comapring all 3 models for AIC score:

AIC_df = data.frame(FwdSelection=FwdSelection_AIC, BackSelection=BackSelection_AIC, BidirSelection=BidirSelection_AIC)
rownames(AIC_df) = c("AIC")
AIC_df
##     FwdSelection BackSelection BidirSelection
## AIC     2495.986      2495.986       2495.986

As AIC scores of all 3 models are same at 2495.986, all 3 models are equal for this perticuler dataset.

  • for large dataset with large no. of predictor variables where the goal is to get small no. of significant predictors, “forward selection” or “bidirectional” algorithms are better choices.

Significance of the Predictors:

  • final variables chosen by all 3 models are same with same p-value for each.
  • these variables out of 11 in original dataset are,
    1. alcohol
    2. volatile.acidity
    3. sulphates
    4. total.sulfur.dioxide
    5. chlorides
    6. pH
    7. free.sulfur.dioxide
  • p-values of all variable except “free.sulfur.dioxide” are less than popular significance level of 0.05. so all those independent variables are significant predictors of the model.
  • p-value of “free.sulfur.dioxide” is greater than 0.05, so it is not a significant predictor of the model.

2. Use the Partial Least Squares Regression on the red wine quality data. (30 Points)

A. Create regression models using PCR and PLSR.

PCR modeling:

pcr_model = pcr(quality ~ ., data=train_set)
summary(pcr_model)
## Data:    X dimension: 1280 11 
##  Y dimension: 1280 1
## Fit method: svdpc
## Number of components considered: 11
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         94.578   99.485   99.750   99.905    99.99   100.00   100.00
## quality    3.614    4.717    5.718    5.972    26.10    33.12    33.13
##          8 comps  9 comps  10 comps  11 comps
## X          100.0    100.0    100.00    100.00
## quality     33.5     35.1     36.17     36.21

PLSR modeling:

plsr_model = plsr(quality ~ ., data=train_set)
summary(plsr_model)
## Data:    X dimension: 1280 11 
##  Y dimension: 1280 1
## Fit method: kernelpls
## Number of components considered: 11
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         94.504   99.440    99.63    99.84    99.97   100.00   100.00
## quality    3.771    5.774    23.95    26.86    28.03    34.13    35.45
##          8 comps  9 comps  10 comps  11 comps
## X         100.00   100.00    100.00    100.00
## quality    35.99    36.15     36.17     36.21

PCR and PLSR are the model that help in situations when you have very large no. of attributes or the predictors are highly correlated (multicollinearity).

PCR uses Principle components generated by PCA method insteado f predictors in the model. PCs are linearly uncorrelated and generated basedo n variability of the predictor variables.

because PCR does not account for predictors correlation with response variable, sometimes PCs are also uncorrelated with response variable and that leads to low predictive performance of the model.

this issue is resolved in PLSR model which accounts for PC’s high correlation with response variable as well as variability of predictor variables.

3. Use the Penalized Regression on the red wine quality data. (30 Points)

A. Create models using LASSO, Ridge regression, and Elastic Net. (Use glmnet package for all)

Penalty-based models: Ridge, Lasso, Elastic

ridge_model <- glmnet(x = as.matrix(train_set[,-12]), y = as.matrix(train_set$quality), alpha = 0) #alpha =0 for ridge
summary(ridge_model)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1100   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         4   -none-    call   
## nobs         1   -none-    numeric
elastic_model <- glmnet(x = as.matrix(train_set[,-12]), y = as.matrix(train_set$quality), alpha = 0.5) #alpha =0.5 for elastic 
summary(elastic_model)
##           Length Class     Mode   
## a0         71    -none-    numeric
## beta      781    dgCMatrix S4     
## df         71    -none-    numeric
## dim         2    -none-    numeric
## lambda     71    -none-    numeric
## dev.ratio  71    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        4    -none-    call   
## nobs        1    -none-    numeric
lasso_model <- glmnet(x = as.matrix(train_set[,-12]), y = as.matrix(train_set$quality), alpha = 1) #alpha =1 for lasso
summary(lasso_model)
##           Length Class     Mode   
## a0         69    -none-    numeric
## beta      759    dgCMatrix S4     
## df         69    -none-    numeric
## dim         2    -none-    numeric
## lambda     69    -none-    numeric
## dev.ratio  69    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        4    -none-    call   
## nobs        1    -none-    numeric

regression coefficients generated by the least squares regression are unbiased estimators.

unbiased estomators are not necessary to provide the smallest variance or MSE of the model.

somestimes biased estimators may lead to better variance and MSE of the model. especially when collinearity exists which leads to inflation of the variance of the estimator.

the solution is to add a penalty term to the regression coefficient when we apply the least squares method to build up the regression model.

based on the way we add this penalty term, we have 3 different type of penalty-based models.

LASSO - adds a penalty on the summation of the absolute regression parameters. makes sparsity in parameters.

RIDGE - adds a penalty on the summation of the squared regression parameters. can handle the multicollinearity and provides smaller variance and MSE.

ELASTIC - combines the lasso and ridge regression concepts.

4. Compare all the models above. Which one gives the best prediction quality? (Use MSE) (10 Points)

Prediction of Quality of red wine:

step.forward.predicted <- predict(fwd_sel, test_set[,-12])
step.backward.predicted <- predict(back_sel, test_set[,-12])
step.both.predicted <- predict(both_sel, test_set[,-12])
pcr.predicted <- predict(pcr_model, test_set[,-12], ncomp = 6)
plsr.predicted <- predict(plsr_model, test_set[,-12], ncomp = 6)
lasso.predicted <- predict(lasso_model, as.matrix(test_set[,-12]), s = 0.1)
ridge.predicted <- predict(ridge_model, as.matrix(test_set[,-12]), s = 0.1)
elastic.predicted <- predict(elastic_model, as.matrix(test_set[,-12]), s = 0.1)
final_results <- data.frame(step.forward.predicted, step.backward.predicted, step.both.predicted, pcr.predicted, plsr.predicted, lasso.predicted, ridge.predicted, elastic.predicted)
colnames(final_results) <- c("Step_forward", "Step_backward", "Step_both", "PCR", "PLSR", "Lasso", "Ridge", "Elastic")
head(final_results)
##   Step_forward Step_backward Step_both      PCR     PLSR    Lasso    Ridge
## 1     5.320067      5.320067  5.320067 5.266609 5.267298 5.388452 5.364428
## 2     5.941546      5.941546  5.941546 5.691850 5.743269 5.397953 5.922857
## 3     5.037195      5.037195  5.037195 5.070398 5.093969 5.297293 5.062500
## 4     5.147282      5.147282  5.147282 5.139064 5.140284 5.401897 5.185984
## 5     5.165358      5.165358  5.165358 5.016662 5.051697 5.369611 5.208209
## 6     5.074575      5.074575  5.074575 5.117242 5.124568 5.296019 5.219919
##    Elastic
## 1 5.358380
## 2 5.650633
## 3 5.174022
## 4 5.304687
## 5 5.273243
## 6 5.200493

Mean Squared Error (MSE) calculation:

MSE = c()
for(i in 1:8){
MSE <- rbind(MSE, c(names(final_results)[i], mean((test_set$quality - final_results[,i])^2)))
}
MSE <- data.frame(Model = MSE[,1], MSE = as.numeric(MSE[,2]))
MSE[order(MSE$MSE),]
##           Model       MSE
## 1  Step_forward 0.4669804
## 2 Step_backward 0.4669804
## 3     Step_both 0.4669804
## 7         Ridge 0.4716701
## 5          PLSR 0.4723489
## 4           PCR 0.4786588
## 8       Elastic 0.4857038
## 6         Lasso 0.5071073

Evaluating Prediction Performance:

based on MSE of all models, looks like Step_forward, step_backward and step_both have the lowest Mean Squared Error (MSE). so these 3 models have the best prediction performance among all 8 models.