Phil Gonzalez_D590 ADS

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

unzip("/Users/gonzalezp/Downloads/wine+quality.zip", files = "winequality-red.csv", exdir = "C:/Users/gonzalezp/Downloads")
redwine <- read.csv("/Users/gonzalezp/Downloads/winequality-red.csv", header = TRUE, sep = ";")
sample_n(redwine, 5)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           6.9            0.480        0.20            1.9     0.082
## 2           8.5            0.655        0.49            6.1     0.122
## 3           8.9            0.430        0.45            1.9     0.052
## 4          10.6            0.360        0.57            2.3     0.087
## 5           7.9            0.200        0.35            1.7     0.054
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   9                   23 0.99585 3.39      0.43    9.05
## 2                  34                  151 1.00100 3.31      1.14    9.30
## 3                   6                   16 0.99480 3.35      0.70   12.50
## 4                   6                   20 0.99676 3.14      0.72   11.10
## 5                   7                   15 0.99458 3.32      0.80   11.90
##   quality
## 1       4
## 2       5
## 3       6
## 4       7
## 5       7

str(redwine)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

A glimpse into the “redwine” dataset shows that all of the independent variables are continuous, but the dependent variable is an integer and so is discrete. This makes sense, given that the “quality” variable represents a ranking of wine quality on a scale of 1-10. Given that the dependent variable is discrete, I will build a logistic regression model to predict the “class” of red wine (high quality or low quality).

sum(is.na(redwine) == TRUE)

## [1] 0

table(complete.cases(redwine))

## 
## TRUE 
## 1599

sort(sapply(redwine, function(x) sum(is.na(x))))

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

library(visdat)
vis_dat(redwine)

summary(redwine)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

A review of the summary data reveals that the independent variables are measured in various units, with the means of some variables being significantly larger than others. For example, the “total sulfur dioxide” variable mean is 46.47 compared to 0.087 for “chlorides.” Standardization will need to be performed on the dataset to ensure that the variables with larger scales are not over represented in the model.

qqnorm(redwine$quality)

boxplot(redwine$quality)

par(mfrow=c(2,6))
for (i in 1:length(redwine)) {
        boxplot(redwine[,i], main=names(redwine[i]), type="l")

}

Boxplot analysis of all of the variables in the red wine quality dataset shows that most independent variables have a number of outliers, with most variables also having a right-skewed distribution as a result. Regarding the nature of these outliers, I don’t believe they are due to a measurement or other error. All independent variables have at least one outlier. Additionally, this dataset includes data points for a large number of unique wines. Given that wine grapes come from various regions across the globe, grow in unique soils, and have many genetic varieties, I believe that it is understandable that outliers would exist in the dataset. Though we cannot determine this from the dataset, it could be that the outliers come from the same geographic location.

That said, I will be testing to determine if the presence of outliers does impact logistic regression models. I will build one model that includes the outliers and another model that excludes the outliers. In addition to testing which model has greater accuracy, I will also analyze the balanced accuracy of the two models to evaluate whether the presence/absence of outliers introduces a bias into the model.

#Encoding the "quality" dependent variable
redwine <- redwine %>%
  mutate(
    quality = if_else(
      quality > 5,
      1,
      0
    ) 
  )
redwine.quality <- redwine$quality

The “quality” dependent variable is encoded to create two distinct classes: “high quality” and “low quality” red wines. Six is the median ranking of wine quality. As such, it will serve as the cutoff between high quality and low quality. Wines with a ranking greater than five will be classified as high quality. Wines with a ranking of five or less will be classified as low quality wines. This cutoff ensures that a near equal number of high quality and low quality wines will be presentin the dataset and classification bias will not be introduced into the models due to an over representation of high or loq quality wines.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

preproc1 <- preProcess(redwine, method = c("center", "scale"))
norm1 <- predict(preproc1, redwine)
head(norm1)

##   fixed.acidity volatile.acidity citric.acid residual.sugar   chlorides
## 1    -0.5281944        0.9615758   -1.391037    -0.45307667 -0.24363047
## 2    -0.2984541        1.9668271   -1.391037     0.04340257  0.22380518
## 3    -0.2984541        1.2966596   -1.185699    -0.16937425  0.09632273
## 4     1.6543385       -1.3840105    1.483689    -0.45307667 -0.26487754
## 5    -0.5281944        0.9615758   -1.391037    -0.45307667 -0.24363047
## 6    -0.5281944        0.7381867   -1.391037    -0.52400227 -0.26487754
##   free.sulfur.dioxide total.sulfur.dioxide    density         pH   sulphates
## 1         -0.46604672           -0.3790141 0.55809987  1.2882399 -0.57902538
## 2          0.87236532            0.6241680 0.02825193 -0.7197081  0.12891007
## 3         -0.08364328            0.2289750 0.13422152 -0.3310730 -0.04807379
## 4          0.10755844            0.4113718 0.66406945 -0.9787982 -0.46103614
## 5         -0.46604672           -0.3790141 0.55809987  1.2882399 -0.57902538
## 6         -0.27484500           -0.1966174 0.55809987  1.2882399 -0.57902538
##      alcohol    quality
## 1 -0.9599458 -1.0716692
## 2 -0.5845942 -1.0716692
## 3 -0.5845942 -1.0716692
## 4 -0.5845942  0.9325402
## 5 -0.9599458 -1.0716692
## 6 -0.9599458 -1.0716692

library(corrplot)

## corrplot 0.95 loaded

M <- cor(norm1)
p.mat1 <- cor.mtest(norm1)
p.mat1

## $p
##                      fixed.acidity volatile.acidity   citric.acid
## fixed.acidity         0.000000e+00     2.276720e-25 2.535321e-210
## volatile.acidity      2.276720e-25     0.000000e+00 1.805663e-128
## citric.acid          2.535321e-210    1.805663e-128  0.000000e+00
## residual.sugar        4.199465e-06     9.389168e-01  8.083723e-09
## chlorides             1.751746e-04     1.422491e-02  1.863705e-16
## free.sulfur.dioxide   6.335579e-10     6.747011e-01  1.473916e-02
## total.sulfur.dioxide  5.709033e-06     2.213857e-03  1.555454e-01
## density              3.074747e-207     3.787554e-01  1.478795e-51
## pH                   4.063034e-220     1.718994e-21 1.007201e-122
## sulphates             1.648652e-13     2.606926e-26  1.265262e-37
## alcohol               1.364868e-02     3.155190e-16  1.059462e-05
## quality               1.399766e-04     9.336621e-40  1.563717e-10
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity          4.199465e-06 1.751746e-04        6.335579e-10
## volatile.acidity       9.389168e-01 1.422491e-02        6.747011e-01
## citric.acid            8.083723e-09 1.863705e-16        1.473916e-02
## residual.sugar         0.000000e+00 2.617079e-02        4.684735e-14
## chlorides              2.617079e-02 0.000000e+00        8.241238e-01
## free.sulfur.dioxide    4.684735e-14 8.241238e-01        0.000000e+00
## total.sulfur.dioxide   2.449285e-16 5.809120e-02       6.404723e-207
## density                9.013042e-49 5.541458e-16        3.804985e-01
## pH                     6.065915e-04 4.149217e-27        4.869975e-03
## sulphates              8.252134e-01 1.986310e-53        3.888321e-02
## alcohol                9.258425e-02 3.654950e-19        5.492314e-03
## quality                9.312092e-01 1.143103e-05        1.351417e-02
##                      total.sulfur.dioxide       density            pH
## fixed.acidity                5.709033e-06 3.074747e-207 4.063034e-220
## volatile.acidity             2.213857e-03  3.787554e-01  1.718994e-21
## citric.acid                  1.555454e-01  1.478795e-51 1.007201e-122
## residual.sugar               2.449285e-16  9.013042e-49  6.065915e-04
## chlorides                    5.809120e-02  5.541458e-16  4.149217e-27
## free.sulfur.dioxide         6.404723e-207  3.804985e-01  4.869975e-03
## total.sulfur.dioxide         0.000000e+00  4.354284e-03  7.818341e-03
## density                      4.354284e-03  0.000000e+00  5.117102e-45
## pH                           7.818341e-03  5.117102e-45  0.000000e+00
## sulphates                    8.601835e-02  2.418474e-09  2.106734e-15
## alcohol                      9.890520e-17 3.938835e-100  9.964498e-17
## quality                      5.623796e-21  1.571846e-10  8.962367e-01
##                         sulphates       alcohol      quality
## fixed.acidity        1.648652e-13  1.364868e-02 1.399766e-04
## volatile.acidity     2.606926e-26  3.155190e-16 9.336621e-40
## citric.acid          1.265262e-37  1.059462e-05 1.563717e-10
## residual.sugar       8.252134e-01  9.258425e-02 9.312092e-01
## chlorides            1.986310e-53  3.654950e-19 1.143103e-05
## free.sulfur.dioxide  3.888321e-02  5.492314e-03 1.351417e-02
## total.sulfur.dioxide 8.601835e-02  9.890520e-17 5.623796e-21
## density              2.418474e-09 3.938835e-100 1.571846e-10
## pH                   2.106734e-15  9.964498e-17 8.962367e-01
## sulphates            0.000000e+00  1.783053e-04 1.147929e-18
## alcohol              1.783053e-04  0.000000e+00 1.023980e-74
## quality              1.147929e-18  1.023980e-74 0.000000e+00
## 
## $lowCI
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity           1.00000000      -0.30136811  0.64388389    0.066127651
## volatile.acidity       -0.30136811       1.00000000 -0.58565496   -0.047107687
## citric.acid             0.64388389      -0.58565496  1.00000000    0.095226254
## residual.sugar          0.06612765      -0.04710769  0.09522625    1.000000000
## chlorides               0.04489025       0.01231363  0.15636411    0.006606405
## free.sulfur.dioxide    -0.20129772      -0.05949433 -0.10967144    0.139305194
## total.sulfur.dioxide   -0.16130760       0.02755215 -0.01351165    0.155554927
## density                 0.63998465      -0.02702409  0.32168095    0.311690761
## pH                     -0.70828567       0.18808228 -0.57563371   -0.134110462
## sulphates               0.13519740      -0.30609173  0.26785576   -0.043505806
## alcohol                -0.11035580      -0.24884156  0.06121189   -0.006960058
## quality                 0.04628813      -0.36471503  0.11097394   -0.051176170
##                         chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity         0.044890252        -0.201297716         -0.161307603
## volatile.acidity      0.012313634        -0.059494333          0.027552148
## citric.acid           0.156364107        -0.109671437         -0.013511651
## residual.sugar        0.006606405         0.139305194          0.155554927
## chlorides             1.000000000        -0.043470846         -0.001624446
## free.sulfur.dioxide  -0.043470846         1.000000000          0.639578570
## total.sulfur.dioxide -0.001624446         0.639578570          1.000000000
## density               0.153117128        -0.070890706          0.022326338
## pH                   -0.310019533         0.021430293         -0.115140383
## sulphates             0.328212685         0.002643125         -0.006087119
## alcohol              -0.267264388        -0.118027906         -0.252133230
## quality              -0.157668845        -0.110443529         -0.277824943
##                          density          pH    sulphates      alcohol
## fixed.acidity         0.63998465 -0.70828567  0.135197400 -0.110355799
## volatile.acidity     -0.02702409  0.18808228 -0.306091729 -0.248841557
## citric.acid           0.32168095 -0.57563371  0.267855761  0.061211891
## residual.sugar        0.31169076 -0.13411046 -0.043505806 -0.006960058
## chlorides             0.15311713 -0.31001953  0.328212685 -0.267264388
## free.sulfur.dioxide  -0.07089071  0.02143029  0.002643125 -0.118027906
## total.sulfur.dioxide  0.02232634 -0.11514038 -0.006087119 -0.252133230
## density               1.00000000 -0.38428354  0.100214831 -0.532254720
## pH                   -0.38428354  1.00000000 -0.243323134  0.158206140
## sulphates             0.10021483 -0.24332313  1.000000000  0.044779062
## alcohol              -0.53225472  0.15820614  0.044779062  1.000000000
## quality              -0.20652030 -0.05227676  0.170877221  0.394129758
##                          quality
## fixed.acidity         0.04628813
## volatile.acidity     -0.36471503
## citric.acid           0.11097394
## residual.sugar       -0.05117617
## chlorides            -0.15766885
## free.sulfur.dioxide  -0.11044353
## total.sulfur.dioxide -0.27782494
## density              -0.20652030
## pH                   -0.05227676
## sulphates             0.17087722
## alcohol               0.39412976
## quality               1.00000000
## 
## $uppCI
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity           1.00000000      -0.20974326  0.69774932     0.16288141
## volatile.acidity       -0.20974326       1.00000000 -0.51749023     0.05093423
## citric.acid             0.69774932      -0.51749023  1.00000000     0.19125221
## residual.sugar          0.16288141       0.05093423  0.19125221     1.00000000
## chlorides               0.14207371       0.10998841  0.25034272     0.10434622
## free.sulfur.dioxide    -0.10556896       0.03853716 -0.01199284     0.23392519
## total.sulfur.dioxide   -0.06451827       0.12502248  0.08440714     0.24956518
## density                 0.69433020       0.07097074  0.40669253     0.39738352
## pH                     -0.65591741       0.28072535 -0.50633364    -0.03678574
## sulphates               0.22996377      -0.21471255  0.35632784     0.05453349
## alcohol                -0.01268548      -0.15480196  0.15807276     0.09090907
## quality                 0.14344594      -0.27678106  0.20653939     0.04686565
##                        chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity         0.14207371         -0.10556896          -0.06451827
## volatile.acidity      0.10998841          0.03853716           0.12502248
## citric.acid           0.25034272         -0.01199284           0.08440714
## residual.sugar        0.10434622          0.23392519           0.24956518
## chlorides             1.00000000          0.05456841           0.09619808
## free.sulfur.dioxide   0.05456841          1.00000000           0.69397398
## total.sulfur.dioxide  0.09619808          0.69397398           1.00000000
## density               0.24722198          0.02710447           0.11987182
## pH                   -0.21884824          0.11898813          -0.01753056
## sulphates             0.41276938          0.10042441           0.09177476
## alcohol              -0.17400572         -0.02045682          -0.15822796
## quality              -0.06079920         -0.01277428          -0.18504601
##                          density          pH   sulphates     alcohol
## fixed.acidity         0.69433020 -0.65591741  0.22996377 -0.01268548
## volatile.acidity      0.07097074  0.28072535 -0.21471255 -0.15480196
## citric.acid           0.40669253 -0.50633364  0.35632784  0.15807276
## residual.sugar        0.39738352 -0.03678574  0.05453349  0.09090907
## chlorides             0.24722198 -0.21884824  0.41276938 -0.17400572
## free.sulfur.dioxide   0.02710447  0.11898813  0.10042441 -0.02045682
## total.sulfur.dioxide  0.11987182 -0.01753056  0.09177476 -0.15822796
## density               1.00000000 -0.29766422  0.19609995 -0.45830615
## pH                   -0.29766422  1.00000000 -0.14906342  0.25211227
## sulphates             0.19609995 -0.14906342  1.00000000  0.14196454
## alcohol              -0.45830615  0.25211227  0.14196454  1.00000000
## quality              -0.11095424  0.04576448  0.26426775  0.47367733
##                          quality
## fixed.acidity         0.14344594
## volatile.acidity     -0.27678106
## citric.acid           0.20653939
## residual.sugar        0.04686565
## chlorides            -0.06079920
## free.sulfur.dioxide  -0.01277428
## total.sulfur.dioxide -0.18504601
## density              -0.11095424
## pH                    0.04576448
## sulphates             0.26426775
## alcohol               0.47367733
## quality               1.00000000

corrplot(M, type = "upper", order = "hclust", p.mat = p.mat1$p, sig.level = 0.05)

The correlation plot shows that there is a high level of correlation between the following variable pairs: fixed acidity and pH (negative), citric acid and volatile acidity (negative), citric acid and pH (negative), free sulfur dioxide and total sulfur dioxide (positive), density and fixed acidity (positive), and fixed acidity and citric acid (positive). In addition to there being a couple of variable pairs with high correlation, there are also a couple of variables that have high correlation with multiple variables (fixed acidity and density). There does appear to be some multicollinearity within this dataset. As a result, I will perform an analysis of the independent variables to determine if there is multicollinearity and identify any variables that have a VIF high enough to warrant removal from the model.

#Add the encoded, unscaled quality variable back into the dataframe
norm1$quality <- redwine.quality
str(norm1)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  -0.528 -0.298 -0.298 1.654 -0.528 ...
##  $ volatile.acidity    : num  0.962 1.967 1.297 -1.384 0.962 ...
##  $ citric.acid         : num  -1.39 -1.39 -1.19 1.48 -1.39 ...
##  $ residual.sugar      : num  -0.4531 0.0434 -0.1694 -0.4531 -0.4531 ...
##  $ chlorides           : num  -0.2436 0.2238 0.0963 -0.2649 -0.2436 ...
##  $ free.sulfur.dioxide : num  -0.466 0.8724 -0.0836 0.1076 -0.466 ...
##  $ total.sulfur.dioxide: num  -0.379 0.624 0.229 0.411 -0.379 ...
##  $ density             : num  0.5581 0.0283 0.1342 0.6641 0.5581 ...
##  $ pH                  : num  1.288 -0.72 -0.331 -0.979 1.288 ...
##  $ sulphates           : num  -0.579 0.1289 -0.0481 -0.461 -0.579 ...
##  $ alcohol             : num  -0.96 -0.585 -0.585 -0.585 -0.96 ...
##  $ quality             : num  0 0 0 1 0 0 0 1 1 0 ...

#Partition the dataset into training and testing sets.
set.seed(123)
training.samples <- createDataPartition(y = norm1$quality, p = 0.8, list = FALSE)
train.data <- norm1[training.samples,]
test.data <- norm1[-training.samples,]

#Run linear regression model to evaluate VIF of independent variables.
lm.model <- lm(quality ~ ., data = train.data)
lm.predictions <- predict(lm.model, test.data)

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(lm.model)

##        fixed.acidity     volatile.acidity          citric.acid 
##             7.904660             1.831394             3.238379 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             1.661341             1.500448             1.936827 
## total.sulfur.dioxide              density                   pH 
##             2.167901             6.279581             3.425344 
##            sulphates              alcohol 
##             1.448063             3.134299

After running the linear regression model and performing a VIF analysis, there are two independent variables with variance inflation factors that I would consider to be problematic. We identified earlier that fixed acidity and density had multiple, highly correlated relationships with other independent variables. The VIFs of 7.9 and 6.3, respectively, reach into the 5-10 range and, I believe, warrant removal of the variables from the model.

lm.model.adapted <- lm(quality ~ . -fixed.acidity -density, data = train.data)
vif(lm.model.adapted)

##     volatile.acidity          citric.acid       residual.sugar 
##             1.695032             2.240593             1.093797 
##            chlorides  free.sulfur.dioxide total.sulfur.dioxide 
##             1.390366             1.916759             2.002540 
##                   pH            sulphates              alcohol 
##             1.631214             1.352515             1.286963

After removing these variables from the model and running the VIF analysis again, we see that the VIFs of all independent variables have decreased. This result supports the removal of the fixed acidity and density variables.

Model 1: Logistic Regression with outliers included, multicollinear independent variables removed

glm.model <- glm(quality ~ . -fixed.acidity -density, family = "binomial", data = train.data)
summary(glm.model)

## 
## Call:
## glm(formula = quality ~ . - fixed.acidity - density, family = "binomial", 
##     data = train.data)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.24152    0.06892   3.504 0.000458 ***
## volatile.acidity     -0.52317    0.09240  -5.662 1.50e-08 ***
## citric.acid          -0.12475    0.09919  -1.258 0.208494    
## residual.sugar       -0.01285    0.07193  -0.179 0.858237    
## chlorides            -0.22084    0.07713  -2.863 0.004195 ** 
## free.sulfur.dioxide   0.21168    0.09674   2.188 0.028667 *  
## total.sulfur.dioxide -0.57273    0.10143  -5.647 1.64e-08 ***
## pH                   -0.22528    0.08327  -2.705 0.006823 ** 
## sulphates             0.40884    0.08201   4.986 6.18e-07 ***
## alcohol               0.98465    0.08740  11.265  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1765.0  on 1279  degrees of freedom
## Residual deviance: 1328.1  on 1270  degrees of freedom
## AIC: 1348.1
## 
## Number of Fisher Scoring iterations: 4

This logistic regression model found that the alcohol, sulphates, total sulfur dioxide, and volatile acidicity variables have strong relationships with the respondent variable wine quality. The independent variables that have the greatest positive influence on a wine quality classification are alcohol and sulfates. The independent variables that have the greatest negative influence on wine classification are total sulfur dioxide and volatile acidity. This model achieves a significant level of deviance reduction.

test.pred <- test.data %>%
  mutate(
    prediction = predict(glm.model, test.data, type = "response"),
    pred_class = round(prediction)
  )

library(yardstick)

## Warning: package 'yardstick' was built under R version 4.5.2

## 
## Attaching package: 'yardstick'

## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity

## The following object is masked from 'package:readr':
## 
##     spec

outcomes <- table(test.pred$pred_class, test.data$quality)
conf <- conf_mat(outcomes)
autoplot(conf)

confusion <- confusionMatrix(outcomes)
print(confusion)

## Confusion Matrix and Statistics
## 
##    
##       0   1
##   0 122  45
##   1  37 115
##                                         
##                Accuracy : 0.7429        
##                  95% CI : (0.6913, 0.79)
##     No Information Rate : 0.5016        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.486         
##                                         
##  Mcnemar's Test P-Value : 0.4395        
##                                         
##             Sensitivity : 0.7673        
##             Specificity : 0.7188        
##          Pos Pred Value : 0.7305        
##          Neg Pred Value : 0.7566        
##              Prevalence : 0.4984        
##          Detection Rate : 0.3824        
##    Detection Prevalence : 0.5235        
##       Balanced Accuracy : 0.7430        
##                                         
##        'Positive' Class : 0             
##

The overall accuracy of this logistic regression model is 74.29%, significantly higher than the no information rate. The p-value illustrates this high statistical significance. Sensitivity of 76.73% compared to specificity at 71.88% indicates that the model is more successful when identifying low quality wines than high quality wines. That said, the balanced accuracy of 74.3% demonstrates an overall balanced rate of identifying both high and low quality wines.

The 74.29% accuracy rate indicates a decent model. I am curious, though, to see if removing outliers from the dataset improves this accuracy rate. Or, alternatively, if removing the outliers introduces bias such that there is a larger discrepancy between the sensitivity and specificity measures.

Now let’s run another logistic regression, this time with the outliers removed.

#Reload the dataset
redwine <- read.csv("/Users/gonzalezp/Downloads/winequality-red.csv", header = TRUE, sep = ";")
red_qual <- redwine %>% 
  mutate(
    quality = if_else(
      quality > 5,
      1,
      0
    )
  ) %>%
  select(quality)

rw.clean <- redwine %>% select(-quality)

#Remove outliers
detect_outlier <- function(x) {
  Q1 <- quantile(x, probs = .25, na.rm = TRUE)
  Q3 <- quantile(x, probs = .75, na.rm = TRUE)
  IQR <- Q3 - Q1
  x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR)
}

numeric_cols <- sapply(rw.clean, is.numeric)
outlier_matrix <- sapply(rw.clean[, numeric_cols], detect_outlier)
rows_with_outliers <- apply(outlier_matrix, 1, any)
redwine_clean <- rw.clean[!rows_with_outliers, ]
red_qual <- red_qual[!rows_with_outliers,]

library(caret)

#Standardize the cleaned up dataset.
preproc2 <- preProcess(redwine_clean, method = c("center", "scale"))
norm2 <- predict(preproc2, redwine_clean)
norm2$quality <- red_qual

#Partition the dataset into training and testing sets.
set.seed(123)
training.samples_clean <- createDataPartition(y = norm2$quality, p = 0.8, list = FALSE)
train.data_clean <- norm2[training.samples_clean,]
test.data_clean <- norm2[-training.samples_clean,]

#Perform a linear regression and run the VIF model analysis again to determine if multicollinearity is present
lm.model_clean <- lm(quality ~ ., data = train.data_clean)

library(car)
vif(lm.model_clean)

##        fixed.acidity     volatile.acidity          citric.acid 
##             6.860257             2.049391             3.140186 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             1.759491             1.302774             1.789781 
## total.sulfur.dioxide              density                   pH 
##             2.039532             6.847952             3.058711 
##            sulphates              alcohol 
##             1.322633             3.626859

Fixed acidity and density again have problematic VIF scores, so they will again be removed from the model.

lm.model.adapted_clean <- lm(quality ~ . -fixed.acidity -density, data = train.data)
vif(lm.model.adapted_clean)

##     volatile.acidity          citric.acid       residual.sugar 
##             1.695032             2.240593             1.093797 
##            chlorides  free.sulfur.dioxide total.sulfur.dioxide 
##             1.390366             1.916759             2.002540 
##                   pH            sulphates              alcohol 
##             1.631214             1.352515             1.286963

Model 2: Logistic Regression with outliers excluded, multicollinear independent variables removed

glm.model_clean <- glm(quality ~ . -fixed.acidity -density, family = "binomial", data = train.data_clean)
summary(glm.model_clean)

## 
## Call:
## glm(formula = quality ~ . - fixed.acidity - density, family = "binomial", 
##     data = train.data_clean)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.28053    0.08098   3.464 0.000532 ***
## volatile.acidity     -0.48457    0.10932  -4.432 9.32e-06 ***
## citric.acid          -0.28652    0.11722  -2.444 0.014513 *  
## residual.sugar        0.01016    0.08378   0.121 0.903435    
## chlorides            -0.10156    0.08744  -1.161 0.245448    
## free.sulfur.dioxide   0.23012    0.10657   2.159 0.030822 *  
## total.sulfur.dioxide -0.44115    0.11618  -3.797 0.000146 ***
## pH                   -0.25925    0.09269  -2.797 0.005158 ** 
## sulphates             0.63865    0.09352   6.829 8.53e-12 ***
## alcohol               0.89641    0.10213   8.777  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1318.93  on 955  degrees of freedom
## Residual deviance:  982.87  on 946  degrees of freedom
## AIC: 1002.9
## 
## Number of Fisher Scoring iterations: 4

In this model, 7 of the 11 independent variables are statistically significant predictors. Alcohol and sulphates have high coefficients, indicating strong predictors of high quality wines. Alcohol is particularly high. Volatile acidity and total sulfur dioxide have strong negative coefficients.

test.pred_clean <- test.data_clean %>%
  mutate(
    prediction = predict(glm.model_clean, test.data_clean, type = "response"),
    pred_class = round(prediction)
  )

outcomes_clean <- table(test.pred_clean$pred_class, test.data_clean$quality)
confusion_clean <- confusionMatrix(outcomes_clean)
print(confusion_clean)

## Confusion Matrix and Statistics
## 
##    
##      0  1
##   0 82 31
##   1 28 97
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.6922, 0.8056)
##     No Information Rate : 0.5378          
##     P-Value [Acc > NIR] : 7.474e-12       
##                                           
##                   Kappa : 0.5023          
##                                           
##  Mcnemar's Test P-Value : 0.7946          
##                                           
##             Sensitivity : 0.7455          
##             Specificity : 0.7578          
##          Pos Pred Value : 0.7257          
##          Neg Pred Value : 0.7760          
##              Prevalence : 0.4622          
##          Detection Rate : 0.3445          
##    Detection Prevalence : 0.4748          
##       Balanced Accuracy : 0.7516          
##                                           
##        'Positive' Class : 0               
##

Removing outliers from the dataset has increased the Kappa, overall accuracy rate, and balanced accuracy rate of the logistic regression model. Indeed, the difference between the sensitivity and specificity measures has actually decreased. This model is better at identifying high quality wines while sensitivity has slightly decreased compared to the first model. If anything, removing outliers from the dataset has decreased bias in the model.

Part 2: Non-Parametrics

library(rpart)
library(rpart.plot)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(readr)
library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(parsnip)
library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:parsnip':
## 
##     fit

## The following object is masked from 'package:car':
## 
##     Predict

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

## 
## Attaching package: 'party'

## The following object is masked from 'package:dplyr':
## 
##     where

library(tidyverse)
library(dplyr)

unzip("/Users/gonzalezp/Downloads/wine+quality.zip", files = "winequality-red.csv", exdir = "C:/Users/gonzalezp/Downloads")
redwine.tree <- read.csv("/Users/gonzalezp/Downloads/winequality-red.csv", header = TRUE, sep = ";")

redwine.tree <- redwine.tree %>%
  mutate(
    quality = as.factor(if_else(
      quality > 5,
      1,
      0
    ))
  )

Partition the training and testing data sets.

set.seed(123)
training.samples_tree <- createDataPartition(y = redwine.tree$quality, p = 0.8, list = FALSE)
train.data_tree <- redwine.tree[training.samples_tree,]
test.data_tree <- redwine.tree[-training.samples_tree,]

Model 3: Decision Tree Model

library(rpart)

model.tree <- rpart(
  quality ~ .,
  data = train.data_tree,
  method = "class"
  )


predictions <- predict(model.tree, newdata = test.data_tree, type = "class")
confusionMatrix(predictions, test.data_tree$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 106  54
##          1  42 117
##                                           
##                Accuracy : 0.6991          
##                  95% CI : (0.6455, 0.7489)
##     No Information Rate : 0.5361          
##     P-Value [Acc > NIR] : 2.048e-09       
##                                           
##                   Kappa : 0.3983          
##                                           
##  Mcnemar's Test P-Value : 0.2616          
##                                           
##             Sensitivity : 0.7162          
##             Specificity : 0.6842          
##          Pos Pred Value : 0.6625          
##          Neg Pred Value : 0.7358          
##              Prevalence : 0.4639          
##          Detection Rate : 0.3323          
##    Detection Prevalence : 0.5016          
##       Balanced Accuracy : 0.7002          
##                                           
##        'Positive' Class : 0               
##

Summary of the decision tree model shows that this model’s accurary is 69.91%. Sensitivity is slightly higher at 71.62% and specificity is slightly lower at 68.42%. The accuracy rate for this model is lower than the other models. This could be because a more linear model is better for this dataset, or because the optimal number of splits in the decision tree was not determined before running the decision tree model.

library(rpart.plot)

rpart.plot(model.tree)

To determine whether a decision tree model with an optimal number of splits is more accurate, I will find the optimal number of splits and prune the decision tree at this number. Then, I will run the test data through the new model to determine accuracy rate.

Create a new decision tree with a general number of nodes. Begin with a general complexity parameter. Determine the optimal number of splits by identifying the CP where xerror is the least. Input this CP value in the control argument in the optimal decision tree.

optimal.tree.model <- rpart(
  quality ~ .,
  data = train.data_tree,
  method = "class",
  control = rpart.control(cp = 0.001)
)

printcp(optimal.tree.model)

## 
## Classification tree:
## rpart(formula = quality ~ ., data = train.data_tree, method = "class", 
##     control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
##  [1] alcohol              chlorides            citric.acid         
##  [4] density              fixed.acidity        free.sulfur.dioxide 
##  [7] pH                   residual.sugar       sulphates           
## [10] total.sulfur.dioxide volatile.acidity    
## 
## Root node error: 596/1280 = 0.46563
## 
## n= 1280 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.3741611      0   1.00000 1.00000 0.029943
## 2  0.0369128      1   0.62584 0.65101 0.027590
## 3  0.0352349      2   0.58893 0.64597 0.027529
## 4  0.0145414      3   0.55369 0.58725 0.026756
## 5  0.0083893      6   0.51007 0.57383 0.026562
## 6  0.0075503      9   0.48490 0.56544 0.026437
## 7  0.0067114     19   0.40436 0.56711 0.026463
## 8  0.0050336     20   0.39765 0.55034 0.026206
## 9  0.0041946     23   0.38255 0.56376 0.026412
## 10 0.0039150     27   0.36577 0.55369 0.026258
## 11 0.0037752     30   0.35403 0.55369 0.026258
## 12 0.0033557     34   0.33893 0.56208 0.026387
## 13 0.0027964     39   0.32215 0.56544 0.026437
## 14 0.0020973     43   0.31040 0.57886 0.026636
## 15 0.0016779     47   0.30201 0.58221 0.026684
## 16 0.0013423     48   0.30034 0.61074 0.027080
## 17 0.0010000     53   0.29362 0.61745 0.027169

optimal.tree <- rpart::prune(optimal.tree.model, cp = 0.005)
rpart.plot(optimal.tree)

Model 4: Optimized Decision Tree Model

pred.opt <- predict(optimal.tree, newdata = test.data_tree, type = "class")
confusionMatrix(pred.opt, test.data_tree$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 109  42
##          1  39 129
##                                           
##                Accuracy : 0.7461          
##                  95% CI : (0.6946, 0.7929)
##     No Information Rate : 0.5361          
##     P-Value [Acc > NIR] : 9.004e-15       
##                                           
##                   Kappa : 0.4902          
##                                           
##  Mcnemar's Test P-Value : 0.8241          
##                                           
##             Sensitivity : 0.7365          
##             Specificity : 0.7544          
##          Pos Pred Value : 0.7219          
##          Neg Pred Value : 0.7679          
##              Prevalence : 0.4639          
##          Detection Rate : 0.3417          
##    Detection Prevalence : 0.4734          
##       Balanced Accuracy : 0.7454          
##                                           
##        'Positive' Class : 0               
##

After optimizing the decision tree, the model results are much improved. Accuracy rate is 74.61%. Sensitivity and specificity are well-balanced. The Kappa has increased dramatically.

library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

var_importance <- vip::vip(optimal.tree.model, num_features = 11)
print(var_importance)

Of all of the independent variables in the dataset, alcohol is by far the most important in determining the classification of red wines.

Part 2: Non-Parametric (Principal Component Analysis)

A principal component analysis is completed in order to reduce the dimensionality of the dataset and remove multicollinearity. Based on the decision in the first two models to remove to independent variables from the analysis, we can see that addressing multicollinearity would be an important benefit of PCA. Retained principal components from the analysis will be used to create another logistic regression model.

library(tidyverse)
library(dplyr)

unzip("/Users/gonzalezp/Downloads/wine+quality.zip", files = "winequality-red.csv", exdir = "C:/Users/gonzalezp/Downloads")
redwine <- read.csv("/Users/gonzalezp/Downloads/winequality-red.csv", header = TRUE, sep = ";")
redwine.pca <- redwine %>% select(-quality)

#Analyze variable mean and variance to show need for scaling
redwine.mean <- apply(redwine.pca, 2, mean)
redwine.var <- apply(redwine.pca, 2, var)
redwine.stats <- data.frame(redwine.mean, redwine.var )
print(redwine.stats)

##                      redwine.mean  redwine.var
## fixed.acidity          8.31963727 3.031416e+00
## volatile.acidity       0.52782051 3.206238e-02
## citric.acid            0.27097561 3.794748e-02
## residual.sugar         2.53880550 1.987897e+00
## chlorides              0.08746654 2.215143e-03
## free.sulfur.dioxide   15.87492183 1.094149e+02
## total.sulfur.dioxide  46.46779237 1.082102e+03
## density                0.99674668 3.562029e-06
## pH                     3.31111320 2.383518e-02
## sulphates              0.65814884 2.873262e-02
## alcohol               10.42298311 1.135647e+00

As shown when preprocessing data for the logistic regression, the units of measurement for the independent variables are various. To ensure that some variables do not have too much influence in the principal component analysis, the varaibles should be scaled.

scaled_redwine <- apply(redwine.pca, 2, scale)

#calculate eigenvalues and eigenvectors
redwine.cov <- cov(scaled_redwine)
redwine.eigen <- eigen(redwine.cov)
str(redwine.eigen)

## List of 2
##  $ values : num [1:11] 3.099 1.926 1.551 1.213 0.959 ...
##  $ vectors: num [1:11, 1:11] 0.489 -0.239 0.464 0.146 0.212 ...
##  - attr(*, "class")= chr "eigen"

phi <- redwine.eigen$vectors[,1:11]
row.names(phi) <- colnames(redwine.pca)
colnames(phi) <- c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10", "PC11")
print(phi)

##                              PC1          PC2         PC3          PC4
## fixed.acidity         0.48931422 -0.110502738 -0.12330157 -0.229617370
## volatile.acidity     -0.23858436  0.274930480 -0.44996253  0.078959783
## citric.acid           0.46363166 -0.151791356  0.23824707 -0.079418256
## residual.sugar        0.14610715  0.272080238  0.10128338 -0.372792562
## chlorides             0.21224658  0.148051555 -0.09261383  0.666194756
## free.sulfur.dioxide  -0.03615752  0.513566812  0.42879287 -0.043537818
## total.sulfur.dioxide  0.02357485  0.569486959  0.32241450 -0.034577115
## density               0.39535301  0.233575490 -0.33887135 -0.174499758
## pH                   -0.43851962  0.006710793  0.05769735 -0.003787746
## sulphates             0.24292133 -0.037553916  0.27978615  0.550872362
## alcohol              -0.11323206 -0.386180959  0.47167322 -0.122181088
##                              PC5         PC6         PC7         PC8
## fixed.acidity         0.08261366 -0.10147858  0.35022736  0.17759545
## volatile.acidity     -0.21873452 -0.41144893  0.53373510  0.07877531
## citric.acid           0.05857268 -0.06959338 -0.10549701  0.37751558
## residual.sugar       -0.73214429 -0.04915555 -0.29066341 -0.29984469
## chlorides            -0.24650090 -0.30433857 -0.37041337  0.35700936
## free.sulfur.dioxide   0.15915198  0.01400021  0.11659611  0.20478050
## total.sulfur.dioxide  0.22246456 -0.13630755  0.09366237 -0.01903597
## density              -0.15707671  0.39115230  0.17048116  0.23922267
## pH                   -0.26752977  0.52211645  0.02513762  0.56139075
## sulphates            -0.22596222  0.38126343  0.44746911 -0.37460432
## alcohol              -0.35068141 -0.36164504  0.32765090  0.21762556
##                               PC9        PC10         PC11
## fixed.acidity        -0.194020908 -0.24952314  0.639691452
## volatile.acidity      0.129110301  0.36592473  0.002388597
## citric.acid           0.381449669  0.62167708 -0.070910304
## residual.sugar       -0.007522949  0.09287208  0.184029964
## chlorides            -0.111338666 -0.21767112  0.053065322
## free.sulfur.dioxide  -0.635405218  0.24848326 -0.051420865
## total.sulfur.dioxide  0.592115893 -0.37075027  0.068701598
## density              -0.020718675 -0.23999012 -0.567331898
## pH                    0.167745886 -0.01096960  0.340710903
## sulphates             0.058367062  0.11232046  0.069555381
## alcohol              -0.037603106 -0.30301450 -0.314525906

Calculate the proportion of variance explained for each principal component

#Calculate proportion of variance explained for each principal component
PVE <- redwine.eigen$values / sum(redwine.eigen$values)
round(PVE, 2)

##  [1] 0.28 0.18 0.14 0.11 0.09 0.06 0.05 0.04 0.03 0.02 0.01

Graph the PVE and cumulative PVE of the principal components in scree plots

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

PVEplot <- qplot(c(1:11), PVE) +
  geom_line() +
  xlab("Principal Component") +
  ylab("PVE") +
  ggtitle("Scree Plot") +
  ylim(0, 1)

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

cumPVE <- qplot(c(1:11), cumsum(PVE)) +
  geom_line() +
  xlab("Principal Component") +
  ylab(NULL) +
  ggtitle("Cumulative Scree Plot") +
  ylim(0, 1) +
  geom_hline(yintercept = 0.8, color = "green", linetype = "dashed")


grid.arrange(PVEplot, cumPVE, ncol = 2)

To determine the number of principal components to retain from the analysis, an hline has been added to the cumulative PVE plot indicating where the cumulative PVE is equal to 80%. The hline intersects with the cumulative PVE plot at approximately x = 5 principal components.

Run the PCA Analysis with prcomp() function

pca_redwine <- prcomp(redwine.pca, scale = TRUE)
summary(pca_redwine)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7604 1.3878 1.2452 1.1015 0.97943 0.81216 0.76406
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion  0.2817 0.4568 0.5978 0.7081 0.79528 0.85525 0.90832
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.65035 0.58706 0.42583 0.24405
## Proportion of Variance 0.03845 0.03133 0.01648 0.00541
## Cumulative Proportion  0.94677 0.97810 0.99459 1.00000

This summary table supports the finding in the cumulative PVE scree plot, that 5 principal components capture approximately 80% of the data’s variance.

head(pca_redwine$rotation[,1:5], 11)

##                              PC1          PC2         PC3          PC4
## fixed.acidity         0.48931422  0.110502738 -0.12330157  0.229617370
## volatile.acidity     -0.23858436 -0.274930480 -0.44996253 -0.078959783
## citric.acid           0.46363166  0.151791356  0.23824707  0.079418256
## residual.sugar        0.14610715 -0.272080238  0.10128338  0.372792562
## chlorides             0.21224658 -0.148051555 -0.09261383 -0.666194756
## free.sulfur.dioxide  -0.03615752 -0.513566812  0.42879287  0.043537818
## total.sulfur.dioxide  0.02357485 -0.569486959  0.32241450  0.034577115
## density               0.39535301 -0.233575490 -0.33887135  0.174499758
## pH                   -0.43851962 -0.006710793  0.05769735  0.003787746
## sulphates             0.24292133  0.037553916  0.27978615 -0.550872362
## alcohol              -0.11323206  0.386180959  0.47167322  0.122181088
##                              PC5
## fixed.acidity        -0.08261366
## volatile.acidity      0.21873452
## citric.acid          -0.05857268
## residual.sugar        0.73214429
## chlorides             0.24650090
## free.sulfur.dioxide  -0.15915198
## total.sulfur.dioxide -0.22246456
## density               0.15707671
## pH                    0.26752977
## sulphates             0.22596222
## alcohol               0.35068141

Rotating the principal component result permits an analysis of the loadings for each principal component. PC1 has a high positive loading from fixed acidity and citric acid, and a high negative loading from pH. PC2 is dominated by high negative loadings from free sulfur dioxide and total sulfur dioxide. PC3 has high negative loadings from volatile acidity and density and high positive loadings from free sulfur dioxide and alcohol. Given that these features were highlighted in the importance analysis by the decision tree, it seems likely that PC3 is a strong predictor. PC4 has high negative loadings from chlorides and sulphates and the highest positive loading factor is residual sugar. Given that chlorides and residual sugar were not identified as features of importance in the decision tree, it seems possible that PC4 will not be a strong predictor. PC5 has high positive loading factor from residual sugar.

Now that I have concluded from the principal component analysis that 5 principal components will be retained, I would like to run these 5 principal components through a logistic regression with the “quality” dependent variable to evaluate whether reducing dimensions in the dataset improves the accuracy of a logistic regression model.

PC_scores <- as.data.frame(pca_redwine$x[,1:5])

rw.quality <- redwine %>%
  mutate(
    quality = if_else(
      quality > 5,
      1,
      0
    )
  ) %>%
  select(quality)

rw.data <- cbind(PC_scores, rw.quality)

set.seed(123)
training.samples_log <- createDataPartition(y = rw.data$quality, p = 0.8, list = FALSE)
train.data_log <- rw.data[training.samples_log,]
test.data_log <- rw.data[-training.samples_log,]

Model 5: Logistic Regression with principal components and outliers included

glm.model_log <- glm(quality ~ ., family = "binomial", data = train.data_log)
summary(glm.model_log)

## 
## Call:
## glm(formula = quality ~ ., family = "binomial", data = train.data_log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24389    0.06686   3.648 0.000265 ***
## PC1          0.08424    0.03669   2.296 0.021684 *  
## PC2          0.76204    0.05784  13.176  < 2e-16 ***
## PC3          0.65325    0.05867  11.135  < 2e-16 ***
## PC4          0.06270    0.05499   1.140 0.254165    
## PC5          0.27099    0.07030   3.855 0.000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1765.0  on 1279  degrees of freedom
## Residual deviance: 1379.7  on 1274  degrees of freedom
## AIC: 1391.7
## 
## Number of Fisher Scoring iterations: 4

A summary of the logistic regression model shows that three of the five principal components are strong and statistically significant predictors of red wine quality. Given that the strongest loading factors in PC2 and PC3 were free sulfur dioxide, total sulfur dioxide, alcohol, and volatile acidity, it is understandable that these principal components would be strong predictors. Additionally, the coefficients for these two principal components are high, indicating that they are strong predictors of high quality wine. Overall, four of the five principal components were statistically significant predictors. The benefit of the principal component analysis is that the number of dimensions from the original dataset has decreased from 11 to 5 while maintaining strong predictors.

test.pred_log <- test.data_log %>%
  mutate(
    prediction = predict(glm.model_log, test.data_log, type = "response"),
    pred_class = round(prediction)
  )

outcomes_log <- table(test.pred_log$pred_class, test.data_log$quality)
confusion_log <- confusionMatrix(outcomes_log)
print(confusion_log)

## Confusion Matrix and Statistics
## 
##    
##       0   1
##   0 123  55
##   1  36 105
##                                           
##                Accuracy : 0.7147          
##                  95% CI : (0.6618, 0.7637)
##     No Information Rate : 0.5016          
##     P-Value [Acc > NIR] : 7.896e-15       
##                                           
##                   Kappa : 0.4297          
##                                           
##  Mcnemar's Test P-Value : 0.05917         
##                                           
##             Sensitivity : 0.7736          
##             Specificity : 0.6562          
##          Pos Pred Value : 0.6910          
##          Neg Pred Value : 0.7447          
##              Prevalence : 0.4984          
##          Detection Rate : 0.3856          
##    Detection Prevalence : 0.5580          
##       Balanced Accuracy : 0.7149          
##                                           
##        'Positive' Class : 0               
##

The accuracy rate for model #5 is 71.47%, lower than previous models. Additionally, with the inclusion of the principal components in the model the balanced accuracy rate has decreased. There is a large difference between sensitivity and specificity, suggesting an unbalanced model and potential bias. One wonders whether this has occurred because the principal components were developed without consideration for the target variable. I also wonder if excluding outliers from the dataset would improve the accuracy rate of the logistic regression model, as it did for model #2.

I am going to build one final model that will exclude outliers from the dataset.

pca_clean <- prcomp(redwine_clean, scale = TRUE)
summary(pca_clean)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7168 1.4804 1.2736 1.03838 0.93225 0.83057 0.79573
## Proportion of Variance 0.2679 0.1992 0.1475 0.09802 0.07901 0.06271 0.05756
## Cumulative Proportion  0.2679 0.4672 0.6146 0.71265 0.79166 0.85437 0.91193
##                            PC8     PC9    PC10   PC11
## Standard deviation     0.61766 0.58925 0.42104 0.2505
## Proportion of Variance 0.03468 0.03156 0.01612 0.0057
## Cumulative Proportion  0.94662 0.97818 0.99430 1.0000

PC_clean <- as.data.frame(pca_clean$x[,1:5])
quality <- norm2 %>% select(quality)
clean.log <- cbind(PC_clean, quality)

set.seed(123)
training.clean.log <- createDataPartition(y = clean.log$quality, p = 0.8, list = FALSE)
train.clean.log <- clean.log[training.clean.log,]
test.clean.log <- clean.log[-training.clean.log,]

model.clean.log <- glm(quality ~ ., family = "binomial", data = train.clean.log)
summary(model.clean.log)

## 
## Call:
## glm(formula = quality ~ ., family = "binomial", data = train.clean.log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.27638    0.07819   3.535 0.000408 ***
## PC1          0.07160    0.04382   1.634 0.102317    
## PC2          0.84477    0.06465  13.066  < 2e-16 ***
## PC3          0.36333    0.06224   5.838 5.29e-09 ***
## PC4         -0.37089    0.07588  -4.888 1.02e-06 ***
## PC5         -0.10615    0.08524  -1.245 0.213016    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1318.9  on 955  degrees of freedom
## Residual deviance: 1029.2  on 950  degrees of freedom
## AIC: 1041.2
## 
## Number of Fisher Scoring iterations: 4

Model #6: Logistic regression with principal components and outliers excluded from the dataset

testpred.clean.log <- test.clean.log %>%
  mutate(
    prediction = predict(model.clean.log, test.clean.log, type = "response"),
    pred_class = round(prediction)
  )

outcomes.clean.log <- table(testpred.clean.log$pred_class, test.clean.log$quality)
confusion.clean.log <- confusionMatrix(outcomes.clean.log)
print(confusion.clean.log)

## Confusion Matrix and Statistics
## 
##    
##      0  1
##   0 85 39
##   1 25 89
##                                         
##                Accuracy : 0.7311        
##                  95% CI : (0.67, 0.7863)
##     No Information Rate : 0.5378        
##     P-Value [Acc > NIR] : 7.142e-10     
##                                         
##                   Kappa : 0.4639        
##                                         
##  Mcnemar's Test P-Value : 0.1042        
##                                         
##             Sensitivity : 0.7727        
##             Specificity : 0.6953        
##          Pos Pred Value : 0.6855        
##          Neg Pred Value : 0.7807        
##              Prevalence : 0.4622        
##          Detection Rate : 0.3571        
##    Detection Prevalence : 0.5210        
##       Balanced Accuracy : 0.7340        
##                                         
##        'Positive' Class : 0             
##

The accuracy rate did increase once outliers were removed, and the balanced accuracy rate increased from 71.4% to 73.4%. However, this model still does not perform as well as model #2, the logistic regression with outliers removed and the multicollinear independent variables removed. This model has an accuracy rate of 75.21%. Model #4, the optimized decision tree model, is second with an accuracy rate of 74.61%. In the case of the logistic regression and decision tree, the balanced accuracy rate was high and the distance between sensitivity and specificity was small. Both of these models could accurately predict high quality and low quality red wines. The logistic regression with principal components did not maintain the same level of balanced accuracy. It is possible that the principal components retained, while capturing the most variance, may have been dominated by stronger predictors of high quality red wine than low quality red wine.

Through testing and various methods of data processing I was able to improve the performance of all of the models that I built.

Phil Gonzalez_D590 ADS_Homework 2

2025-11-03