library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
unzip("/Users/gonzalezp/Downloads/wine+quality.zip", files = "winequality-red.csv", exdir = "C:/Users/gonzalezp/Downloads")
redwine <- read.csv("/Users/gonzalezp/Downloads/winequality-red.csv", header = TRUE, sep = ";")
sample_n(redwine, 5)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 6.9 0.480 0.20 1.9 0.082
## 2 8.5 0.655 0.49 6.1 0.122
## 3 8.9 0.430 0.45 1.9 0.052
## 4 10.6 0.360 0.57 2.3 0.087
## 5 7.9 0.200 0.35 1.7 0.054
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 9 23 0.99585 3.39 0.43 9.05
## 2 34 151 1.00100 3.31 1.14 9.30
## 3 6 16 0.99480 3.35 0.70 12.50
## 4 6 20 0.99676 3.14 0.72 11.10
## 5 7 15 0.99458 3.32 0.80 11.90
## quality
## 1 4
## 2 5
## 3 6
## 4 7
## 5 7
str(redwine)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
A glimpse into the “redwine” dataset shows that all of the independent variables are continuous, but the dependent variable is an integer and so is discrete. This makes sense, given that the “quality” variable represents a ranking of wine quality on a scale of 1-10. Given that the dependent variable is discrete, I will build a logistic regression model to predict the “class” of red wine (high quality or low quality).
sum(is.na(redwine) == TRUE)
## [1] 0
table(complete.cases(redwine))
##
## TRUE
## 1599
sort(sapply(redwine, function(x) sum(is.na(x))))
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
library(visdat)
vis_dat(redwine)
summary(redwine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
A review of the summary data reveals that the independent variables are measured in various units, with the means of some variables being significantly larger than others. For example, the “total sulfur dioxide” variable mean is 46.47 compared to 0.087 for “chlorides.” Standardization will need to be performed on the dataset to ensure that the variables with larger scales are not over represented in the model.
qqnorm(redwine$quality)
boxplot(redwine$quality)
par(mfrow=c(2,6))
for (i in 1:length(redwine)) {
boxplot(redwine[,i], main=names(redwine[i]), type="l")
}
Boxplot analysis of all of the variables in the red wine quality dataset
shows that most independent variables have a number of outliers, with
most variables also having a right-skewed distribution as a result.
Regarding the nature of these outliers, I don’t believe they are due to
a measurement or other error. All independent variables have at least
one outlier. Additionally, this dataset includes data points for a large
number of unique wines. Given that wine grapes come from various regions
across the globe, grow in unique soils, and have many genetic varieties,
I believe that it is understandable that outliers would exist in the
dataset. Though we cannot determine this from the dataset, it could be
that the outliers come from the same geographic location.
That said, I will be testing to determine if the presence of outliers does impact logistic regression models. I will build one model that includes the outliers and another model that excludes the outliers. In addition to testing which model has greater accuracy, I will also analyze the balanced accuracy of the two models to evaluate whether the presence/absence of outliers introduces a bias into the model.
#Encoding the "quality" dependent variable
redwine <- redwine %>%
mutate(
quality = if_else(
quality > 5,
1,
0
)
)
redwine.quality <- redwine$quality
The “quality” dependent variable is encoded to create two distinct classes: “high quality” and “low quality” red wines. Six is the median ranking of wine quality. As such, it will serve as the cutoff between high quality and low quality. Wines with a ranking greater than five will be classified as high quality. Wines with a ranking of five or less will be classified as low quality wines. This cutoff ensures that a near equal number of high quality and low quality wines will be presentin the dataset and classification bias will not be introduced into the models due to an over representation of high or loq quality wines.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
preproc1 <- preProcess(redwine, method = c("center", "scale"))
norm1 <- predict(preproc1, redwine)
head(norm1)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 -0.5281944 0.9615758 -1.391037 -0.45307667 -0.24363047
## 2 -0.2984541 1.9668271 -1.391037 0.04340257 0.22380518
## 3 -0.2984541 1.2966596 -1.185699 -0.16937425 0.09632273
## 4 1.6543385 -1.3840105 1.483689 -0.45307667 -0.26487754
## 5 -0.5281944 0.9615758 -1.391037 -0.45307667 -0.24363047
## 6 -0.5281944 0.7381867 -1.391037 -0.52400227 -0.26487754
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 -0.46604672 -0.3790141 0.55809987 1.2882399 -0.57902538
## 2 0.87236532 0.6241680 0.02825193 -0.7197081 0.12891007
## 3 -0.08364328 0.2289750 0.13422152 -0.3310730 -0.04807379
## 4 0.10755844 0.4113718 0.66406945 -0.9787982 -0.46103614
## 5 -0.46604672 -0.3790141 0.55809987 1.2882399 -0.57902538
## 6 -0.27484500 -0.1966174 0.55809987 1.2882399 -0.57902538
## alcohol quality
## 1 -0.9599458 -1.0716692
## 2 -0.5845942 -1.0716692
## 3 -0.5845942 -1.0716692
## 4 -0.5845942 0.9325402
## 5 -0.9599458 -1.0716692
## 6 -0.9599458 -1.0716692
library(corrplot)
## corrplot 0.95 loaded
M <- cor(norm1)
p.mat1 <- cor.mtest(norm1)
p.mat1
## $p
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 0.000000e+00 2.276720e-25 2.535321e-210
## volatile.acidity 2.276720e-25 0.000000e+00 1.805663e-128
## citric.acid 2.535321e-210 1.805663e-128 0.000000e+00
## residual.sugar 4.199465e-06 9.389168e-01 8.083723e-09
## chlorides 1.751746e-04 1.422491e-02 1.863705e-16
## free.sulfur.dioxide 6.335579e-10 6.747011e-01 1.473916e-02
## total.sulfur.dioxide 5.709033e-06 2.213857e-03 1.555454e-01
## density 3.074747e-207 3.787554e-01 1.478795e-51
## pH 4.063034e-220 1.718994e-21 1.007201e-122
## sulphates 1.648652e-13 2.606926e-26 1.265262e-37
## alcohol 1.364868e-02 3.155190e-16 1.059462e-05
## quality 1.399766e-04 9.336621e-40 1.563717e-10
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 4.199465e-06 1.751746e-04 6.335579e-10
## volatile.acidity 9.389168e-01 1.422491e-02 6.747011e-01
## citric.acid 8.083723e-09 1.863705e-16 1.473916e-02
## residual.sugar 0.000000e+00 2.617079e-02 4.684735e-14
## chlorides 2.617079e-02 0.000000e+00 8.241238e-01
## free.sulfur.dioxide 4.684735e-14 8.241238e-01 0.000000e+00
## total.sulfur.dioxide 2.449285e-16 5.809120e-02 6.404723e-207
## density 9.013042e-49 5.541458e-16 3.804985e-01
## pH 6.065915e-04 4.149217e-27 4.869975e-03
## sulphates 8.252134e-01 1.986310e-53 3.888321e-02
## alcohol 9.258425e-02 3.654950e-19 5.492314e-03
## quality 9.312092e-01 1.143103e-05 1.351417e-02
## total.sulfur.dioxide density pH
## fixed.acidity 5.709033e-06 3.074747e-207 4.063034e-220
## volatile.acidity 2.213857e-03 3.787554e-01 1.718994e-21
## citric.acid 1.555454e-01 1.478795e-51 1.007201e-122
## residual.sugar 2.449285e-16 9.013042e-49 6.065915e-04
## chlorides 5.809120e-02 5.541458e-16 4.149217e-27
## free.sulfur.dioxide 6.404723e-207 3.804985e-01 4.869975e-03
## total.sulfur.dioxide 0.000000e+00 4.354284e-03 7.818341e-03
## density 4.354284e-03 0.000000e+00 5.117102e-45
## pH 7.818341e-03 5.117102e-45 0.000000e+00
## sulphates 8.601835e-02 2.418474e-09 2.106734e-15
## alcohol 9.890520e-17 3.938835e-100 9.964498e-17
## quality 5.623796e-21 1.571846e-10 8.962367e-01
## sulphates alcohol quality
## fixed.acidity 1.648652e-13 1.364868e-02 1.399766e-04
## volatile.acidity 2.606926e-26 3.155190e-16 9.336621e-40
## citric.acid 1.265262e-37 1.059462e-05 1.563717e-10
## residual.sugar 8.252134e-01 9.258425e-02 9.312092e-01
## chlorides 1.986310e-53 3.654950e-19 1.143103e-05
## free.sulfur.dioxide 3.888321e-02 5.492314e-03 1.351417e-02
## total.sulfur.dioxide 8.601835e-02 9.890520e-17 5.623796e-21
## density 2.418474e-09 3.938835e-100 1.571846e-10
## pH 2.106734e-15 9.964498e-17 8.962367e-01
## sulphates 0.000000e+00 1.783053e-04 1.147929e-18
## alcohol 1.783053e-04 0.000000e+00 1.023980e-74
## quality 1.147929e-18 1.023980e-74 0.000000e+00
##
## $lowCI
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00000000 -0.30136811 0.64388389 0.066127651
## volatile.acidity -0.30136811 1.00000000 -0.58565496 -0.047107687
## citric.acid 0.64388389 -0.58565496 1.00000000 0.095226254
## residual.sugar 0.06612765 -0.04710769 0.09522625 1.000000000
## chlorides 0.04489025 0.01231363 0.15636411 0.006606405
## free.sulfur.dioxide -0.20129772 -0.05949433 -0.10967144 0.139305194
## total.sulfur.dioxide -0.16130760 0.02755215 -0.01351165 0.155554927
## density 0.63998465 -0.02702409 0.32168095 0.311690761
## pH -0.70828567 0.18808228 -0.57563371 -0.134110462
## sulphates 0.13519740 -0.30609173 0.26785576 -0.043505806
## alcohol -0.11035580 -0.24884156 0.06121189 -0.006960058
## quality 0.04628813 -0.36471503 0.11097394 -0.051176170
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity 0.044890252 -0.201297716 -0.161307603
## volatile.acidity 0.012313634 -0.059494333 0.027552148
## citric.acid 0.156364107 -0.109671437 -0.013511651
## residual.sugar 0.006606405 0.139305194 0.155554927
## chlorides 1.000000000 -0.043470846 -0.001624446
## free.sulfur.dioxide -0.043470846 1.000000000 0.639578570
## total.sulfur.dioxide -0.001624446 0.639578570 1.000000000
## density 0.153117128 -0.070890706 0.022326338
## pH -0.310019533 0.021430293 -0.115140383
## sulphates 0.328212685 0.002643125 -0.006087119
## alcohol -0.267264388 -0.118027906 -0.252133230
## quality -0.157668845 -0.110443529 -0.277824943
## density pH sulphates alcohol
## fixed.acidity 0.63998465 -0.70828567 0.135197400 -0.110355799
## volatile.acidity -0.02702409 0.18808228 -0.306091729 -0.248841557
## citric.acid 0.32168095 -0.57563371 0.267855761 0.061211891
## residual.sugar 0.31169076 -0.13411046 -0.043505806 -0.006960058
## chlorides 0.15311713 -0.31001953 0.328212685 -0.267264388
## free.sulfur.dioxide -0.07089071 0.02143029 0.002643125 -0.118027906
## total.sulfur.dioxide 0.02232634 -0.11514038 -0.006087119 -0.252133230
## density 1.00000000 -0.38428354 0.100214831 -0.532254720
## pH -0.38428354 1.00000000 -0.243323134 0.158206140
## sulphates 0.10021483 -0.24332313 1.000000000 0.044779062
## alcohol -0.53225472 0.15820614 0.044779062 1.000000000
## quality -0.20652030 -0.05227676 0.170877221 0.394129758
## quality
## fixed.acidity 0.04628813
## volatile.acidity -0.36471503
## citric.acid 0.11097394
## residual.sugar -0.05117617
## chlorides -0.15766885
## free.sulfur.dioxide -0.11044353
## total.sulfur.dioxide -0.27782494
## density -0.20652030
## pH -0.05227676
## sulphates 0.17087722
## alcohol 0.39412976
## quality 1.00000000
##
## $uppCI
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00000000 -0.20974326 0.69774932 0.16288141
## volatile.acidity -0.20974326 1.00000000 -0.51749023 0.05093423
## citric.acid 0.69774932 -0.51749023 1.00000000 0.19125221
## residual.sugar 0.16288141 0.05093423 0.19125221 1.00000000
## chlorides 0.14207371 0.10998841 0.25034272 0.10434622
## free.sulfur.dioxide -0.10556896 0.03853716 -0.01199284 0.23392519
## total.sulfur.dioxide -0.06451827 0.12502248 0.08440714 0.24956518
## density 0.69433020 0.07097074 0.40669253 0.39738352
## pH -0.65591741 0.28072535 -0.50633364 -0.03678574
## sulphates 0.22996377 -0.21471255 0.35632784 0.05453349
## alcohol -0.01268548 -0.15480196 0.15807276 0.09090907
## quality 0.14344594 -0.27678106 0.20653939 0.04686565
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity 0.14207371 -0.10556896 -0.06451827
## volatile.acidity 0.10998841 0.03853716 0.12502248
## citric.acid 0.25034272 -0.01199284 0.08440714
## residual.sugar 0.10434622 0.23392519 0.24956518
## chlorides 1.00000000 0.05456841 0.09619808
## free.sulfur.dioxide 0.05456841 1.00000000 0.69397398
## total.sulfur.dioxide 0.09619808 0.69397398 1.00000000
## density 0.24722198 0.02710447 0.11987182
## pH -0.21884824 0.11898813 -0.01753056
## sulphates 0.41276938 0.10042441 0.09177476
## alcohol -0.17400572 -0.02045682 -0.15822796
## quality -0.06079920 -0.01277428 -0.18504601
## density pH sulphates alcohol
## fixed.acidity 0.69433020 -0.65591741 0.22996377 -0.01268548
## volatile.acidity 0.07097074 0.28072535 -0.21471255 -0.15480196
## citric.acid 0.40669253 -0.50633364 0.35632784 0.15807276
## residual.sugar 0.39738352 -0.03678574 0.05453349 0.09090907
## chlorides 0.24722198 -0.21884824 0.41276938 -0.17400572
## free.sulfur.dioxide 0.02710447 0.11898813 0.10042441 -0.02045682
## total.sulfur.dioxide 0.11987182 -0.01753056 0.09177476 -0.15822796
## density 1.00000000 -0.29766422 0.19609995 -0.45830615
## pH -0.29766422 1.00000000 -0.14906342 0.25211227
## sulphates 0.19609995 -0.14906342 1.00000000 0.14196454
## alcohol -0.45830615 0.25211227 0.14196454 1.00000000
## quality -0.11095424 0.04576448 0.26426775 0.47367733
## quality
## fixed.acidity 0.14344594
## volatile.acidity -0.27678106
## citric.acid 0.20653939
## residual.sugar 0.04686565
## chlorides -0.06079920
## free.sulfur.dioxide -0.01277428
## total.sulfur.dioxide -0.18504601
## density -0.11095424
## pH 0.04576448
## sulphates 0.26426775
## alcohol 0.47367733
## quality 1.00000000
corrplot(M, type = "upper", order = "hclust", p.mat = p.mat1$p, sig.level = 0.05)
The correlation plot shows that there is a high level of correlation
between the following variable pairs: fixed acidity and pH (negative),
citric acid and volatile acidity (negative), citric acid and pH
(negative), free sulfur dioxide and total sulfur dioxide (positive),
density and fixed acidity (positive), and fixed acidity and citric acid
(positive). In addition to there being a couple of variable pairs with
high correlation, there are also a couple of variables that have high
correlation with multiple variables (fixed acidity and density). There
does appear to be some multicollinearity within this dataset. As a
result, I will perform an analysis of the independent variables to
determine if there is multicollinearity and identify any variables that
have a VIF high enough to warrant removal from the model.
#Add the encoded, unscaled quality variable back into the dataframe
norm1$quality <- redwine.quality
str(norm1)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num -0.528 -0.298 -0.298 1.654 -0.528 ...
## $ volatile.acidity : num 0.962 1.967 1.297 -1.384 0.962 ...
## $ citric.acid : num -1.39 -1.39 -1.19 1.48 -1.39 ...
## $ residual.sugar : num -0.4531 0.0434 -0.1694 -0.4531 -0.4531 ...
## $ chlorides : num -0.2436 0.2238 0.0963 -0.2649 -0.2436 ...
## $ free.sulfur.dioxide : num -0.466 0.8724 -0.0836 0.1076 -0.466 ...
## $ total.sulfur.dioxide: num -0.379 0.624 0.229 0.411 -0.379 ...
## $ density : num 0.5581 0.0283 0.1342 0.6641 0.5581 ...
## $ pH : num 1.288 -0.72 -0.331 -0.979 1.288 ...
## $ sulphates : num -0.579 0.1289 -0.0481 -0.461 -0.579 ...
## $ alcohol : num -0.96 -0.585 -0.585 -0.585 -0.96 ...
## $ quality : num 0 0 0 1 0 0 0 1 1 0 ...
#Partition the dataset into training and testing sets.
set.seed(123)
training.samples <- createDataPartition(y = norm1$quality, p = 0.8, list = FALSE)
train.data <- norm1[training.samples,]
test.data <- norm1[-training.samples,]
#Run linear regression model to evaluate VIF of independent variables.
lm.model <- lm(quality ~ ., data = train.data)
lm.predictions <- predict(lm.model, test.data)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(lm.model)
## fixed.acidity volatile.acidity citric.acid
## 7.904660 1.831394 3.238379
## residual.sugar chlorides free.sulfur.dioxide
## 1.661341 1.500448 1.936827
## total.sulfur.dioxide density pH
## 2.167901 6.279581 3.425344
## sulphates alcohol
## 1.448063 3.134299
After running the linear regression model and performing a VIF analysis, there are two independent variables with variance inflation factors that I would consider to be problematic. We identified earlier that fixed acidity and density had multiple, highly correlated relationships with other independent variables. The VIFs of 7.9 and 6.3, respectively, reach into the 5-10 range and, I believe, warrant removal of the variables from the model.
lm.model.adapted <- lm(quality ~ . -fixed.acidity -density, data = train.data)
vif(lm.model.adapted)
## volatile.acidity citric.acid residual.sugar
## 1.695032 2.240593 1.093797
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 1.390366 1.916759 2.002540
## pH sulphates alcohol
## 1.631214 1.352515 1.286963
After removing these variables from the model and running the VIF analysis again, we see that the VIFs of all independent variables have decreased. This result supports the removal of the fixed acidity and density variables.
Model 1: Logistic Regression with outliers included, multicollinear independent variables removed
glm.model <- glm(quality ~ . -fixed.acidity -density, family = "binomial", data = train.data)
summary(glm.model)
##
## Call:
## glm(formula = quality ~ . - fixed.acidity - density, family = "binomial",
## data = train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.24152 0.06892 3.504 0.000458 ***
## volatile.acidity -0.52317 0.09240 -5.662 1.50e-08 ***
## citric.acid -0.12475 0.09919 -1.258 0.208494
## residual.sugar -0.01285 0.07193 -0.179 0.858237
## chlorides -0.22084 0.07713 -2.863 0.004195 **
## free.sulfur.dioxide 0.21168 0.09674 2.188 0.028667 *
## total.sulfur.dioxide -0.57273 0.10143 -5.647 1.64e-08 ***
## pH -0.22528 0.08327 -2.705 0.006823 **
## sulphates 0.40884 0.08201 4.986 6.18e-07 ***
## alcohol 0.98465 0.08740 11.265 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1765.0 on 1279 degrees of freedom
## Residual deviance: 1328.1 on 1270 degrees of freedom
## AIC: 1348.1
##
## Number of Fisher Scoring iterations: 4
This logistic regression model found that the alcohol, sulphates, total sulfur dioxide, and volatile acidicity variables have strong relationships with the respondent variable wine quality. The independent variables that have the greatest positive influence on a wine quality classification are alcohol and sulfates. The independent variables that have the greatest negative influence on wine classification are total sulfur dioxide and volatile acidity. This model achieves a significant level of deviance reduction.
test.pred <- test.data %>%
mutate(
prediction = predict(glm.model, test.data, type = "response"),
pred_class = round(prediction)
)
library(yardstick)
## Warning: package 'yardstick' was built under R version 4.5.2
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:readr':
##
## spec
outcomes <- table(test.pred$pred_class, test.data$quality)
conf <- conf_mat(outcomes)
autoplot(conf)
confusion <- confusionMatrix(outcomes)
print(confusion)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 122 45
## 1 37 115
##
## Accuracy : 0.7429
## 95% CI : (0.6913, 0.79)
## No Information Rate : 0.5016
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.486
##
## Mcnemar's Test P-Value : 0.4395
##
## Sensitivity : 0.7673
## Specificity : 0.7188
## Pos Pred Value : 0.7305
## Neg Pred Value : 0.7566
## Prevalence : 0.4984
## Detection Rate : 0.3824
## Detection Prevalence : 0.5235
## Balanced Accuracy : 0.7430
##
## 'Positive' Class : 0
##
The overall accuracy of this logistic regression model is 74.29%, significantly higher than the no information rate. The p-value illustrates this high statistical significance. Sensitivity of 76.73% compared to specificity at 71.88% indicates that the model is more successful when identifying low quality wines than high quality wines. That said, the balanced accuracy of 74.3% demonstrates an overall balanced rate of identifying both high and low quality wines.
The 74.29% accuracy rate indicates a decent model. I am curious, though, to see if removing outliers from the dataset improves this accuracy rate. Or, alternatively, if removing the outliers introduces bias such that there is a larger discrepancy between the sensitivity and specificity measures.
Now let’s run another logistic regression, this time with the outliers removed.
#Reload the dataset
redwine <- read.csv("/Users/gonzalezp/Downloads/winequality-red.csv", header = TRUE, sep = ";")
red_qual <- redwine %>%
mutate(
quality = if_else(
quality > 5,
1,
0
)
) %>%
select(quality)
rw.clean <- redwine %>% select(-quality)
#Remove outliers
detect_outlier <- function(x) {
Q1 <- quantile(x, probs = .25, na.rm = TRUE)
Q3 <- quantile(x, probs = .75, na.rm = TRUE)
IQR <- Q3 - Q1
x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR)
}
numeric_cols <- sapply(rw.clean, is.numeric)
outlier_matrix <- sapply(rw.clean[, numeric_cols], detect_outlier)
rows_with_outliers <- apply(outlier_matrix, 1, any)
redwine_clean <- rw.clean[!rows_with_outliers, ]
red_qual <- red_qual[!rows_with_outliers,]
library(caret)
#Standardize the cleaned up dataset.
preproc2 <- preProcess(redwine_clean, method = c("center", "scale"))
norm2 <- predict(preproc2, redwine_clean)
norm2$quality <- red_qual
#Partition the dataset into training and testing sets.
set.seed(123)
training.samples_clean <- createDataPartition(y = norm2$quality, p = 0.8, list = FALSE)
train.data_clean <- norm2[training.samples_clean,]
test.data_clean <- norm2[-training.samples_clean,]
#Perform a linear regression and run the VIF model analysis again to determine if multicollinearity is present
lm.model_clean <- lm(quality ~ ., data = train.data_clean)
library(car)
vif(lm.model_clean)
## fixed.acidity volatile.acidity citric.acid
## 6.860257 2.049391 3.140186
## residual.sugar chlorides free.sulfur.dioxide
## 1.759491 1.302774 1.789781
## total.sulfur.dioxide density pH
## 2.039532 6.847952 3.058711
## sulphates alcohol
## 1.322633 3.626859
Fixed acidity and density again have problematic VIF scores, so they will again be removed from the model.
lm.model.adapted_clean <- lm(quality ~ . -fixed.acidity -density, data = train.data)
vif(lm.model.adapted_clean)
## volatile.acidity citric.acid residual.sugar
## 1.695032 2.240593 1.093797
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 1.390366 1.916759 2.002540
## pH sulphates alcohol
## 1.631214 1.352515 1.286963
Model 2: Logistic Regression with outliers excluded, multicollinear independent variables removed
glm.model_clean <- glm(quality ~ . -fixed.acidity -density, family = "binomial", data = train.data_clean)
summary(glm.model_clean)
##
## Call:
## glm(formula = quality ~ . - fixed.acidity - density, family = "binomial",
## data = train.data_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.28053 0.08098 3.464 0.000532 ***
## volatile.acidity -0.48457 0.10932 -4.432 9.32e-06 ***
## citric.acid -0.28652 0.11722 -2.444 0.014513 *
## residual.sugar 0.01016 0.08378 0.121 0.903435
## chlorides -0.10156 0.08744 -1.161 0.245448
## free.sulfur.dioxide 0.23012 0.10657 2.159 0.030822 *
## total.sulfur.dioxide -0.44115 0.11618 -3.797 0.000146 ***
## pH -0.25925 0.09269 -2.797 0.005158 **
## sulphates 0.63865 0.09352 6.829 8.53e-12 ***
## alcohol 0.89641 0.10213 8.777 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1318.93 on 955 degrees of freedom
## Residual deviance: 982.87 on 946 degrees of freedom
## AIC: 1002.9
##
## Number of Fisher Scoring iterations: 4
In this model, 7 of the 11 independent variables are statistically significant predictors. Alcohol and sulphates have high coefficients, indicating strong predictors of high quality wines. Alcohol is particularly high. Volatile acidity and total sulfur dioxide have strong negative coefficients.
test.pred_clean <- test.data_clean %>%
mutate(
prediction = predict(glm.model_clean, test.data_clean, type = "response"),
pred_class = round(prediction)
)
outcomes_clean <- table(test.pred_clean$pred_class, test.data_clean$quality)
confusion_clean <- confusionMatrix(outcomes_clean)
print(confusion_clean)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 82 31
## 1 28 97
##
## Accuracy : 0.7521
## 95% CI : (0.6922, 0.8056)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : 7.474e-12
##
## Kappa : 0.5023
##
## Mcnemar's Test P-Value : 0.7946
##
## Sensitivity : 0.7455
## Specificity : 0.7578
## Pos Pred Value : 0.7257
## Neg Pred Value : 0.7760
## Prevalence : 0.4622
## Detection Rate : 0.3445
## Detection Prevalence : 0.4748
## Balanced Accuracy : 0.7516
##
## 'Positive' Class : 0
##
Removing outliers from the dataset has increased the Kappa, overall accuracy rate, and balanced accuracy rate of the logistic regression model. Indeed, the difference between the sensitivity and specificity measures has actually decreased. This model is better at identifying high quality wines while sensitivity has slightly decreased compared to the first model. If anything, removing outliers from the dataset has decreased bias in the model.
Part 2: Non-Parametrics
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(readr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(parsnip)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:parsnip':
##
## fit
## The following object is masked from 'package:car':
##
## Predict
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
##
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
##
## where
library(tidyverse)
library(dplyr)
unzip("/Users/gonzalezp/Downloads/wine+quality.zip", files = "winequality-red.csv", exdir = "C:/Users/gonzalezp/Downloads")
redwine.tree <- read.csv("/Users/gonzalezp/Downloads/winequality-red.csv", header = TRUE, sep = ";")
redwine.tree <- redwine.tree %>%
mutate(
quality = as.factor(if_else(
quality > 5,
1,
0
))
)
Partition the training and testing data sets.
set.seed(123)
training.samples_tree <- createDataPartition(y = redwine.tree$quality, p = 0.8, list = FALSE)
train.data_tree <- redwine.tree[training.samples_tree,]
test.data_tree <- redwine.tree[-training.samples_tree,]
Model 3: Decision Tree Model
library(rpart)
model.tree <- rpart(
quality ~ .,
data = train.data_tree,
method = "class"
)
predictions <- predict(model.tree, newdata = test.data_tree, type = "class")
confusionMatrix(predictions, test.data_tree$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 106 54
## 1 42 117
##
## Accuracy : 0.6991
## 95% CI : (0.6455, 0.7489)
## No Information Rate : 0.5361
## P-Value [Acc > NIR] : 2.048e-09
##
## Kappa : 0.3983
##
## Mcnemar's Test P-Value : 0.2616
##
## Sensitivity : 0.7162
## Specificity : 0.6842
## Pos Pred Value : 0.6625
## Neg Pred Value : 0.7358
## Prevalence : 0.4639
## Detection Rate : 0.3323
## Detection Prevalence : 0.5016
## Balanced Accuracy : 0.7002
##
## 'Positive' Class : 0
##
Summary of the decision tree model shows that this model’s accurary is 69.91%. Sensitivity is slightly higher at 71.62% and specificity is slightly lower at 68.42%. The accuracy rate for this model is lower than the other models. This could be because a more linear model is better for this dataset, or because the optimal number of splits in the decision tree was not determined before running the decision tree model.
library(rpart.plot)
rpart.plot(model.tree)
To determine whether a decision tree model with an optimal number of
splits is more accurate, I will find the optimal number of splits and
prune the decision tree at this number. Then, I will run the test data
through the new model to determine accuracy rate.
Create a new decision tree with a general number of nodes. Begin with a general complexity parameter. Determine the optimal number of splits by identifying the CP where xerror is the least. Input this CP value in the control argument in the optimal decision tree.
optimal.tree.model <- rpart(
quality ~ .,
data = train.data_tree,
method = "class",
control = rpart.control(cp = 0.001)
)
printcp(optimal.tree.model)
##
## Classification tree:
## rpart(formula = quality ~ ., data = train.data_tree, method = "class",
## control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] alcohol chlorides citric.acid
## [4] density fixed.acidity free.sulfur.dioxide
## [7] pH residual.sugar sulphates
## [10] total.sulfur.dioxide volatile.acidity
##
## Root node error: 596/1280 = 0.46563
##
## n= 1280
##
## CP nsplit rel error xerror xstd
## 1 0.3741611 0 1.00000 1.00000 0.029943
## 2 0.0369128 1 0.62584 0.65101 0.027590
## 3 0.0352349 2 0.58893 0.64597 0.027529
## 4 0.0145414 3 0.55369 0.58725 0.026756
## 5 0.0083893 6 0.51007 0.57383 0.026562
## 6 0.0075503 9 0.48490 0.56544 0.026437
## 7 0.0067114 19 0.40436 0.56711 0.026463
## 8 0.0050336 20 0.39765 0.55034 0.026206
## 9 0.0041946 23 0.38255 0.56376 0.026412
## 10 0.0039150 27 0.36577 0.55369 0.026258
## 11 0.0037752 30 0.35403 0.55369 0.026258
## 12 0.0033557 34 0.33893 0.56208 0.026387
## 13 0.0027964 39 0.32215 0.56544 0.026437
## 14 0.0020973 43 0.31040 0.57886 0.026636
## 15 0.0016779 47 0.30201 0.58221 0.026684
## 16 0.0013423 48 0.30034 0.61074 0.027080
## 17 0.0010000 53 0.29362 0.61745 0.027169
optimal.tree <- rpart::prune(optimal.tree.model, cp = 0.005)
rpart.plot(optimal.tree)
Model 4: Optimized Decision Tree Model
pred.opt <- predict(optimal.tree, newdata = test.data_tree, type = "class")
confusionMatrix(pred.opt, test.data_tree$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 109 42
## 1 39 129
##
## Accuracy : 0.7461
## 95% CI : (0.6946, 0.7929)
## No Information Rate : 0.5361
## P-Value [Acc > NIR] : 9.004e-15
##
## Kappa : 0.4902
##
## Mcnemar's Test P-Value : 0.8241
##
## Sensitivity : 0.7365
## Specificity : 0.7544
## Pos Pred Value : 0.7219
## Neg Pred Value : 0.7679
## Prevalence : 0.4639
## Detection Rate : 0.3417
## Detection Prevalence : 0.4734
## Balanced Accuracy : 0.7454
##
## 'Positive' Class : 0
##
After optimizing the decision tree, the model results are much improved. Accuracy rate is 74.61%. Sensitivity and specificity are well-balanced. The Kappa has increased dramatically.
library(vip)
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
var_importance <- vip::vip(optimal.tree.model, num_features = 11)
print(var_importance)
Of all of the independent variables in the dataset, alcohol is by far
the most important in determining the classification of red wines.
Part 2: Non-Parametric (Principal Component Analysis)
A principal component analysis is completed in order to reduce the dimensionality of the dataset and remove multicollinearity. Based on the decision in the first two models to remove to independent variables from the analysis, we can see that addressing multicollinearity would be an important benefit of PCA. Retained principal components from the analysis will be used to create another logistic regression model.
library(tidyverse)
library(dplyr)
unzip("/Users/gonzalezp/Downloads/wine+quality.zip", files = "winequality-red.csv", exdir = "C:/Users/gonzalezp/Downloads")
redwine <- read.csv("/Users/gonzalezp/Downloads/winequality-red.csv", header = TRUE, sep = ";")
redwine.pca <- redwine %>% select(-quality)
#Analyze variable mean and variance to show need for scaling
redwine.mean <- apply(redwine.pca, 2, mean)
redwine.var <- apply(redwine.pca, 2, var)
redwine.stats <- data.frame(redwine.mean, redwine.var )
print(redwine.stats)
## redwine.mean redwine.var
## fixed.acidity 8.31963727 3.031416e+00
## volatile.acidity 0.52782051 3.206238e-02
## citric.acid 0.27097561 3.794748e-02
## residual.sugar 2.53880550 1.987897e+00
## chlorides 0.08746654 2.215143e-03
## free.sulfur.dioxide 15.87492183 1.094149e+02
## total.sulfur.dioxide 46.46779237 1.082102e+03
## density 0.99674668 3.562029e-06
## pH 3.31111320 2.383518e-02
## sulphates 0.65814884 2.873262e-02
## alcohol 10.42298311 1.135647e+00
As shown when preprocessing data for the logistic regression, the units of measurement for the independent variables are various. To ensure that some variables do not have too much influence in the principal component analysis, the varaibles should be scaled.
scaled_redwine <- apply(redwine.pca, 2, scale)
#calculate eigenvalues and eigenvectors
redwine.cov <- cov(scaled_redwine)
redwine.eigen <- eigen(redwine.cov)
str(redwine.eigen)
## List of 2
## $ values : num [1:11] 3.099 1.926 1.551 1.213 0.959 ...
## $ vectors: num [1:11, 1:11] 0.489 -0.239 0.464 0.146 0.212 ...
## - attr(*, "class")= chr "eigen"
phi <- redwine.eigen$vectors[,1:11]
row.names(phi) <- colnames(redwine.pca)
colnames(phi) <- c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10", "PC11")
print(phi)
## PC1 PC2 PC3 PC4
## fixed.acidity 0.48931422 -0.110502738 -0.12330157 -0.229617370
## volatile.acidity -0.23858436 0.274930480 -0.44996253 0.078959783
## citric.acid 0.46363166 -0.151791356 0.23824707 -0.079418256
## residual.sugar 0.14610715 0.272080238 0.10128338 -0.372792562
## chlorides 0.21224658 0.148051555 -0.09261383 0.666194756
## free.sulfur.dioxide -0.03615752 0.513566812 0.42879287 -0.043537818
## total.sulfur.dioxide 0.02357485 0.569486959 0.32241450 -0.034577115
## density 0.39535301 0.233575490 -0.33887135 -0.174499758
## pH -0.43851962 0.006710793 0.05769735 -0.003787746
## sulphates 0.24292133 -0.037553916 0.27978615 0.550872362
## alcohol -0.11323206 -0.386180959 0.47167322 -0.122181088
## PC5 PC6 PC7 PC8
## fixed.acidity 0.08261366 -0.10147858 0.35022736 0.17759545
## volatile.acidity -0.21873452 -0.41144893 0.53373510 0.07877531
## citric.acid 0.05857268 -0.06959338 -0.10549701 0.37751558
## residual.sugar -0.73214429 -0.04915555 -0.29066341 -0.29984469
## chlorides -0.24650090 -0.30433857 -0.37041337 0.35700936
## free.sulfur.dioxide 0.15915198 0.01400021 0.11659611 0.20478050
## total.sulfur.dioxide 0.22246456 -0.13630755 0.09366237 -0.01903597
## density -0.15707671 0.39115230 0.17048116 0.23922267
## pH -0.26752977 0.52211645 0.02513762 0.56139075
## sulphates -0.22596222 0.38126343 0.44746911 -0.37460432
## alcohol -0.35068141 -0.36164504 0.32765090 0.21762556
## PC9 PC10 PC11
## fixed.acidity -0.194020908 -0.24952314 0.639691452
## volatile.acidity 0.129110301 0.36592473 0.002388597
## citric.acid 0.381449669 0.62167708 -0.070910304
## residual.sugar -0.007522949 0.09287208 0.184029964
## chlorides -0.111338666 -0.21767112 0.053065322
## free.sulfur.dioxide -0.635405218 0.24848326 -0.051420865
## total.sulfur.dioxide 0.592115893 -0.37075027 0.068701598
## density -0.020718675 -0.23999012 -0.567331898
## pH 0.167745886 -0.01096960 0.340710903
## sulphates 0.058367062 0.11232046 0.069555381
## alcohol -0.037603106 -0.30301450 -0.314525906
Calculate the proportion of variance explained for each principal component
#Calculate proportion of variance explained for each principal component
PVE <- redwine.eigen$values / sum(redwine.eigen$values)
round(PVE, 2)
## [1] 0.28 0.18 0.14 0.11 0.09 0.06 0.05 0.04 0.03 0.02 0.01
Graph the PVE and cumulative PVE of the principal components in scree plots
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
PVEplot <- qplot(c(1:11), PVE) +
geom_line() +
xlab("Principal Component") +
ylab("PVE") +
ggtitle("Scree Plot") +
ylim(0, 1)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cumPVE <- qplot(c(1:11), cumsum(PVE)) +
geom_line() +
xlab("Principal Component") +
ylab(NULL) +
ggtitle("Cumulative Scree Plot") +
ylim(0, 1) +
geom_hline(yintercept = 0.8, color = "green", linetype = "dashed")
grid.arrange(PVEplot, cumPVE, ncol = 2)
To determine the number of principal components to retain from the
analysis, an hline has been added to the cumulative PVE plot indicating
where the cumulative PVE is equal to 80%. The hline intersects with the
cumulative PVE plot at approximately x = 5 principal components.
Run the PCA Analysis with prcomp() function
pca_redwine <- prcomp(redwine.pca, scale = TRUE)
summary(pca_redwine)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7604 1.3878 1.2452 1.1015 0.97943 0.81216 0.76406
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion 0.2817 0.4568 0.5978 0.7081 0.79528 0.85525 0.90832
## PC8 PC9 PC10 PC11
## Standard deviation 0.65035 0.58706 0.42583 0.24405
## Proportion of Variance 0.03845 0.03133 0.01648 0.00541
## Cumulative Proportion 0.94677 0.97810 0.99459 1.00000
This summary table supports the finding in the cumulative PVE scree plot, that 5 principal components capture approximately 80% of the data’s variance.
head(pca_redwine$rotation[,1:5], 11)
## PC1 PC2 PC3 PC4
## fixed.acidity 0.48931422 0.110502738 -0.12330157 0.229617370
## volatile.acidity -0.23858436 -0.274930480 -0.44996253 -0.078959783
## citric.acid 0.46363166 0.151791356 0.23824707 0.079418256
## residual.sugar 0.14610715 -0.272080238 0.10128338 0.372792562
## chlorides 0.21224658 -0.148051555 -0.09261383 -0.666194756
## free.sulfur.dioxide -0.03615752 -0.513566812 0.42879287 0.043537818
## total.sulfur.dioxide 0.02357485 -0.569486959 0.32241450 0.034577115
## density 0.39535301 -0.233575490 -0.33887135 0.174499758
## pH -0.43851962 -0.006710793 0.05769735 0.003787746
## sulphates 0.24292133 0.037553916 0.27978615 -0.550872362
## alcohol -0.11323206 0.386180959 0.47167322 0.122181088
## PC5
## fixed.acidity -0.08261366
## volatile.acidity 0.21873452
## citric.acid -0.05857268
## residual.sugar 0.73214429
## chlorides 0.24650090
## free.sulfur.dioxide -0.15915198
## total.sulfur.dioxide -0.22246456
## density 0.15707671
## pH 0.26752977
## sulphates 0.22596222
## alcohol 0.35068141
Rotating the principal component result permits an analysis of the loadings for each principal component. PC1 has a high positive loading from fixed acidity and citric acid, and a high negative loading from pH. PC2 is dominated by high negative loadings from free sulfur dioxide and total sulfur dioxide. PC3 has high negative loadings from volatile acidity and density and high positive loadings from free sulfur dioxide and alcohol. Given that these features were highlighted in the importance analysis by the decision tree, it seems likely that PC3 is a strong predictor. PC4 has high negative loadings from chlorides and sulphates and the highest positive loading factor is residual sugar. Given that chlorides and residual sugar were not identified as features of importance in the decision tree, it seems possible that PC4 will not be a strong predictor. PC5 has high positive loading factor from residual sugar.
Now that I have concluded from the principal component analysis that 5 principal components will be retained, I would like to run these 5 principal components through a logistic regression with the “quality” dependent variable to evaluate whether reducing dimensions in the dataset improves the accuracy of a logistic regression model.
PC_scores <- as.data.frame(pca_redwine$x[,1:5])
rw.quality <- redwine %>%
mutate(
quality = if_else(
quality > 5,
1,
0
)
) %>%
select(quality)
rw.data <- cbind(PC_scores, rw.quality)
set.seed(123)
training.samples_log <- createDataPartition(y = rw.data$quality, p = 0.8, list = FALSE)
train.data_log <- rw.data[training.samples_log,]
test.data_log <- rw.data[-training.samples_log,]
Model 5: Logistic Regression with principal components and outliers included
glm.model_log <- glm(quality ~ ., family = "binomial", data = train.data_log)
summary(glm.model_log)
##
## Call:
## glm(formula = quality ~ ., family = "binomial", data = train.data_log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.24389 0.06686 3.648 0.000265 ***
## PC1 0.08424 0.03669 2.296 0.021684 *
## PC2 0.76204 0.05784 13.176 < 2e-16 ***
## PC3 0.65325 0.05867 11.135 < 2e-16 ***
## PC4 0.06270 0.05499 1.140 0.254165
## PC5 0.27099 0.07030 3.855 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1765.0 on 1279 degrees of freedom
## Residual deviance: 1379.7 on 1274 degrees of freedom
## AIC: 1391.7
##
## Number of Fisher Scoring iterations: 4
A summary of the logistic regression model shows that three of the five principal components are strong and statistically significant predictors of red wine quality. Given that the strongest loading factors in PC2 and PC3 were free sulfur dioxide, total sulfur dioxide, alcohol, and volatile acidity, it is understandable that these principal components would be strong predictors. Additionally, the coefficients for these two principal components are high, indicating that they are strong predictors of high quality wine. Overall, four of the five principal components were statistically significant predictors. The benefit of the principal component analysis is that the number of dimensions from the original dataset has decreased from 11 to 5 while maintaining strong predictors.
test.pred_log <- test.data_log %>%
mutate(
prediction = predict(glm.model_log, test.data_log, type = "response"),
pred_class = round(prediction)
)
outcomes_log <- table(test.pred_log$pred_class, test.data_log$quality)
confusion_log <- confusionMatrix(outcomes_log)
print(confusion_log)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 123 55
## 1 36 105
##
## Accuracy : 0.7147
## 95% CI : (0.6618, 0.7637)
## No Information Rate : 0.5016
## P-Value [Acc > NIR] : 7.896e-15
##
## Kappa : 0.4297
##
## Mcnemar's Test P-Value : 0.05917
##
## Sensitivity : 0.7736
## Specificity : 0.6562
## Pos Pred Value : 0.6910
## Neg Pred Value : 0.7447
## Prevalence : 0.4984
## Detection Rate : 0.3856
## Detection Prevalence : 0.5580
## Balanced Accuracy : 0.7149
##
## 'Positive' Class : 0
##
The accuracy rate for model #5 is 71.47%, lower than previous models. Additionally, with the inclusion of the principal components in the model the balanced accuracy rate has decreased. There is a large difference between sensitivity and specificity, suggesting an unbalanced model and potential bias. One wonders whether this has occurred because the principal components were developed without consideration for the target variable. I also wonder if excluding outliers from the dataset would improve the accuracy rate of the logistic regression model, as it did for model #2.
I am going to build one final model that will exclude outliers from the dataset.
pca_clean <- prcomp(redwine_clean, scale = TRUE)
summary(pca_clean)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7168 1.4804 1.2736 1.03838 0.93225 0.83057 0.79573
## Proportion of Variance 0.2679 0.1992 0.1475 0.09802 0.07901 0.06271 0.05756
## Cumulative Proportion 0.2679 0.4672 0.6146 0.71265 0.79166 0.85437 0.91193
## PC8 PC9 PC10 PC11
## Standard deviation 0.61766 0.58925 0.42104 0.2505
## Proportion of Variance 0.03468 0.03156 0.01612 0.0057
## Cumulative Proportion 0.94662 0.97818 0.99430 1.0000
PC_clean <- as.data.frame(pca_clean$x[,1:5])
quality <- norm2 %>% select(quality)
clean.log <- cbind(PC_clean, quality)
set.seed(123)
training.clean.log <- createDataPartition(y = clean.log$quality, p = 0.8, list = FALSE)
train.clean.log <- clean.log[training.clean.log,]
test.clean.log <- clean.log[-training.clean.log,]
model.clean.log <- glm(quality ~ ., family = "binomial", data = train.clean.log)
summary(model.clean.log)
##
## Call:
## glm(formula = quality ~ ., family = "binomial", data = train.clean.log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.27638 0.07819 3.535 0.000408 ***
## PC1 0.07160 0.04382 1.634 0.102317
## PC2 0.84477 0.06465 13.066 < 2e-16 ***
## PC3 0.36333 0.06224 5.838 5.29e-09 ***
## PC4 -0.37089 0.07588 -4.888 1.02e-06 ***
## PC5 -0.10615 0.08524 -1.245 0.213016
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1318.9 on 955 degrees of freedom
## Residual deviance: 1029.2 on 950 degrees of freedom
## AIC: 1041.2
##
## Number of Fisher Scoring iterations: 4
Model #6: Logistic regression with principal components and outliers excluded from the dataset
testpred.clean.log <- test.clean.log %>%
mutate(
prediction = predict(model.clean.log, test.clean.log, type = "response"),
pred_class = round(prediction)
)
outcomes.clean.log <- table(testpred.clean.log$pred_class, test.clean.log$quality)
confusion.clean.log <- confusionMatrix(outcomes.clean.log)
print(confusion.clean.log)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 85 39
## 1 25 89
##
## Accuracy : 0.7311
## 95% CI : (0.67, 0.7863)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : 7.142e-10
##
## Kappa : 0.4639
##
## Mcnemar's Test P-Value : 0.1042
##
## Sensitivity : 0.7727
## Specificity : 0.6953
## Pos Pred Value : 0.6855
## Neg Pred Value : 0.7807
## Prevalence : 0.4622
## Detection Rate : 0.3571
## Detection Prevalence : 0.5210
## Balanced Accuracy : 0.7340
##
## 'Positive' Class : 0
##
The accuracy rate did increase once outliers were removed, and the balanced accuracy rate increased from 71.4% to 73.4%. However, this model still does not perform as well as model #2, the logistic regression with outliers removed and the multicollinear independent variables removed. This model has an accuracy rate of 75.21%. Model #4, the optimized decision tree model, is second with an accuracy rate of 74.61%. In the case of the logistic regression and decision tree, the balanced accuracy rate was high and the distance between sensitivity and specificity was small. Both of these models could accurately predict high quality and low quality red wines. The logistic regression with principal components did not maintain the same level of balanced accuracy. It is possible that the principal components retained, while capturing the most variance, may have been dominated by stronger predictors of high quality red wine than low quality red wine.
Through testing and various methods of data processing I was able to improve the performance of all of the models that I built.