We will be investigating what are the factors that affect wine quality. We will use PCA.
looking at the data
winequality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep = ";")
head(winequality)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
str(winequality)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
summary(winequality)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917
## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937
## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961
## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.180 Median :0.4700 Median :10.40 Median :6.000
## Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878
## 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000
# Viewing the wine quality frequency
ggplot(winequality,aes(quality)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# what affects quality?
# Select the numeric columns for correlation calculation
numeric_cols <- c("fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar",
"chlorides", "free.sulfur.dioxide", "total.sulfur.dioxide",
"density", "pH", "sulphates", "alcohol", "quality")
# Create the correlation matrix
cor_matrix <- cor(winequality[numeric_cols])
# Plot the correlation matrix
library(ggcorrplot)
ggcorrplot(cor_matrix,
hc.order = TRUE,
type = "lower",
outline.color = "white",
ggtheme = ggplot2::theme_grey,
colors = c("#6D9EC1", "white", "#E46726")
)
from the correlation plot, we can observe that beverage quality is dependent on the: - alcohol content - PH - Sulphates
Negative determiners of quality: - density (also negative predictor of alcohol content) - Chlorides - Total Sulfur dioxide
Now we proceed to apply Principal Component Analysis on our dataset
# excluding labels
numeric_data <- winequality[numeric_cols]
# perform PCA
pca_result <- prcomp(numeric_data, center = TRUE, scale = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8294 1.2594 1.1710 1.04157 0.98756 0.96890 0.8771
## Proportion of Variance 0.2789 0.1322 0.1143 0.09041 0.08127 0.07823 0.0641
## Cumulative Proportion 0.2789 0.4111 0.5253 0.61573 0.69701 0.77524 0.8393
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.85082 0.74599 0.58561 0.53302 0.14307
## Proportion of Variance 0.06032 0.04638 0.02858 0.02368 0.00171
## Cumulative Proportion 0.89967 0.94604 0.97462 0.99829 1.00000
The first 3 principal components, PC1, PC2 and PC3 accounts for more than \(50\)% of the winequality data.
9 PCs are needed to account for more than \(90\)% of the variance
screeplot(pca_result,
type = 'l',
npcs = 15,
main = "Screeplot of the first 10 PCs")
abline(h = 1,
col = "red",
lty = 5)
legend("topright",
legend = c("Eigenvalue = 1"),
col = c("red"),
lty = 5,
cex = 0.6)
cumpro <- cumsum(pca_result$sdev^2/sum(pca_result$sdev^2))
plot(cumpro[0:12],
xlab = "PC #",
ylab = "Amount of explained variance",
main = "Cumulative variance plot")
abline(v = 6, col="blue", lty=5)
abline(h = 0.77524, col="blue", lty=5) # found from previous code chunk
legend("topleft", legend=c("Cut-off @ PC6"),
col=c("blue"), lty=5, cex=0.6)
\(3\) PCs can be used to explain more than \(50\)% of the variance. For visual purposes, let us use 2
# exhaustive list of PCA breakdowns
head(pca_result$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -3.5429563 0.3550511 -0.3257996 -1.7352335 0.4008275 0.8916788
## [2,] 0.6127372 -0.2893815 0.8163511 0.8486472 0.5273716 0.4664336
## [3,] -0.1423793 1.1679020 -0.1530525 0.1909000 0.3061774 0.4934076
## [4,] -1.3793842 -0.1995669 -0.3244634 -0.4087818 0.5199860 -0.7688398
## [5,] -1.3793842 -0.1995669 -0.3244634 -0.4087818 0.5199860 -0.7688398
## [6,] -0.1423793 1.1679020 -0.1530525 0.1909000 0.3061774 0.4934076
## PC7 PC8 PC9 PC10 PC11 PC12
## [1,] -0.9368043 0.06998381 0.5938390 -0.000134629 0.04482358 -0.04961423
## [2,] -0.2156768 -0.45274357 -0.3100269 -1.233863444 -0.04319005 -0.14142681
## [3,] -0.1841590 -0.56598395 -1.1131973 0.389541143 -0.82737924 0.17831140
## [4,] 0.0848360 0.16698331 -0.7344227 -0.070773454 0.36389391 0.04092464
## [5,] 0.0848360 0.16698331 -0.7344227 -0.070773454 0.36389391 0.04092464
## [6,] -0.1841590 -0.56598395 -1.1131973 0.389541143 -0.82737924 0.17831140
# plotting
plot(pca_result$x[,1],pca_result$x[,2],
xlab = "PC1 (28%)",
ylab = "PC2 (13%)",
main = "PC1 / PC2 - plot")
full plot
# setting quality as factor
winequality$quality <- as.factor(winequality$quality)
# tidying up
pca_result%>%
augment(winequality) %>%
rename_at(vars(starts_with(".fitted")),
list(~str_replace(.,".fitted","")))
## # A tibble: 4,898 × 25
## .rown…¹ fixed…² volat…³ citri…⁴ resid…⁵ chlor…⁶ free.…⁷ total…⁸ density pH
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 7 0.27 0.36 20.7 0.045 45 170 1.00 3
## 2 2 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3
## 3 3 8.1 0.28 0.4 6.9 0.05 30 97 0.995 3.26
## 4 4 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19
## 5 5 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19
## 6 6 8.1 0.28 0.4 6.9 0.05 30 97 0.995 3.26
## 7 7 6.2 0.32 0.16 7 0.045 30 136 0.995 3.18
## 8 8 7 0.27 0.36 20.7 0.045 45 170 1.00 3
## 9 9 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3
## 10 10 8.1 0.22 0.43 1.5 0.044 28 129 0.994 3.22
## # … with 4,888 more rows, 15 more variables: sulphates <dbl>, alcohol <dbl>,
## # quality <fct>, PC1 <dbl>, PC2 <dbl>, PC3 <dbl>, PC4 <dbl>, PC5 <dbl>,
## # PC6 <dbl>, PC7 <dbl>, PC8 <dbl>, PC9 <dbl>, PC10 <dbl>, PC11 <dbl>,
## # PC12 <dbl>, and abbreviated variable names ¹.rownames, ²fixed.acidity,
## # ³volatile.acidity, ⁴citric.acid, ⁵residual.sugar, ⁶chlorides,
## # ⁷free.sulfur.dioxide, ⁸total.sulfur.dioxide
# full plot
pca_result %>%
augment(winequality) %>% # inputs the PC cols into the original dataframe
rename_at(vars(starts_with(".fitted")), # renames the cols starting with ".fitted"
list(~str_replace(.,".fitted",""))) %>% # renames it to ""
ggplot(aes(x=PC1, # plot accordingly
y=PC2,
shape=quality
))+
geom_point()
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 5 rows containing missing values (`geom_point()`).
several observations
let us investigate further into why that is?
# the coefficients provide the formula here
pca_formula <- pca_result$rotation
pca_formula
## PC1 PC2 PC3 PC4
## fixed.acidity -0.15690447 0.56066866 -0.20738436 0.03373494
## volatile.acidity -0.02428722 0.01606694 0.52491466 -0.13119747
## citric.acid -0.13294430 0.28938115 -0.44635554 0.32953335
## residual.sugar -0.40605288 -0.03882402 -0.03384313 -0.41615630
## chlorides -0.21754400 0.03691144 0.21471269 0.50961203
## free.sulfur.dioxide -0.27471931 -0.34554881 -0.31297088 -0.14892788
## total.sulfur.dioxide -0.39044148 -0.27232605 -0.12479447 -0.02161841
## density -0.50129557 -0.01773344 0.03196758 -0.10386393
## pH 0.13003701 -0.56714503 0.06848384 0.20410995
## sulphates -0.03364168 -0.24826266 -0.22699505 0.51924489
## alcohol 0.44279498 0.01698188 -0.15887556 -0.13438871
## quality 0.22713722 -0.14603134 -0.48884718 -0.27820033
## PC5 PC6 PC7 PC8
## fixed.acidity -0.24413933 0.105856235 0.22355921 0.13041311
## volatile.acidity -0.70298193 -0.123704688 -0.22363601 -0.22960669
## citric.acid -0.06510579 -0.131958661 -0.12037133 -0.69141866
## residual.sugar 0.01610213 0.289918546 -0.33860858 -0.11329401
## chlorides 0.17829248 -0.409317266 -0.55225504 0.21139734
## free.sulfur.dioxide -0.11117214 -0.488085145 0.22407108 0.12883115
## total.sulfur.dioxide -0.27144774 -0.272493820 0.20375343 0.01290262
## density 0.07834373 0.326008106 -0.12313568 -0.08667076
## pH 0.11270171 0.192688838 0.07704001 -0.47796137
## sulphates -0.45623099 0.479811894 -0.04462167 0.33642752
## alcohol -0.30855451 -0.135443327 -0.09801169 -0.08899029
## quality -0.04112191 -0.005524396 -0.58434519 0.14444197
## PC9 PC10 PC11 PC12
## fixed.acidity -0.63145048 0.20087123 -0.10411772 0.170792295
## volatile.acidity -0.03159628 -0.14175876 -0.27002270 0.013376718
## citric.acid 0.24949503 -0.10632912 -0.05395597 0.009648802
## residual.sugar 0.17730336 0.37427490 0.17987291 0.493565139
## chlorides -0.17916182 0.23552782 0.09108849 0.025168952
## free.sulfur.dioxide 0.10184710 0.32733415 -0.49921348 -0.029475198
## total.sulfur.dioxide -0.17800832 -0.34735757 0.64355326 0.035060193
## density -0.12538636 0.04349161 -0.06686042 -0.761184485
## pH -0.52031593 0.18375599 -0.07911267 0.141842640
## sulphates 0.23662489 0.05519364 -0.04102077 0.042787387
## alcohol 0.01278298 0.57530003 0.41895440 -0.350156811
## quality -0.29970621 -0.36771605 -0.14620225 -0.016069252
This means that PC1=−0.15690447⋅fixed.acidity+0.02428722⋅volatile.acidity+0.13294430⋅citric.acid+0.40605288⋅residual.sugar+…
and so on.
We notice that PC1 is heavily correlated with alcohol content (\(0.44\)!)
PC2 is negatively correlated to the PH and correlated with the fixed.acidity
This corroborates with our findings in the correlation plot
Higher acidity and alcohol are correlated with better quality alcohol