Wine data analysis

red wine

We will be investigating what are the factors that affect wine quality. We will use PCA.

looking at the data

winequality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep = ";")

head(winequality)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

str(winequality)

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

summary(winequality)

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0        Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00      Median :134.0        Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4        Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000

# Viewing the wine quality frequency
ggplot(winequality,aes(quality)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# what affects quality?
# Select the numeric columns for correlation calculation
numeric_cols <- c("fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar",
                  "chlorides", "free.sulfur.dioxide", "total.sulfur.dioxide",
                  "density", "pH", "sulphates", "alcohol", "quality")

# Create the correlation matrix
cor_matrix <- cor(winequality[numeric_cols])

# Plot the correlation matrix
library(ggcorrplot)
ggcorrplot(cor_matrix,
           hc.order = TRUE,
           type = "lower",
           outline.color = "white",
           ggtheme = ggplot2::theme_grey,
           colors = c("#6D9EC1", "white", "#E46726")
)

from the correlation plot, we can observe that beverage quality is dependent on the: - alcohol content - PH - Sulphates

Negative determiners of quality: - density (also negative predictor of alcohol content) - Chlorides - Total Sulfur dioxide

Now we proceed to apply Principal Component Analysis on our dataset

# excluding labels
numeric_data <- winequality[numeric_cols]

# perform PCA
pca_result <- prcomp(numeric_data, center = TRUE, scale = TRUE)
summary(pca_result)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.8294 1.2594 1.1710 1.04157 0.98756 0.96890 0.8771
## Proportion of Variance 0.2789 0.1322 0.1143 0.09041 0.08127 0.07823 0.0641
## Cumulative Proportion  0.2789 0.4111 0.5253 0.61573 0.69701 0.77524 0.8393
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.85082 0.74599 0.58561 0.53302 0.14307
## Proportion of Variance 0.06032 0.04638 0.02858 0.02368 0.00171
## Cumulative Proportion  0.89967 0.94604 0.97462 0.99829 1.00000

The first 3 principal components, PC1, PC2 and PC3 accounts for more than \(50\)% of the winequality data.

9 PCs are needed to account for more than \(90\)% of the variance

screeplot(pca_result,
          type = 'l',
          npcs = 15,
          main = "Screeplot of the first 10 PCs")

abline(h = 1,
       col = "red",
       lty = 5)
legend("topright",
       legend = c("Eigenvalue = 1"),
       col = c("red"),
       lty = 5,
       cex = 0.6)

cumpro <- cumsum(pca_result$sdev^2/sum(pca_result$sdev^2))
plot(cumpro[0:12],
     xlab = "PC #",
     ylab = "Amount of explained variance",
     main = "Cumulative variance plot")
abline(v = 6, col="blue", lty=5)
abline(h = 0.77524, col="blue", lty=5) # found from previous code chunk
legend("topleft", legend=c("Cut-off @ PC6"),
       col=c("blue"), lty=5, cex=0.6)

\(3\) PCs can be used to explain more than \(50\)% of the variance. For visual purposes, let us use 2

# exhaustive list of PCA breakdowns
head(pca_result$x)

##             PC1        PC2        PC3        PC4       PC5        PC6
## [1,] -3.5429563  0.3550511 -0.3257996 -1.7352335 0.4008275  0.8916788
## [2,]  0.6127372 -0.2893815  0.8163511  0.8486472 0.5273716  0.4664336
## [3,] -0.1423793  1.1679020 -0.1530525  0.1909000 0.3061774  0.4934076
## [4,] -1.3793842 -0.1995669 -0.3244634 -0.4087818 0.5199860 -0.7688398
## [5,] -1.3793842 -0.1995669 -0.3244634 -0.4087818 0.5199860 -0.7688398
## [6,] -0.1423793  1.1679020 -0.1530525  0.1909000 0.3061774  0.4934076
##             PC7         PC8        PC9         PC10        PC11        PC12
## [1,] -0.9368043  0.06998381  0.5938390 -0.000134629  0.04482358 -0.04961423
## [2,] -0.2156768 -0.45274357 -0.3100269 -1.233863444 -0.04319005 -0.14142681
## [3,] -0.1841590 -0.56598395 -1.1131973  0.389541143 -0.82737924  0.17831140
## [4,]  0.0848360  0.16698331 -0.7344227 -0.070773454  0.36389391  0.04092464
## [5,]  0.0848360  0.16698331 -0.7344227 -0.070773454  0.36389391  0.04092464
## [6,] -0.1841590 -0.56598395 -1.1131973  0.389541143 -0.82737924  0.17831140

# plotting
plot(pca_result$x[,1],pca_result$x[,2],
     xlab = "PC1 (28%)",
     ylab = "PC2 (13%)",
     main = "PC1 / PC2 - plot")

full plot

# setting quality as factor
winequality$quality <- as.factor(winequality$quality)

# tidying up
pca_result%>%
  augment(winequality) %>%
  rename_at(vars(starts_with(".fitted")),
            list(~str_replace(.,".fitted","")))

## # A tibble: 4,898 × 25
##    .rown…¹ fixed…² volat…³ citri…⁴ resid…⁵ chlor…⁶ free.…⁷ total…⁸ density    pH
##    <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
##  1 1           7      0.27    0.36    20.7   0.045      45     170   1.00   3   
##  2 2           6.3    0.3     0.34     1.6   0.049      14     132   0.994  3.3 
##  3 3           8.1    0.28    0.4      6.9   0.05       30      97   0.995  3.26
##  4 4           7.2    0.23    0.32     8.5   0.058      47     186   0.996  3.19
##  5 5           7.2    0.23    0.32     8.5   0.058      47     186   0.996  3.19
##  6 6           8.1    0.28    0.4      6.9   0.05       30      97   0.995  3.26
##  7 7           6.2    0.32    0.16     7     0.045      30     136   0.995  3.18
##  8 8           7      0.27    0.36    20.7   0.045      45     170   1.00   3   
##  9 9           6.3    0.3     0.34     1.6   0.049      14     132   0.994  3.3 
## 10 10          8.1    0.22    0.43     1.5   0.044      28     129   0.994  3.22
## # … with 4,888 more rows, 15 more variables: sulphates <dbl>, alcohol <dbl>,
## #   quality <fct>, PC1 <dbl>, PC2 <dbl>, PC3 <dbl>, PC4 <dbl>, PC5 <dbl>,
## #   PC6 <dbl>, PC7 <dbl>, PC8 <dbl>, PC9 <dbl>, PC10 <dbl>, PC11 <dbl>,
## #   PC12 <dbl>, and abbreviated variable names ¹.rownames, ²fixed.acidity,
## #   ³volatile.acidity, ⁴citric.acid, ⁵residual.sugar, ⁶chlorides,
## #   ⁷free.sulfur.dioxide, ⁸total.sulfur.dioxide

# full plot
pca_result %>%
  augment(winequality) %>% # inputs the PC cols into the original dataframe
  rename_at(vars(starts_with(".fitted")), # renames the cols starting with ".fitted"
            list(~str_replace(.,".fitted",""))) %>% # renames it to ""
  ggplot(aes(x=PC1, # plot accordingly
             y=PC2,
             shape=quality
             ))+
  geom_point()

## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.

## Warning: Removed 5 rows containing missing values (`geom_point()`).

several observations

for the higher quality beverage (i.e. \(8\) or \(9\)), they are high on PC1 but lower on PC2 (although some beverage can be seen there which is quality \(9\))
for the lower quality ones (i.e. \(3\) or \(4\)), they are more dispersed but consistently have low PC1

let us investigate further into why that is?

# the coefficients provide the formula here
pca_formula <- pca_result$rotation
pca_formula

##                              PC1         PC2         PC3         PC4
## fixed.acidity        -0.15690447  0.56066866 -0.20738436  0.03373494
## volatile.acidity     -0.02428722  0.01606694  0.52491466 -0.13119747
## citric.acid          -0.13294430  0.28938115 -0.44635554  0.32953335
## residual.sugar       -0.40605288 -0.03882402 -0.03384313 -0.41615630
## chlorides            -0.21754400  0.03691144  0.21471269  0.50961203
## free.sulfur.dioxide  -0.27471931 -0.34554881 -0.31297088 -0.14892788
## total.sulfur.dioxide -0.39044148 -0.27232605 -0.12479447 -0.02161841
## density              -0.50129557 -0.01773344  0.03196758 -0.10386393
## pH                    0.13003701 -0.56714503  0.06848384  0.20410995
## sulphates            -0.03364168 -0.24826266 -0.22699505  0.51924489
## alcohol               0.44279498  0.01698188 -0.15887556 -0.13438871
## quality               0.22713722 -0.14603134 -0.48884718 -0.27820033
##                              PC5          PC6         PC7         PC8
## fixed.acidity        -0.24413933  0.105856235  0.22355921  0.13041311
## volatile.acidity     -0.70298193 -0.123704688 -0.22363601 -0.22960669
## citric.acid          -0.06510579 -0.131958661 -0.12037133 -0.69141866
## residual.sugar        0.01610213  0.289918546 -0.33860858 -0.11329401
## chlorides             0.17829248 -0.409317266 -0.55225504  0.21139734
## free.sulfur.dioxide  -0.11117214 -0.488085145  0.22407108  0.12883115
## total.sulfur.dioxide -0.27144774 -0.272493820  0.20375343  0.01290262
## density               0.07834373  0.326008106 -0.12313568 -0.08667076
## pH                    0.11270171  0.192688838  0.07704001 -0.47796137
## sulphates            -0.45623099  0.479811894 -0.04462167  0.33642752
## alcohol              -0.30855451 -0.135443327 -0.09801169 -0.08899029
## quality              -0.04112191 -0.005524396 -0.58434519  0.14444197
##                              PC9        PC10        PC11         PC12
## fixed.acidity        -0.63145048  0.20087123 -0.10411772  0.170792295
## volatile.acidity     -0.03159628 -0.14175876 -0.27002270  0.013376718
## citric.acid           0.24949503 -0.10632912 -0.05395597  0.009648802
## residual.sugar        0.17730336  0.37427490  0.17987291  0.493565139
## chlorides            -0.17916182  0.23552782  0.09108849  0.025168952
## free.sulfur.dioxide   0.10184710  0.32733415 -0.49921348 -0.029475198
## total.sulfur.dioxide -0.17800832 -0.34735757  0.64355326  0.035060193
## density              -0.12538636  0.04349161 -0.06686042 -0.761184485
## pH                   -0.52031593  0.18375599 -0.07911267  0.141842640
## sulphates             0.23662489  0.05519364 -0.04102077  0.042787387
## alcohol               0.01278298  0.57530003  0.41895440 -0.350156811
## quality              -0.29970621 -0.36771605 -0.14620225 -0.016069252

This means that PC1=−0.15690447⋅fixed.acidity+0.02428722⋅volatile.acidity+0.13294430⋅citric.acid+0.40605288⋅residual.sugar+…

and so on.

We notice that PC1 is heavily correlated with alcohol content (\(0.44\)!)

PC2 is negatively correlated to the PH and correlated with the fixed.acidity

This corroborates with our findings in the correlation plot

Higher acidity and alcohol are correlated with better quality alcohol

Wine data analysis

Manish

2023-06-23