Introduction

This is a project related to Dimension Reduction. Dataset used in this project is the Wine Quality Dataset available on Kaggle

Link to dataset -> https://www.kaggle.com/datasets/yasserh/wine-quality-dataset

Literature Review

Several researches were conducted on wine quality using different datasets. One of those researches is “APPLICATION OF MACHINE LEARNING TO PREDICT QUALITY OF PORTUGUESE WINE BASED ON SENSORY PREFERENCES” by Alsénvitor Campos Aires do Nascimento. This study proposed collecting, processing, and modelling a machine learning solution based on Vivino’s red and white Portuguese wines dataset. As a result of their research, they achieved the accuracy of prediction which is a relevant percentage of 85% for red and 84% for white wines by the Random Forest model.

The wine’s quality is determined by its origin, grape variety, and planting conditions, all of which have an impact on how the product is viewed from a sensory standpoint (Palade and Popa, 2014). However, Lee, Park, and Kang (2015) confirm that physicochemical and sensory approaches are typically used to evaluate wine quality.

Dataset

The dataset consists of the following columns:

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

12 - quality (score between 0 and 10)

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tibble' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.4.2
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.4.2
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
data <- read.csv("C://Users/Admin/Desktop/WineQT.csv")

summary(data)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   : 0.900  
##  1st Qu.: 7.100   1st Qu.:0.3925   1st Qu.:0.0900   1st Qu.: 1.900  
##  Median : 7.900   Median :0.5200   Median :0.2500   Median : 2.200  
##  Mean   : 8.311   Mean   :0.5313   Mean   :0.2684   Mean   : 2.532  
##  3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4200   3rd Qu.: 2.600  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.0000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 21.00       1st Qu.:0.9956  
##  Median :0.07900   Median :13.00       Median : 37.00       Median :0.9967  
##  Mean   :0.08693   Mean   :15.62       Mean   : 45.91       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 61.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :68.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.205   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6577   Mean   :10.44   Mean   :5.657  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000  
##        Id      
##  Min.   :   0  
##  1st Qu.: 411  
##  Median : 794  
##  Mean   : 805  
##  3rd Qu.:1210  
##  Max.   :1597

The summary function shows the description of the whole dataset. It is always recommended to check the dataset using this function to have a better understandiing.

Histogram

Visualizing the histograms

par(mar = c(4, 4, 2, 1))
hist.data.frame(data)

The histogram shows the frequency graph for each column that exists in the dataset.

Removing id and getting the dimensions of the dataset

The ID column is unnecessary for this task and that is why it needs to be removed.

data <- data %>% select(-Id)
dim(data)
## [1] 1143   12

So, our dataset has 12 dimensions, it is a lot, so it can be reduced after several calculations.

Then the data is preprocessed: This procedure is necessary in order to get a better structured dataset and avoid unnecessary calculations which will be required later.

preprocessed <- preProcess(data,, method=c("center", "scale"))
normalized <- predict(preprocessed, data)
ggpairs(normalized, progress=FALSE)

This graph shows the frequency table of the each pair of the column

Correlation Matrix

Correlation Matrix is a technique in unsupervised learning. It shows which columns are more linked to each other. This can be observed from the graph below:

cor<- cor(normalized, method="pearson") 
corrplot(cor)

It can be observed from the correlation matrix that fixed.acidity and citric.acid, and density and fixed.acidity have more correlation.

PCA

Principle Component Analysis is a technique used for dimension reduction. Here, parameters called eigenvalues are important to consider.

pca_result <- PCA(normalized, graph = FALSE)

# View PCA summary
summary(pca_result)
## 
## Call:
## PCA(X = normalized, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               3.178   2.235   1.686   1.216   0.967   0.666   0.587
## % of var.             26.480  18.622  14.053  10.131   8.062   5.551   4.888
## Cumulative % of var.  26.480  45.102  59.155  69.286  77.348  82.899  87.787
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.492   0.407   0.332   0.176   0.058
## % of var.              4.097   3.395   2.770   1.465   0.487
## Cumulative % of var.  91.884  95.279  98.049  99.513 100.000
## 
## Individuals (the 10 first)
##                          Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                    |  2.733 | -1.720  0.081  0.396 |  1.254  0.062  0.211 |
## 2                    |  2.927 | -0.962  0.025  0.108 |  2.120  0.176  0.524 |
## 3                    |  2.075 | -0.874  0.021  0.177 |  1.469  0.085  0.501 |
## 4                    |  3.080 |  2.351  0.152  0.583 | -0.207  0.002  0.005 |
## 5                    |  2.733 | -1.720  0.081  0.396 |  1.254  0.062  0.211 |
## 6                    |  2.636 | -1.692  0.079  0.412 |  1.265  0.063  0.230 |
## 7                    |  2.254 | -1.238  0.042  0.302 |  1.244  0.061  0.305 |
## 8                    |  3.169 | -2.040  0.115  0.414 | -0.922  0.033  0.085 |
## 9                    |  2.654 | -0.890  0.022  0.112 | -0.587  0.013  0.049 |
## 10                   |  2.282 | -1.256  0.043  0.303 |  1.332  0.069  0.340 |
##                       Dim.3    ctr   cos2  
## 1                    -1.335  0.092  0.239 |
## 2                     0.173  0.002  0.003 |
## 3                    -0.594  0.018  0.082 |
## 4                     0.099  0.001  0.001 |
## 5                    -1.335  0.092  0.239 |
## 6                    -1.080  0.060  0.168 |
## 7                    -0.692  0.025  0.094 |
## 8                    -0.692  0.025  0.048 |
## 9                    -1.097  0.062  0.171 |
## 10                   -0.309  0.005  0.018 |
## 
## Variables (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## fixed.acidity        |  0.862 23.392  0.743 |  0.010  0.004  0.000 | -0.206
## volatile.acidity     | -0.451  6.411  0.204 |  0.538 12.947  0.289 | -0.264
## citric.acid          |  0.837 22.070  0.701 | -0.216  2.088  0.047 |  0.140
## residual.sugar       |  0.301  2.849  0.091 |  0.214  2.045  0.046 |  0.299
## chlorides            |  0.377  4.463  0.142 |  0.278  3.453  0.077 |  0.018
## free.sulfur.dioxide  | -0.100  0.317  0.010 |  0.340  5.177  0.116 |  0.820
## total.sulfur.dioxide | -0.004  0.000  0.000 |  0.498 11.121  0.248 |  0.738
## density              |  0.671 14.186  0.451 |  0.482 10.406  0.233 | -0.208
## pH                   | -0.761 18.226  0.579 | -0.113  0.575  0.013 |  0.077
## sulphates            |  0.445  6.234  0.198 | -0.209  1.948  0.044 |  0.270
##                         ctr   cos2  
## fixed.acidity         2.524  0.043 |
## volatile.acidity      4.121  0.069 |
## citric.acid           1.161  0.020 |
## residual.sugar        5.310  0.090 |
## chlorides             0.020  0.000 |
## free.sulfur.dioxide  39.852  0.672 |
## total.sulfur.dioxide 32.278  0.544 |
## density               2.575  0.043 |
## pH                    0.353  0.006 |
## sulphates             4.323  0.073 |

The summary shows the dimensions and their eigenvalues. 4 of them are more than 1. More explanation in the following paragparh:

Eigenvalues

normalized.cov<-cov(normalized)
normalized.eigen<-eigen(normalized.cov)
normalized.eigen$values
##  [1] 3.17765845 2.23460254 1.68638738 1.21566982 0.96739032 0.66612476
##  [7] 0.58659975 0.49163995 0.40740646 0.33235549 0.17575505 0.05841003

The output implies that 4 components should be chosen as they have eigen values higher than 1.

fviz_eig(pca_result, choice='eigenvalue')

fviz_eig(pca_result, ylim = c(0, 50))

The 2 graphs above indicate that 4 dimensions is enough for us to cover more than half of the dataset. Hence, 4 dimensions (eigenvalues higher than 1), should be chosen.

Component Analysis

Component analysis is done in order to better visualize the relationship between the dimensions.

fviz_pca_ind(pca_result, col.ind=normalized$quality, geom = "point", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07" ))

Correlation Circle

Another technique to better visualize which components are closely linked to each other.

fviz_pca_var(pca_result, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=11)

It can be observed that fixed.acidity, citric.acid and density have the most influence, in comparison with fixed.sulfur.dioxide and residual.sugar.

Analysis of 4 most important dimensions of PCA

PC1 and PC2

fviz_pca_var(pca_result, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
  labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=20)

Contribution of fixed.acidity and citric.acid is the most powerful for PC1 and PC2.

PC2 and PC3

fviz_pca_var(pca_result, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
  labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=20)

For PC2 and PC3, free.sulfur.dioxide and total.sulfur.dioxide have biggest contribution. The following procedure is also done for PC3 and PC4 to better understand the relationship between components.

PC3 and PC4

fviz_pca_var(pca_result, col.var="contrib", repel = TRUE, axes = c(3, 4)) +
  labs(title="Variables loadings for PC3 and PC4", x="PC3", y="PC4")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=20)

PC1 <- fviz_contrib(pca_result, choice = "var", axes = 1,fill = "#990066",color = "#990066") + ggtitle("DIM1") + theme(plot.title = element_text(hjust = 0.5))
PC2 <- fviz_contrib(pca_result, choice = "var", axes = 2,fill = "#990066",color = "#990066") + ggtitle("DIM2") + theme(plot.title = element_text(hjust = 0.5))
PC3 <- fviz_contrib(pca_result, choice = "var", axes = 3,fill = "#990066",color = "#990066") + ggtitle("DIM3") + theme(plot.title = element_text(hjust = 0.5))
PC4 <- fviz_contrib(pca_result, choice = "var", axes = 4,fill = "#990066",color = "#990066") + ggtitle("DIM4") + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(PC1, PC2, PC3,PC4,   ncol=4)

It can be observed from the contribution graphs that first 2 variables are the most important ones for all dimensions but the 3rd variable also important in 3 dimensions.

Conclusion

The purpose of this project is to analyse the huge dataset called Wine Quality and reduce its size using the technique called Principal Component Analysis (PCA). After the analysis, 4 eigenvalues that have indicators more than 1 suggest that the dimension can be reduced to 4.