This is a project related to Dimension Reduction. Dataset used in this project is the Wine Quality Dataset available on Kaggle
Link to dataset -> https://www.kaggle.com/datasets/yasserh/wine-quality-dataset
Several researches were conducted on wine quality using different datasets. One of those researches is “APPLICATION OF MACHINE LEARNING TO PREDICT QUALITY OF PORTUGUESE WINE BASED ON SENSORY PREFERENCES” by Alsénvitor Campos Aires do Nascimento. This study proposed collecting, processing, and modelling a machine learning solution based on Vivino’s red and white Portuguese wines dataset. As a result of their research, they achieved the accuracy of prediction which is a relevant percentage of 85% for red and 84% for white wines by the Random Forest model.
The wine’s quality is determined by its origin, grape variety, and planting conditions, all of which have an impact on how the product is viewed from a sensory standpoint (Palade and Popa, 2014). However, Lee, Park, and Kang (2015) confirm that physicochemical and sensory approaches are typically used to evaluate wine quality.
The dataset consists of the following columns:
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
12 - quality (score between 0 and 10)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tibble' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.4.2
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.4.2
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, units
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
data <- read.csv("C://Users/Admin/Desktop/WineQT.csv")
summary(data)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. : 0.900
## 1st Qu.: 7.100 1st Qu.:0.3925 1st Qu.:0.0900 1st Qu.: 1.900
## Median : 7.900 Median :0.5200 Median :0.2500 Median : 2.200
## Mean : 8.311 Mean :0.5313 Mean :0.2684 Mean : 2.532
## 3rd Qu.: 9.100 3rd Qu.:0.6400 3rd Qu.:0.4200 3rd Qu.: 2.600
## Max. :15.900 Max. :1.5800 Max. :1.0000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 21.00 1st Qu.:0.9956
## Median :0.07900 Median :13.00 Median : 37.00 Median :0.9967
## Mean :0.08693 Mean :15.62 Mean : 45.91 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 61.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :68.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.205 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6577 Mean :10.44 Mean :5.657
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
## Id
## Min. : 0
## 1st Qu.: 411
## Median : 794
## Mean : 805
## 3rd Qu.:1210
## Max. :1597
The summary function shows the description of the whole dataset. It is always recommended to check the dataset using this function to have a better understandiing.
Visualizing the histograms
par(mar = c(4, 4, 2, 1))
hist.data.frame(data)
The histogram shows the frequency graph for each column that exists in the dataset.
The ID column is unnecessary for this task and that is why it needs to be removed.
data <- data %>% select(-Id)
dim(data)
## [1] 1143 12
So, our dataset has 12 dimensions, it is a lot, so it can be reduced after several calculations.
Then the data is preprocessed: This procedure is necessary in order to get a better structured dataset and avoid unnecessary calculations which will be required later.
preprocessed <- preProcess(data,, method=c("center", "scale"))
normalized <- predict(preprocessed, data)
ggpairs(normalized, progress=FALSE)
This graph shows the frequency table of the each pair of the column
Correlation Matrix is a technique in unsupervised learning. It shows which columns are more linked to each other. This can be observed from the graph below:
cor<- cor(normalized, method="pearson")
corrplot(cor)
It can be observed from the correlation matrix that fixed.acidity and citric.acid, and density and fixed.acidity have more correlation.
Principle Component Analysis is a technique used for dimension reduction. Here, parameters called eigenvalues are important to consider.
pca_result <- PCA(normalized, graph = FALSE)
# View PCA summary
summary(pca_result)
##
## Call:
## PCA(X = normalized, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 3.178 2.235 1.686 1.216 0.967 0.666 0.587
## % of var. 26.480 18.622 14.053 10.131 8.062 5.551 4.888
## Cumulative % of var. 26.480 45.102 59.155 69.286 77.348 82.899 87.787
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.492 0.407 0.332 0.176 0.058
## % of var. 4.097 3.395 2.770 1.465 0.487
## Cumulative % of var. 91.884 95.279 98.049 99.513 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 2.733 | -1.720 0.081 0.396 | 1.254 0.062 0.211 |
## 2 | 2.927 | -0.962 0.025 0.108 | 2.120 0.176 0.524 |
## 3 | 2.075 | -0.874 0.021 0.177 | 1.469 0.085 0.501 |
## 4 | 3.080 | 2.351 0.152 0.583 | -0.207 0.002 0.005 |
## 5 | 2.733 | -1.720 0.081 0.396 | 1.254 0.062 0.211 |
## 6 | 2.636 | -1.692 0.079 0.412 | 1.265 0.063 0.230 |
## 7 | 2.254 | -1.238 0.042 0.302 | 1.244 0.061 0.305 |
## 8 | 3.169 | -2.040 0.115 0.414 | -0.922 0.033 0.085 |
## 9 | 2.654 | -0.890 0.022 0.112 | -0.587 0.013 0.049 |
## 10 | 2.282 | -1.256 0.043 0.303 | 1.332 0.069 0.340 |
## Dim.3 ctr cos2
## 1 -1.335 0.092 0.239 |
## 2 0.173 0.002 0.003 |
## 3 -0.594 0.018 0.082 |
## 4 0.099 0.001 0.001 |
## 5 -1.335 0.092 0.239 |
## 6 -1.080 0.060 0.168 |
## 7 -0.692 0.025 0.094 |
## 8 -0.692 0.025 0.048 |
## 9 -1.097 0.062 0.171 |
## 10 -0.309 0.005 0.018 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## fixed.acidity | 0.862 23.392 0.743 | 0.010 0.004 0.000 | -0.206
## volatile.acidity | -0.451 6.411 0.204 | 0.538 12.947 0.289 | -0.264
## citric.acid | 0.837 22.070 0.701 | -0.216 2.088 0.047 | 0.140
## residual.sugar | 0.301 2.849 0.091 | 0.214 2.045 0.046 | 0.299
## chlorides | 0.377 4.463 0.142 | 0.278 3.453 0.077 | 0.018
## free.sulfur.dioxide | -0.100 0.317 0.010 | 0.340 5.177 0.116 | 0.820
## total.sulfur.dioxide | -0.004 0.000 0.000 | 0.498 11.121 0.248 | 0.738
## density | 0.671 14.186 0.451 | 0.482 10.406 0.233 | -0.208
## pH | -0.761 18.226 0.579 | -0.113 0.575 0.013 | 0.077
## sulphates | 0.445 6.234 0.198 | -0.209 1.948 0.044 | 0.270
## ctr cos2
## fixed.acidity 2.524 0.043 |
## volatile.acidity 4.121 0.069 |
## citric.acid 1.161 0.020 |
## residual.sugar 5.310 0.090 |
## chlorides 0.020 0.000 |
## free.sulfur.dioxide 39.852 0.672 |
## total.sulfur.dioxide 32.278 0.544 |
## density 2.575 0.043 |
## pH 0.353 0.006 |
## sulphates 4.323 0.073 |
The summary shows the dimensions and their eigenvalues. 4 of them are more than 1. More explanation in the following paragparh:
normalized.cov<-cov(normalized)
normalized.eigen<-eigen(normalized.cov)
normalized.eigen$values
## [1] 3.17765845 2.23460254 1.68638738 1.21566982 0.96739032 0.66612476
## [7] 0.58659975 0.49163995 0.40740646 0.33235549 0.17575505 0.05841003
The output implies that 4 components should be chosen as they have eigen values higher than 1.
fviz_eig(pca_result, choice='eigenvalue')
fviz_eig(pca_result, ylim = c(0, 50))
The 2 graphs above indicate that 4 dimensions is enough for us to cover more than half of the dataset. Hence, 4 dimensions (eigenvalues higher than 1), should be chosen.
Component analysis is done in order to better visualize the relationship between the dimensions.
fviz_pca_ind(pca_result, col.ind=normalized$quality, geom = "point", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07" ))
Another technique to better visualize which components are closely linked to each other.
fviz_pca_var(pca_result, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)
It can be observed that fixed.acidity, citric.acid and density have the most influence, in comparison with fixed.sulfur.dioxide and residual.sugar.
fviz_pca_var(pca_result, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=20)
Contribution of fixed.acidity and citric.acid is the most powerful for PC1 and PC2.
fviz_pca_var(pca_result, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=20)
For PC2 and PC3, free.sulfur.dioxide and total.sulfur.dioxide have biggest contribution. The following procedure is also done for PC3 and PC4 to better understand the relationship between components.
fviz_pca_var(pca_result, col.var="contrib", repel = TRUE, axes = c(3, 4)) +
labs(title="Variables loadings for PC3 and PC4", x="PC3", y="PC4")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=20)
PC1 <- fviz_contrib(pca_result, choice = "var", axes = 1,fill = "#990066",color = "#990066") + ggtitle("DIM1") + theme(plot.title = element_text(hjust = 0.5))
PC2 <- fviz_contrib(pca_result, choice = "var", axes = 2,fill = "#990066",color = "#990066") + ggtitle("DIM2") + theme(plot.title = element_text(hjust = 0.5))
PC3 <- fviz_contrib(pca_result, choice = "var", axes = 3,fill = "#990066",color = "#990066") + ggtitle("DIM3") + theme(plot.title = element_text(hjust = 0.5))
PC4 <- fviz_contrib(pca_result, choice = "var", axes = 4,fill = "#990066",color = "#990066") + ggtitle("DIM4") + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(PC1, PC2, PC3,PC4, ncol=4)
It can be observed from the contribution graphs that first 2 variables are the most important ones for all dimensions but the 3rd variable also important in 3 dimensions.
The purpose of this project is to analyse the huge dataset called Wine Quality and reduce its size using the technique called Principal Component Analysis (PCA). After the analysis, 4 eigenvalues that have indicators more than 1 suggest that the dimension can be reduced to 4.