In the realm of big data, datasets more often than not are enormous and include many variables, making pattern analysis and visualization a challenging task. High-dimensional datasets could lead to redundancy, noisy traits, and inefficient computations. Dimension reduction techniques help reduce the volume of information while preserving the maximum amount of relevant information.
This project focuses on the application of dimension reduction to the wine dataset, which for varying physicochemical properties and quality metrics of wines. The aim is to identify the important dimensions to capture dataset variance and visualize the underlying structure, thus producing more interpretable models without losing key insights.
library(corrplot)
library(clusterSim)
library(depthTools)
library(ggplot2)
library(factoextra)
library(gridExtra)
library(psych)
wine_cor <- cor(wine, method = 'pearson')
print(wine_cor, digits = 2)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## fixed_acidity 1.000 0.56 0.596 0.65
## volatile_acidity 0.562 1.00 0.635 0.73
## citric_acid 0.596 0.63 1.000 0.77
## residual_sugar 0.654 0.73 0.765 1.00
## chlorides 0.602 0.62 0.660 0.75
## free_sulfur_dioxide 0.654 0.75 0.701 0.78
## total_sulfur_dioxide 0.507 0.62 0.637 0.75
## density 0.625 0.67 0.740 0.79
## pH -0.254 -0.21 -0.185 -0.25
## sulphates 0.579 0.66 0.646 0.68
## alcohol 0.168 0.38 0.358 0.35
## quality 0.038 0.02 0.052 0.05
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## fixed_acidity 0.602 0.654 0.507 0.625
## volatile_acidity 0.623 0.746 0.622 0.673
## citric_acid 0.660 0.701 0.637 0.740
## residual_sugar 0.748 0.782 0.749 0.788
## chlorides 1.000 0.647 0.607 0.715
## free_sulfur_dioxide 0.647 1.000 0.702 0.714
## total_sulfur_dioxide 0.607 0.702 1.000 0.640
## density 0.715 0.714 0.640 1.000
## pH -0.113 -0.221 -0.253 -0.252
## sulphates 0.603 0.688 0.616 0.642
## alcohol 0.278 0.325 0.369 0.319
## quality 0.053 0.053 0.036 0.038
## pH sulphates alcohol quality
## fixed_acidity -0.2544 0.579 0.168 0.0375
## volatile_acidity -0.2139 0.660 0.376 0.0203
## citric_acid -0.1854 0.646 0.358 0.0523
## residual_sugar -0.2545 0.680 0.350 0.0497
## chlorides -0.1132 0.603 0.278 0.0529
## free_sulfur_dioxide -0.2206 0.688 0.325 0.0530
## total_sulfur_dioxide -0.2531 0.616 0.369 0.0360
## density -0.2520 0.642 0.319 0.0385
## pH 1.0000 -0.233 -0.107 -0.0024
## sulphates -0.2334 1.000 0.402 0.0477
## alcohol -0.1069 0.402 1.000 0.0819
## quality -0.0024 0.048 0.082 1.0000
We can see how some variables have a strong correlation, meaning they are dependend on each other. For example density and residual_sugar (0.79), free_sulfur_dioxide and total_sulfur_dioxide (0.79). Weak correlations are also visible on this plot. For example quality is independent variable. It does not have any correlation with any other variable. Other pairs with weak correlation are pH and fixed_acidity (-0.23), pH and density (-0.25).
corrplot(wine_cor, order = 'alphabet', tl.cex = 0.6)
On this graph we can see how many of the variables are dependend on each
other and have strong correlation
wine_nor <- data.Normalization(wine, type = 'n1', normalization = 'column')
summary(wine_nor)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. :-2.48437 Min. :-1.9035 Min. :-2.0629 Min. :-1.6139
## 1st Qu.:-0.91017 1st Qu.:-0.9446 1st Qu.:-0.9975 1st Qu.:-1.1301
## Median : 0.08406 Median : 0.1512 Median : 0.1978 Median : 0.3319
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.82974 3rd Qu.: 0.8361 3rd Qu.: 0.8215 3rd Qu.: 0.8157
## Max. : 2.52822 Max. : 2.2059 Max. : 2.2507 Max. : 1.8149
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :-1.53078 Min. :-1.6645 Min. :-2.2260 Min. :-1.9001
## 1st Qu.:-1.02651 1st Qu.:-1.0943 1st Qu.:-0.7886 1st Qu.:-1.0435
## Median : 0.03806 Median : 0.2120 Median : 0.1147 Median : 0.1851
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.78246 3rd Qu.: 0.8408 3rd Qu.: 0.8247 3rd Qu.: 0.8201
## Max. : 3.28780 Max. : 2.0677 Max. : 2.1061 Max. : 2.4109
## pH sulphates alcohol quality
## Min. :-2.56001 Min. :-1.9609 Min. :-2.784402 Min. :-1.5
## 1st Qu.:-0.75107 1st Qu.:-0.9812 1st Qu.:-0.754286 1st Qu.:-1.0
## Median :-0.05084 Median : 0.1454 Median : 0.007007 Median : 0.0
## Mean : 0.00000 Mean : 0.0000 Mean : 0.000000 Mean : 0.0
## 3rd Qu.: 0.64940 3rd Qu.: 0.8311 3rd Qu.: 0.768301 3rd Qu.: 1.0
## Max. : 4.96751 Max. : 2.3986 Max. : 3.052182 Max. : 1.5
Principal Compoment Analysis is a technique used for data with a linear structure. Its main goal is to preserve global variance and is easier to interpret
pca_result <- prcomp(wine, center = FALSE, scale. = FALSE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 290.4722 47.90766 10.52077 5.82846 1.90172 1.57015
## Proportion of Variance 0.9718 0.02644 0.00127 0.00039 0.00004 0.00003
## Cumulative Proportion 0.9718 0.99826 0.99954 0.99993 0.99997 1.00000
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.3455 0.2754 0.2274 0.2131 0.07874 0.04135
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.00000 0.00000
## Cumulative Proportion 1.0000 1.0000 1.0000 1.0000 1.00000 1.00000
Since PC1 and PC2 explain almost all variance, the dataset can be reduced to two dimensions with minimal information loss
pca_scaled <- prcomp(wine, center = TRUE, scale. = TRUE)
summary(pca_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5681 1.01266 0.97193 0.93816 0.68131 0.65988 0.6000
## Proportion of Variance 0.5496 0.08546 0.07872 0.07335 0.03868 0.03629 0.0300
## Cumulative Proportion 0.5496 0.63506 0.71378 0.78713 0.82581 0.86209 0.8921
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.5889 0.57926 0.48509 0.45969 0.40731
## Proportion of Variance 0.0289 0.02796 0.01961 0.01761 0.01383
## Cumulative Proportion 0.9210 0.94896 0.96856 0.98617 1.00000
fviz_eig(pca_result, choice = 'eigenvalue')
We now have a clear understanding how PC1 and PC2 are distinct from the
other principal components.
pca_data <- as.data.frame(pca_result$x)
ggplot(pca_data, aes(PC1, PC2)) +
geom_point(alpha = 0.7) +
labs(title = "PCA: PC1 vs PC2", x = "PC1", y = "PC2")
As we knew before PC1 captures the most variance, which can be
intepreted by large amount of points along PC1 axis.
var <- get_pca_var(pca_result)
a <- fviz_contrib(pca_result, 'var', axes = 1, xtickslab.rt = 90)
b <- fviz_contrib(pca_result, 'var', axes = 2, xtickslab.rt = 90)
grid.arrange(a, b)
Both dimensions have a very similar contribution with
total_sulruf_dioxide and free_sulruf_dioxide being most dominant
pca_p_result <- principal(wine, nfactors = 3, rotate = 'varimax')
plot(pca_p_result$uniquenesses)
text(pca_p_result$uniquenesses, labels = names(pca_p_result$uniquenesses), cex = 0.8)
We can see how ‘alcohol’ variable stands out from the rest, as its
values rarely repeat in the dataset.