Project overview

Introduction

In the realm of big data, datasets more often than not are enormous and include many variables, making pattern analysis and visualization a challenging task. High-dimensional datasets could lead to redundancy, noisy traits, and inefficient computations. Dimension reduction techniques help reduce the volume of information while preserving the maximum amount of relevant information.

This project focuses on the application of dimension reduction to the wine dataset, which for varying physicochemical properties and quality metrics of wines. The aim is to identify the important dimensions to capture dataset variance and visualize the underlying structure, thus producing more interpretable models without losing key insights.

Packages used in the project:

library(corrplot)
library(clusterSim)
library(depthTools)
library(ggplot2)
library(factoextra)
library(gridExtra)
library(psych)

Preparing and analyzing data

Calculating Pearson correlation coefficients for variables to analyze relationship between two variables

wine_cor <- cor(wine, method = 'pearson')
print(wine_cor, digits = 2)
##                      fixed_acidity volatile_acidity citric_acid residual_sugar
## fixed_acidity                1.000             0.56       0.596           0.65
## volatile_acidity             0.562             1.00       0.635           0.73
## citric_acid                  0.596             0.63       1.000           0.77
## residual_sugar               0.654             0.73       0.765           1.00
## chlorides                    0.602             0.62       0.660           0.75
## free_sulfur_dioxide          0.654             0.75       0.701           0.78
## total_sulfur_dioxide         0.507             0.62       0.637           0.75
## density                      0.625             0.67       0.740           0.79
## pH                          -0.254            -0.21      -0.185          -0.25
## sulphates                    0.579             0.66       0.646           0.68
## alcohol                      0.168             0.38       0.358           0.35
## quality                      0.038             0.02       0.052           0.05
##                      chlorides free_sulfur_dioxide total_sulfur_dioxide density
## fixed_acidity            0.602               0.654                0.507   0.625
## volatile_acidity         0.623               0.746                0.622   0.673
## citric_acid              0.660               0.701                0.637   0.740
## residual_sugar           0.748               0.782                0.749   0.788
## chlorides                1.000               0.647                0.607   0.715
## free_sulfur_dioxide      0.647               1.000                0.702   0.714
## total_sulfur_dioxide     0.607               0.702                1.000   0.640
## density                  0.715               0.714                0.640   1.000
## pH                      -0.113              -0.221               -0.253  -0.252
## sulphates                0.603               0.688                0.616   0.642
## alcohol                  0.278               0.325                0.369   0.319
## quality                  0.053               0.053                0.036   0.038
##                           pH sulphates alcohol quality
## fixed_acidity        -0.2544     0.579   0.168  0.0375
## volatile_acidity     -0.2139     0.660   0.376  0.0203
## citric_acid          -0.1854     0.646   0.358  0.0523
## residual_sugar       -0.2545     0.680   0.350  0.0497
## chlorides            -0.1132     0.603   0.278  0.0529
## free_sulfur_dioxide  -0.2206     0.688   0.325  0.0530
## total_sulfur_dioxide -0.2531     0.616   0.369  0.0360
## density              -0.2520     0.642   0.319  0.0385
## pH                    1.0000    -0.233  -0.107 -0.0024
## sulphates            -0.2334     1.000   0.402  0.0477
## alcohol              -0.1069     0.402   1.000  0.0819
## quality              -0.0024     0.048   0.082  1.0000

We can see how some variables have a strong correlation, meaning they are dependend on each other. For example density and residual_sugar (0.79), free_sulfur_dioxide and total_sulfur_dioxide (0.79). Weak correlations are also visible on this plot. For example quality is independent variable. It does not have any correlation with any other variable. Other pairs with weak correlation are pH and fixed_acidity (-0.23), pH and density (-0.25).

Generating correlation heatmap

corrplot(wine_cor, order = 'alphabet', tl.cex = 0.6)

On this graph we can see how many of the variables are dependend on each other and have strong correlation

Displaying key statistics for each variable

wine_nor <- data.Normalization(wine, type = 'n1', normalization = 'column')
summary(wine_nor)
##  fixed_acidity      volatile_acidity   citric_acid      residual_sugar   
##  Min.   :-2.48437   Min.   :-1.9035   Min.   :-2.0629   Min.   :-1.6139  
##  1st Qu.:-0.91017   1st Qu.:-0.9446   1st Qu.:-0.9975   1st Qu.:-1.1301  
##  Median : 0.08406   Median : 0.1512   Median : 0.1978   Median : 0.3319  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.82974   3rd Qu.: 0.8361   3rd Qu.: 0.8215   3rd Qu.: 0.8157  
##  Max.   : 2.52822   Max.   : 2.2059   Max.   : 2.2507   Max.   : 1.8149  
##    chlorides        free_sulfur_dioxide total_sulfur_dioxide    density       
##  Min.   :-1.53078   Min.   :-1.6645     Min.   :-2.2260      Min.   :-1.9001  
##  1st Qu.:-1.02651   1st Qu.:-1.0943     1st Qu.:-0.7886      1st Qu.:-1.0435  
##  Median : 0.03806   Median : 0.2120     Median : 0.1147      Median : 0.1851  
##  Mean   : 0.00000   Mean   : 0.0000     Mean   : 0.0000      Mean   : 0.0000  
##  3rd Qu.: 0.78246   3rd Qu.: 0.8408     3rd Qu.: 0.8247      3rd Qu.: 0.8201  
##  Max.   : 3.28780   Max.   : 2.0677     Max.   : 2.1061      Max.   : 2.4109  
##        pH             sulphates          alcohol             quality    
##  Min.   :-2.56001   Min.   :-1.9609   Min.   :-2.784402   Min.   :-1.5  
##  1st Qu.:-0.75107   1st Qu.:-0.9812   1st Qu.:-0.754286   1st Qu.:-1.0  
##  Median :-0.05084   Median : 0.1454   Median : 0.007007   Median : 0.0  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0  
##  3rd Qu.: 0.64940   3rd Qu.: 0.8311   3rd Qu.: 0.768301   3rd Qu.: 1.0  
##  Max.   : 4.96751   Max.   : 2.3986   Max.   : 3.052182   Max.   : 1.5

PCA

Principal Compoment Analysis is a technique used for data with a linear structure. Its main goal is to preserve global variance and is easier to interpret

Performing principal component analysis on ‘wine’ dataset

pca_result <- prcomp(wine, center = FALSE, scale. = FALSE)
summary(pca_result)
## Importance of components:
##                             PC1      PC2      PC3     PC4     PC5     PC6
## Standard deviation     290.4722 47.90766 10.52077 5.82846 1.90172 1.57015
## Proportion of Variance   0.9718  0.02644  0.00127 0.00039 0.00004 0.00003
## Cumulative Proportion    0.9718  0.99826  0.99954 0.99993 0.99997 1.00000
##                           PC7    PC8    PC9   PC10    PC11    PC12
## Standard deviation     0.3455 0.2754 0.2274 0.2131 0.07874 0.04135
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.00000 0.00000
## Cumulative Proportion  1.0000 1.0000 1.0000 1.0000 1.00000 1.00000

Since PC1 and PC2 explain almost all variance, the dataset can be reduced to two dimensions with minimal information loss

pca_scaled <- prcomp(wine, center = TRUE, scale. = TRUE)
summary(pca_scaled)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.5681 1.01266 0.97193 0.93816 0.68131 0.65988 0.6000
## Proportion of Variance 0.5496 0.08546 0.07872 0.07335 0.03868 0.03629 0.0300
## Cumulative Proportion  0.5496 0.63506 0.71378 0.78713 0.82581 0.86209 0.8921
##                           PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.5889 0.57926 0.48509 0.45969 0.40731
## Proportion of Variance 0.0289 0.02796 0.01961 0.01761 0.01383
## Cumulative Proportion  0.9210 0.94896 0.96856 0.98617 1.00000

Vizualizing eigenvalues for each principal component (PC)

fviz_eig(pca_result, choice = 'eigenvalue')

We now have a clear understanding how PC1 and PC2 are distinct from the other principal components.

Now let’s vizualize how PC1 and PC2 are distributed

pca_data <- as.data.frame(pca_result$x)
ggplot(pca_data, aes(PC1, PC2)) +
  geom_point(alpha = 0.7) +
  labs(title = "PCA: PC1 vs PC2", x = "PC1", y = "PC2")

As we knew before PC1 captures the most variance, which can be intepreted by large amount of points along PC1 axis.

Checking contribution of variables to their specific principal component (or dimension)

var <- get_pca_var(pca_result)
a <- fviz_contrib(pca_result, 'var', axes = 1, xtickslab.rt = 90)
b <- fviz_contrib(pca_result, 'var', axes = 2, xtickslab.rt = 90)
grid.arrange(a, b)

Both dimensions have a very similar contribution with total_sulruf_dioxide and free_sulruf_dioxide being most dominant

Uniqueness of variables in PCA

pca_p_result <- principal(wine, nfactors = 3, rotate = 'varimax')
plot(pca_p_result$uniquenesses)
text(pca_p_result$uniquenesses, labels = names(pca_p_result$uniquenesses), cex = 0.8)

We can see how ‘alcohol’ variable stands out from the rest, as its values rarely repeat in the dataset.