1 Introduction

We go into the realm of oenology, investigating the subtle dance of chemical constituents that characterize the vast range of Italian wines. Analyze the learning curve to understand whether your model has low or high variance in parameter estimates across samples (bias-variance trade-off) and needs to add more samples or factors to your data set.This dataset contains the results of a chemical analysis of wines made from three different varieties grown in the same region of Italy.Our goal is to uncover the presence of clusters in the wine dataset. In other words, see if he can distinguish between the three cultivators within the dataset.

2 Data Exploration

First we need to load some libraries and read the data set.

# Load the necessary libraries
library(tidyverse)
library(corrplot)
library(corrgram)
library(gridExtra)
library(GGally)
library(knitr)
library(cluster)
library(fpc)
library(factoextra)

# Load the wine dataset
wines <- read.csv("Wine.csv")

Display the first and last few rows of the dataset.

2.1 First rows

# First rows 
head(wines)

##   Alcohol Malic_Acid  Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1   14.23       1.71 2.43         15.6       127          2.80       3.06
## 2   13.20       1.78 2.14         11.2       100          2.65       2.76
## 3   13.16       2.36 2.67         18.6       101          2.80       3.24
## 4   14.37       1.95 2.50         16.8       113          3.85       3.49
## 5   13.24       2.59 2.87         21.0       118          2.80       2.69
## 6   14.20       1.76 2.45         15.2       112          3.27       3.39
##   Nonflavanoid_Phenols Proanthocyanins Color_Intensity  Hue OD280 Proline
## 1                 0.28            2.29            5.64 1.04  3.92    1065
## 2                 0.26            1.28            4.38 1.05  3.40    1050
## 3                 0.30            2.81            5.68 1.03  3.17    1185
## 4                 0.24            2.18            7.80 0.86  3.45    1480
## 5                 0.39            1.82            4.32 1.04  2.93     735
## 6                 0.34            1.97            6.75 1.05  2.85    1450

2.2 Last rows

# Last rows 
tail(wines)

##     Alcohol Malic_Acid  Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 173   14.16       2.51 2.48         20.0        91          1.68       0.70
## 174   13.71       5.65 2.45         20.5        95          1.68       0.61
## 175   13.40       3.91 2.48         23.0       102          1.80       0.75
## 176   13.27       4.28 2.26         20.0       120          1.59       0.69
## 177   13.17       2.59 2.37         20.0       120          1.65       0.68
## 178   14.13       4.10 2.74         24.5        96          2.05       0.76
##     Nonflavanoid_Phenols Proanthocyanins Color_Intensity  Hue OD280 Proline
## 173                 0.44            1.24             9.7 0.62  1.71     660
## 174                 0.52            1.06             7.7 0.64  1.74     740
## 175                 0.43            1.41             7.3 0.70  1.56     750
## 176                 0.43            1.35            10.2 0.59  1.56     835
## 177                 0.53            1.46             9.3 0.60  1.62     840
## 178                 0.56            1.35             9.2 0.61  1.60     560

2.3 Summary

# Summary
summary(wines)

##     Alcohol        Malic_Acid         Ash         Ash_Alcanity  
##  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
##  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
##  Median :13.05   Median :1.865   Median :2.360   Median :19.50  
##  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
##  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
##  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
##    Magnesium      Total_Phenols     Flavanoids    Nonflavanoid_Phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color_Intensity       Hue             OD280      
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

2.4 Structure

# Structure 
glimpse(wines)

## Rows: 178
## Columns: 13
## $ Alcohol              <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, …
## $ Malic_Acid           <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
## $ Ash                  <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
## $ Ash_Alcanity         <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
## $ Magnesium            <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 10…
## $ Total_Phenols        <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
## $ Flavanoids           <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
## $ Nonflavanoid_Phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
## $ Proanthocyanins      <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
## $ Color_Intensity      <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
## $ Hue                  <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
## $ OD280                <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
## $ Proline              <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…

2.5 Data analysis

2.5.1 Data Preprocessing

This graphical study of the data helps us to detect patterns, central tendencies, and probable outliers. Histograms are a useful tool for analyzing the underlying properties of a dataset, whether it is alcohol concentration, acidity, or other chemical parameters.

wines %>%
 keep(is.numeric) %>%                     
  gather() %>%                            
  ggplot(aes(value)) +                    
  geom_histogram(colour="black") +
    facet_wrap(~ key, scales = "free") +   
    geom_density(colour="red", alpha=0.5, show.legend=FALSE) +
  labs(x="Values", y="Frequency", 
       title="Diverse Attributes of Italian Wines - Histograms") +
  theme_bw()

From the density above, we can see that the density is not normal and there are outliers. We can identify outliers using boxplots.

boxplot(wines,
        las = 2,
        col = c("blue")
        )

Capping will be used to remove outliers since the other methods were introducing other outliers.

# Columns with outliers: Malic_Acid, Ash, Ash_Alcanity, Magnesium, Proanthocyanins, Color_Intensity, Hue

cap_outliers <- function(x, threshold = 1.5) {
  q <- quantile(x, c(0.25, 0.75))
  iqr <- q[2] - q[1]
  lower_bound <- q[1] - threshold * iqr
  upper_bound <- q[2] + threshold * iqr
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

columns_to_cap <- c("Malic_Acid", "Ash", "Ash_Alcanity", "Magnesium", "Proanthocyanins", "Color_Intensity", "Hue")

wines[columns_to_cap] <- lapply(wines[columns_to_cap], cap_outliers)

We can now verify.

boxplot(wines,
        las = 2,
        col = c("blue")
        )

To construct a graphical representation of a correlation matrix, we may utilize the ‘corrplot()’ method.

corrgram(wines, order=TRUE, lower.panel=panel.shade,
  upper.panel=panel.pie, text.panel=panel.txt)

2.5.2 Pattern Recognition with K-Means

To represent the variables in the same range of values, we must normalize them.

# Normalization
wines_df <- as.data.frame(scale(wines))

3 K-means Clustering Algorithms

In this section we are going to execute the k-means algorithm and analyze the main components that the function returns. Here, the data is grouped into two clusters (center = 2). The ‘kmeans()’ function also has an nstart option that tries multiple initial configurations and reports the best configuration. For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.

# Execution of k-means with k=2
set.seed(1234)
wines_df1<- kmeans(wines_df, centers=2, nstart = 25)

par(mfrow = c(1, 2))
# plot1
plot(as.matrix(wines_df), col=wines_df1$cluster, main = "K means with 2 clusters",
     xlab="", ylab="")
# plot2
plotcluster(wines_df,wines_df1$cluster)
points(wines_df1$centers,col=1:8,pch=16)

par(mfrow = c(1, 1))

You can also view the results using fviz_cluster(). This clearly shows the cluster. If there are more than two dimensions (variables), fviz_cluster() performs principal component analysis (PCA) and plots the data points according to the first two principal components that explain most of the variance.

fviz_cluster(wines_df1, data = wines_df)

4 Optimal Number of Clusters

We can use Silhouette.elbow and gap statics method to find the optimal number of clusters.

set.seed(1234)

plot1<- fviz_nbclust (wines_df, kmeans, method = "wss")+
  geom_vline(xintercept = 3, linetype = 5)+
  labs(subtitle = "Elbow method Plot")

plot2<- fviz_nbclust(wines_df, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method Plot")

plot3<- fviz_nbclust(wines_df, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method Plot")

plot(plot1)

4.1 Results

K=3 was the estimated value by all the three methods.

set.seed(1234)

wines_k3 <- kmeans(wines_df, centers=3)
plot(as.matrix(wines_df), col=wines_k3$cluster, main = "K-means with 3 clusters",
     xlab="", ylab="")

clusplot(wines_df, wines_k3$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

fviz_cluster(wines_k3, data = wines_df)

5 Principal Component Analysis (PCA) Execution

PCA was used to decrease the dimensionality of the data and identify the six principle components that explain the bulk of the variation. This allows us to display the dataset in a lower-dimensional space, showing the underlying elements that create wine qualities.

Concealed trends were revealed and important insights into wine categories and key elements influencing variability were gained by applying clustering and PCA to the wine dataset. These approaches have enormous potential for interpreting complicated, unlabeled data and making data-driven decisions.

wines_df <- as.data.frame(apply(wines, 2, as.numeric))

wines.pca <- prcomp(wines_df, center = TRUE, scale. = TRUE)

# Summary
summary(wines.pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.1776 1.5966 1.1910 0.95818 0.90477 0.79118 0.72394
## Proportion of Variance 0.3648 0.1961 0.1091 0.07062 0.06297 0.04815 0.04032
## Cumulative Proportion  0.3648 0.5609 0.6700 0.74059 0.80356 0.85171 0.89203
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.59732 0.53873 0.50104 0.47771 0.41529 0.32392
## Proportion of Variance 0.02745 0.02233 0.01931 0.01755 0.01327 0.00807
## Cumulative Proportion  0.91947 0.94180 0.96111 0.97866 0.99193 1.00000

5.1 Eigenvalues and Eigenvectors

Eigenvalues show the amount of variance explained by each principal component. In PCA, eigenvalues are used to determine the significance of each component in capturing the variation in the data. Eigenvectors indicate the direction of the principal components in the original feature space. These vectors represent the weights assigned to each original feature in the linear combination that forms the principal components.

cov_matrix <- cov(wines_df)
eigens <- eigen(cov_matrix)

# eigenvalues
values <- eigens$values

# eigenvectors
vectors <- eigens$vectors

# Sort the eigenvectors by their corresponding eigenvalues in descending order
sorted_indices <- order(values, decreasing = TRUE)
sorted_values <- values[sorted_indices]
sorted_vectors <- vectors[, sorted_indices]

print(sorted_values)

##  [1] 9.920072e+04 1.496058e+02 9.062181e+00 4.745043e+00 1.172489e+00
##  [6] 8.165326e-01 2.768415e-01 1.466982e-01 1.113144e-01 7.192301e-02
## [11] 3.569606e-02 2.057526e-02 8.339814e-03

print(sorted_vectors)

##                [,1]          [,2]         [,3]         [,4]         [,5]
##  [1,] -0.0016593080 -0.0027761662  0.016391489 -0.142874926  0.036172691
##  [2,]  0.0006677014 -0.0041934002  0.125965256 -0.173165435 -0.575191774
##  [3,] -0.0001924218 -0.0054879766  0.049630777  0.010219881  0.017289952
##  [4,]  0.0046835201 -0.0273975429  0.940194011  0.325951829  0.067651990
##  [5,] -0.0175545210 -0.9991777062 -0.033020551  0.012286692 -0.007665888
##  [6,] -0.0009898384 -0.0011231922 -0.044181380  0.081738936  0.324478814
##  [7,] -0.0015673086 -0.0003016742 -0.091062158  0.178002706  0.542972954
##  [8,]  0.0001230822  0.0013910539  0.014031046 -0.011389299 -0.030189061
##  [9,] -0.0006036117 -0.0030055900 -0.025325702  0.055591726  0.258662209
## [10,] -0.0023486151 -0.0223590143  0.281266974 -0.873709189  0.350631946
## [11,] -0.0001729932  0.0012750789 -0.026753256  0.061437939  0.047658053
## [12,] -0.0007049313  0.0043345027 -0.073585741  0.185372761  0.264136814
## [13,] -0.9998283896  0.0174704364  0.004630613  0.002946757 -0.002346365
##               [,6]          [,7]          [,8]          [,9]         [,10]
##  [1,] -0.196704317  0.9395376757 -2.228308e-01  7.673033e-02 -1.348560e-02
##  [2,] -0.768418188 -0.1576080436  4.749577e-02  2.252255e-02 -2.152715e-02
##  [3,] -0.035772489  0.0404659068  1.273863e-01  8.049553e-02 -3.466877e-02
##  [4,]  0.024782607  0.0321668955 -8.992161e-03  1.083829e-03 -2.294613e-04
##  [5,]  0.002953274  0.0008509383  1.616223e-03 -2.357219e-03  5.371658e-05
##  [6,] -0.264553961 -0.0267135405  1.701381e-01  2.941137e-01  8.354143e-01
##  [7,] -0.411020461 -0.0668437914  2.230388e-01  3.666183e-01 -5.405048e-01
##  [8,]  0.020382943 -0.0021211328 -6.955413e-03  3.989250e-02  4.017669e-02
##  [9,] -0.221260705 -0.2480629996 -8.917215e-01 -8.768485e-02  3.162450e-02
## [10,]  0.012386957 -0.1214536445  7.861507e-02 -1.006349e-01 -1.781824e-02
## [11,]  0.025924990  0.0297311478  7.229968e-03  3.109937e-02 -5.261243e-02
## [12,] -0.282512325  0.0860132434  2.272408e-01 -8.636159e-01  4.889195e-02
## [13,]  0.001092317 -0.0010363146 -2.020321e-05 -4.945129e-05  3.521567e-05
##               [,11]         [,12]         [,13]
##  [1,]  0.0248170506 -0.0130591310  8.131430e-03
##  [2,]  0.0175240682  0.0656060960 -1.175961e-02
##  [3,] -0.9589882511 -0.1447970565 -1.721996e-01
##  [4,]  0.0516359983  0.0065153954  1.254503e-03
##  [5,]  0.0033932152  0.0007572426  2.441334e-03
##  [6,]  0.0317662010  0.0195926559 -2.481482e-02
##  [7,]  0.0952058366 -0.0482260769  6.406239e-02
##  [8,] -0.1950901050  0.1421737938  9.678908e-01
##  [9,] -0.1213782382 -0.0085770592 -1.410269e-02
## [10,]  0.0021520616  0.0500962659 -5.389182e-03
## [11,] -0.1120031169  0.9741206613 -1.625958e-01
## [12,] -0.0364047564  0.0099397349  4.401880e-02
## [13,]  0.0002459432 -0.0001037714  3.637047e-05

It involves computing the eigenvectors and eigenvalues of the covariance matrix of the dataset, and then projecting the data onto a reduced set of dimensions defined by the selected principal components.

eig.val <- get_eigenvalue(wines.pca)
eig.val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.7418544       36.4758028                    36.47580
## Dim.2   2.5492707       19.6097746                    56.08558
## Dim.3   1.4184536       10.9111819                    66.99676
## Dim.4   0.9181091        7.0623779                    74.05914
## Dim.5   0.8186077        6.2969826                    80.35612
## Dim.6   0.6259587        4.8150671                    85.17119
## Dim.7   0.5240955        4.0315036                    89.20269
## Dim.8   0.3567893        2.7445334                    91.94722
## Dim.9   0.2902270        2.2325157                    94.17974
## Dim.10  0.2510438        1.9311063                    96.11085
## Dim.11  0.2282063        1.7554330                    97.86628
## Dim.12  0.1724624        1.3266336                    99.19291
## Dim.13  0.1049214        0.8070876                   100.00000

fviz_pca_ind(prcomp(wines_df),
             geom = "point", ggtheme = theme_classic())

5.2 Visualisation

5.2.1 Scree Plot

A scree plot is a graphical representation of the eigenvalues corresponding to each principal component. It helps in visualizing the amount of variance explained by each component.

fviz_eig(wines.pca, addlabels = TRUE, ylim = c(0, 50))

5.2.2 Factor Map

PCA serves as a powerful tool for dimensionality reduction and data exploration, offering valuable insights into the underlying patterns and structures present in the dataset.

fviz_pca_ind(wines.pca)

5.2.3 Biplot

It helps in visualizing the relationships between variables and observations.

variable_colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#FF7F00", "#FDBF6F",
                     "#A65628", "#984EA3", "#999999", "#DEB887", "#C0C0C0",
                     "#000080", "#4682B4", "#FFA07A")

# Create a biplot with variable colors
fviz_pca_biplot(wines.pca, geom = "point", col.var = variable_colors)

5.2.4 Scatter3d

wines_df %>% 
  plot_ly(x = .$Ash, y = .$Hue, z = .$Malic_Acid, type="scatter3d", mode="markers", color = .$clusters)

6 Conclusion

In conclusion, both K-means clustering and Principal Component Analysis (PCA) offer valuable insights into finding the optimal number of clusters in a dataset. K-means helps identify distinct groups based on similarities in data points, while PCA aids in dimensionality reduction, revealing underlying patterns. By leveraging the elbow method or silhouette analysis alongside PCA-based techniques like hierarchical clustering, we can determine the most suitable number of clusters for our data, enhancing our understanding of complex datasets and facilitating more informed decision-making processes.

7 References

Applied Predictive Modeling, Max kuhn 4.ggplot2 -Elegant Graphics for Data Analysis by Wickham, Hadley

Tutorial:Clustering wines with K-means Xavier Vivancos García

Clustering assessment, validation and measures Navni Joshi

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson

Introduction-Statistical-Learning-Applications-Statistics - Gareth James, Daniela Witten, Trevor Hastie, Robert

Classification of Wines Using Principal Component Analysis by Jackson Barth, Duwani Katumullage,Chenyu Yang and Jing Cao

McCannon, B. C. (2020). Wine descriptions provide information: A text analysis. Journal of Wine Economics, 15(1), 71–94.

Decoding Wine Complexity using K-Means and PCA

Daisy Mutua

2024-02-13