library(readr)
wine_clustering <- read_csv("~/Desktop/MVA/HW2/wine-clustering.csv")
## Rows: 178 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Alcohol, Malic_Acid, Ash, Ash_Alcanity, Magnesium, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proant...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(wine_clustering)

wine_clustering$ID <- seq(1, nrow(wine_clustering))
# Adding ID numbers so I can filter later. 

head(wine_clustering)
## # A tibble: 6 × 14
##   Alcohol Malic_Acid   Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
##     <dbl>      <dbl> <dbl>        <dbl>     <dbl>         <dbl>      <dbl>                <dbl>           <dbl>
## 1    14.2       1.71  2.43         15.6       127          2.8        3.06                 0.28            2.29
## 2    13.2       1.78  2.14         11.2       100          2.65       2.76                 0.26            1.28
## 3    13.2       2.36  2.67         18.6       101          2.8        3.24                 0.3             2.81
## 4    14.4       1.95  2.5          16.8       113          3.85       3.49                 0.24            2.18
## 5    13.2       2.59  2.87         21         118          2.8        2.69                 0.39            1.82
## 6    14.2       1.76  2.45         15.2       112          3.27       3.39                 0.34            1.97
## # ℹ 5 more variables: Color_Intensity <dbl>, Hue <dbl>, OD280 <dbl>, Proline <dbl>, ID <int>

Dataset overview

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Unit of observation: one sample of wine grown in Italy. Sample size: the dataset includes 178 units with 13 variables.

The dataset was retrieved from https://www.kaggle.com/datasets/harrywang/wine-dataset-for-clustering.

Description of Variables

Data manipulation

# Checking for NAs
colSums(is.na(wine_clustering))
##              Alcohol           Malic_Acid                  Ash         Ash_Alcanity            Magnesium 
##                    0                    0                    0                    0                    0 
##        Total_Phenols           Flavanoids Nonflavanoid_Phenols      Proanthocyanins      Color_Intensity 
##                    0                    0                    0                    0                    0 
##                  Hue                OD280              Proline                   ID 
##                    0                    0                    0                    0
#There are no NAs, perfect!

#I'm going to check for duplicates (I used ChatGPT for help):
wine_clustering <- wine_clustering[!duplicated(wine_clustering), ]

Research question:

Can we cluster wines into those that age well vs. those best consumed young?

Let me first clarify the difference between high-quality and low-quality wine:

To find this out, I will cluster these variables: Proanthocyanins, Flavanoids, Proline, Alcohol and OD280.

Variable standardization, dissimilarity & removing outliers

wine_clustering_std <- as.data.frame(scale(wine_clustering[c("Proanthocyanins",
                                                             "Flavanoids",
                                                             "Proline",
                                                             "OD280",
                                                             "Alcohol")]))
#Finding outliers
wine_clustering$Dissimilarity <- sqrt(wine_clustering$Proanthocyanins^2 + 
                                        wine_clustering$Flavanoids^2 + 
                                        wine_clustering$Proline^2 + 
                                        wine_clustering$OD280^2 + 
                                        wine_clustering$Alcohol^2)
head(wine_clustering[order(-wine_clustering$Dissimilarity), c("Alcohol", "Dissimilarity")])
## # A tibble: 6 × 2
##   Alcohol Dissimilarity
##     <dbl>         <dbl>
## 1    14.2         1680.
## 2    14.4         1547.
## 3    13.6         1515.
## 4    14.1         1510.
## 5    14.4         1480.
## 6    14.2         1450.

Dissimilarity values represent how different each wine is compared to the rest of the wines in the dataset, with higher values indicating greater divergence. AS we can see, there is a dissimilarity between one specific wine, the one with Alcohol level 14.19%. This might be an outlier, since Alcohol is higher in another wine and the Dissimilarity is smaller. I’m going to filter it out.

# Filtering out the potential outlier
wine_clustering <- wine_clustering %>%
filter(!(row_number() %in% c(19)))

#Finding outliers again
wine_clustering$Dissimilarity <- sqrt(wine_clustering$Proanthocyanins^2 + 
                                        wine_clustering$Flavanoids^2 + 
                                        wine_clustering$Proline^2 + 
                                        wine_clustering$OD280^2 + 
                                        wine_clustering$Alcohol^2)

# Redoing the Dissimilarity without the potential outlier
head(wine_clustering[order(-wine_clustering$Dissimilarity), c("Alcohol", "Dissimilarity")])
## # A tibble: 6 × 2
##   Alcohol Dissimilarity
##     <dbl>         <dbl>
## 1    14.4         1547.
## 2    13.6         1515.
## 3    14.1         1510.
## 4    14.4         1480.
## 5    14.2         1450.
## 6    13.8         1375.

Euclidian Distances

library(factoextra) 

#Finding Euclidian distances then saving them into object Distances

Distances <- get_dist(wine_clustering_std, 
                      method = "euclidian")

fviz_dist(Distances, #Showing matrix of distances
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

library(factoextra) 
get_clust_tendency(wine_clustering_std, #Hopkins statistics
                   n = nrow(wine_clustering) - 1,
                   graph = FALSE) 
## $hopkins_stat
## [1] 0.7266345
## 
## $plot
## NULL

Explanation:

The Hopkins statistic that we use to measure clustering tendency, is 0.73, indicating a strong clustering structure. Because it is close to 1, it confirms that the data is a great fit for cluster analysis. And if we look at the visual analysis made with Euclidian distances, we see that the distance matrix reveals the formation of 3 distinct squares, further supporting the fact that there are well-defined clusters within the data.

Identifying clusters

library(factoextra)
library(NbClust)

fviz_nbclust(wine_clustering_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

fviz_nbclust(wine_clustering_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

I used both the elbow method and Silhouette analysis. With the elbow method, we determine the optimal number of clusters by plotting the total within-cluster sum of squares against the number of clusters. In both graphs we see that there is a significant break at Cluster 3, suggesting they capture the structure of the data set effectively and simplistically enough while maintaining differentiations.

Let’s check the assumption of the optimal number of clusters with the K-method:

library(NbClust)
nc <- NbClust(wine_clustering_std, distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 16 proposed 3 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

According to the K-method and the majority rule, the best number of clusters is 3, as I assumed.

So now let’s perform clustering with 3 clusters and visualise them:

Clustering <- kmeans(wine_clustering_std,
centers = 3,
nstart = 25)

Clustering
## K-means clustering with 3 clusters of sizes 61, 60, 57
## 
## Cluster means:
##   Proanthocyanins Flavanoids    Proline      OD280    Alcohol
## 1       0.6067834  0.9267060  1.1353067  0.7325703  0.9166196
## 2       0.2232302  0.1615814 -0.7708399  0.4232188 -1.0272531
## 3      -0.8843438 -1.1618237 -0.4035669 -1.2294722  0.1003752
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [58] 1 1 3 3 3 3 2 2 2 2 2 3 2 3 1 3 1 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [115] 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [172] 3 3 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 121.71535 120.22832  73.79202
##  (between_SS / total_SS =  64.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
## [8] "iter"         "ifault"
library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = FALSE,
             ggtheme = theme_bw(),
             data = wine_clustering_std)

These 3 clusters look great! The clusters satisfy both within-cluster and between-cluster requirements.They show a clear separation of groups without severe outlier influence, which means that there are meaningful and consistent groupings in the data.

Explanation of results:

Cluster creation and Cluster Analysis

Averages <- Clustering$centers

Averages
##   Proanthocyanins Flavanoids    Proline      OD280    Alcohol
## 1       0.6067834  0.9267060  1.1353067  0.7325703  0.9166196
## 2       0.2232302  0.1615814 -0.7708399  0.4232188 -1.0272531
## 3      -0.8843438 -1.1618237 -0.4035669 -1.2294722  0.1003752

Explanation of results:

  • Cluster 1 displays positive averages, so they are above average for all values. This indicates high-quality wines, since they require high amounts of Proanthocyanins, Flavanoid, Proline, OD280 and moderate amounts of Alcohol.

  • Cluster 2 displays positive averages for Proanthocyanins, Flavanoid and OD280, which means they are of above average in value, while displaying negative averages for Proline and Alcohol, so they are below average. Proanthocyanins and Flavanoids are close to average. This indicates a mid-quality wine, neither high-quality, neither low-quality wines.

  • Cluster 3 displays negative averages, so they are below average for all values. This indicates low-quality wines, since they require low amounts of Proanthocyanins, Flavanoid, Proline, OD280 and either too high or too low amounts of Alcohol and since it is close to 0, we can conclude that this is too low for better quality wine.

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
##   Proanthocyanins Flavanoids    Proline      OD280    Alcohol
## 1       0.6067834  0.9267060  1.1353067  0.7325703  0.9166196
## 2       0.2232302  0.1615814 -0.7708399  0.4232188 -1.0272531
## 3      -0.8843438 -1.1618237 -0.4035669 -1.2294722  0.1003752
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Proanthocyanins",
                                        "Flavanoids",
                                        "Proline",
                                        "OD280",
                                        "Alcohol"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("Proanthocyanins",
                                        "Flavanoids",
                                        "Proline",
                                        "OD280",
                                        "Alcohol"),

                       labels = c("Proanthocyanins",
                                        "Flavanoids",
                                        "Proline",
                                        "OD280",
                                        "Alcohol"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Explanation of results:

This graph shows the distinctions above. Wines in Group 1 are have distinctly above average levels of all values. Group 2 is middling, varying around average while Group 3 is distinctly below average. So this emphasises that Group 1 consists of High-quality wine, Group 2 of some sort of middle-quality and Group 3 of Low-quality wines.

#I was facing an issue, so this is additional code to fix it, mostly I just copied code from above. 

nrow(wine_clustering) 
## [1] 177
nrow(Clustering)       
## NULL
wine_clustering$Group <- Clustering$cluster[1:nrow(wine_clustering)]

Clustering <- kmeans(wine_clustering[, c("Total_Phenols", "Flavanoids", "Proline", "Alcohol", "OD280")], centers = 3)
wine_clustering$Group <- Clustering$cluster

Checking if clustering variables successfully differentiate between groups

wine_clustering$Group <- Clustering$cluster

fit <- aov(cbind(Proanthocyanins, Flavanoids, Proline, OD280, Alcohol) ~ as.factor(Group), 
           data = wine_clustering)

summary(fit)
##  Response Proanthocyanins :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2  6.537  3.2683  11.069 2.984e-05 ***
## Residuals        174 51.375  0.2953                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Flavanoids :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2  60.176 30.0879  46.417 < 2.2e-16 ***
## Residuals        174 112.787  0.6482                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Proline :
##                   Df   Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 14579239 7289619  604.67 < 2.2e-16 ***
## Residuals        174  2097663   12056                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response OD280 :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 16.673  8.3367  20.006 1.512e-08 ***
## Residuals        174 72.506  0.4167                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Alcohol :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 45.570 22.7852  56.913 < 2.2e-16 ***
## Residuals        174 69.661  0.4004                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation of results:

The low p-values (< 0.001) for all variables confirm that the clustering successfully differentiates wines based on their taste profile, whether they are High-quality or Low-quality based on key indicators (values of Proanthocyanins, Flavanoids, Proline, OD280 and Alcohol). Each cluster and their meaningful differences, represent distinct profiles of all analyzed variables. The clustering approach is therefore valid and correct.

Criterion Validity

To evaluate the criterion validity of my clustering solution, I selected Alcohol level as a numerical variable to benchmark the test. Alcohol level is the cumulative variable of other factors in wine production, so, by comparing clusters based on this variable, I can link them to their flavour profiles as High-quality or Low-quality wine.

Aggregate the means of Value added per inhabitant by cluster group

# Aggregate the means of Value added per inhabitant by cluster group
aggregate(wine_clustering$Alcohol,
          by = list(Cluster = wine_clustering$Group),
          FUN = mean)
##   Cluster        x
## 1       1 13.79609
## 2       2 12.51667
## 3       3 12.92984
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
 # Levene's test

leveneTest(wine_clustering$Alcohol, as.factor(wine_clustering$Group))
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   2   4.457 0.01295 *
##       174                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation of results:

The Levene’s Test checks whether the variances across groups are homogeneous.

Hypotheses: H0: The variances of the groups are equal (homogeneity of variance) H1: The variances of the groups are not equal (heterogeneity of variance)

Since the p-value (0.01754) is less than the significance level (0.05), we reject the null hypothesis. This suggests that there is significant evidence that the variances across the groups are different (variance is not homogeneous).

Because I cannot perform ANOVA, since the assumptions are not met, I will perform a non-parametric test, let’s do the Kruskal test:

kruskal.test(Alcohol ~ as.factor(Group), 
             data = wine_clustering)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Alcohol by as.factor(Group)
## Kruskal-Wallis chi-squared = 70.556, df = 2, p-value = 4.776e-16

Explanation:

H0: The distribution of Alcohol is the same across all groups. H1: At least one group’s Alcohol distribution is different.

Since p-value is much smaller than 0.05, we reject the null hypothesis, this means that there is a statistically significant difference in Alcohol content among the three groups.

Now I want to check which group differs:

pairwise.wilcox.test(wine_clustering$Alcohol, 
                     as.factor(wine_clustering$Group), 
                     p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  wine_clustering$Alcohol and as.factor(wine_clustering$Group) 
## 
##   1       2     
## 2 1.6e-14 -     
## 3 2.4e-09 0.0029
## 
## P value adjustment method: bonferroni

We see that there are statistically significant differences are between all 3 groups which is to be expected, since Alcohol level is supposed differ as one of the factors of High-quality wine and Low-quality wine.

Conclusion:

So let us name clusters according to this data analysis:

Cluster 1: High quality wine, that includes wines like these italian reds: - Barolo DOCG (Piedmont) - Brunello di Montalcino DOCG (Tuscany) - Amarone della Valpolicella DOCG (Veneto) - Barbaresco DOCG (Piedmont) - Super Tuscans (Tuscany)

Cluster 2: Middling quality wine, that includes wines like these italian sparkling wines: - Prosecco DOC (Veneto) - Lambrusco di Sorbara DOC (Emilia-Romagna) - Rosato Salento IGT (Puglia)

Cluster 3: Low quality wine, that includes wines like these italian whites: - Vino da Tavola Bianco - Trebbiano IGT (Mass-Produced) - Soave (Low-End, Supermarket Bottles)