library(readr)
wine_clustering <- read_csv("~/Desktop/MVA/HW2/wine-clustering.csv")
## Rows: 178 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Alcohol, Malic_Acid, Ash, Ash_Alcanity, Magnesium, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proant...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(wine_clustering)
wine_clustering$ID <- seq(1, nrow(wine_clustering))
# Adding ID numbers so I can filter later.
head(wine_clustering)
## # A tibble: 6 × 14
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14.2 1.71 2.43 15.6 127 2.8 3.06 0.28 2.29
## 2 13.2 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28
## 3 13.2 2.36 2.67 18.6 101 2.8 3.24 0.3 2.81
## 4 14.4 1.95 2.5 16.8 113 3.85 3.49 0.24 2.18
## 5 13.2 2.59 2.87 21 118 2.8 2.69 0.39 1.82
## 6 14.2 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97
## # ℹ 5 more variables: Color_Intensity <dbl>, Hue <dbl>, OD280 <dbl>, Proline <dbl>, ID <int>
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Unit of observation: one sample of wine grown in Italy. Sample size: the dataset includes 178 units with 13 variables.
The dataset was retrieved from https://www.kaggle.com/datasets/harrywang/wine-dataset-for-clustering.
Alcohol – The alcohol content of the wine (%).
Malic_Acid – The concentration of malic acid, which influences wine acidity.
Ash – The amount of ash residue left after burning, indicating mineral content.
Ash_Alcanity – The alkalinity of the ash, related to pH levels in wine.
Magnesium – The magnesium content in wine (measured in mg/L).
Total_Phenols – The total amount of phenolic compounds, affecting taste and antioxidant properties.
Flavanoids – A subclass of phenols, contributing to flavor, color, and bitterness.
Nonflavanoid_Phenols – Phenols that do not belong to the flavonoid group, influencing bitterness and astringency.
Proanthocyanins – A class of polyphenols that contribute to wine’s color and aging properties.
Color_Intensity – A measure of the depth and intensity of the wine’s color.
Hue – The tint or shade of the wine (usually on a scale where higher values indicate redder wines and lower values indicate browner tones).
OD280 – Optical Density (absorbance at 280 nm), related to the wine’s protein and phenol content.
Proline – An amino acid found in wine, often associated with the age and quality of wine.
# Checking for NAs
colSums(is.na(wine_clustering))
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium
## 0 0 0 0 0
## Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 0 0 0 0 0
## Hue OD280 Proline ID
## 0 0 0 0
#There are no NAs, perfect!
#I'm going to check for duplicates (I used ChatGPT for help):
wine_clustering <- wine_clustering[!duplicated(wine_clustering), ]
Can we cluster wines into those that age well vs. those best consumed young?
Let me first clarify the difference between high-quality and low-quality wine:
High-quality wines: High Total Phenols, Flavanoids, Proline, and OD280 values and moderate Alcohol content (not too high or low).
Low-quality wines: Low Total Phenols, Flavanoids, Proline, and OD280 values and either too high or too low Alcohol content.
To find this out, I will cluster these variables: Proanthocyanins, Flavanoids, Proline, Alcohol and OD280.
wine_clustering_std <- as.data.frame(scale(wine_clustering[c("Proanthocyanins",
"Flavanoids",
"Proline",
"OD280",
"Alcohol")]))
#Finding outliers
wine_clustering$Dissimilarity <- sqrt(wine_clustering$Proanthocyanins^2 +
wine_clustering$Flavanoids^2 +
wine_clustering$Proline^2 +
wine_clustering$OD280^2 +
wine_clustering$Alcohol^2)
head(wine_clustering[order(-wine_clustering$Dissimilarity), c("Alcohol", "Dissimilarity")])
## # A tibble: 6 × 2
## Alcohol Dissimilarity
## <dbl> <dbl>
## 1 14.2 1680.
## 2 14.4 1547.
## 3 13.6 1515.
## 4 14.1 1510.
## 5 14.4 1480.
## 6 14.2 1450.
Dissimilarity values represent how different each wine is compared to the rest of the wines in the dataset, with higher values indicating greater divergence. AS we can see, there is a dissimilarity between one specific wine, the one with Alcohol level 14.19%. This might be an outlier, since Alcohol is higher in another wine and the Dissimilarity is smaller. I’m going to filter it out.
# Filtering out the potential outlier
wine_clustering <- wine_clustering %>%
filter(!(row_number() %in% c(19)))
#Finding outliers again
wine_clustering$Dissimilarity <- sqrt(wine_clustering$Proanthocyanins^2 +
wine_clustering$Flavanoids^2 +
wine_clustering$Proline^2 +
wine_clustering$OD280^2 +
wine_clustering$Alcohol^2)
# Redoing the Dissimilarity without the potential outlier
head(wine_clustering[order(-wine_clustering$Dissimilarity), c("Alcohol", "Dissimilarity")])
## # A tibble: 6 × 2
## Alcohol Dissimilarity
## <dbl> <dbl>
## 1 14.4 1547.
## 2 13.6 1515.
## 3 14.1 1510.
## 4 14.4 1480.
## 5 14.2 1450.
## 6 13.8 1375.
library(factoextra)
#Finding Euclidian distances then saving them into object Distances
Distances <- get_dist(wine_clustering_std,
method = "euclidian")
fviz_dist(Distances, #Showing matrix of distances
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
library(factoextra)
get_clust_tendency(wine_clustering_std, #Hopkins statistics
n = nrow(wine_clustering) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7266345
##
## $plot
## NULL
Explanation:
The Hopkins statistic that we use to measure clustering tendency, is 0.73, indicating a strong clustering structure. Because it is close to 1, it confirms that the data is a great fit for cluster analysis. And if we look at the visual analysis made with Euclidian distances, we see that the distance matrix reveals the formation of 3 distinct squares, further supporting the fact that there are well-defined clusters within the data.
library(factoextra)
library(NbClust)
fviz_nbclust(wine_clustering_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
fviz_nbclust(wine_clustering_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
I used both the elbow method and Silhouette analysis. With the elbow method, we determine the optimal number of clusters by plotting the total within-cluster sum of squares against the number of clusters. In both graphs we see that there is a significant break at Cluster 3, suggesting they capture the structure of the data set effectively and simplistically enough while maintaining differentiations.
Let’s check the assumption of the optimal number of clusters with the K-method:
library(NbClust)
nc <- NbClust(wine_clustering_std, distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 16 proposed 3 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
According to the K-method and the majority rule, the best number of clusters is 3, as I assumed.
So now let’s perform clustering with 3 clusters and visualise them:
Clustering <- kmeans(wine_clustering_std,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 61, 60, 57
##
## Cluster means:
## Proanthocyanins Flavanoids Proline OD280 Alcohol
## 1 0.6067834 0.9267060 1.1353067 0.7325703 0.9166196
## 2 0.2232302 0.1615814 -0.7708399 0.4232188 -1.0272531
## 3 -0.8843438 -1.1618237 -0.4035669 -1.2294722 0.1003752
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [58] 1 1 3 3 3 3 2 2 2 2 2 3 2 3 1 3 1 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [115] 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [172] 3 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 121.71535 120.22832 73.79202
## (between_SS / total_SS = 64.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
## [8] "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = wine_clustering_std)
These 3 clusters look great! The clusters satisfy both within-cluster and between-cluster requirements.They show a clear separation of groups without severe outlier influence, which means that there are meaningful and consistent groupings in the data.
Explanation of results:
Cluster sizes: The largest group is Cluster 1, with 61 wines and the smallest is Cluster 3, with 57 unites.
Cluster means: Cluster 1 has the highest average values across all compound categories, indicating that it includes wine samples with the highest levels of Proanthocyanins, Flavanoids, Proline, Alcohol and OD280, while Cluster 3 has negative averages, indicating below-average levels of Proanthocyanins, Flavanoids, Proline, Alcohol and OD280.
Cluster cohesion: Cluster 1 has the highest within-cluster sum of squares (121.72), while Cluster 3 has the lowest within-cluster sum of squares and is the most compact (73.79). Cluster 1 reflects greater variability among its units, its wines, while Cluster 3 indicates strong internal similarity.
Overall clustering efficiency: The model achieves a 64.3% between-cluster variation, which is a good proportion of variance explained.
Averages <- Clustering$centers
Averages
## Proanthocyanins Flavanoids Proline OD280 Alcohol
## 1 0.6067834 0.9267060 1.1353067 0.7325703 0.9166196
## 2 0.2232302 0.1615814 -0.7708399 0.4232188 -1.0272531
## 3 -0.8843438 -1.1618237 -0.4035669 -1.2294722 0.1003752
Explanation of results:
Cluster 1 displays positive averages, so they are above average for all values. This indicates high-quality wines, since they require high amounts of Proanthocyanins, Flavanoid, Proline, OD280 and moderate amounts of Alcohol.
Cluster 2 displays positive averages for Proanthocyanins, Flavanoid and OD280, which means they are of above average in value, while displaying negative averages for Proline and Alcohol, so they are below average. Proanthocyanins and Flavanoids are close to average. This indicates a mid-quality wine, neither high-quality, neither low-quality wines.
Cluster 3 displays negative averages, so they are below average for all values. This indicates low-quality wines, since they require low amounts of Proanthocyanins, Flavanoid, Proline, OD280 and either too high or too low amounts of Alcohol and since it is close to 0, we can conclude that this is too low for better quality wine.
Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
## Proanthocyanins Flavanoids Proline OD280 Alcohol
## 1 0.6067834 0.9267060 1.1353067 0.7325703 0.9166196
## 2 0.2232302 0.1615814 -0.7708399 0.4232188 -1.0272531
## 3 -0.8843438 -1.1618237 -0.4035669 -1.2294722 0.1003752
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Proanthocyanins",
"Flavanoids",
"Proline",
"OD280",
"Alcohol"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("Proanthocyanins",
"Flavanoids",
"Proline",
"OD280",
"Alcohol"),
labels = c("Proanthocyanins",
"Flavanoids",
"Proline",
"OD280",
"Alcohol"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
Explanation of results:
This graph shows the distinctions above. Wines in Group 1 are have distinctly above average levels of all values. Group 2 is middling, varying around average while Group 3 is distinctly below average. So this emphasises that Group 1 consists of High-quality wine, Group 2 of some sort of middle-quality and Group 3 of Low-quality wines.
#I was facing an issue, so this is additional code to fix it, mostly I just copied code from above.
nrow(wine_clustering)
## [1] 177
nrow(Clustering)
## NULL
wine_clustering$Group <- Clustering$cluster[1:nrow(wine_clustering)]
Clustering <- kmeans(wine_clustering[, c("Total_Phenols", "Flavanoids", "Proline", "Alcohol", "OD280")], centers = 3)
wine_clustering$Group <- Clustering$cluster
wine_clustering$Group <- Clustering$cluster
fit <- aov(cbind(Proanthocyanins, Flavanoids, Proline, OD280, Alcohol) ~ as.factor(Group),
data = wine_clustering)
summary(fit)
## Response Proanthocyanins :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 6.537 3.2683 11.069 2.984e-05 ***
## Residuals 174 51.375 0.2953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Flavanoids :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 60.176 30.0879 46.417 < 2.2e-16 ***
## Residuals 174 112.787 0.6482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Proline :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 14579239 7289619 604.67 < 2.2e-16 ***
## Residuals 174 2097663 12056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response OD280 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 16.673 8.3367 20.006 1.512e-08 ***
## Residuals 174 72.506 0.4167
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Alcohol :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 45.570 22.7852 56.913 < 2.2e-16 ***
## Residuals 174 69.661 0.4004
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explanation of results:
The low p-values (< 0.001) for all variables confirm that the clustering successfully differentiates wines based on their taste profile, whether they are High-quality or Low-quality based on key indicators (values of Proanthocyanins, Flavanoids, Proline, OD280 and Alcohol). Each cluster and their meaningful differences, represent distinct profiles of all analyzed variables. The clustering approach is therefore valid and correct.
To evaluate the criterion validity of my clustering solution, I selected Alcohol level as a numerical variable to benchmark the test. Alcohol level is the cumulative variable of other factors in wine production, so, by comparing clusters based on this variable, I can link them to their flavour profiles as High-quality or Low-quality wine.
# Aggregate the means of Value added per inhabitant by cluster group
aggregate(wine_clustering$Alcohol,
by = list(Cluster = wine_clustering$Group),
FUN = mean)
## Cluster x
## 1 1 13.79609
## 2 2 12.51667
## 3 3 12.92984
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Levene's test
leveneTest(wine_clustering$Alcohol, as.factor(wine_clustering$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 4.457 0.01295 *
## 174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explanation of results:
The Levene’s Test checks whether the variances across groups are homogeneous.
Hypotheses: H0: The variances of the groups are equal (homogeneity of variance) H1: The variances of the groups are not equal (heterogeneity of variance)
Since the p-value (0.01754) is less than the significance level (0.05), we reject the null hypothesis. This suggests that there is significant evidence that the variances across the groups are different (variance is not homogeneous).
Because I cannot perform ANOVA, since the assumptions are not met, I will perform a non-parametric test, let’s do the Kruskal test:
kruskal.test(Alcohol ~ as.factor(Group),
data = wine_clustering)
##
## Kruskal-Wallis rank sum test
##
## data: Alcohol by as.factor(Group)
## Kruskal-Wallis chi-squared = 70.556, df = 2, p-value = 4.776e-16
Explanation:
H0: The distribution of Alcohol is the same across all groups. H1: At least one group’s Alcohol distribution is different.
Since p-value is much smaller than 0.05, we reject the null hypothesis, this means that there is a statistically significant difference in Alcohol content among the three groups.
Now I want to check which group differs:
pairwise.wilcox.test(wine_clustering$Alcohol,
as.factor(wine_clustering$Group),
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: wine_clustering$Alcohol and as.factor(wine_clustering$Group)
##
## 1 2
## 2 1.6e-14 -
## 3 2.4e-09 0.0029
##
## P value adjustment method: bonferroni
We see that there are statistically significant differences are between all 3 groups which is to be expected, since Alcohol level is supposed differ as one of the factors of High-quality wine and Low-quality wine.
So let us name clusters according to this data analysis:
Cluster 1: High quality wine, that includes wines like these italian reds: - Barolo DOCG (Piedmont) - Brunello di Montalcino DOCG (Tuscany) - Amarone della Valpolicella DOCG (Veneto) - Barbaresco DOCG (Piedmont) - Super Tuscans (Tuscany)
Cluster 2: Middling quality wine, that includes wines like these italian sparkling wines: - Prosecco DOC (Veneto) - Lambrusco di Sorbara DOC (Emilia-Romagna) - Rosato Salento IGT (Puglia)
Cluster 3: Low quality wine, that includes wines like these italian whites: - Vino da Tavola Bianco - Trebbiano IGT (Mass-Produced) - Soave (Low-End, Supermarket Bottles)