mydata <- read.csv("C:/Users/Andrej/Documents/wine-clustering.csv")
mydata$ID <- seq(1, nrow(mydata))
head(mydata)
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 13.20 1.78 2.14 11.2 100 2.65 2.76
## 3 13.16 2.36 2.67 18.6 101 2.80 3.24
## 4 14.37 1.95 2.50 16.8 113 3.85 3.49
## 5 13.24 2.59 2.87 21.0 118 2.80 2.69
## 6 14.20 1.76 2.45 15.2 112 3.27 3.39
## Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue OD280 Proline ID
## 1 0.28 2.29 5.64 1.04 3.92 1065 1
## 2 0.26 1.28 4.38 1.05 3.40 1050 2
## 3 0.30 2.81 5.68 1.03 3.17 1185 3
## 4 0.24 2.18 7.80 0.86 3.45 1480 4
## 5 0.39 1.82 4.32 1.04 2.93 735 5
## 6 0.34 1.97 6.75 1.05 2.85 1450 6
Explaining data:
Unit of observation is an individual wine sample.
The dataset contains a total of 178 observations (wine samples).
The dataset includes 13 variables, all of which describe the chemical composition and properties of the wine. These variables are:
Alcohol: The alcohol content of the wine (measured in %).
Malic acid: The amount of malic acid in the wine (measured in g/L).
Ash: The ash content of the wine (measured in g/L).
Alcalinity of Ash: The alkalinity of the ash (measured in meq/L).
Magnesium: The magnesium content (measured in mg/L).
Total phenols: Total phenolic compounds (measured in g/L).
Flavanoids: A subset of phenols (measured in g/L).
Nonflavanoid phenols: Non-flavonoid phenolic compounds (measured in g/L).
Proanthocyanins: Proanthocyanidin content (measured in g/L).
Color intensity: The intensity of the wine’s color (measured in absorbance units).
Hue: The hue of the wine (ratio of color intensity at 520 nm to 420 nm).
OD280/OD315 of diluted wines: A measure of dilution and optical density.
Proline: The amount of proline, an amino acid (measured in mg/L).
The data is sourced from Kaggle.com and the owner is Harry Wang. The dataset originates from the UCI Machine Learning Repository,
summary(mydata[c(1:13)])
## Alcohol Malic_Acid Ash Ash_Alcanity
## Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color_Intensity Hue OD280
## Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## Proline
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
Descriptive statistics:
Alcohol (%): Min: 11.0, Max: 14.83, Mean: 13.0 The first quartile (Q1) is around 12.36, and the third quartile (Q3) is 13.68, meaning most wines fall between these values.
Malic Acid (g/L): Min: 0.74, Max: 5.80, Mean: 2.34 Wines have varying malic acid content, with the median at 1.86.
Ash (g/L): Min: 1.36, Max: 3.23, Mean: 2.37 Most wines have ash content between 2.21 and 2.56 (interquartile range).
Alcalinity of Ash (meq/L): Min: 10.6, Max: 30.0, Mean: 19.49 The alkalinity levels vary significantly.
Magnesium (mg/L): Min: 70, Max: 162, Mean: 99.74 Median of 98, suggesting that most wines contain close to 98 mg/L of magnesium.
Total Phenols (g/L): Min: 0.98, Max: 3.88, Mean: 2.29 Phenolic content is an important quality factor in wines.
Flavanoids (g/L): Min: 0.34, Max: 5.08, Mean: 2.03 The higher the flavonoid content, the more likely the wine is to have antioxidant properties.
Nonflavanoid Phenols (g/L): Min: 0.13, Max: 0.66, Mean: 0.36 These are generally lower in concentration compared to flavonoids.
Proanthocyanins (g/L): Min: 0.41, Max: 3.58, Mean: 1.59 Important for wine color stability.
Color Intensity (absorbance units): Min: 1.28, Max: 13.0, Mean: 5.06 High variation in color intensity among wines.
Hue (Ratio of absorbance at 520 nm to 420 nm): Min: 0.48, Max: 1.71, Mean: 0.96 Represents the tone of the wine (red, brown, etc.).
OD280/OD315 of Diluted Wines: Min: 1.27, Max: 4.0, Mean: 2.61 Measures the wine’s protein and polyphenol content.
Proline (mg/L): Min: 278, Max: 1680, Mean: 746.9 Proline is an amino acid that contributes to wine structure.
mydata_clu_std <- as.data.frame(scale(mydata[c(1,2,3,4)]))
mydata$Dissimilarity = sqrt(mydata_clu_std$Alcohol^2 + mydata_clu_std$Malic_Acid^2 + mydata_clu_std$Ash^2 + mydata_clu_std$Ash_Alcanity^2)
head(mydata[order(-mydata$Dissimilarity), c("ID","Dissimilarity")], 15)
## ID Dissimilarity
## 60 60 4.766567
## 122 122 4.515761
## 26 26 3.530586
## 138 138 3.481034
## 128 128 3.435175
## 74 74 3.312897
## 14 14 3.289520
## 124 124 3.274170
## 170 170 3.209481
## 123 123 3.190471
## 174 174 3.121685
## 111 111 3.024639
## 67 67 3.023544
## 9 9 2.947353
## 178 178 2.920125
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!ID %in% c(60,122))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c(1,2,3,4)]))
I deleted the outliers.
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(mydata_clu_std,
method = "euclidian")
fviz_dist(Distances,
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
The heatmap shows how similar or different the wine samples are based on their chemical properties (represents the pairwise Euclidean distances between wine samples). Darker red areas mean the wines are very similar to each other, while lighter or white areas indicate they are quite different. The diagonal line is the darkest because it represents each wine compared to itself, which always has a distance of zero. Some parts of the heatmap have clusters of dark red, suggesting that certain wines share similar characteristics and could naturally form groups. On the other hand, the lighter areas show wines that are more unique or different from most others. This visualization helps us see patterns in the data and suggests that some wines are much more alike than others.
library(factoextra)
get_clust_tendency(mydata_clu_std,
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.6945244
##
## $plot
## NULL
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 176
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 3 proposed 6 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 0.4786 57.6999 68.7291 -2.7380 195.7042 695777636 22827.713 525.6800
## 3 2.8581 74.1813 36.9756 -1.1059 414.8250 450769372 12193.656 376.8329
## 4 4.1169 71.9311 17.9432 -1.4397 511.9200 461574739 6629.902 310.4746
## 5 0.1714 63.6896 30.7565 -2.1781 555.3059 563642640 4558.635 281.1454
## 6 1.7404 65.8797 5.8352 -0.4581 663.4610 439020370 3829.754 238.2866
## 7 3.0483 57.4170 14.2171 -2.1218 692.3694 507042472 3531.074 230.3789
## 8 2.0913 55.0580 11.4010 -1.9890 748.9253 480254888 3060.681 212.5022
## 9 0.2815 52.5555 13.3857 -2.0742 795.9699 465254457 2662.459 198.9976
## 10 4.7456 51.6367 7.1999 -1.7001 854.6658 411498742 2307.682 184.2308
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 1.9564 1.3316 0.3876 1.7087 0.2328 0.6515 60.4441 1.2788 0.3086
## 3 4.3051 1.8576 0.3306 1.2865 0.2961 0.6567 31.3625 1.2409 0.3901
## 4 5.6250 2.2546 0.3584 1.2087 0.3064 0.9698 1.8032 0.0736 0.3724
## 5 6.2510 2.4898 0.3500 1.2700 0.2566 1.6420 -23.8491 -0.9124 0.3456
## 6 7.8877 2.9376 0.3884 1.2594 0.2605 0.7335 13.4461 0.8530 0.3315
## 7 8.4903 3.0385 0.3743 1.3309 0.2470 2.8293 -32.9745 -1.4788 0.3095
## 8 9.5070 3.2941 0.3614 1.3766 0.2358 2.3739 -21.9924 -1.3337 0.2950
## 9 10.4214 3.5176 0.3632 1.3389 0.2232 2.2169 -15.9187 -1.2424 0.2820
## 10 11.7300 3.7996 0.3365 1.3031 0.2313 1.5787 -8.4314 -0.8384 0.2714
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 262.8400 0.3666 0.0873 0.7128 0.0618 0.0024 2.2061 1.6277 1.3981
## 3 125.6110 0.4925 0.1170 1.2983 0.0633 0.0029 1.4914 1.3407 0.9514
## 4 77.6187 0.5350 1.0127 1.5825 0.0761 0.0033 1.4295 1.2306 0.7832
## 5 56.2291 0.4930 0.2994 2.1287 0.0801 0.0035 1.7610 1.1656 0.6172
## 6 39.7144 0.4846 0.8848 2.6539 0.1280 0.0039 1.7467 1.0804 0.3561
## 7 32.9113 0.4695 0.2551 2.9101 0.0747 0.0043 1.9500 1.0596 0.4384
## 8 26.5628 0.4617 0.5144 3.2913 0.0845 0.0045 2.0081 1.0156 0.3300
## 9 22.1108 0.4437 0.1305 3.7196 0.0635 0.0046 1.9808 0.9850 0.3104
## 10 18.4231 0.4431 -0.6502 4.0480 0.0635 0.0050 1.9889 0.9439 0.3031
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.6427 62.8160 0.2776
## 3 0.5821 43.0671 0.2943
## 4 0.5607 45.4513 0.9901
## 5 0.4780 66.6283 1.0000
## 6 0.5087 35.7380 0.4940
## 7 0.3890 80.0903 1.0000
## 8 0.4195 52.5756 1.0000
## 9 0.3508 53.6685 1.0000
## 10 0.3890 36.1192 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 10.0000 3.0000 3.0000 6.0000 3.0000 3 3.00
## Value_Index 4.7456 74.1813 31.7535 -0.4581 219.1208 255813632 10634.06
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 3.0000 6.000 3.0000 4.0000 4.0000 2.0000
## Value_Index 82.4888 2.3487 -0.347 0.3306 1.2087 0.3064 0.6515
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain Dunn
## Number_clusters 2.0000 2.0000 3.0000 3.000 4.000 1 2.0000 6.000
## Value_Index 60.4441 1.2788 0.3901 137.229 0.535 NA 0.7128 0.128
## Hubert SDindex Dindex SDbw
## Number_clusters 0 4.0000 0 10.0000
## Value_Index 0 1.4295 0 0.3031
##
## $Best.partition
## [1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
## [38] 1 3 1 1 2 1 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 1 3 3 2 2 2 3
## [75] 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [112] 2 3 3 3 3 3 3 3 3 2 2 2 3 3 2 3 2 3 2 2 2 3 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2
## [149] 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 2 2 2 2 2
Based on the elbow method, silhouette analysis and the cluster dendogram, I choose to make 3 clusters.
Clustering <- kmeans(mydata_clu_std,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 63, 61, 52
##
## Cluster means:
## Alcohol Malic_Acid Ash Ash_Alcanity
## 1 -0.90777691 -0.5401134 -0.7547032 -0.0805728
## 2 0.08777437 0.9239729 0.5161277 0.8282048
## 3 0.99684056 -0.4295231 0.3088945 -0.8739309
##
## Clustering vector:
## [1] 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 3 3 2 3 3 3 3 3 3
## [38] 3 1 3 3 2 3 2 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 3 1 1 2 2 2 1
## [75] 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## [112] 2 1 1 1 1 1 1 1 1 2 2 2 1 1 2 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [149] 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 125.3203 175.6697 75.8429
## (between_SS / total_SS = 46.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The clustering algorithm divided the dataset into clusters of sizes 63, 61, and 52, meaning that 63 wine samples belong to Cluster 1, 61 to Cluster 2, and 52 to Cluster 3. Each cluster has a set of mean values for selected variables (Alcohol, Malic Acid, Ash, and Ash Alcanity), which indicate the typical characteristics of wines in that group.
From the cluster means, we can see that Cluster 1 has the lowest Alcohol content, the lowest Ash levels, and moderate Malic Acid, suggesting these wines may be less intense in alcohol and body. Cluster 2 has a high Malic Acid content and higher Ash Alcanity, possibly indicating wines with more acidity and mineral content. Cluster 3 has the highest Alcohol levels and the lowest Ash Alcanity, which could mean these wines are stronger in alcohol but lower in certain mineral qualities.
The within-cluster sum of squares (WCSS) values (125.32, 175.66, and 75.84) measure how compact each cluster is, with lower values indicating that points in the cluster are closer to each other. The between-cluster sum of squares divided by total sum of squares (46.2%) shows how well-separated the clusters are; typically, higher values indicate clearer distinctions between groups. This result suggests a moderate level of separation but not a perfect clustering.
Overall, the k-means algorithm has successfully grouped the wines into three distinct clusters based on their chemical properties.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(176,26,168,126,87,121,136))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c(1,2,3,4)]))
I deleted outliers that I identified in the cluster plot.
Clustering <- kmeans(mydata_clu_std,
centers = 3,
nstart = 25)
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(25,3,55,35, 132))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c(1,2,3,4)]))
I deleted the overlapping units.
Clustering <- kmeans(mydata_clu_std,
centers = 3,
nstart = 25)
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
The final cluster plot.
Averages <- Clustering$centers
Averages
## Alcohol Malic_Acid Ash Ash_Alcanity
## 1 0.1185140 0.9833759 0.4728621 0.80717569
## 2 -0.9008307 -0.5441204 -0.7072839 0.01016065
## 3 1.0302448 -0.4034757 0.3816052 -0.92119682
Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c(1,2,3,4))
Figure$Group <- factor(Figure$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
library(ggplot2)
ggplot(Figure, aes(x = name, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
mydata$Group <- Clustering$cluster
fit <- aov(cbind(Alcohol, Ash, Ash_Alcanity, Malic_Acid) ~ as.factor(Group), data = mydata)
summary(fit)
## Response Alcohol :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 67.480 33.740 134.67 < 2.2e-16 ***
## Residuals 161 40.336 0.251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Ash :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 2.9945 1.49723 35.702 1.469e-13 ***
## Residuals 161 6.7519 0.04194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Ash_Alcanity :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 703.10 351.55 70.187 < 2.2e-16 ***
## Residuals 161 806.42 5.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Malic_Acid :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 92.764 46.382 74.582 < 2.2e-16 ***
## Residuals 161 100.125 0.622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Each variable is analyzed separately using ANOVA to determine the statistical significance of differences between groups. The degrees of freedom (Df) for the grouping factor is 2, reflecting the presence of three clusters, while the residuals have 160 degrees of freedom, accounting for the remaining variability in the dataset. The Sum of Squares (Sum Sq) and Mean Square (Mean Sq) provide a measure of variance, with the mean square calculated as the sum of squares divided by the degrees of freedom.
The F-values are notably high for each variable, indicating strong evidence that the groups have different mean values. The p-values (Pr(>F)) for all four variables are extremely small (p < 2.2e-16), confirming that the differences between clusters are highly significant. The presence of three asterisks (***) next to each p-value reinforces this, meaning that Alcohol, Ash, Ash Alcanity, and Malic Acid vary significantly among the three groups. This result supports the idea that the clusters formed in the k-means analysis are not arbitrary but represent wines with distinct chemical compositions.
aggregate(mydata$Proline,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 650.6481
## 2 2 538.7419
## 3 3 1142.6458
aggregate(mydata$Color_Intensity,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 6.307593
## 2 2 3.448387
## 3 3 5.786875
For Proline, the mean values for the three clusters are 650.65, 536.18, and 1142.65. This indicates that wines in Cluster 3 have the highest Proline content, while wines in Cluster 2 have the lowest. Since Proline is an amino acid associated with wine structure and aging potential, this suggests that Cluster 3 wines may have richer characteristics in terms of amino acid composition.
For Color Intensity, the average values are 6.31, 3.39, and 5.79. Wines in Cluster 1 have the highest color intensity, meaning they likely appear darker or more concentrated in color, while Cluster 2 wines have the lowest intensity, suggesting a lighter appearance. Cluster 3 falls in between, indicating moderate intensity. Color intensity is a key characteristic in wine classification, often correlating with the type of grape used and the winemaking process.
These differences in Proline and Color Intensity suggest that the clusters are not only chemically distinct but may also differ in their visual and structural properties, supporting the validity of the clustering results
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$Proline, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 2.2196 0.112
## 161
Levene’s test for homogeneity of variance was conducted on the Proline variable across the three clusters. The test produced an F-value of 2.2525 and a p-value of 0.1085, which is greater than the common significance threshold of 0.05. This means we fail to reject the null hypothesis, suggesting that the variances across the groups are not significantly different.
Since the assumption of homogeneity of variances is met (p > 0.05), there is no need to use Welch’s ANOVA, which is specifically designed for cases where variances are unequal.
library(car)
leveneTest(mydata$Color_Intensity, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 16.089 4.264e-07 ***
## 161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Levene’s test for homogeneity of variance on Color Intensity produced an F-value of 17.388 and a p-value of 1.468e-07, which is extremely small (p < 0.001). This means we reject the null hypothesis, indicating that the variances across the three clusters are significantly different. Since the assumption of homogeneity of variances is violated, a standard one-way ANOVA is not appropriate, as it assumes equal variances.
Instead, we should use Welch’s ANOVA, which adjusts for unequal variances and provides more reliable results when variance heterogeneity is present.
library(dplyr)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(Proline)
## # A tibble: 3 × 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Proline 0.933 0.00483
## 2 2 Proline 0.946 0.00901
## 3 3 Proline 0.984 0.736
The Shapiro-Wilk test for normality has been conducted for the Proline variable within each cluster. The test results show that for Groups 1 and 2, the p-values are 0.0048 and 0.0071, respectively, which are below the standard significance threshold of 0.05. This indicates that the Proline variable in these groups does not follow a normal distribution. However, for Group 3, the p-value is 0.7363, suggesting that normality is not violated in this group.
Since at least some of the groups do not meet the normality assumption, a standard one-way ANOVA is not appropriate, as it assumes normally distributed residuals. Instead, the Kruskal-Wallis test should be used as a non-parametric alternative, since it does not assume normality and is suitable for comparing medians across multiple groups.
library(dplyr)
library(rstatix)
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(Color_Intensity)
## # A tibble: 3 × 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Color_Intensity 0.945 0.0153
## 2 2 Color_Intensity 0.834 0.000000743
## 3 3 Color_Intensity 0.971 0.284
The Shapiro-Wilk test for normality has been conducted for the Color Intensity variable within each cluster. The results indicate that for Group 2, the p-value is extremely small (5.33e-07), which is well below 0.05, meaning that the Color Intensity distribution in this group significantly deviates from normality. For Group 1, the p-value is 0.015, which is also below 0.05, suggesting a moderate violation of normality. However, Group 3 has a p-value of 0.284, indicating that normality is not violated in that cluster.
Since at least some of the groups do not follow a normal distribution, a standard one-way ANOVA is not appropriate because it assumes normally distributed residuals. Additionally, Levene’s test previously indicated that the assumption of homogeneity of variances was violated for Color Intensity. Given these conditions, the best approach is to use the Kruskal-Wallis test, which is a non-parametric alternative to ANOVA that does not require normality
kruskal.test(Proline ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Proline by as.factor(Group)
## Kruskal-Wallis chi-squared = 95.232, df = 2, p-value < 2.2e-16
The Kruskal-Wallis rank sum test was performed to compare the distribution of Proline across the three clusters. The test produced a chi-squared value of 95.288 with 2 degrees of freedom, and the p-value is less than 2.2e-16, which is far below the standard significance threshold of 0.05.
This result means that there is a statistically significant difference in the distribution of Proline among the three clusters.
kruskal_effsize(Proline ~ as.factor(Group),
data = mydata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 Proline 164 0.579 eta2[H] large
The effect size calculation for the Kruskal-Wallis test on Proline is shown in the output. The effect size metric used is eta-squared (η²), which quantifies how much of the variance in Proline can be explained by the clustering variable (Group). The computed effect size is 0.583, and the interpretation is labeled as large, meaning that the difference in Proline levels across the clusters is substantial.
Since an effect size above 0.14 is typically considered large in non-parametric tests, this result confirms that the grouping has a strong impact on Proline distribution. This reinforces the earlier Kruskal-Wallis test results, suggesting that Proline is an important distinguishing factor among the clusters, with meaningful differences in its levels across the groups.
fit <- aov(cbind(Color_Intensity) ~ as.factor(Group),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 270.4 135.20 35.38 1.84e-13 ***
## Residuals 161 615.4 3.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The one-way ANOVA results for Color Intensity across the three clusters show that the grouping variable (Group) has a significant effect. The F-value is 37.2, and the p-value is 5.41e-14, which is far below the 0.05 threshold. This confirms that there are statistically significant differences in Color Intensity between the clusters.
The Sum of Squares (SS) is divided into two parts:
Between-group SS = 279.8, which represents the variation explained by the clusters. Residual SS = 601.8, which represents the unexplained variation within the clusters. The Mean Square (MS) for the group factor is 139.92, while the residual MS is 3.76, leading to a large F-value, which further supports that the groups differ significantly.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
eta_squared(fit)
## For one-way between subjects designs, partial eta squared is equivalent
## to eta squared. Returning eta squared.
## # Effect Size for ANOVA
##
## Parameter | Eta2 | 95% CI
## --------------------------------------
## as.factor(Group) | 0.31 | [0.21, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_eta_squared(0.30, rules = "cohen1992")
## [1] "large"
## (Rules: cohen1992)
The effect size calculation for ANOVA using eta-squared (η²) shows a value of 0.32 for the grouping factor (as.factor(Group)), with a 95% confidence interval ranging from 0.22 to 1.00. This effect size quantifies how much of the variance in Color Intensity is explained by the cluster grouping.
According to Cohen’s 1992 interpretation, an eta-squared value of 0.30 or higher is considered a large effect size. This means that the differences in Color Intensity across clusters are not only statistically significant but also substantially meaningful in terms of practical significance. The clustering structure effectively captures distinct variations in Color Intensity, reinforcing the conclusion that the clusters represent wines with significantly different characteristics.