library(readr)
mydata <- read_csv("WineQT.csv")
## Rows: 1143 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(mydata)
## # A tibble: 6 × 13
## `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # ℹ 8 more variables: `free sulfur dioxide` <dbl>,
## # `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>, Id <dbl>
Source: Kaggle
Unit of observation: 1 wine
Sample size = n = 1143
The data contains the following variables:
library(pastecs)
round(stat.desc(mydata[ , c(-13)]), 2)
## fixed acidity volatile acidity citric acid residual sugar
## nbr.val 1143.00 1143.00 1143.00 1143.00
## nbr.null 0.00 0.00 99.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 4.60 0.12 0.00 0.90
## max 15.90 1.58 1.00 15.50
## range 11.30 1.46 1.00 14.60
## sum 9499.60 607.32 306.74 2894.25
## median 7.90 0.52 0.25 2.20
## mean 8.31 0.53 0.27 2.53
## SE.mean 0.05 0.01 0.01 0.04
## CI.mean.0.95 0.10 0.01 0.01 0.08
## var 3.05 0.03 0.04 1.84
## std.dev 1.75 0.18 0.20 1.36
## coef.var 0.21 0.34 0.73 0.54
## chlorides free sulfur dioxide total sulfur dioxide density pH
## nbr.val 1143.00 1143.00 1143.00 1143.00 1143.00
## nbr.null 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 0.01 1.00 6.00 0.99 2.74
## max 0.61 68.00 289.00 1.00 4.01
## range 0.60 67.00 283.00 0.01 1.27
## sum 99.36 17848.50 52480.50 1139.26 3784.49
## median 0.08 13.00 37.00 1.00 3.31
## mean 0.09 15.62 45.91 1.00 3.31
## SE.mean 0.00 0.30 0.97 0.00 0.00
## CI.mean.0.95 0.00 0.59 1.90 0.00 0.01
## var 0.00 105.07 1074.67 0.00 0.02
## std.dev 0.05 10.25 32.78 0.00 0.16
## coef.var 0.54 0.66 0.71 0.00 0.05
## sulphates alcohol quality
## nbr.val 1143.00 1143.00 1143.00
## nbr.null 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00
## min 0.33 8.40 3.00
## max 2.00 14.90 8.00
## range 1.67 6.50 5.00
## sum 751.76 11935.33 6466.00
## median 0.62 10.20 6.00
## mean 0.66 10.44 5.66
## SE.mean 0.01 0.03 0.02
## CI.mean.0.95 0.01 0.06 0.05
## var 0.03 1.17 0.65
## std.dev 0.17 1.08 0.81
## coef.var 0.26 0.10 0.14
Citric acid has 99 numbers with the value 0 (nbr.null). I will not remove them because in this case the wines posses zero grams of citric acid per liter.
The average residual sugar is 2.53 g/L.
The maximum amount of alcohol that is contained in one wine is 14.90% while the minimum amount of alcohol is 8.40%.
50% of the wines have pH value of below or equal to 3.31, while the other 50% have a pH value higher than 3.31.
I will use the following six variables for clustering: total sulfur dioxide, density, pH, residual sugar, and alcohol.
I will create new set mydata_CluStd (Wine Cluster Variables Standardized) containing only the 5 standardized cluster variables, and the ID number:
mydata_CluStd <- as.data.frame(scale(mydata[c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol", "Id")]))
I will calculate the distances and order them in descending order to see if there will be any potential outliers:
mydata$Dissimilarity <- sqrt(mydata_CluStd$`total sulfur dioxide`^2 + mydata_CluStd$density^2 + mydata_CluStd$pH^2 + mydata_CluStd$`residual sugar`^2 + mydata_CluStd$alcohol^2)
head(mydata[order(-mydata$Dissimilarity), c("Id", "Dissimilarity")])
## # A tibble: 6 × 2
## Id Dissimilarity
## <dbl> <dbl>
## 1 1434 10.4
## 2 480 10.2
## 3 1081 9.12
## 4 1474 9.07
## 5 1476 9.07
## 6 1079 8.85
I will remove the first 3 units, since they have the highest dissimilarities from the rest of the units. Also, I will again standardize the data (using the scale function) after removing the three units:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!Id %in% c(1434, 480))
mydata$Id <- seq(1, nrow(mydata))
mydata_CluStd <- as.data.frame(scale(mydata[c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol", "Id")]))
Next, I will have to check if data is clusterable. I will use the Hopkins statistic and look at the Dissimilarity matrix.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(mydata_CluStd,
method = "euclidian")
fviz_dist(Distances,
gradient = list(low = "blue",
mid = "grey",
high = "white"))
The dimension of the dissimilarity matrix is “n x n” or “1143 x 1143”. From the matrix, I can observe some groups forming, but I will have to further check if data is clusterable using the official test - Hopkins Statistic.
library(factoextra)
get_clust_tendency(mydata_CluStd,
n = nrow(mydata_CluStd) -1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8775994
##
## $plot
## NULL
The Hopkins statistic is 0.878 which is above the cut point of 0.5. In fact, it is way closer to 1, than to 0.5, so I will proceed with the next step in clustering which is determining how much clusters to create. I will use four approaches to find out.
First, I will use the Hierarchical approach to see how many groups it proposes. However, it may happen that the number of groups differs from the other K-means methods (Elbow method and Silhouette analysis) because the hierarchical approach does not do reclassifications.
library(dplyr)
library(factoextra)
WARD <- mydata_CluStd %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
According to the cluster Dendogram, I should use 2 cluster variables. However I will further check using the Elbow method and Silhouette Analysis.
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_CluStd, kmeans, method = "wss") +
labs(subtitle = "Elbow Method")
From the graph, we can observe an “elbow” forming at the optimal number of clusters being either 3 or 5. I will further check this using the Silhouette analysis.
fviz_nbclust(mydata_CluStd, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")
From the Silhouette method, the optimal number of clusters to do is 3. Lastly, we can check what all other indexes say about what is the optimal number of clusters to make:
library(NbClust)
NbClust(mydata_CluStd,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 6 proposed 8 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 1.4015 345.1606 197.4638 -3.9732 1417.381 9.816376e+17 854134.1 5249.270
## 3 1.0559 300.9672 144.9435 -2.9442 2510.638 8.472497e+17 692896.0 4473.686
## 4 0.6782 274.2736 133.2776 -2.9159 3462.636 6.539332e+17 570367.8 3968.261
## 5 268.6059 262.9056 71.0133 -1.2713 3969.136 6.554902e+17 431900.7 3551.911
## 6 0.0027 237.4657 1.6457 -2.9458 4248.373 7.389993e+17 360184.3 3342.938
## 7 0.4834 198.2744 184.4730 -8.8669 4440.808 8.497508e+17 309533.2 3338.098
## 8 33.1869 223.7518 41.4020 2.0845 5201.886 5.696254e+17 240699.1 2871.051
## 9 0.3576 207.9287 51.6006 1.2272 5417.635 5.967256e+17 211842.6 2769.836
## 10 0.3071 198.8081 73.7797 1.7261 5766.334 5.427072e+17 198475.3 2649.081
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 2.6170 1.3030 0.2391 1.7507 0.2320 1.2733 -108.6055 -0.8240 0.3146
## 3 4.7290 1.5289 0.2432 1.5211 0.2505 1.1660 -84.5470 -0.5463 0.3233
## 4 6.5883 1.7237 0.2266 1.6460 0.1880 1.8968 -277.9990 -1.8136 0.3039
## 5 7.1665 1.9257 0.2102 1.5565 0.1977 1.4556 -70.1149 -1.1970 0.3031
## 6 7.5860 2.0461 0.2049 1.5387 0.1914 1.6812 -88.7354 -1.5491 0.2892
## 7 8.0386 2.0491 0.1940 1.6710 0.1767 3.5648 -120.1536 -2.7420 0.2664
## 8 9.4703 2.3824 0.1927 1.5109 0.1905 3.1426 -107.7224 -2.5973 0.2689
## 9 9.8378 2.4695 0.2145 1.4493 0.1937 2.1641 -136.6303 -2.0490 0.2569
## 10 10.7713 2.5820 0.2092 1.4605 0.1855 1.8450 -114.4991 -1.7448 0.2473
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 2624.6352 0.3059 -0.5065 0.7201 0.0355 2e-04 1.8439 1.9304 1.1111
## 3 1491.2286 0.4583 1.0475 0.7968 0.0325 4e-04 2.1094 1.8314 1.1583
## 4 992.0652 0.4242 0.2199 1.4321 0.0335 5e-04 2.1163 1.7175 1.5339
## 5 710.3821 0.4419 0.5613 1.7846 0.0252 5e-04 1.9197 1.6199 1.3037
## 6 557.1563 0.4333 1.9836 2.0257 0.0148 5e-04 2.0400 1.5727 0.7271
## 7 476.8711 0.3728 0.0426 2.9949 0.0153 5e-04 1.9429 1.5285 0.7869
## 8 358.8814 0.3919 -0.1498 3.2566 0.0197 6e-04 1.9110 1.4433 0.6242
## 9 307.7595 0.3979 0.7589 3.2256 0.0185 6e-04 2.2172 1.4344 0.9059
## 10 264.9081 0.3711 0.4499 3.9259 0.0222 6e-04 2.2880 1.3907 0.8195
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.8140 115.5871 1
## 3 0.8115 137.9640 1
## 4 0.8010 146.0942 1
## 5 0.7605 70.5281 1
## 6 0.7585 69.7340 1
## 7 0.7278 62.4611 1
## 8 0.7246 60.0521 1
## 9 0.7238 96.9404 1
## 10 0.7246 95.0191 1
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 5.0000 2.0000 7.0000 8.0000 3.000 8.000000e+00
## Value_Index 268.6059 345.1606 182.8273 2.0845 1093.257 3.072257e+17
## TrCovW TraceW Friedman Rubin Cindex DB Silhouette
## Number_clusters 3.0 8.0000 3.000 8.0000 8.0000 9.0000 3.0000
## Value_Index 161238.1 365.8319 2.112 -0.2463 0.1927 1.4493 0.2505
## Duda PseudoT2 Beale Ratkowsky Ball PtBiserial Frey
## Number_clusters 2.0000 2.0000 2.000 3.0000 3.000 3.0000 1
## Value_Index 1.2733 -108.6055 -0.824 0.3233 1133.407 0.4583 NA
## McClain Dunn Hubert SDindex Dindex SDbw
## Number_clusters 2.0000 2.0000 0 2.0000 0 8.0000
## Value_Index 0.7201 0.0355 0 1.8439 0 0.6242
##
## $Best.partition
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [149] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2
## [260] 2 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
## [297] 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2
## [334] 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
## [408] 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [445] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [482] 2 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [519] 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [556] 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 2 2 1 1 2 1
## [593] 1 1 1 2 1 1 1 1 2 2 1 2 1 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 2 2 2
## [630] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 2 1
## [667] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 2 1 2 2 1 2 1 2 2 1 2 2 2 2 2 1 1 1
## [704] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [741] 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1
## [778] 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1
## [815] 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2
## [852] 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 2 1 2 1 1 2 1 2 1 2 1 2 2 1 1 1 2 1 1 1 2 1
## [889] 1 2 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 1 1
## [926] 2 2 2 2 1 1 2 1 2 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## [963] 1 1 2 2 1 2 1 1 2 2 2 1 2 1 1 1 1 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1
## [1000] 1 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 2 1
## [1037] 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1
## [1074] 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2
## [1111] 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
7 indexes proposed 2 as the best number of clusters, but also another 7 indexes proposed 3 as the best number of clusters. So, from all of the three analyses done above (Elbow, Silhouette, Indexes), I decide to use 3 clusters.
Clustering <- kmeans(mydata_CluStd,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 73, 435, 633
##
## Cluster means:
## total sulfur dioxide density pH residual sugar alcohol
## 1 0.6315611 1.1080968 -0.3571929 3.0796806 0.3945449
## 2 -0.3383862 -0.8273622 0.4563296 -0.2720966 0.7592351
## 3 0.1597062 0.4407764 -0.2723986 -0.1681748 -0.5672497
## Id
## 1 -0.1733101
## 2 0.7532038
## 3 -0.4976177
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 3 3 2 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 2 3 3 3 3 3 3 3 1 1 1 3 3
## [112] 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3
## [149] 2 3 3 3 1 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 2 3 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 1 1 3 1 3 3 2 3 2 3 3 3 3 3 3 3 3 3 1 2 2 3 3 3 3 3 3 3 3
## [260] 1 1 3 2 3 3 3 2 1 3 3 3 2 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 2 3
## [297] 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 3 2 3 3 3
## [334] 3 3 3 3 3 3 3 3 3 3 3 2 2 3 1 3 3 3 3 3 3 1 1 3 2 3 3 3 3 3 3 3 1 1 3 3 3
## [371] 3 3 3 3 3 3 2 3 3 3 3 1 1 3 1 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 3 3 3 1 3 3 3
## [408] 2 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 1 3 3 3 3 3 1 2 3 3 3 3 3 3 3 3 3 1 3 3 3
## [445] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [482] 3 3 2 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 1 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3
## [519] 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [556] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2 3 3 2 3 3 2 3 3 2 3 2 3 3 3 2 3 3 2
## [593] 2 2 2 3 2 2 2 2 3 3 2 3 2 3 3 2 2 2 2 3 3 3 3 3 2 2 2 2 2 2 3 3 3 2 1 3 3
## [630] 3 3 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 3 2 3 2 2 2 3 3 2 2 3 3 2 2 2 2 3 2
## [667] 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 2 2 1 3 3 2 3 3 2 3 2 3 3 2 3 3 3 3 3 2 2 3
## [704] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 3 2 1 2 2 2 2 2 2 2 2 3 2 2
## [741] 3 2 3 2 2 3 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 3 2 3 2 2 3 3 2 2 2 2 1 1 3 2 2
## [778] 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 1 2 2 2 3 3 3 2 2 2 2 2 2 1 2
## [815] 2 2 3 2 3 2 2 2 3 2 2 2 2 2 3 3 2 1 2 2 2 3 2 2 2 2 2 2 3 2 2 2 3 2 2 2 1
## [852] 2 2 3 2 2 2 2 2 3 2 2 2 3 3 3 3 2 3 2 2 3 2 3 2 3 2 3 3 2 2 2 3 2 2 2 3 1
## [889] 2 3 2 3 2 2 2 2 2 2 2 2 3 2 3 1 2 3 2 2 2 2 3 2 3 1 1 2 2 2 3 2 2 2 2 2 2
## [926] 3 3 3 3 2 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 3 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2
## [963] 2 2 1 3 2 3 2 2 3 3 3 2 3 2 2 2 2 3 3 3 3 2 1 2 2 2 2 3 3 3 2 3 2 2 2 1 2
## [1000] 2 2 2 2 1 3 3 2 3 2 2 3 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 1 2 2 2 2 2 3 2 3 2
## [1037] 2 2 2 2 2 2 2 2 3 3 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2
## [1074] 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1
## [1111] 3 3 3 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 704.8552 1489.0927 2279.7379
## (between_SS / total_SS = 34.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_CluStd)
I will remove the variables with Id number 1050, 232, 760 and 761 in order to create more homogeneous group. I will also standardize the data again.
mydata <- mydata %>%
filter(!Id %in% c(232, 1050, 760, 761))
mydata$Id <- seq(1, nrow(mydata))
mydata_CluStd <- as.data.frame(scale(mydata[c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol")]))
Clustering <- kmeans(mydata_CluStd,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 73, 380, 684
##
## Cluster means:
## total sulfur dioxide density pH residual sugar alcohol
## 1 0.4452809 1.1364382 -0.2702118 3.0788004 0.4190285
## 2 -0.3625911 -0.9036230 0.5303141 -0.2307469 0.9781134
## 3 0.1539168 0.3807263 -0.2657805 -0.2003927 -0.5881172
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 1 3 3 3 2 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 3 3 3 3 2 3 2 3 2 2 2 3 3 3 1 1 1 3 3
## [112] 3 3 3 1 1 3 3 3 2 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3
## [149] 2 3 3 3 1 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 2 3 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3
## [223] 3 3 3 3 2 3 3 3 1 3 1 3 3 2 3 2 3 3 3 3 3 3 3 3 3 1 2 2 3 2 3 3 3 3 3 3 1
## [260] 1 3 2 3 3 3 2 1 3 3 3 2 3 3 1 1 3 3 1 3 3 3 3 3 1 3 3 1 3 1 2 3 3 3 2 2 3
## [297] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 2 3 3 3 3 3 2 1 3 3 3 2 3 3 3 2 3 3 3 3
## [334] 3 3 3 3 3 2 3 3 3 3 2 2 3 1 3 3 3 3 3 3 1 1 3 2 3 3 3 3 3 2 2 1 1 3 3 3 3
## [371] 2 3 3 3 3 2 3 3 3 2 1 1 3 1 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 3 3 3 1 3 3 3 2
## [408] 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 1 3 3 3 3 3 1 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3
## [445] 3 3 3 1 3 3 3 3 3 3 3 3 2 2 1 3 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [482] 3 2 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 1 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2
## [519] 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2
## [556] 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2 3 3 2 3 3 2 3 3 2 3 2 3 3 3 2 3 3 2 2
## [593] 2 2 3 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 3 3 3 3 3 2 3 2 2 2 2 3 3 3 2 1 3 3 3
## [630] 3 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 3 2 3 2 2 2 3 3 2 2 3 3 2 2 2 2 3 2 2
## [667] 2 2 2 2 2 2 3 2 2 2 3 3 2 3 2 2 1 3 3 2 3 3 2 3 2 3 3 2 3 3 3 3 3 2 2 3 2
## [704] 2 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 3 2 1 2 3 2 2 3 2 3 3 3 2 2 3
## [741] 2 3 3 2 3 2 2 2 2 3 2 3 3 2 1 3 2 1 3 2 3 1 2 3 3 2 2 2 2 1 1 3 2 2 2 2 2
## [778] 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 1 2 2 3 3 3 3 3 2 2 3 2 3 1 2 2 2 3
## [815] 2 3 2 2 2 3 2 2 2 3 2 3 3 2 1 2 2 2 3 3 3 3 2 2 2 3 3 3 3 3 3 2 2 1 2 2 3
## [852] 2 2 2 2 3 3 2 2 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 3 3 2 3 2 3 3 3 2 3 1 2 3 2
## [889] 3 2 2 2 2 3 2 2 2 3 2 3 1 3 3 2 2 2 2 3 2 3 1 1 2 3 2 3 2 2 2 2 2 2 3 3 3
## [926] 3 2 2 3 2 3 2 2 2 2 2 2 3 3 3 3 3 3 3 2 3 3 3 2 2 2 2 3 3 2 3 2 3 2 2 2 1
## [963] 3 3 3 2 3 3 3 3 2 3 3 3 2 2 3 3 3 3 3 1 2 2 3 3 3 3 3 2 3 2 3 3 1 2 2 2 2
## [1000] 2 1 3 3 3 3 2 3 3 3 3 2 2 2 2 2 2 3 2 3 2 2 2 2 1 3 2 2 2 2 3 2 3 2 2 3 2
## [1037] 2 3 3 2 3 3 3 2 2 2 2 1 2 1 3 2 3 2 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 3 2
## [1074] 3 3 3 3 2 3 3 2 3 3 3 3 3 2 2 3 3 2 3 2 2 2 2 2 3 2 3 2 3 2 2 2 1 3 3 3 3
## [1111] 3 2 2 3 3 2 2 2 3 2 2 2 1 2 2 2 3 2 2 2 2 2 2 3 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 517.4663 1094.2405 1970.8128
## (between_SS / total_SS = 36.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_CluStd)
Now, I will remove Id number 1048 and 231, and standardize the data again with the function “scale”.
mydata <- mydata %>%
filter(!Id %in% c(1048, 231))
mydata$Id <- seq(1, nrow(mydata))
mydata_CluStd <- as.data.frame(scale(mydata[c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol")]))
Clustering <- kmeans(mydata_CluStd,
centers = 3,
nstart = 25)
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_CluStd)
Clustering
## K-means clustering with 3 clusters of sizes 669, 382, 84
##
## Cluster means:
## total sulfur dioxide density pH residual sugar alcohol
## 1 0.1679786 0.3984463 -0.2740983 -0.2046639 -0.6026384
## 2 -0.3549890 -0.9142728 0.5270688 -0.2823040 0.9422405
## 3 0.2765254 0.9844247 -0.2139106 2.9138128 0.5146332
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 1 3 1 1 1 2 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 3 3 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 2 1 2 1 2 2 2 1 1 1 3 3 3 1 1
## [112] 1 1 1 3 3 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## [149] 2 1 1 1 3 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 2 1 3 3 3 1 3 3 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## [223] 1 1 1 1 2 1 1 1 3 3 1 1 2 3 2 1 1 1 1 1 1 1 1 1 3 2 2 1 2 1 1 1 1 1 1 3 3
## [260] 1 2 1 1 1 2 3 1 1 1 2 1 1 3 3 1 1 3 1 1 1 1 1 3 1 1 3 1 3 2 1 1 1 2 2 1 2
## [297] 2 1 1 1 1 1 1 1 1 1 1 1 1 3 2 1 2 1 1 1 1 1 2 3 1 1 1 2 1 1 1 2 1 1 1 1 1
## [334] 1 1 1 1 2 1 1 1 1 2 2 1 3 1 1 1 1 1 1 3 3 1 2 1 1 1 1 1 2 2 3 3 1 1 1 1 2
## [371] 1 1 1 1 2 1 1 1 2 3 3 1 3 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 1 1 1 3 1 1 1 2 1
## [408] 1 1 1 1 1 1 1 1 1 2 1 2 1 1 3 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1
## [445] 1 1 3 1 1 1 1 1 1 1 1 2 2 3 1 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [519] 3 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 1 1 1 2 1 1 2 2 2
## [593] 2 1 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 2 1 1 1 2 3 1 1 1 1
## [630] 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 1 2 1 2 2 2 1 1 2 2 1 1 2 2 2 3 1 2 3 2
## [667] 2 2 2 2 2 1 2 2 2 1 1 2 1 2 2 3 1 1 2 1 1 2 1 2 1 1 2 1 1 1 1 1 2 2 1 2 2
## [704] 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 2 3 2 1 2 2 2 2 2 1 1 2 2 1 2
## [741] 1 1 2 1 2 2 2 2 1 2 1 1 2 3 1 2 3 1 2 1 3 2 1 1 2 2 2 2 3 3 1 2 2 2 2 2 2
## [778] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 3 2 2 1 1 1 1 1 2 2 1 2 1 3 2 2 2 1 2
## [815] 1 2 2 2 1 2 2 2 1 2 1 1 2 3 2 2 2 1 1 1 1 3 2 2 1 1 1 1 1 1 2 2 3 2 2 1 2
## [852] 2 2 2 1 1 2 2 2 1 1 1 1 2 1 2 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1 2 1 3 2 1 2 1
## [889] 2 2 2 2 1 2 2 2 1 2 1 3 1 1 2 2 2 2 1 2 1 3 3 2 1 2 1 2 2 2 2 2 2 1 1 1 1
## [926] 2 2 1 2 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 2 3 1
## [963] 1 1 2 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 3 2 2 1 1 1 1 1 2 1 2 1 1 3 2 2 2 2 2
## [1000] 3 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 1 2 1 3 2 2 2 3 1 2 2 2 2 1 2 1 2 2 1 2 2
## [1037] 1 1 2 1 1 1 2 2 2 2 2 3 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1 1
## [1074] 1 1 2 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 3 1 1 1 1 1 2
## [1111] 2 1 1 2 2 2 1 2 2 2 3 2 2 2 1 2 2 2 2 2 2 1 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 1935.007 1090.579 527.816
## (between_SS / total_SS = 37.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Finally, I arrived at an acceptable graphical representation of the clusters.
The first cluster group has 669 variables. The second cluster group has 382 variables. The third cluster group has 84 variables.
The “Cluster means” field shows the final leaders/centers of the three groups.
The “Clustering vector” shows in which group (1st, 2nd, or 3rd) belongs each of the 1135 variables.
The within cluster sum of squares by cluster is the highest in group number 1 (= 1935.007).
The (between_SS / total_SS) ratio is 37.3%. This ratio increased by 2.7 percentage points (from initial 34.6%) because I removed the units with very high distances form the group leaders, thus I decreased the within distance (we can also observe that with the within cluster sum of squares when we compare the initial and the last numbers).
Averages <- Clustering$centers
Averages
## total sulfur dioxide density pH residual sugar alcohol
## 1 0.1679786 0.3984463 -0.2740983 -0.2046639 -0.6026384
## 2 -0.3549890 -0.9142728 0.5270688 -0.2823040 0.9422405
## 3 0.2765254 0.9844247 -0.2139106 2.9138128 0.5146332
Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:pastecs':
##
## extract
Figure <- pivot_longer(Figure, cols = c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol"))
Figure$Group <- factor(Figure$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$ImeF <- factor(Figure$name,
levels = c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol"),
labels = c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol"))
library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
scale_color_brewer(palette="Set1") +
ylim(-3, 3) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
Explanation of each cluster variable, and what high or low value means (explanations taken from Chat GPT):
Total Sulfur Dioxide: High levels preserve the wine longer but may create a chemical taste, while low levels offer a more natural wine but reduce shelf life and stability.
Density: High density indicates a fuller-bodied, rich wine, while low density suggests a lighter, often drier wine.
pH: A low pH (high acidity) produces crisp, fresh wines with aging potential, while a high pH (low acidity) can result in softer, less structured wines prone to quicker spoilage.
Residual Sugar: High residual sugar contributes to sweeter wines, whereas low sugar indicates a dry, non-sweet wine.
Alcohol: High alcohol adds warmth and body but can overpower flavors, while low alcohol leads to lighter, more refreshing wines that may feel thin without proper balance.
mydata$Group <- Clustering$cluster
fit <- aov(cbind(`total sulfur dioxide`, density, pH, `residual sugar`, alcohol) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response total sulfur dioxide :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 71784 35892 39.193 < 2.2e-16 ***
## Residuals 1132 1036663 916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response density :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.0017939 0.00089693 457.55 < 2.2e-16 ***
## Residuals 1132 0.0022190 0.00000196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response pH :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 3.9139 1.95695 93.13 < 2.2e-16 ***
## Residuals 1132 23.7868 0.02101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response residual sugar :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 882.40 441.20 1205.3 < 2.2e-16 ***
## Residuals 1132 414.35 0.37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response alcohol :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 702.64 351.32 645.84 < 2.2e-16 ***
## Residuals 1132 615.78 0.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For each factor variable (total sulfur dioxide, density, pH…), the hypothesis is the following:
H0: μ(G1) = μ(G2) = μ(G3)
H1: at least one arithmetic mean is different
For all factors, the p value is less than 0.001, meaning that for all of them, we can reject H0. This means that the clustering variables successfully differentiate between groups.
In order to evaluate whether I have chosen the right clusters, I will use fixed acidity to serve as a criterion validity i.e. check if the groups I created are also divided along the same criteria.
aggregate(mydata$`fixed acidity`,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 8.685052
## 2 2 7.420419
## 3 3 9.265476
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$`fixed acidity`, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 25.098 2.162e-11 ***
## 1132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis:
H0: ρ² (G1) = ρ² (G2) = ρ² (G3) (the variances of the groups are equal)
H1: at least one is different (the variances of the groups are not equal)
We can reject H0 (p < 0.001), meaning that there is heterogenity among variances between groups.
colnames(mydata)
## [1] "fixed acidity" "volatile acidity" "citric acid"
## [4] "residual sugar" "chlorides" "free sulfur dioxide"
## [7] "total sulfur dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "Id" "Dissimilarity" "Group"
I will further perform Shapiro Wilk test. Because the “shapiro_test” function didn’t work on my variable because of the space in between, I will rename the variable into fixed_acidity.
mydata <- mydata %>%
rename(fixed_acidity = "fixed acidity")
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(Group) %>%
shapiro_test(`fixed_acidity`)
## # A tibble: 3 × 4
## Group variable statistic p
## <int> <chr> <dbl> <dbl>
## 1 1 fixed_acidity 0.916 7.59e-19
## 2 2 fixed_acidity 0.981 5.82e- 5
## 3 3 fixed_acidity 0.900 7.54e- 6
Hypothesis about the first row (Group 1):
H0: the variable fixed acidity is normally distributed in G1
H1: the variable fixed acidity is not normally distributed in G1
The same hypothesis applies to row 2 and 3 (G2 and G3). We reject all of the three hypotheses (p<0.001), the variable is not normally distributed in neither of the groups, so I will use Kruskal Walis test.
fit <- aov(fixed_acidity ~ as.factor(Group),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 473 236.51 89.4 <2e-16 ***
## Residuals 1132 2995 2.65
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
kruskal.test(fixed_acidity ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: fixed_acidity by as.factor(Group)
## Kruskal-Wallis chi-squared = 154.06, df = 2, p-value < 2.2e-16
H0: location distribution of fixed acidity is the same in G1, G2 and G3
H1: at least one location distribution is different
We can reject H0 (p < 0.001), there is statistically significant difference in the distribution of fixed acidity between G1, G2, and G3.
kruskal_effsize(fixed_acidity ~ as.factor(Group),
data = mydata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 fixed_acidity 1135 0.134 eta2[H] moderate
The effect size for this test is moderate.
For the conclusion, I used Chat GPT as help for describing the groups based on the clustering variables, since I am not a wine expert:
Group 1: Barely Boozy Bliss - not too strong, not too light… just right!
Group characteristics: Total sulfur dioxide is very close to the average, the density is little above average and the pH and residual sugar are below the average. The alcohol level is below average, and the lowest among the three clusters. This cluster accounts for the largest portion of wines (58.94%).
Description: These wines are likely dry white wines or lighter red wines with lower alcohol content. The below-average residual sugar and pH indicate wines that are more acidic and not sweet. With the highest representation, these wines may include popular table wines meant for easy drinking, such as Pinot Grigio, Sauvignon Blanc, or light reds like Pinot Noir. These wines pair well with light dishes such as seafood, chicken, salads, or creamy pasta. The acidity helps cut through rich or creamy flavors, while their light profile complements delicate foods.
Group 2: Robust Balance - delivering a strong, yet smooth experience.
Group characteristics: Low total sulfur dioxide and density (the lowest among the three groups), above average pH balance, little below average residual sugar, and highest level of alcohol among the groups. This cluster accounts for the second largest portion of wines (33.66%).
Description: These wines may be stronger red or white wines with a robust alcohol profile. The slightly elevated residual sugar and above-average pH suggest wines with more balance between acidity and sweetness. Examples could include rich reds like Merlot or Syrah, or fuller-bodied whites like Chardonnay. These wines may have a more intense flavor and mouthfeel. They pair well with hearty dishes such as grilled meats, rich stews, roasted vegetables, or creamy cheeses. The fuller body of these wines enhances savory, rich flavors.
Group 3: Dessert Friend - easygoing wines with rich sweetness.
Group characteristics: Above average total sulfur dioxide and and density. Below average pH. Extremely high residual sugar and above average level of alcohol. This cluster accounts for the smallest portion of wines (7.4%).
Description: This cluster likely includes sweet dessert or fortified wines. The extremely high residual sugar and above-average alcohol content, paired with lower pH, point to rich, sweet wines like Port, Late Harvest Riesling, or Sauternes. These wines are full-bodied and often enjoyed as dessert wines due to their sweetness and alcohol strength. They are perfect with desserts such as fruit tarts, crème brûlée, chocolate-based desserts, or blue cheeses. The sweetness balances the richness of desserts or contrasts beautifully with salty cheeses.