Homework no. 2 - Clustering

library(readr)
mydata <- read_csv("WineQT.csv")

## Rows: 1143 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(mydata)

## # A tibble: 6 × 13
##   `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
##             <dbl>              <dbl>         <dbl>            <dbl>     <dbl>
## 1             7.4               0.7           0                 1.9     0.076
## 2             7.8               0.88          0                 2.6     0.098
## 3             7.8               0.76          0.04              2.3     0.092
## 4            11.2               0.28          0.56              1.9     0.075
## 5             7.4               0.7           0                 1.9     0.076
## 6             7.4               0.66          0                 1.8     0.075
## # ℹ 8 more variables: `free sulfur dioxide` <dbl>,
## #   `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>, Id <dbl>

Source: Kaggle

Unit of observation: 1 wine

Sample size = n = 1143

The data contains the following variables:

Fixed Acidity (grams per liter (g/L))
Volatile Acidity (grams per liter (g/L) of acetic acid)
Citric Acid (grams per liter (g/L))
Residual Sugar(grams per liter (g/L))
Chlorides (grams per liter (g/L))
Free Sulfur Dioxide (milligrams per liter (mg/L))
Total Sulfur Dioxide (milligrams per liter (mg/L))
Density (grams per cubic centimeter (g/cm³))
pH (pH scale (no unit, dimensionless))
Alcohol (percent alcohol by volume (% ABV))
Quality (score between 0 and 10)
ID - ID number of the wine

Descriptive Statistics

library(pastecs)
round(stat.desc(mydata[ , c(-13)]), 2)

##              fixed acidity volatile acidity citric acid residual sugar
## nbr.val            1143.00          1143.00     1143.00        1143.00
## nbr.null              0.00             0.00       99.00           0.00
## nbr.na                0.00             0.00        0.00           0.00
## min                   4.60             0.12        0.00           0.90
## max                  15.90             1.58        1.00          15.50
## range                11.30             1.46        1.00          14.60
## sum                9499.60           607.32      306.74        2894.25
## median                7.90             0.52        0.25           2.20
## mean                  8.31             0.53        0.27           2.53
## SE.mean               0.05             0.01        0.01           0.04
## CI.mean.0.95          0.10             0.01        0.01           0.08
## var                   3.05             0.03        0.04           1.84
## std.dev               1.75             0.18        0.20           1.36
## coef.var              0.21             0.34        0.73           0.54
##              chlorides free sulfur dioxide total sulfur dioxide density      pH
## nbr.val        1143.00             1143.00              1143.00 1143.00 1143.00
## nbr.null          0.00                0.00                 0.00    0.00    0.00
## nbr.na            0.00                0.00                 0.00    0.00    0.00
## min               0.01                1.00                 6.00    0.99    2.74
## max               0.61               68.00               289.00    1.00    4.01
## range             0.60               67.00               283.00    0.01    1.27
## sum              99.36            17848.50             52480.50 1139.26 3784.49
## median            0.08               13.00                37.00    1.00    3.31
## mean              0.09               15.62                45.91    1.00    3.31
## SE.mean           0.00                0.30                 0.97    0.00    0.00
## CI.mean.0.95      0.00                0.59                 1.90    0.00    0.01
## var               0.00              105.07              1074.67    0.00    0.02
## std.dev           0.05               10.25                32.78    0.00    0.16
## coef.var          0.54                0.66                 0.71    0.00    0.05
##              sulphates  alcohol quality
## nbr.val        1143.00  1143.00 1143.00
## nbr.null          0.00     0.00    0.00
## nbr.na            0.00     0.00    0.00
## min               0.33     8.40    3.00
## max               2.00    14.90    8.00
## range             1.67     6.50    5.00
## sum             751.76 11935.33 6466.00
## median            0.62    10.20    6.00
## mean              0.66    10.44    5.66
## SE.mean           0.01     0.03    0.02
## CI.mean.0.95      0.01     0.06    0.05
## var               0.03     1.17    0.65
## std.dev           0.17     1.08    0.81
## coef.var          0.26     0.10    0.14

Citric acid has 99 numbers with the value 0 (nbr.null). I will not remove them because in this case the wines posses zero grams of citric acid per liter.

The average residual sugar is 2.53 g/L.

The maximum amount of alcohol that is contained in one wine is 14.90% while the minimum amount of alcohol is 8.40%.

50% of the wines have pH value of below or equal to 3.31, while the other 50% have a pH value higher than 3.31.

I will use the following six variables for clustering: total sulfur dioxide, density, pH, residual sugar, and alcohol.

I will create new set mydata_CluStd (Wine Cluster Variables Standardized) containing only the 5 standardized cluster variables, and the ID number:

mydata_CluStd <- as.data.frame(scale(mydata[c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol", "Id")]))

I will calculate the distances and order them in descending order to see if there will be any potential outliers:

mydata$Dissimilarity <- sqrt(mydata_CluStd$`total sulfur dioxide`^2 + mydata_CluStd$density^2 + mydata_CluStd$pH^2 + mydata_CluStd$`residual sugar`^2 + mydata_CluStd$alcohol^2)

head(mydata[order(-mydata$Dissimilarity), c("Id", "Dissimilarity")])

## # A tibble: 6 × 2
##      Id Dissimilarity
##   <dbl>         <dbl>
## 1  1434         10.4 
## 2   480         10.2 
## 3  1081          9.12
## 4  1474          9.07
## 5  1476          9.07
## 6  1079          8.85

I will remove the first 3 units, since they have the highest dissimilarities from the rest of the units. Also, I will again standardize the data (using the scale function) after removing the three units:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

  mydata <- mydata %>%
  filter(!Id %in% c(1434, 480))

mydata$Id <- seq(1, nrow(mydata))

mydata_CluStd <- as.data.frame(scale(mydata[c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol", "Id")]))

Checking if data is clusterable

Next, I will have to check if data is clusterable. I will use the Hopkins statistic and look at the Dissimilarity matrix.

Dissimilarity matrix

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distances <- get_dist(mydata_CluStd,
                      method = "euclidian")

fviz_dist(Distances, 
          gradient = list(low = "blue",
                          mid = "grey",
                          high = "white"))

The dimension of the dissimilarity matrix is “n x n” or “1143 x 1143”. From the matrix, I can observe some groups forming, but I will have to further check if data is clusterable using the official test - Hopkins Statistic.

Hopkins statistic

library(factoextra)
get_clust_tendency(mydata_CluStd,
                   n = nrow(mydata_CluStd) -1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8775994
## 
## $plot
## NULL

The Hopkins statistic is 0.878 which is above the cut point of 0.5. In fact, it is way closer to 1, than to 0.5, so I will proceed with the next step in clustering which is determining how much clusters to create. I will use four approaches to find out.

Determining how much clusters to create

First, I will use the Hierarchical approach to see how many groups it proposes. However, it may happen that the number of groups differs from the other K-means methods (Elbow method and Silhouette analysis) because the hierarchical approach does not do reclassifications.

Hierarchical approach - Dendogram

library(dplyr)
library(factoextra)
WARD <- mydata_CluStd %>%
  get_dist(method = "euclidean") %>%
             hclust(method = "ward.D2")

library(factoextra)
fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

According to the cluster Dendogram, I should use 2 cluster variables. However I will further check using the Elbow method and Silhouette Analysis.

K-means approach - Elbow Method

library(factoextra)
library(NbClust)
fviz_nbclust(mydata_CluStd, kmeans, method = "wss") +
  labs(subtitle = "Elbow Method")

From the graph, we can observe an “elbow” forming at the optimal number of clusters being either 3 or 5. I will further check this using the Silhouette analysis.

K-means approach - Silhouette Analysis

fviz_nbclust(mydata_CluStd, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette analysis")

From the Silhouette method, the optimal number of clusters to do is 3. Lastly, we can check what all other indexes say about what is the optimal number of clusters to make:

Other indexes

library(NbClust)
NbClust(mydata_CluStd,
        distance = "euclidean",
        min.nc = 2, max.nc = 10,
        method = "kmeans",
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 6 proposed 8 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

## $All.index
##          KL       CH Hartigan     CCC    Scott      Marriot   TrCovW   TraceW
## 2    1.4015 345.1606 197.4638 -3.9732 1417.381 9.816376e+17 854134.1 5249.270
## 3    1.0559 300.9672 144.9435 -2.9442 2510.638 8.472497e+17 692896.0 4473.686
## 4    0.6782 274.2736 133.2776 -2.9159 3462.636 6.539332e+17 570367.8 3968.261
## 5  268.6059 262.9056  71.0133 -1.2713 3969.136 6.554902e+17 431900.7 3551.911
## 6    0.0027 237.4657   1.6457 -2.9458 4248.373 7.389993e+17 360184.3 3342.938
## 7    0.4834 198.2744 184.4730 -8.8669 4440.808 8.497508e+17 309533.2 3338.098
## 8   33.1869 223.7518  41.4020  2.0845 5201.886 5.696254e+17 240699.1 2871.051
## 9    0.3576 207.9287  51.6006  1.2272 5417.635 5.967256e+17 211842.6 2769.836
## 10   0.3071 198.8081  73.7797  1.7261 5766.334 5.427072e+17 198475.3 2649.081
##    Friedman  Rubin Cindex     DB Silhouette   Duda  Pseudot2   Beale Ratkowsky
## 2    2.6170 1.3030 0.2391 1.7507     0.2320 1.2733 -108.6055 -0.8240    0.3146
## 3    4.7290 1.5289 0.2432 1.5211     0.2505 1.1660  -84.5470 -0.5463    0.3233
## 4    6.5883 1.7237 0.2266 1.6460     0.1880 1.8968 -277.9990 -1.8136    0.3039
## 5    7.1665 1.9257 0.2102 1.5565     0.1977 1.4556  -70.1149 -1.1970    0.3031
## 6    7.5860 2.0461 0.2049 1.5387     0.1914 1.6812  -88.7354 -1.5491    0.2892
## 7    8.0386 2.0491 0.1940 1.6710     0.1767 3.5648 -120.1536 -2.7420    0.2664
## 8    9.4703 2.3824 0.1927 1.5109     0.1905 3.1426 -107.7224 -2.5973    0.2689
## 9    9.8378 2.4695 0.2145 1.4493     0.1937 2.1641 -136.6303 -2.0490    0.2569
## 10  10.7713 2.5820 0.2092 1.4605     0.1855 1.8450 -114.4991 -1.7448    0.2473
##         Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  2624.6352     0.3059 -0.5065  0.7201 0.0355  2e-04  1.8439 1.9304 1.1111
## 3  1491.2286     0.4583  1.0475  0.7968 0.0325  4e-04  2.1094 1.8314 1.1583
## 4   992.0652     0.4242  0.2199  1.4321 0.0335  5e-04  2.1163 1.7175 1.5339
## 5   710.3821     0.4419  0.5613  1.7846 0.0252  5e-04  1.9197 1.6199 1.3037
## 6   557.1563     0.4333  1.9836  2.0257 0.0148  5e-04  2.0400 1.5727 0.7271
## 7   476.8711     0.3728  0.0426  2.9949 0.0153  5e-04  1.9429 1.5285 0.7869
## 8   358.8814     0.3919 -0.1498  3.2566 0.0197  6e-04  1.9110 1.4433 0.6242
## 9   307.7595     0.3979  0.7589  3.2256 0.0185  6e-04  2.2172 1.4344 0.9059
## 10  264.9081     0.3711  0.4499  3.9259 0.0222  6e-04  2.2880 1.3907 0.8195
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.8140           115.5871            1
## 3          0.8115           137.9640            1
## 4          0.8010           146.0942            1
## 5          0.7605            70.5281            1
## 6          0.7585            69.7340            1
## 7          0.7278            62.4611            1
## 8          0.7246            60.0521            1
## 9          0.7238            96.9404            1
## 10         0.7246            95.0191            1
## 
## $Best.nc
##                       KL       CH Hartigan    CCC    Scott      Marriot
## Number_clusters   5.0000   2.0000   7.0000 8.0000    3.000 8.000000e+00
## Value_Index     268.6059 345.1606 182.8273 2.0845 1093.257 3.072257e+17
##                   TrCovW   TraceW Friedman   Rubin Cindex     DB Silhouette
## Number_clusters      3.0   8.0000    3.000  8.0000 8.0000 9.0000     3.0000
## Value_Index     161238.1 365.8319    2.112 -0.2463 0.1927 1.4493     0.2505
##                   Duda  PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey
## Number_clusters 2.0000    2.0000  2.000    3.0000    3.000     3.0000    1
## Value_Index     1.2733 -108.6055 -0.824    0.3233 1133.407     0.4583   NA
##                 McClain   Dunn Hubert SDindex Dindex   SDbw
## Number_clusters  2.0000 2.0000      0  2.0000      0 8.0000
## Value_Index      0.7201 0.0355      0  1.8439      0 0.6242
## 
## $Best.partition
##    [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
##   [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
##   [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
##  [112] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
##  [149] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [186] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2
##  [260] 2 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
##  [297] 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2
##  [334] 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
##  [371] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
##  [408] 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [445] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [482] 2 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [519] 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [556] 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 2 2 1 1 2 1
##  [593] 1 1 1 2 1 1 1 1 2 2 1 2 1 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 2 2 2
##  [630] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 2 1
##  [667] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 2 1 2 2 1 2 1 2 2 1 2 2 2 2 2 1 1 1
##  [704] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1
##  [741] 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1
##  [778] 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1
##  [815] 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2
##  [852] 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 2 1 2 1 1 2 1 2 1 2 1 2 2 1 1 1 2 1 1 1 2 1
##  [889] 1 2 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 1 1
##  [926] 2 2 2 2 1 1 2 1 2 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
##  [963] 1 1 2 2 1 2 1 1 2 2 2 1 2 1 1 1 1 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1
## [1000] 1 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 2 1
## [1037] 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1
## [1074] 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2
## [1111] 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

7 indexes proposed 2 as the best number of clusters, but also another 7 indexes proposed 3 as the best number of clusters. So, from all of the three analyses done above (Elbow, Silhouette, Indexes), I decide to use 3 clusters.

Creating clusters

Clustering <- kmeans(mydata_CluStd,
                     centers = 3,
                     nstart = 25)
Clustering

## K-means clustering with 3 clusters of sizes 73, 435, 633
## 
## Cluster means:
##   total sulfur dioxide    density         pH residual sugar    alcohol
## 1            0.6315611  1.1080968 -0.3571929      3.0796806  0.3945449
## 2           -0.3383862 -0.8273622  0.4563296     -0.2720966  0.7592351
## 3            0.1597062  0.4407764 -0.2723986     -0.1681748 -0.5672497
##           Id
## 1 -0.1733101
## 2  0.7532038
## 3 -0.4976177
## 
## Clustering vector:
##    [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 3 3 2 3 3 3 3
##   [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3
##   [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 2 3 3 3 3 3 3 3 1 1 1 3 3
##  [112] 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3
##  [149] 2 3 3 3 1 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [186] 3 3 3 3 3 2 3 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [223] 3 3 3 3 3 3 3 3 1 1 3 1 3 3 2 3 2 3 3 3 3 3 3 3 3 3 1 2 2 3 3 3 3 3 3 3 3
##  [260] 1 1 3 2 3 3 3 2 1 3 3 3 2 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 2 3
##  [297] 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 3 2 3 3 3
##  [334] 3 3 3 3 3 3 3 3 3 3 3 2 2 3 1 3 3 3 3 3 3 1 1 3 2 3 3 3 3 3 3 3 1 1 3 3 3
##  [371] 3 3 3 3 3 3 2 3 3 3 3 1 1 3 1 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 3 3 3 1 3 3 3
##  [408] 2 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 1 3 3 3 3 3 1 2 3 3 3 3 3 3 3 3 3 1 3 3 3
##  [445] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [482] 3 3 2 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 1 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [519] 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [556] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2 3 3 2 3 3 2 3 3 2 3 2 3 3 3 2 3 3 2
##  [593] 2 2 2 3 2 2 2 2 3 3 2 3 2 3 3 2 2 2 2 3 3 3 3 3 2 2 2 2 2 2 3 3 3 2 1 3 3
##  [630] 3 3 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 3 2 3 2 2 2 3 3 2 2 3 3 2 2 2 2 3 2
##  [667] 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 2 2 1 3 3 2 3 3 2 3 2 3 3 2 3 3 3 3 3 2 2 3
##  [704] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 3 2 1 2 2 2 2 2 2 2 2 3 2 2
##  [741] 3 2 3 2 2 3 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 3 2 3 2 2 3 3 2 2 2 2 1 1 3 2 2
##  [778] 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 1 2 2 2 3 3 3 2 2 2 2 2 2 1 2
##  [815] 2 2 3 2 3 2 2 2 3 2 2 2 2 2 3 3 2 1 2 2 2 3 2 2 2 2 2 2 3 2 2 2 3 2 2 2 1
##  [852] 2 2 3 2 2 2 2 2 3 2 2 2 3 3 3 3 2 3 2 2 3 2 3 2 3 2 3 3 2 2 2 3 2 2 2 3 1
##  [889] 2 3 2 3 2 2 2 2 2 2 2 2 3 2 3 1 2 3 2 2 2 2 3 2 3 1 1 2 2 2 3 2 2 2 2 2 2
##  [926] 3 3 3 3 2 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 3 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2
##  [963] 2 2 1 3 2 3 2 2 3 3 3 2 3 2 2 2 2 3 3 3 3 2 1 2 2 2 2 3 3 3 2 3 2 2 2 1 2
## [1000] 2 2 2 2 1 3 3 2 3 2 2 3 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 1 2 2 2 2 2 3 2 3 2
## [1037] 2 2 2 2 2 2 2 2 3 3 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2
## [1074] 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1
## [1111] 3 3 3 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  704.8552 1489.0927 2279.7379
##  (between_SS / total_SS =  34.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1",
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_CluStd)

I will remove the variables with Id number 1050, 232, 760 and 761 in order to create more homogeneous group. I will also standardize the data again.

mydata <- mydata %>%
  filter(!Id %in% c(232, 1050, 760, 761))

mydata$Id <- seq(1, nrow(mydata))

mydata_CluStd <- as.data.frame(scale(mydata[c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol")]))

Clustering <- kmeans(mydata_CluStd,
                     centers = 3,
                     nstart = 25)
Clustering

## K-means clustering with 3 clusters of sizes 73, 380, 684
## 
## Cluster means:
##   total sulfur dioxide    density         pH residual sugar    alcohol
## 1            0.4452809  1.1364382 -0.2702118      3.0788004  0.4190285
## 2           -0.3625911 -0.9036230  0.5303141     -0.2307469  0.9781134
## 3            0.1539168  0.3807263 -0.2657805     -0.2003927 -0.5881172
## 
## Clustering vector:
##    [1] 3 3 3 3 3 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 1 3 3 3 2 3 3 3 3
##   [38] 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3
##   [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 3 3 3 3 2 3 2 3 2 2 2 3 3 3 1 1 1 3 3
##  [112] 3 3 3 1 1 3 3 3 2 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3
##  [149] 2 3 3 3 1 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [186] 3 3 3 3 3 2 3 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3
##  [223] 3 3 3 3 2 3 3 3 1 3 1 3 3 2 3 2 3 3 3 3 3 3 3 3 3 1 2 2 3 2 3 3 3 3 3 3 1
##  [260] 1 3 2 3 3 3 2 1 3 3 3 2 3 3 1 1 3 3 1 3 3 3 3 3 1 3 3 1 3 1 2 3 3 3 2 2 3
##  [297] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 2 3 3 3 3 3 2 1 3 3 3 2 3 3 3 2 3 3 3 3
##  [334] 3 3 3 3 3 2 3 3 3 3 2 2 3 1 3 3 3 3 3 3 1 1 3 2 3 3 3 3 3 2 2 1 1 3 3 3 3
##  [371] 2 3 3 3 3 2 3 3 3 2 1 1 3 1 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 3 3 3 1 3 3 3 2
##  [408] 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 1 3 3 3 3 3 1 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3
##  [445] 3 3 3 1 3 3 3 3 3 3 3 3 2 2 1 3 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [482] 3 2 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 1 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2
##  [519] 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2
##  [556] 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2 3 3 2 3 3 2 3 3 2 3 2 3 3 3 2 3 3 2 2
##  [593] 2 2 3 2 2 2 2 3 3 3 3 3 3 3 2 2 2 2 3 3 3 3 3 2 3 2 2 2 2 3 3 3 2 1 3 3 3
##  [630] 3 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 3 2 3 2 2 2 3 3 2 2 3 3 2 2 2 2 3 2 2
##  [667] 2 2 2 2 2 2 3 2 2 2 3 3 2 3 2 2 1 3 3 2 3 3 2 3 2 3 3 2 3 3 3 3 3 2 2 3 2
##  [704] 2 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 3 2 1 2 3 2 2 3 2 3 3 3 2 2 3
##  [741] 2 3 3 2 3 2 2 2 2 3 2 3 3 2 1 3 2 1 3 2 3 1 2 3 3 2 2 2 2 1 1 3 2 2 2 2 2
##  [778] 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 1 2 2 3 3 3 3 3 2 2 3 2 3 1 2 2 2 3
##  [815] 2 3 2 2 2 3 2 2 2 3 2 3 3 2 1 2 2 2 3 3 3 3 2 2 2 3 3 3 3 3 3 2 2 1 2 2 3
##  [852] 2 2 2 2 3 3 2 2 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 3 3 2 3 2 3 3 3 2 3 1 2 3 2
##  [889] 3 2 2 2 2 3 2 2 2 3 2 3 1 3 3 2 2 2 2 3 2 3 1 1 2 3 2 3 2 2 2 2 2 2 3 3 3
##  [926] 3 2 2 3 2 3 2 2 2 2 2 2 3 3 3 3 3 3 3 2 3 3 3 2 2 2 2 3 3 2 3 2 3 2 2 2 1
##  [963] 3 3 3 2 3 3 3 3 2 3 3 3 2 2 3 3 3 3 3 1 2 2 3 3 3 3 3 2 3 2 3 3 1 2 2 2 2
## [1000] 2 1 3 3 3 3 2 3 3 3 3 2 2 2 2 2 2 3 2 3 2 2 2 2 1 3 2 2 2 2 3 2 3 2 2 3 2
## [1037] 2 3 3 2 3 3 3 2 2 2 2 1 2 1 3 2 3 2 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 3 2
## [1074] 3 3 3 3 2 3 3 2 3 3 3 3 3 2 2 3 3 2 3 2 2 2 2 2 3 2 3 2 3 2 2 2 1 3 3 3 3
## [1111] 3 2 2 3 3 2 2 2 3 2 2 2 1 2 2 2 3 2 2 2 2 2 2 3 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  517.4663 1094.2405 1970.8128
##  (between_SS / total_SS =  36.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1",
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_CluStd)

Now, I will remove Id number 1048 and 231, and standardize the data again with the function “scale”.

mydata <- mydata %>%
  filter(!Id %in% c(1048, 231))

mydata$Id <- seq(1, nrow(mydata))

mydata_CluStd <- as.data.frame(scale(mydata[c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol")]))

Clustering <- kmeans(mydata_CluStd,
                     centers = 3,
                     nstart = 25)

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1",
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_CluStd)

Clustering

## K-means clustering with 3 clusters of sizes 669, 382, 84
## 
## Cluster means:
##   total sulfur dioxide    density         pH residual sugar    alcohol
## 1            0.1679786  0.3984463 -0.2740983     -0.2046639 -0.6026384
## 2           -0.3549890 -0.9142728  0.5270688     -0.2823040  0.9422405
## 3            0.2765254  0.9844247 -0.2139106      2.9138128  0.5146332
## 
## Clustering vector:
##    [1] 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 1 3 1 1 1 2 1 1 1 1
##   [38] 1 1 1 1 1 1 1 1 3 3 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1
##   [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 2 1 2 1 2 2 2 1 1 1 3 3 3 1 1
##  [112] 1 1 1 3 3 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
##  [149] 2 1 1 1 3 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [186] 1 1 1 1 1 2 1 3 3 3 1 3 3 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
##  [223] 1 1 1 1 2 1 1 1 3 3 1 1 2 3 2 1 1 1 1 1 1 1 1 1 3 2 2 1 2 1 1 1 1 1 1 3 3
##  [260] 1 2 1 1 1 2 3 1 1 1 2 1 1 3 3 1 1 3 1 1 1 1 1 3 1 1 3 1 3 2 1 1 1 2 2 1 2
##  [297] 2 1 1 1 1 1 1 1 1 1 1 1 1 3 2 1 2 1 1 1 1 1 2 3 1 1 1 2 1 1 1 2 1 1 1 1 1
##  [334] 1 1 1 1 2 1 1 1 1 2 2 1 3 1 1 1 1 1 1 3 3 1 2 1 1 1 1 1 2 2 3 3 1 1 1 1 2
##  [371] 1 1 1 1 2 1 1 1 2 3 3 1 3 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 1 1 1 3 1 1 1 2 1
##  [408] 1 1 1 1 1 1 1 1 1 2 1 2 1 1 3 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1
##  [445] 1 1 3 1 1 1 1 1 1 1 1 2 2 3 1 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [482] 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
##  [519] 3 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
##  [556] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 1 1 1 2 1 1 2 2 2
##  [593] 2 1 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 2 1 1 1 2 3 1 1 1 1
##  [630] 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 1 2 1 2 2 2 1 1 2 2 1 1 2 2 2 3 1 2 3 2
##  [667] 2 2 2 2 2 1 2 2 2 1 1 2 1 2 2 3 1 1 2 1 1 2 1 2 1 1 2 1 1 1 1 1 2 2 1 2 2
##  [704] 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 2 3 2 1 2 2 2 2 2 1 1 2 2 1 2
##  [741] 1 1 2 1 2 2 2 2 1 2 1 1 2 3 1 2 3 1 2 1 3 2 1 1 2 2 2 2 3 3 1 2 2 2 2 2 2
##  [778] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 3 2 2 1 1 1 1 1 2 2 1 2 1 3 2 2 2 1 2
##  [815] 1 2 2 2 1 2 2 2 1 2 1 1 2 3 2 2 2 1 1 1 1 3 2 2 1 1 1 1 1 1 2 2 3 2 2 1 2
##  [852] 2 2 2 1 1 2 2 2 1 1 1 1 2 1 2 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1 2 1 3 2 1 2 1
##  [889] 2 2 2 2 1 2 2 2 1 2 1 3 1 1 2 2 2 2 1 2 1 3 3 2 1 2 1 2 2 2 2 2 2 1 1 1 1
##  [926] 2 2 1 2 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 2 3 1
##  [963] 1 1 2 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 3 2 2 1 1 1 1 1 2 1 2 1 1 3 2 2 2 2 2
## [1000] 3 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 1 2 1 3 2 2 2 3 1 2 2 2 2 1 2 1 2 2 1 2 2
## [1037] 1 1 2 1 1 1 2 2 2 2 2 3 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1 1
## [1074] 1 1 2 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 3 1 1 1 1 1 2
## [1111] 2 1 1 2 2 2 1 2 2 2 3 2 2 2 1 2 2 2 2 2 2 1 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 1935.007 1090.579  527.816
##  (between_SS / total_SS =  37.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Finally, I arrived at an acceptable graphical representation of the clusters.

The first cluster group has 669 variables. The second cluster group has 382 variables. The third cluster group has 84 variables.

The “Cluster means” field shows the final leaders/centers of the three groups.

The “Clustering vector” shows in which group (1st, 2nd, or 3rd) belongs each of the 1135 variables.

The within cluster sum of squares by cluster is the highest in group number 1 (= 1935.007).

The (between_SS / total_SS) ratio is 37.3%. This ratio increased by 2.7 percentage points (from initial 34.6%) because I removed the units with very high distances form the group leaders, thus I decreased the within distance (we can also observe that with the within cluster sum of squares when we compare the initial and the last numbers).

Averages <- Clustering$centers
Averages

##   total sulfur dioxide    density         pH residual sugar    alcohol
## 1            0.1679786  0.3984463 -0.2740983     -0.2046639 -0.6026384
## 2           -0.3549890 -0.9142728  0.5270688     -0.2823040  0.9422405
## 3            0.2765254  0.9844247 -0.2139106      2.9138128  0.5146332

Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:pastecs':
## 
##     extract

Figure <- pivot_longer(Figure, cols = c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol"))

Figure$Group <- factor(Figure$id, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$ImeF <- factor(Figure$name, 
              levels = c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol"), 
              labels = c("total sulfur dioxide", "density", "pH", "residual sugar", "alcohol"))


library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  scale_color_brewer(palette="Set1") +
  ylim(-3, 3) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Explaining cluster variables

Explanation of each cluster variable, and what high or low value means (explanations taken from Chat GPT):

Total Sulfur Dioxide: High levels preserve the wine longer but may create a chemical taste, while low levels offer a more natural wine but reduce shelf life and stability.
Density: High density indicates a fuller-bodied, rich wine, while low density suggests a lighter, often drier wine.
pH: A low pH (high acidity) produces crisp, fresh wines with aging potential, while a high pH (low acidity) can result in softer, less structured wines prone to quicker spoilage.
Residual Sugar: High residual sugar contributes to sweeter wines, whereas low sugar indicates a dry, non-sweet wine.
Alcohol: High alcohol adds warmth and body but can overpower flavors, while low alcohol leads to lighter, more refreshing wines that may feel thin without proper balance.

mydata$Group <- Clustering$cluster

fit <- aov(cbind(`total sulfur dioxide`, density, pH, `residual sugar`, alcohol) ~ as.factor(Group), 
             data = mydata)

summary(fit)

##  Response total sulfur dioxide :
##                    Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)    2   71784   35892  39.193 < 2.2e-16 ***
## Residuals        1132 1036663     916                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response density :
##                    Df    Sum Sq    Mean Sq F value    Pr(>F)    
## as.factor(Group)    2 0.0017939 0.00089693  457.55 < 2.2e-16 ***
## Residuals        1132 0.0022190 0.00000196                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response pH :
##                    Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)    2  3.9139 1.95695   93.13 < 2.2e-16 ***
## Residuals        1132 23.7868 0.02101                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response residual sugar :
##                    Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)    2 882.40  441.20  1205.3 < 2.2e-16 ***
## Residuals        1132 414.35    0.37                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response alcohol :
##                    Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)    2 702.64  351.32  645.84 < 2.2e-16 ***
## Residuals        1132 615.78    0.54                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For each factor variable (total sulfur dioxide, density, pH…), the hypothesis is the following:

H0: μ(G1) = μ(G2) = μ(G3)

H1: at least one arithmetic mean is different

For all factors, the p value is less than 0.001, meaning that for all of them, we can reject H0. This means that the clustering variables successfully differentiate between groups.

Criterion Validity

In order to evaluate whether I have chosen the right clusters, I will use fixed acidity to serve as a criterion validity i.e. check if the groups I created are also divided along the same criteria.

aggregate(mydata$`fixed acidity`,
          by = list(mydata$Group),
          FUN = mean)

##   Group.1        x
## 1       1 8.685052
## 2       2 7.420419
## 3       3 9.265476

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$`fixed acidity`, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    2  25.098 2.162e-11 ***
##       1132                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis:

H0: ρ² (G1) = ρ² (G2) = ρ² (G3) (the variances of the groups are equal)

H1: at least one is different (the variances of the groups are not equal)

We can reject H0 (p < 0.001), meaning that there is heterogenity among variances between groups.

colnames(mydata)

##  [1] "fixed acidity"        "volatile acidity"     "citric acid"         
##  [4] "residual sugar"       "chlorides"            "free sulfur dioxide" 
##  [7] "total sulfur dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "Id"                   "Dissimilarity"        "Group"

I will further perform Shapiro Wilk test. Because the “shapiro_test” function didn’t work on my variable because of the space in between, I will rename the variable into fixed_acidity.

mydata <- mydata %>%
  rename(fixed_acidity = "fixed acidity")

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(Group) %>%
  shapiro_test(`fixed_acidity`)

## # A tibble: 3 × 4
##   Group variable      statistic        p
##   <int> <chr>             <dbl>    <dbl>
## 1     1 fixed_acidity     0.916 7.59e-19
## 2     2 fixed_acidity     0.981 5.82e- 5
## 3     3 fixed_acidity     0.900 7.54e- 6

Hypothesis about the first row (Group 1):

H0: the variable fixed acidity is normally distributed in G1

H1: the variable fixed acidity is not normally distributed in G1

The same hypothesis applies to row 2 and 3 (G2 and G3). We reject all of the three hypotheses (p<0.001), the variable is not normally distributed in neither of the groups, so I will use Kruskal Walis test.

fit <- aov(fixed_acidity ~ as.factor(Group), 
           data = mydata)

summary(fit)

##                    Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Group)    2    473  236.51    89.4 <2e-16 ***
## Residuals        1132   2995    2.65                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

kruskal.test(fixed_acidity ~ as.factor(Group),
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  fixed_acidity by as.factor(Group)
## Kruskal-Wallis chi-squared = 154.06, df = 2, p-value < 2.2e-16

H0: location distribution of fixed acidity is the same in G1, G2 and G3

H1: at least one location distribution is different

We can reject H0 (p < 0.001), there is statistically significant difference in the distribution of fixed acidity between G1, G2, and G3.

kruskal_effsize(fixed_acidity ~ as.factor(Group), 
                data = mydata)

## # A tibble: 1 × 5
##   .y.               n effsize method  magnitude
## * <chr>         <int>   <dbl> <chr>   <ord>    
## 1 fixed_acidity  1135   0.134 eta2[H] moderate

The effect size for this test is moderate.

Conclusion - naming and describing the clusters

For the conclusion, I used Chat GPT as help for describing the groups based on the clustering variables, since I am not a wine expert:

Group 1: Barely Boozy Bliss - not too strong, not too light… just right!

Group characteristics: Total sulfur dioxide is very close to the average, the density is little above average and the pH and residual sugar are below the average. The alcohol level is below average, and the lowest among the three clusters. This cluster accounts for the largest portion of wines (58.94%).

Description: These wines are likely dry white wines or lighter red wines with lower alcohol content. The below-average residual sugar and pH indicate wines that are more acidic and not sweet. With the highest representation, these wines may include popular table wines meant for easy drinking, such as Pinot Grigio, Sauvignon Blanc, or light reds like Pinot Noir. These wines pair well with light dishes such as seafood, chicken, salads, or creamy pasta. The acidity helps cut through rich or creamy flavors, while their light profile complements delicate foods.

Group 2: Robust Balance - delivering a strong, yet smooth experience.

Group characteristics: Low total sulfur dioxide and density (the lowest among the three groups), above average pH balance, little below average residual sugar, and highest level of alcohol among the groups. This cluster accounts for the second largest portion of wines (33.66%).

Description: These wines may be stronger red or white wines with a robust alcohol profile. The slightly elevated residual sugar and above-average pH suggest wines with more balance between acidity and sweetness. Examples could include rich reds like Merlot or Syrah, or fuller-bodied whites like Chardonnay. These wines may have a more intense flavor and mouthfeel. They pair well with hearty dishes such as grilled meats, rich stews, roasted vegetables, or creamy cheeses. The fuller body of these wines enhances savory, rich flavors.

Group 3: Dessert Friend - easygoing wines with rich sweetness.

Group characteristics: Above average total sulfur dioxide and and density. Below average pH. Extremely high residual sugar and above average level of alcohol. This cluster accounts for the smallest portion of wines (7.4%).

Description: This cluster likely includes sweet dessert or fortified wines. The extremely high residual sugar and above-average alcohol content, paired with lower pH, point to rich, sweet wines like Port, Late Harvest Riesling, or Sauternes. These wines are full-bodied and often enjoyed as dessert wines due to their sweetness and alcohol strength. They are perfect with desserts such as fruit tarts, crème brûlée, chocolate-based desserts, or blue cheeses. The sweetness balances the richness of desserts or contrasts beautifully with salty cheeses.

Homework_2

Ana Marija Dimitrova

2025-01-21