Homework 2

mydata <- read.csv("C:/Users/Andrej/Documents/wine-clustering.csv")
mydata$ID <- seq(1, nrow(mydata))
head(mydata)

##   Alcohol Malic_Acid  Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1   14.23       1.71 2.43         15.6       127          2.80       3.06
## 2   13.20       1.78 2.14         11.2       100          2.65       2.76
## 3   13.16       2.36 2.67         18.6       101          2.80       3.24
## 4   14.37       1.95 2.50         16.8       113          3.85       3.49
## 5   13.24       2.59 2.87         21.0       118          2.80       2.69
## 6   14.20       1.76 2.45         15.2       112          3.27       3.39
##   Nonflavanoid_Phenols Proanthocyanins Color_Intensity  Hue OD280 Proline ID
## 1                 0.28            2.29            5.64 1.04  3.92    1065  1
## 2                 0.26            1.28            4.38 1.05  3.40    1050  2
## 3                 0.30            2.81            5.68 1.03  3.17    1185  3
## 4                 0.24            2.18            7.80 0.86  3.45    1480  4
## 5                 0.39            1.82            4.32 1.04  2.93     735  5
## 6                 0.34            1.97            6.75 1.05  2.85    1450  6

Explaining data:

Unit of observation is an individual wine sample.

The dataset contains a total of 178 observations (wine samples).

The dataset includes 13 variables, all of which describe the chemical composition and properties of the wine. These variables are:

Alcohol: The alcohol content of the wine (measured in %).

Malic acid: The amount of malic acid in the wine (measured in g/L).

Ash: The ash content of the wine (measured in g/L).

Alcalinity of Ash: The alkalinity of the ash (measured in meq/L).

Magnesium: The magnesium content (measured in mg/L).

Total phenols: Total phenolic compounds (measured in g/L).

Flavanoids: A subset of phenols (measured in g/L).

Nonflavanoid phenols: Non-flavonoid phenolic compounds (measured in g/L).

Proanthocyanins: Proanthocyanidin content (measured in g/L).

Color intensity: The intensity of the wine’s color (measured in absorbance units).

Hue: The hue of the wine (ratio of color intensity at 520 nm to 420 nm).

OD280/OD315 of diluted wines: A measure of dilution and optical density.

Proline: The amount of proline, an amino acid (measured in mg/L).

The data is sourced from Kaggle.com and the owner is Harry Wang. The dataset originates from the UCI Machine Learning Repository,

summary(mydata[c(1:13)])

##     Alcohol        Malic_Acid         Ash         Ash_Alcanity  
##  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
##  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
##  Median :13.05   Median :1.865   Median :2.360   Median :19.50  
##  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
##  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
##  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
##    Magnesium      Total_Phenols     Flavanoids    Nonflavanoid_Phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color_Intensity       Hue             OD280      
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

Descriptive statistics:

Alcohol (%): Min: 11.0, Max: 14.83, Mean: 13.0 The first quartile (Q1) is around 12.36, and the third quartile (Q3) is 13.68, meaning most wines fall between these values.

Malic Acid (g/L): Min: 0.74, Max: 5.80, Mean: 2.34 Wines have varying malic acid content, with the median at 1.86.

Ash (g/L): Min: 1.36, Max: 3.23, Mean: 2.37 Most wines have ash content between 2.21 and 2.56 (interquartile range).

Alcalinity of Ash (meq/L): Min: 10.6, Max: 30.0, Mean: 19.49 The alkalinity levels vary significantly.

Magnesium (mg/L): Min: 70, Max: 162, Mean: 99.74 Median of 98, suggesting that most wines contain close to 98 mg/L of magnesium.

Total Phenols (g/L): Min: 0.98, Max: 3.88, Mean: 2.29 Phenolic content is an important quality factor in wines.

Flavanoids (g/L): Min: 0.34, Max: 5.08, Mean: 2.03 The higher the flavonoid content, the more likely the wine is to have antioxidant properties.

Nonflavanoid Phenols (g/L): Min: 0.13, Max: 0.66, Mean: 0.36 These are generally lower in concentration compared to flavonoids.

Proanthocyanins (g/L): Min: 0.41, Max: 3.58, Mean: 1.59 Important for wine color stability.

Color Intensity (absorbance units): Min: 1.28, Max: 13.0, Mean: 5.06 High variation in color intensity among wines.

Hue (Ratio of absorbance at 520 nm to 420 nm): Min: 0.48, Max: 1.71, Mean: 0.96 Represents the tone of the wine (red, brown, etc.).

OD280/OD315 of Diluted Wines: Min: 1.27, Max: 4.0, Mean: 2.61 Measures the wine’s protein and polyphenol content.

Proline (mg/L): Min: 278, Max: 1680, Mean: 746.9 Proline is an amino acid that contributes to wine structure.

mydata_clu_std <- as.data.frame(scale(mydata[c(1,2,3,4)]))

mydata$Dissimilarity = sqrt(mydata_clu_std$Alcohol^2 + mydata_clu_std$Malic_Acid^2 + mydata_clu_std$Ash^2 + mydata_clu_std$Ash_Alcanity^2)

head(mydata[order(-mydata$Dissimilarity), c("ID","Dissimilarity")], 15)

##      ID Dissimilarity
## 60   60      4.766567
## 122 122      4.515761
## 26   26      3.530586
## 138 138      3.481034
## 128 128      3.435175
## 74   74      3.312897
## 14   14      3.289520
## 124 124      3.274170
## 170 170      3.209481
## 123 123      3.190471
## 174 174      3.121685
## 111 111      3.024639
## 67   67      3.023544
## 9     9      2.947353
## 178 178      2.920125

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  filter(!ID %in% c(60,122)) 

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c(1,2,3,4)]))

I deleted the outliers.

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distances <- get_dist(mydata_clu_std, 
                      method = "euclidian")

fviz_dist(Distances, 
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

The heatmap shows how similar or different the wine samples are based on their chemical properties (represents the pairwise Euclidean distances between wine samples). Darker red areas mean the wines are very similar to each other, while lighter or white areas indicate they are quite different. The diagonal line is the darkest because it represents each wine compared to itself, which always has a distance of zero. Some parts of the heatmap have clusters of dark red, suggesting that certain wines share similar characteristics and could naturally form groups. On the other hand, the lighter areas show wines that are more unique or different from most others. This visualization helps us see patterns in the data and suggests that some wines are much more alike than others.

library(factoextra) 
get_clust_tendency(mydata_clu_std, 
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.6945244
## 
## $plot
## NULL

library(factoextra)
library(NbClust)

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 176

library(factoextra)
fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

library(NbClust)
NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 3 proposed 6 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##        KL      CH Hartigan     CCC    Scott   Marriot    TrCovW   TraceW
## 2  0.4786 57.6999  68.7291 -2.7380 195.7042 695777636 22827.713 525.6800
## 3  2.8581 74.1813  36.9756 -1.1059 414.8250 450769372 12193.656 376.8329
## 4  4.1169 71.9311  17.9432 -1.4397 511.9200 461574739  6629.902 310.4746
## 5  0.1714 63.6896  30.7565 -2.1781 555.3059 563642640  4558.635 281.1454
## 6  1.7404 65.8797   5.8352 -0.4581 663.4610 439020370  3829.754 238.2866
## 7  3.0483 57.4170  14.2171 -2.1218 692.3694 507042472  3531.074 230.3789
## 8  2.0913 55.0580  11.4010 -1.9890 748.9253 480254888  3060.681 212.5022
## 9  0.2815 52.5555  13.3857 -2.0742 795.9699 465254457  2662.459 198.9976
## 10 4.7456 51.6367   7.1999 -1.7001 854.6658 411498742  2307.682 184.2308
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2    1.9564 1.3316 0.3876 1.7087     0.2328 0.6515  60.4441  1.2788    0.3086
## 3    4.3051 1.8576 0.3306 1.2865     0.2961 0.6567  31.3625  1.2409    0.3901
## 4    5.6250 2.2546 0.3584 1.2087     0.3064 0.9698   1.8032  0.0736    0.3724
## 5    6.2510 2.4898 0.3500 1.2700     0.2566 1.6420 -23.8491 -0.9124    0.3456
## 6    7.8877 2.9376 0.3884 1.2594     0.2605 0.7335  13.4461  0.8530    0.3315
## 7    8.4903 3.0385 0.3743 1.3309     0.2470 2.8293 -32.9745 -1.4788    0.3095
## 8    9.5070 3.2941 0.3614 1.3766     0.2358 2.3739 -21.9924 -1.3337    0.2950
## 9   10.4214 3.5176 0.3632 1.3389     0.2232 2.2169 -15.9187 -1.2424    0.2820
## 10  11.7300 3.7996 0.3365 1.3031     0.2313 1.5787  -8.4314 -0.8384    0.2714
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  262.8400     0.3666  0.0873  0.7128 0.0618 0.0024  2.2061 1.6277 1.3981
## 3  125.6110     0.4925  0.1170  1.2983 0.0633 0.0029  1.4914 1.3407 0.9514
## 4   77.6187     0.5350  1.0127  1.5825 0.0761 0.0033  1.4295 1.2306 0.7832
## 5   56.2291     0.4930  0.2994  2.1287 0.0801 0.0035  1.7610 1.1656 0.6172
## 6   39.7144     0.4846  0.8848  2.6539 0.1280 0.0039  1.7467 1.0804 0.3561
## 7   32.9113     0.4695  0.2551  2.9101 0.0747 0.0043  1.9500 1.0596 0.4384
## 8   26.5628     0.4617  0.5144  3.2913 0.0845 0.0045  2.0081 1.0156 0.3300
## 9   22.1108     0.4437  0.1305  3.7196 0.0635 0.0046  1.9808 0.9850 0.3104
## 10  18.4231     0.4431 -0.6502  4.0480 0.0635 0.0050  1.9889 0.9439 0.3031
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.6427            62.8160       0.2776
## 3          0.5821            43.0671       0.2943
## 4          0.5607            45.4513       0.9901
## 5          0.4780            66.6283       1.0000
## 6          0.5087            35.7380       0.4940
## 7          0.3890            80.0903       1.0000
## 8          0.4195            52.5756       1.0000
## 9          0.3508            53.6685       1.0000
## 10         0.3890            36.1192       1.0000
## 
## $Best.nc
##                      KL      CH Hartigan     CCC    Scott   Marriot   TrCovW
## Number_clusters 10.0000  3.0000   3.0000  6.0000   3.0000         3     3.00
## Value_Index      4.7456 74.1813  31.7535 -0.4581 219.1208 255813632 10634.06
##                  TraceW Friedman  Rubin Cindex     DB Silhouette   Duda
## Number_clusters  3.0000   3.0000  6.000 3.0000 4.0000     4.0000 2.0000
## Value_Index     82.4888   2.3487 -0.347 0.3306 1.2087     0.3064 0.6515
##                 PseudoT2  Beale Ratkowsky    Ball PtBiserial Frey McClain  Dunn
## Number_clusters   2.0000 2.0000    3.0000   3.000      4.000    1  2.0000 6.000
## Value_Index      60.4441 1.2788    0.3901 137.229      0.535   NA  0.7128 0.128
##                 Hubert SDindex Dindex    SDbw
## Number_clusters      0  4.0000      0 10.0000
## Value_Index          0  1.4295      0  0.3031
## 
## $Best.partition
##   [1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
##  [38] 1 3 1 1 2 1 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 1 3 3 2 2 2 3
##  [75] 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [112] 2 3 3 3 3 3 3 3 3 2 2 2 3 3 2 3 2 3 2 2 2 3 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2
## [149] 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 2 2 2 2 2

Based on the elbow method, silhouette analysis and the cluster dendogram, I choose to make 3 clusters.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, 
                     nstart = 25) 

Clustering

## K-means clustering with 3 clusters of sizes 63, 61, 52
## 
## Cluster means:
##       Alcohol Malic_Acid        Ash Ash_Alcanity
## 1 -0.90777691 -0.5401134 -0.7547032   -0.0805728
## 2  0.08777437  0.9239729  0.5161277    0.8282048
## 3  0.99684056 -0.4295231  0.3088945   -0.8739309
## 
## Clustering vector:
##   [1] 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 3 3 2 3 3 3 3 3 3
##  [38] 3 1 3 3 2 3 2 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 3 1 1 2 2 2 1
##  [75] 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## [112] 2 1 1 1 1 1 1 1 1 2 2 2 1 1 2 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [149] 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 125.3203 175.6697  75.8429
##  (between_SS / total_SS =  46.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The clustering algorithm divided the dataset into clusters of sizes 63, 61, and 52, meaning that 63 wine samples belong to Cluster 1, 61 to Cluster 2, and 52 to Cluster 3. Each cluster has a set of mean values for selected variables (Alcohol, Malic Acid, Ash, and Ash Alcanity), which indicate the typical characteristics of wines in that group.

From the cluster means, we can see that Cluster 1 has the lowest Alcohol content, the lowest Ash levels, and moderate Malic Acid, suggesting these wines may be less intense in alcohol and body. Cluster 2 has a high Malic Acid content and higher Ash Alcanity, possibly indicating wines with more acidity and mineral content. Cluster 3 has the highest Alcohol levels and the lowest Ash Alcanity, which could mean these wines are stronger in alcohol but lower in certain mineral qualities.

The within-cluster sum of squares (WCSS) values (125.32, 175.66, and 75.84) measure how compact each cluster is, with lower values indicating that points in the cluster are closer to each other. The between-cluster sum of squares divided by total sum of squares (46.2%) shows how well-separated the clusters are; typically, higher values indicate clearer distinctions between groups. This result suggests a moderate level of separation but not a perfect clustering.

Overall, the k-means algorithm has successfully grouped the wines into three distinct clusters based on their chemical properties.

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(176,26,168,126,87,121,136)) 

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c(1,2,3,4)]))

I deleted outliers that I identified in the cluster plot.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3,
                     nstart = 25)

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(25,3,55,35, 132)) 

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c(1,2,3,4)]))

I deleted the overlapping units.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3,
                     nstart = 25)

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

The final cluster plot.

Averages <- Clustering$centers
Averages

##      Alcohol Malic_Acid        Ash Ash_Alcanity
## 1  0.1185140  0.9833759  0.4728621   0.80717569
## 2 -0.9008307 -0.5441204 -0.7072839   0.01016065
## 3  1.0302448 -0.4034757  0.3816052  -0.92119682

Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c(1,2,3,4))

Figure$Group <- factor(Figure$id, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))



library(ggplot2)
ggplot(Figure, aes(x = name, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

mydata$Group <- Clustering$cluster

fit <- aov(cbind(Alcohol, Ash, Ash_Alcanity, Malic_Acid) ~ as.factor(Group), data = mydata)
summary(fit)

##  Response Alcohol :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 67.480  33.740  134.67 < 2.2e-16 ***
## Residuals        161 40.336   0.251                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Ash :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 2.9945 1.49723  35.702 1.469e-13 ***
## Residuals        161 6.7519 0.04194                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Ash_Alcanity :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 703.10  351.55  70.187 < 2.2e-16 ***
## Residuals        161 806.42    5.01                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Malic_Acid :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2  92.764  46.382  74.582 < 2.2e-16 ***
## Residuals        161 100.125   0.622                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Each variable is analyzed separately using ANOVA to determine the statistical significance of differences between groups. The degrees of freedom (Df) for the grouping factor is 2, reflecting the presence of three clusters, while the residuals have 160 degrees of freedom, accounting for the remaining variability in the dataset. The Sum of Squares (Sum Sq) and Mean Square (Mean Sq) provide a measure of variance, with the mean square calculated as the sum of squares divided by the degrees of freedom.

The F-values are notably high for each variable, indicating strong evidence that the groups have different mean values. The p-values (Pr(>F)) for all four variables are extremely small (p < 2.2e-16), confirming that the differences between clusters are highly significant. The presence of three asterisks (***) next to each p-value reinforces this, meaning that Alcohol, Ash, Ash Alcanity, and Malic Acid vary significantly among the three groups. This result supports the idea that the clusters formed in the k-means analysis are not arbitrary but represent wines with distinct chemical compositions.

aggregate(mydata$Proline, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1         x
## 1       1  650.6481
## 2       2  538.7419
## 3       3 1142.6458

aggregate(mydata$Color_Intensity, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1        x
## 1       1 6.307593
## 2       2 3.448387
## 3       3 5.786875

For Proline, the mean values for the three clusters are 650.65, 536.18, and 1142.65. This indicates that wines in Cluster 3 have the highest Proline content, while wines in Cluster 2 have the lowest. Since Proline is an amino acid associated with wine structure and aging potential, this suggests that Cluster 3 wines may have richer characteristics in terms of amino acid composition.

For Color Intensity, the average values are 6.31, 3.39, and 5.79. Wines in Cluster 1 have the highest color intensity, meaning they likely appear darker or more concentrated in color, while Cluster 2 wines have the lowest intensity, suggesting a lighter appearance. Cluster 3 falls in between, indicating moderate intensity. Color intensity is a key characteristic in wine classification, often correlating with the type of grape used and the winemaking process.

These differences in Proline and Color Intensity suggest that the clusters are not only chemically distinct but may also differ in their visual and structural properties, supporting the validity of the clustering results

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$Proline, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  2.2196  0.112
##       161

Levene’s test for homogeneity of variance was conducted on the Proline variable across the three clusters. The test produced an F-value of 2.2525 and a p-value of 0.1085, which is greater than the common significance threshold of 0.05. This means we fail to reject the null hypothesis, suggesting that the variances across the groups are not significantly different.

Since the assumption of homogeneity of variances is met (p > 0.05), there is no need to use Welch’s ANOVA, which is specifically designed for cases where variances are unequal.

library(car)
leveneTest(mydata$Color_Intensity, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   2  16.089 4.264e-07 ***
##       161                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Levene’s test for homogeneity of variance on Color Intensity produced an F-value of 17.388 and a p-value of 1.468e-07, which is extremely small (p < 0.001). This means we reject the null hypothesis, indicating that the variances across the three clusters are significantly different. Since the assumption of homogeneity of variances is violated, a standard one-way ANOVA is not appropriate, as it assumes equal variances.

Instead, we should use Welch’s ANOVA, which adjusts for unequal variances and provides more reliable results when variance heterogeneity is present.

library(dplyr)
library(rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(Proline)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable statistic       p
##   <fct>                     <chr>        <dbl>   <dbl>
## 1 1                         Proline      0.933 0.00483
## 2 2                         Proline      0.946 0.00901
## 3 3                         Proline      0.984 0.736

The Shapiro-Wilk test for normality has been conducted for the Proline variable within each cluster. The test results show that for Groups 1 and 2, the p-values are 0.0048 and 0.0071, respectively, which are below the standard significance threshold of 0.05. This indicates that the Proline variable in these groups does not follow a normal distribution. However, for Group 3, the p-value is 0.7363, suggesting that normality is not violated in this group.

Since at least some of the groups do not meet the normality assumption, a standard one-way ANOVA is not appropriate, as it assumes normally distributed residuals. Instead, the Kruskal-Wallis test should be used as a non-parametric alternative, since it does not assume normality and is suitable for comparing medians across multiple groups.

library(dplyr)
library(rstatix)
mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(Color_Intensity)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable        statistic           p
##   <fct>                     <chr>               <dbl>       <dbl>
## 1 1                         Color_Intensity     0.945 0.0153     
## 2 2                         Color_Intensity     0.834 0.000000743
## 3 3                         Color_Intensity     0.971 0.284

The Shapiro-Wilk test for normality has been conducted for the Color Intensity variable within each cluster. The results indicate that for Group 2, the p-value is extremely small (5.33e-07), which is well below 0.05, meaning that the Color Intensity distribution in this group significantly deviates from normality. For Group 1, the p-value is 0.015, which is also below 0.05, suggesting a moderate violation of normality. However, Group 3 has a p-value of 0.284, indicating that normality is not violated in that cluster.

Since at least some of the groups do not follow a normal distribution, a standard one-way ANOVA is not appropriate because it assumes normally distributed residuals. Additionally, Levene’s test previously indicated that the assumption of homogeneity of variances was violated for Color Intensity. Given these conditions, the best approach is to use the Kruskal-Wallis test, which is a non-parametric alternative to ANOVA that does not require normality

kruskal.test(Proline ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Proline by as.factor(Group)
## Kruskal-Wallis chi-squared = 95.232, df = 2, p-value < 2.2e-16

The Kruskal-Wallis rank sum test was performed to compare the distribution of Proline across the three clusters. The test produced a chi-squared value of 95.288 with 2 degrees of freedom, and the p-value is less than 2.2e-16, which is far below the standard significance threshold of 0.05.

This result means that there is a statistically significant difference in the distribution of Proline among the three clusters.

kruskal_effsize(Proline ~ as.factor(Group), 
             data = mydata)

## # A tibble: 1 × 5
##   .y.         n effsize method  magnitude
## * <chr>   <int>   <dbl> <chr>   <ord>    
## 1 Proline   164   0.579 eta2[H] large

The effect size calculation for the Kruskal-Wallis test on Proline is shown in the output. The effect size metric used is eta-squared (η²), which quantifies how much of the variance in Proline can be explained by the clustering variable (Group). The computed effect size is 0.583, and the interpretation is labeled as large, meaning that the difference in Proline levels across the clusters is substantial.

Since an effect size above 0.14 is typically considered large in non-parametric tests, this result confirms that the grouping has a strong impact on Proline distribution. This reinforces the earlier Kruskal-Wallis test results, suggesting that Proline is an important distinguishing factor among the clusters, with meaningful differences in its levels across the groups.

fit <- aov(cbind(Color_Intensity) ~ as.factor(Group), 
           data = mydata)

summary(fit)

##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(Group)   2  270.4  135.20   35.38 1.84e-13 ***
## Residuals        161  615.4    3.82                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The one-way ANOVA results for Color Intensity across the three clusters show that the grouping variable (Group) has a significant effect. The F-value is 37.2, and the p-value is 5.41e-14, which is far below the 0.05 threshold. This confirms that there are statistically significant differences in Color Intensity between the clusters.

The Sum of Squares (SS) is divided into two parts:

Between-group SS = 279.8, which represents the variation explained by the clusters. Residual SS = 601.8, which represents the unexplained variation within the clusters. The Mean Square (MS) for the group factor is 139.92, while the residual MS is 3.76, leading to a large F-value, which further supports that the groups differ significantly.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

eta_squared(fit)

## For one-way between subjects designs, partial eta squared is equivalent
##   to eta squared. Returning eta squared.

## # Effect Size for ANOVA
## 
## Parameter        | Eta2 |       95% CI
## --------------------------------------
## as.factor(Group) | 0.31 | [0.21, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_eta_squared(0.30, rules = "cohen1992")

## [1] "large"
## (Rules: cohen1992)

The effect size calculation for ANOVA using eta-squared (η²) shows a value of 0.32 for the grouping factor (as.factor(Group)), with a 95% confidence interval ranging from 0.22 to 1.00. This effect size quantifies how much of the variance in Color Intensity is explained by the cluster grouping.

According to Cohen’s 1992 interpretation, an eta-squared value of 0.30 or higher is considered a large effect size. This means that the differences in Color Intensity across clusters are not only statistically significant but also substantially meaningful in terms of practical significance. The clustering structure effectively captures distinct variations in Color Intensity, reinforcing the conclusion that the clusters represent wines with significantly different characteristics.

Homework 2

Jernej Košorok

2025-01-25