Importing Data and Clustering

mydata <- read.table("./Wine_HW2.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".")

mydata <- cbind(ID = 1:nrow(mydata), mydata)

head(mydata)

##   ID Alcohol Malic_Acid  Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1  1   14.23       1.71 2.43         15.6       127          2.80       3.06
## 2  2   13.20       1.78 2.14         11.2       100          2.65       2.76
## 3  3   13.16       2.36 2.67         18.6       101          2.80       3.24
## 4  4   14.37       1.95 2.50         16.8       113          3.85       3.49
## 5  5   13.24       2.59 2.87         21.0       118          2.80       2.69
## 6  6   14.20       1.76 2.45         15.2       112          3.27       3.39
##   Nonflavanoid_Phenols Proanthocyanins Color_Intensity  Hue OD280 Proline
## 1                 0.28            2.29            5.64 1.04  3.92    1065
## 2                 0.26            1.28            4.38 1.05  3.40    1050
## 3                 0.30            2.81            5.68 1.03  3.17    1185
## 4                 0.24            2.18            7.80 0.86  3.45    1480
## 5                 0.39            1.82            4.32 1.04  2.93     735
## 6                 0.34            1.97            6.75 1.05  2.85    1450

I have added the column ID.

Unit of observation is a type of wine from Italy. The data set contains 13 variables, i.e., 13 constituents that are found in each of the wines. Unfortunately, the data set provides no description of these constituents, and my knowledge on the matter is very limited, so here is a short explanation of the lesser-known constituents from other sources:

Alcohol
Malic acid: it is the main acid found in the acidity of grapes, and if there is not enough of it, wines taste flat. (source: https://www.randoxfood.com/why-is-testing-for-l-malic-acid-important-in-winemaking/)
Ash: all the products that remain after igniting the residue left when the wine evaporates (source: https://www.oiv.int/standards/compendium-of-international-methods-of-wine-and-must-analysis/annex-a-methods-of-analysis-of-wines-and-musts/section-2-physical-analysis/ash-%28type-i%29)
Magnesium: if there is not enough magnesium in the soil, the vines suffer (source: https://www.vinazmoravyvinazcech.cz/en/encyclopedia/climate-topography-and-soil/the-mineral-composition-of-wine)
Total phenols: chemical compounds that affect the taste of bitterness (source: https://pubmed.ncbi.nlm.nih.gov/12074959/)
Flavanoids: chemical compounds that affect the color and mouth-feel of the wine. (source: https://content.iospress.com/download/nutrition-and-aging/nua026?id=nutrition-and-aging/nua026)
Nonflavanoid phenols: phenolic compounds that are not flavanoids (source: https://www.guildsomm.com/public_content/features/articles/b/jennifer-angelosante/posts/phenolics)
Proanthocyanins: contribute to wine astringecy (source: https://www.sciencedirect.com/science/article/pii/S0308814617319635)
Color intensity
Hue: color
OD280/OD315 of diluted wines: the method of calculating protein concentration (source: https://www.sciencedirect.com/science/article/pii/S2666720723000218)
Proline: an amino acid that boosts viscosity, sweetness, and flavor of red wine (source: https://www.awri.com.au/wp-content/uploads/2024/01/s2388-proline.pdf)

Source of data: https://www.kaggle.com/datasets/harrywang/wine-dataset-for-clustering

library(pastecs)
round(stat.desc(mydata[ , c(-1)]), 2)

##              Alcohol Malic_Acid    Ash Ash_Alcanity Magnesium Total_Phenols
## nbr.val       178.00     178.00 178.00       178.00    178.00        178.00
## nbr.null        0.00       0.00   0.00         0.00      0.00          0.00
## nbr.na          0.00       0.00   0.00         0.00      0.00          0.00
## min            11.03       0.74   1.36        10.60     70.00          0.98
## max            14.83       5.80   3.23        30.00    162.00          3.88
## range           3.80       5.06   1.87        19.40     92.00          2.90
## sum          2314.11     415.87 421.24      3470.10  17754.00        408.53
## median         13.05       1.87   2.36        19.50     98.00          2.36
## mean           13.00       2.34   2.37        19.49     99.74          2.30
## SE.mean         0.06       0.08   0.02         0.25      1.07          0.05
## CI.mean.0.95    0.12       0.17   0.04         0.49      2.11          0.09
## var             0.66       1.25   0.08        11.15    203.99          0.39
## std.dev         0.81       1.12   0.27         3.34     14.28          0.63
## coef.var        0.06       0.48   0.12         0.17      0.14          0.27
##              Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## nbr.val          178.00               178.00          178.00          178.00
## nbr.null           0.00                 0.00            0.00            0.00
## nbr.na             0.00                 0.00            0.00            0.00
## min                0.34                 0.13            0.41            1.28
## max                5.08                 0.66            3.58           13.00
## range              4.74                 0.53            3.17           11.72
## sum              361.21                64.41          283.18          900.34
## median             2.13                 0.34            1.56            4.69
## mean               2.03                 0.36            1.59            5.06
## SE.mean            0.07                 0.01            0.04            0.17
## CI.mean.0.95       0.15                 0.02            0.08            0.34
## var                1.00                 0.02            0.33            5.37
## std.dev            1.00                 0.12            0.57            2.32
## coef.var           0.49                 0.34            0.36            0.46
##                 Hue  OD280   Proline
## nbr.val      178.00 178.00    178.00
## nbr.null       0.00   0.00      0.00
## nbr.na         0.00   0.00      0.00
## min            0.48   1.27    278.00
## max            1.71   4.00   1680.00
## range          1.23   2.73   1402.00
## sum          170.43 464.88 132947.00
## median         0.96   2.78    673.50
## mean           0.96   2.61    746.89
## SE.mean        0.02   0.05     23.60
## CI.mean.0.95   0.03   0.11     46.58
## var            0.05   0.50  99166.72
## std.dev        0.23   0.71    314.91
## coef.var       0.24   0.27      0.42

The median for the variable Alcohol is 13.05. This means that half of the wines in the sample contain more than 13.05% of alcohol, and half less than 13.05% of alcohol.

The average amount of malic acid found in the wines from the sample is 2.34g/L.

The lowest amount of magnesium in the wines from the sample is 70 mg/L, and the highest is 162 mg/L.

mydata_clu_std <- as.data.frame(scale(mydata[5:10]))

I have standardized the data that I will use for clustering.

mydata$Dissimilarity = sqrt(mydata_clu_std$Ash_Alcanity^2 + mydata_clu_std$Magnesium^2 + mydata_clu_std$Total_Phenols^2 + mydata_clu_std$Flavanoids^2 + mydata_clu_std$Nonflavanoid_Phenols^2 + mydata_clu_std$Proanthocyanins^2)

head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")], 10)

##      ID Dissimilarity
## 96   96      5.292541
## 74   74      4.761905
## 122 122      4.627028
## 70   70      4.505610
## 15   15      4.036761
## 111 111      3.949371
## 14   14      3.937640
## 60   60      3.843902
## 51   51      3.807776
## 79   79      3.610168

There seems to be a relatively big gap between units ID 96, 74, 122, 70, and the rest. Thus, I will remove units ID 96, 74, 122, and 70 as outliers.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  filter(!ID %in% c(96, 74, 122,70))

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[5:10]))

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distances <- get_dist(mydata_clu_std,
                      method = "euclidian")

fviz_dist(Distances,
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

Let us calculate the Hopkins statistics to see if the data is clusterable. If it is above 0.5, it means that it is appropriate.

library(factoextra)
get_clust_tendency(mydata_clu_std,
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.6718374
## 
## $plot
## NULL

The Hopkins statistics is about 0.67.

The procedures above have helped us determine whether the data is appropriate for clustering. Now that we have learned that it is, we are interested in how many clusters we should form. To answer this question, I will choose the K-Means clustering approach and determine the number of clusters with the elbow method and the silhouette analysis.

library(factoextra)
library(NbClust)

fviz_nbclust(mydata, kmeans, method = "wss") + labs(subtitle = "Elbow method")

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette") + labs(subtitle = "Silhouette analysis")

Both suggest that we should form two clusters. However, let us look at what the other indices suggest as well.

library(NbClust)
NbClust(mydata_clu_std,
        distance = "euclidean",
        min.nc = 2, max.nc = 10,
        method = "kmeans",
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 11 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 4 proposed 6 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

## $All.index
##         KL       CH Hartigan     CCC    Scott      Marriot    TrCovW   TraceW
## 2  34.3560 119.5826  22.9304  0.4269 268.1222 1.359770e+12 11933.035 612.2999
## 3   0.1991  78.7670  27.1163 -1.4892 395.9105 1.467897e+12  8489.215 540.2727
## 4   1.4640  69.4685  19.9563 -1.3533 524.6060 1.245535e+12  6312.740 466.3251
## 5   1.1004  62.8348  16.8889 -1.1738 609.1731 1.197020e+12  4923.840 417.3344
## 6   2.5915  58.3220  11.2848 -0.9710 704.8139 9.948314e+11  4383.481 379.4175
## 7   0.7563  53.4271  10.8272 -1.1190 757.7271 9.990212e+11  3425.170 355.5357
## 8  11.5771  50.0092   6.9229 -1.0764 823.7212 8.929781e+11  3095.027 333.8885
## 9   1.4616  46.1685   5.7111 -1.5456 848.4073 9.806878e+11  2792.972 320.5213
## 10  0.0631  42.8325   8.2764 -2.0602 904.3641 8.777696e+11  2633.831 309.7984
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2    9.4490 1.6952 0.3360 1.1976     0.3527 0.8068  20.5894  0.9105    0.4315
## 3   10.6943 1.9213 0.3143 1.4288     0.2912 1.1162 -11.0356 -0.3958    0.3946
## 4   12.8986 2.2259 0.3279 1.4041     0.2376 1.1210  -9.6089 -0.4089    0.3671
## 5   14.1361 2.4872 0.3330 1.5799     0.2093 1.2559 -12.4277 -0.7675    0.3444
## 6   16.1703 2.7358 0.3602 1.5389     0.2093 1.3499 -13.4776 -0.9728    0.3236
## 7   17.1546 2.9195 0.3860 1.5501     0.2006 1.6371 -19.8474 -1.4473    0.3057
## 8   18.2640 3.1088 0.3714 1.5654     0.1857 2.3964 -26.2222 -2.1618    0.2904
## 9   18.3825 3.2385 0.3657 1.5417     0.1806 1.6302 -13.9162 -1.4277    0.2766
## 10  20.2824 3.3506 0.3621 1.5621     0.1699 1.3855  -8.6261 -0.9941    0.2644
##        Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  306.1500     0.5939 0.9522  0.6415 0.1048 0.0019  1.6434 1.7651 0.9734
## 3  180.0909     0.5854 1.2315  0.9157 0.1048 0.0020  1.6587 1.6557 0.6535
## 4  116.5813     0.5212 0.4987  1.5841 0.0708 0.0022  1.6253 1.5275 0.6505
## 5   83.4669     0.5028 0.4854  2.1211 0.1285 0.0025  1.6555 1.4490 0.4338
## 6   63.2363     0.4829 0.9004  2.6098 0.1461 0.0027  1.6068 1.3891 0.4073
## 7   50.7908     0.4482 0.2858  3.2406 0.1319 0.0026  1.7164 1.3393 0.3764
## 8   41.7361     0.4381 0.2498  3.6671 0.1375 0.0028  1.7240 1.3041 0.3650
## 9   35.6135     0.4355 1.5777  3.7907 0.1294 0.0028  1.8261 1.2830 0.3476
## 10  30.9798     0.4112 0.2143  4.3343 0.1333 0.0029  2.0855 1.2574 0.3294
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.7107            35.0013       0.4869
## 3          0.7086            43.5868       1.0000
## 4          0.6808            41.7323       1.0000
## 5          0.6484            33.0720       1.0000
## 6          0.6288            30.7008       1.0000
## 7          0.5853            36.1350       1.0000
## 8          0.5748            33.2833       1.0000
## 9          0.5569            28.6401       1.0000
## 10         0.4503            37.8492       1.0000
## 
## $Best.nc
##                     KL       CH Hartigan    CCC    Scott      Marriot   TrCovW
## Number_clusters  2.000   2.0000   4.0000 2.0000   4.0000            6    3.000
## Value_Index     34.356 119.5826   7.1601 0.4269 128.6955 206377922035 3443.821
##                  TraceW Friedman   Rubin Cindex     DB Silhouette   Duda
## Number_clusters  4.0000   4.0000  6.0000 3.0000 2.0000     2.0000 2.0000
## Value_Index     24.9568   2.2043 -0.0648 0.3143 1.1976     0.3527 0.8068
##                 PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   2.0000 2.0000    2.0000   3.0000     2.0000    1  2.0000
## Value_Index      20.5894 0.9105    0.4315 126.0591     0.5939   NA  0.6415
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 6.0000      0  6.0000      0 10.0000
## Value_Index     0.1461      0  1.6068      0  0.3294
## 
## $Best.partition
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 2 1 2 1 2
##  [75] 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 2 2
## [112] 2 2 2 2 2 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Now that we have seen that it is most optimal to form two clusters, let us do so.

Clustering <- kmeans(mydata_clu_std,
                     centers = 2,
                     nstart = 25)
Clustering

## K-means clustering with 2 clusters of sizes 86, 88
## 
## Cluster means:
##   Ash_Alcanity  Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## 1   -0.5342117  0.2574513     0.8037372  0.8600769           -0.6073951
## 2    0.5220705 -0.2516001    -0.7854705 -0.8405297            0.5935907
##   Proanthocyanins
## 1       0.6301831
## 2      -0.6158607
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 2 1 2 1 2
##  [75] 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 2 2
## [112] 2 2 2 2 2 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 277.4111 334.8888
##  (between_SS / total_SS =  41.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

As we can we, we have formed two clusters with 86 and 88 units, respectively. 41.0% of total variability in the Italian wines from our sample is explained by the classification of said wines in two groups.

rownames(mydata_clu_std) <-mydata$ID

library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = TRUE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

ID 94 and 127 seem to be outliers. I will remove them.

library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(94, 127))

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[5:10]))

Clustering <- kmeans(mydata_clu_std,
                     centers = 2,
                     nstart = 25)
Clustering

## K-means clustering with 2 clusters of sizes 86, 86
## 
## Cluster means:
##   Ash_Alcanity  Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## 1    0.5308095 -0.2925087    -0.7919399 -0.8487812            0.6303733
## 2   -0.5308095  0.2925087     0.7919399  0.8487812           -0.6303733
##   Proanthocyanins
## 1      -0.6221845
## 2       0.6221845
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 1 2 1 2 1
##  [75] 2 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 1 1 1
## [112] 1 1 1 1 1 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 314.0324 282.0703
##  (between_SS / total_SS =  41.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Having removed outliers, the ratio between variability between groups and total variability increased. This ratio should be as large as possible. Now each group contains 86 units.

library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = TRUE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Averages <- Clustering$centers
Averages

##   Ash_Alcanity  Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## 1    0.5308095 -0.2925087    -0.7919399 -0.8487812            0.6303733
## 2   -0.5308095  0.2925087     0.7919399  0.8487812           -0.6303733
##   Proanthocyanins
## 1      -0.6221845
## 2       0.6221845

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:pastecs':
## 
##     extract

Figure <- pivot_longer(Figure, cols = c("Ash_Alcanity", "Magnesium", "Total_Phenols","Flavanoids","Nonflavanoid_Phenols","Proanthocyanins"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2), 
                       labels = c("1", "2"))

library(ggplot2)
ggplot(Figure, aes(x = name, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Wines from group 2 are above average in flavanoids, magnesium, proanthocyanins, and total phenols. On the other hand, wines from group 1 are above average in ash alcalinity and nonflavanoid phenols.

I do not know enough about wines to say wines from which group are better. However, apparently flavanoids, magnesium, proanthocyanins, and total phenols are the attributes often associated with premium wine quality, so I suppose wines from group 2 are better.

mydata$Group <- Clustering$cluster

Let us test the differences between the group means for the cluster variables. To test this, I will perform the one way analysis of variance, ANOVA.

I could write the hypotheses for all the variables included in clustering (ash alcalinity, magnesium, total phenols, flavanoids, nonflavanoid phenols, and proanthocyanins), and the formal test below does test it for all of them. However, let me write the hypotheses only for magnesium. The hypotheses for all cluster variables would be written in the same manner:

H0: μ magnesium,G1 = μ magnesium,G2
H1: at least one μ magnesium,j is different

fit <-aov(cbind(Ash_Alcanity, Magnesium, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proanthocyanins) ~ as.factor(Group),
data = mydata)

summary(fit)

##  Response Ash_Alcanity :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1  500.83  500.83  67.234 5.648e-14 ***
## Residuals        170 1266.34    7.45                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Magnesium :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1  2213.3 2213.31  16.008 9.397e-05 ***
## Residuals        170 23504.4  138.26                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Total_Phenols :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 41.759  41.759   290.5 < 2.2e-16 ***
## Residuals        170 24.437   0.144                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Flavanoids :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 119.006 119.006  447.38 < 2.2e-16 ***
## Residuals        170  45.221   0.266                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Nonflavanoid_Phenols :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 1.0315 1.03153  113.19 < 2.2e-16 ***
## Residuals        170 1.5493 0.00911                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Proanthocyanins :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 20.881 20.8814   108.4 < 2.2e-16 ***
## Residuals        170 32.746  0.1926                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0 can be rejected as p < 0.001. For all variables, the p-value is lower than 0.05, which means that the mean values of the variables (ash alcalinity, magnesium, total phenols, flavanoids, nonflavanoid phenols, and proanthocyanins) differ; they are not the same in group 1 and group 2. If that were not the case, the variables would have to be eliminated or replaced and the classification repeated.

Let us check the wines from which group have on average more alcohol.

aggregate(mydata$Alcohol,
          by = list(mydata$Group),
          FUN = mean)

##   Group.1        x
## 1       1 12.75221
## 2       2 13.29674

Wines in group 2 have on average more alcohol.

To check the criterion validity, I will use a variable that was not used in the clustering process. For this, I will perform Levene’s test for homogeneity of variance. Again, I will choose alcohol.

The hypotheses are the following:

H0: σ2 alcohol,G1 = σ2 alcohol,G2
H1: at least one σ2 alcohol,j is different

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$Alcohol, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  2.0032 0.1588
##       170

We cannot reject the H0 (p = 0.16). This means that we do not have sufficient evidence to conclude that the variances are different in group 1 and group 2.

Let us now check if the variable is normally distributed. I will test this with the Shapiro Wilk normality test.

library(dplyr)
library(rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(Group) %>%
  shapiro_test(Alcohol)

## # A tibble: 2 × 4
##   Group variable statistic       p
##   <int> <chr>        <dbl>   <dbl>
## 1     1 Alcohol      0.988 0.615  
## 2     2 Alcohol      0.956 0.00514

My hypotheses are the following:

For group 1:

H0: the variable alcohol is normally distributed in group 1.
H1: the variable alcohol is not normally distributed in group 1.

For group 2:

H0: the variable alcohol is normally distributed in group 2.
H1: the variable alcohol is not normally distributed in group 2.

H0 can be rejected for group 2, but not for group 1.

As the assumption of normality is violated, I will perform the non-parametric alternative to ANOVA - the Kruskal-Wallis sum test.

My hypotheses are the following:

H0: all distribution locations of alcohol are the same.
H1: at least one distribution location of alcohol is different.

kruskal.test(Alcohol ~ as.factor(Group),
data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Alcohol by as.factor(Group)
## Kruskal-Wallis chi-squared = 20.036, df = 1, p-value = 7.6e-06

As the p-value < 0.001, H0 can be rejected.

HW2 - Blanka Pucer

2025-01-22

Importing Data and Clustering

Conclusion