mydata <- read.table("./Wine_HW2.csv",
header = TRUE,
sep = ",",
dec = ".")
mydata <- cbind(ID = 1:nrow(mydata), mydata)
head(mydata)
## ID Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 2 13.20 1.78 2.14 11.2 100 2.65 2.76
## 3 3 13.16 2.36 2.67 18.6 101 2.80 3.24
## 4 4 14.37 1.95 2.50 16.8 113 3.85 3.49
## 5 5 13.24 2.59 2.87 21.0 118 2.80 2.69
## 6 6 14.20 1.76 2.45 15.2 112 3.27 3.39
## Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue OD280 Proline
## 1 0.28 2.29 5.64 1.04 3.92 1065
## 2 0.26 1.28 4.38 1.05 3.40 1050
## 3 0.30 2.81 5.68 1.03 3.17 1185
## 4 0.24 2.18 7.80 0.86 3.45 1480
## 5 0.39 1.82 4.32 1.04 2.93 735
## 6 0.34 1.97 6.75 1.05 2.85 1450
I have added the column ID.
Unit of observation is a type of wine from Italy. The data set contains 13 variables, i.e., 13 constituents that are found in each of the wines. Unfortunately, the data set provides no description of these constituents, and my knowledge on the matter is very limited, so here is a short explanation of the lesser-known constituents from other sources:
Source of data: https://www.kaggle.com/datasets/harrywang/wine-dataset-for-clustering
library(pastecs)
round(stat.desc(mydata[ , c(-1)]), 2)
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## nbr.val 178.00 178.00 178.00 178.00 178.00 178.00
## nbr.null 0.00 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00 0.00
## min 11.03 0.74 1.36 10.60 70.00 0.98
## max 14.83 5.80 3.23 30.00 162.00 3.88
## range 3.80 5.06 1.87 19.40 92.00 2.90
## sum 2314.11 415.87 421.24 3470.10 17754.00 408.53
## median 13.05 1.87 2.36 19.50 98.00 2.36
## mean 13.00 2.34 2.37 19.49 99.74 2.30
## SE.mean 0.06 0.08 0.02 0.25 1.07 0.05
## CI.mean.0.95 0.12 0.17 0.04 0.49 2.11 0.09
## var 0.66 1.25 0.08 11.15 203.99 0.39
## std.dev 0.81 1.12 0.27 3.34 14.28 0.63
## coef.var 0.06 0.48 0.12 0.17 0.14 0.27
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## nbr.val 178.00 178.00 178.00 178.00
## nbr.null 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 0.34 0.13 0.41 1.28
## max 5.08 0.66 3.58 13.00
## range 4.74 0.53 3.17 11.72
## sum 361.21 64.41 283.18 900.34
## median 2.13 0.34 1.56 4.69
## mean 2.03 0.36 1.59 5.06
## SE.mean 0.07 0.01 0.04 0.17
## CI.mean.0.95 0.15 0.02 0.08 0.34
## var 1.00 0.02 0.33 5.37
## std.dev 1.00 0.12 0.57 2.32
## coef.var 0.49 0.34 0.36 0.46
## Hue OD280 Proline
## nbr.val 178.00 178.00 178.00
## nbr.null 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00
## min 0.48 1.27 278.00
## max 1.71 4.00 1680.00
## range 1.23 2.73 1402.00
## sum 170.43 464.88 132947.00
## median 0.96 2.78 673.50
## mean 0.96 2.61 746.89
## SE.mean 0.02 0.05 23.60
## CI.mean.0.95 0.03 0.11 46.58
## var 0.05 0.50 99166.72
## std.dev 0.23 0.71 314.91
## coef.var 0.24 0.27 0.42
The median for the variable Alcohol is 13.05. This means that half of the wines in the sample contain more than 13.05% of alcohol, and half less than 13.05% of alcohol.
The average amount of malic acid found in the wines from the sample is 2.34g/L.
The lowest amount of magnesium in the wines from the sample is 70 mg/L, and the highest is 162 mg/L.
mydata_clu_std <- as.data.frame(scale(mydata[5:10]))
I have standardized the data that I will use for clustering.
mydata$Dissimilarity = sqrt(mydata_clu_std$Ash_Alcanity^2 + mydata_clu_std$Magnesium^2 + mydata_clu_std$Total_Phenols^2 + mydata_clu_std$Flavanoids^2 + mydata_clu_std$Nonflavanoid_Phenols^2 + mydata_clu_std$Proanthocyanins^2)
head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")], 10)
## ID Dissimilarity
## 96 96 5.292541
## 74 74 4.761905
## 122 122 4.627028
## 70 70 4.505610
## 15 15 4.036761
## 111 111 3.949371
## 14 14 3.937640
## 60 60 3.843902
## 51 51 3.807776
## 79 79 3.610168
There seems to be a relatively big gap between units ID 96, 74, 122, 70, and the rest. Thus, I will remove units ID 96, 74, 122, and 70 as outliers.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!ID %in% c(96, 74, 122,70))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[5:10]))
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(mydata_clu_std,
method = "euclidian")
fviz_dist(Distances,
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
Let us calculate the Hopkins statistics to see if the data is clusterable. If it is above 0.5, it means that it is appropriate.
library(factoextra)
get_clust_tendency(mydata_clu_std,
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.6718374
##
## $plot
## NULL
The Hopkins statistics is about 0.67.
The procedures above have helped us determine whether the data is appropriate for clustering. Now that we have learned that it is, we are interested in how many clusters we should form. To answer this question, I will choose the K-Means clustering approach and determine the number of clusters with the elbow method and the silhouette analysis.
library(factoextra)
library(NbClust)
fviz_nbclust(mydata, kmeans, method = "wss") + labs(subtitle = "Elbow method")
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette") + labs(subtitle = "Silhouette analysis")
Both suggest that we should form two clusters. However, let us look at what the other indices suggest as well.
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 11 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 4 proposed 6 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 34.3560 119.5826 22.9304 0.4269 268.1222 1.359770e+12 11933.035 612.2999
## 3 0.1991 78.7670 27.1163 -1.4892 395.9105 1.467897e+12 8489.215 540.2727
## 4 1.4640 69.4685 19.9563 -1.3533 524.6060 1.245535e+12 6312.740 466.3251
## 5 1.1004 62.8348 16.8889 -1.1738 609.1731 1.197020e+12 4923.840 417.3344
## 6 2.5915 58.3220 11.2848 -0.9710 704.8139 9.948314e+11 4383.481 379.4175
## 7 0.7563 53.4271 10.8272 -1.1190 757.7271 9.990212e+11 3425.170 355.5357
## 8 11.5771 50.0092 6.9229 -1.0764 823.7212 8.929781e+11 3095.027 333.8885
## 9 1.4616 46.1685 5.7111 -1.5456 848.4073 9.806878e+11 2792.972 320.5213
## 10 0.0631 42.8325 8.2764 -2.0602 904.3641 8.777696e+11 2633.831 309.7984
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 9.4490 1.6952 0.3360 1.1976 0.3527 0.8068 20.5894 0.9105 0.4315
## 3 10.6943 1.9213 0.3143 1.4288 0.2912 1.1162 -11.0356 -0.3958 0.3946
## 4 12.8986 2.2259 0.3279 1.4041 0.2376 1.1210 -9.6089 -0.4089 0.3671
## 5 14.1361 2.4872 0.3330 1.5799 0.2093 1.2559 -12.4277 -0.7675 0.3444
## 6 16.1703 2.7358 0.3602 1.5389 0.2093 1.3499 -13.4776 -0.9728 0.3236
## 7 17.1546 2.9195 0.3860 1.5501 0.2006 1.6371 -19.8474 -1.4473 0.3057
## 8 18.2640 3.1088 0.3714 1.5654 0.1857 2.3964 -26.2222 -2.1618 0.2904
## 9 18.3825 3.2385 0.3657 1.5417 0.1806 1.6302 -13.9162 -1.4277 0.2766
## 10 20.2824 3.3506 0.3621 1.5621 0.1699 1.3855 -8.6261 -0.9941 0.2644
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 306.1500 0.5939 0.9522 0.6415 0.1048 0.0019 1.6434 1.7651 0.9734
## 3 180.0909 0.5854 1.2315 0.9157 0.1048 0.0020 1.6587 1.6557 0.6535
## 4 116.5813 0.5212 0.4987 1.5841 0.0708 0.0022 1.6253 1.5275 0.6505
## 5 83.4669 0.5028 0.4854 2.1211 0.1285 0.0025 1.6555 1.4490 0.4338
## 6 63.2363 0.4829 0.9004 2.6098 0.1461 0.0027 1.6068 1.3891 0.4073
## 7 50.7908 0.4482 0.2858 3.2406 0.1319 0.0026 1.7164 1.3393 0.3764
## 8 41.7361 0.4381 0.2498 3.6671 0.1375 0.0028 1.7240 1.3041 0.3650
## 9 35.6135 0.4355 1.5777 3.7907 0.1294 0.0028 1.8261 1.2830 0.3476
## 10 30.9798 0.4112 0.2143 4.3343 0.1333 0.0029 2.0855 1.2574 0.3294
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.7107 35.0013 0.4869
## 3 0.7086 43.5868 1.0000
## 4 0.6808 41.7323 1.0000
## 5 0.6484 33.0720 1.0000
## 6 0.6288 30.7008 1.0000
## 7 0.5853 36.1350 1.0000
## 8 0.5748 33.2833 1.0000
## 9 0.5569 28.6401 1.0000
## 10 0.4503 37.8492 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 2.000 2.0000 4.0000 2.0000 4.0000 6 3.000
## Value_Index 34.356 119.5826 7.1601 0.4269 128.6955 206377922035 3443.821
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 4.0000 4.0000 6.0000 3.0000 2.0000 2.0000 2.0000
## Value_Index 24.9568 2.2043 -0.0648 0.3143 1.1976 0.3527 0.8068
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 2.0000 3.0000 2.0000 1 2.0000
## Value_Index 20.5894 0.9105 0.4315 126.0591 0.5939 NA 0.6415
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 6.0000 0 6.0000 0 10.0000
## Value_Index 0.1461 0 1.6068 0 0.3294
##
## $Best.partition
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 2 1 2 1 2
## [75] 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 2 2
## [112] 2 2 2 2 2 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Now that we have seen that it is most optimal to form two clusters, let us do so.
Clustering <- kmeans(mydata_clu_std,
centers = 2,
nstart = 25)
Clustering
## K-means clustering with 2 clusters of sizes 86, 88
##
## Cluster means:
## Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## 1 -0.5342117 0.2574513 0.8037372 0.8600769 -0.6073951
## 2 0.5220705 -0.2516001 -0.7854705 -0.8405297 0.5935907
## Proanthocyanins
## 1 0.6301831
## 2 -0.6158607
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 2 1 2 1 2
## [75] 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 2 2
## [112] 2 2 2 2 2 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 277.4111 334.8888
## (between_SS / total_SS = 41.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
As we can we, we have formed two clusters with 86 and 88 units, respectively. 41.0% of total variability in the Italian wines from our sample is explained by the classification of said wines in two groups.
rownames(mydata_clu_std) <-mydata$ID
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = mydata_clu_std)
ID 94 and 127 seem to be outliers. I will remove them.
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(94, 127))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[5:10]))
Clustering <- kmeans(mydata_clu_std,
centers = 2,
nstart = 25)
Clustering
## K-means clustering with 2 clusters of sizes 86, 86
##
## Cluster means:
## Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## 1 0.5308095 -0.2925087 -0.7919399 -0.8487812 0.6303733
## 2 -0.5308095 0.2925087 0.7919399 0.8487812 -0.6303733
## Proanthocyanins
## 1 -0.6221845
## 2 0.6221845
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 1 2 1 2 1
## [75] 2 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 1 1 1
## [112] 1 1 1 1 1 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 314.0324 282.0703
## (between_SS / total_SS = 41.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Having removed outliers, the ratio between variability between groups and total variability increased. This ratio should be as large as possible. Now each group contains 86 units.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = mydata_clu_std)
Averages <- Clustering$centers
Averages
## Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## 1 0.5308095 -0.2925087 -0.7919399 -0.8487812 0.6303733
## 2 -0.5308095 0.2925087 0.7919399 0.8487812 -0.6303733
## Proanthocyanins
## 1 -0.6221845
## 2 0.6221845
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:pastecs':
##
## extract
Figure <- pivot_longer(Figure, cols = c("Ash_Alcanity", "Magnesium", "Total_Phenols","Flavanoids","Nonflavanoid_Phenols","Proanthocyanins"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2),
labels = c("1", "2"))
library(ggplot2)
ggplot(Figure, aes(x = name, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
Wines from group 2 are above average in flavanoids, magnesium, proanthocyanins, and total phenols. On the other hand, wines from group 1 are above average in ash alcalinity and nonflavanoid phenols.
I do not know enough about wines to say wines from which group are better. However, apparently flavanoids, magnesium, proanthocyanins, and total phenols are the attributes often associated with premium wine quality, so I suppose wines from group 2 are better.
mydata$Group <- Clustering$cluster
Let us test the differences between the group means for the cluster variables. To test this, I will perform the one way analysis of variance, ANOVA.
I could write the hypotheses for all the variables included in clustering (ash alcalinity, magnesium, total phenols, flavanoids, nonflavanoid phenols, and proanthocyanins), and the formal test below does test it for all of them. However, let me write the hypotheses only for magnesium. The hypotheses for all cluster variables would be written in the same manner:
fit <-aov(cbind(Ash_Alcanity, Magnesium, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proanthocyanins) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response Ash_Alcanity :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 500.83 500.83 67.234 5.648e-14 ***
## Residuals 170 1266.34 7.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Magnesium :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 2213.3 2213.31 16.008 9.397e-05 ***
## Residuals 170 23504.4 138.26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Total_Phenols :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 41.759 41.759 290.5 < 2.2e-16 ***
## Residuals 170 24.437 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Flavanoids :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 119.006 119.006 447.38 < 2.2e-16 ***
## Residuals 170 45.221 0.266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Nonflavanoid_Phenols :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 1.0315 1.03153 113.19 < 2.2e-16 ***
## Residuals 170 1.5493 0.00911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Proanthocyanins :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 20.881 20.8814 108.4 < 2.2e-16 ***
## Residuals 170 32.746 0.1926
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0 can be rejected as p < 0.001. For all variables, the p-value is lower than 0.05, which means that the mean values of the variables (ash alcalinity, magnesium, total phenols, flavanoids, nonflavanoid phenols, and proanthocyanins) differ; they are not the same in group 1 and group 2. If that were not the case, the variables would have to be eliminated or replaced and the classification repeated.
Let us check the wines from which group have on average more alcohol.
aggregate(mydata$Alcohol,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 12.75221
## 2 2 13.29674
Wines in group 2 have on average more alcohol.
To check the criterion validity, I will use a variable that was not used in the clustering process. For this, I will perform Levene’s test for homogeneity of variance. Again, I will choose alcohol.
The hypotheses are the following:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$Alcohol, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 2.0032 0.1588
## 170
We cannot reject the H0 (p = 0.16). This means that we do not have sufficient evidence to conclude that the variances are different in group 1 and group 2.
Let us now check if the variable is normally distributed. I will test this with the Shapiro Wilk normality test.
library(dplyr)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(Group) %>%
shapiro_test(Alcohol)
## # A tibble: 2 × 4
## Group variable statistic p
## <int> <chr> <dbl> <dbl>
## 1 1 Alcohol 0.988 0.615
## 2 2 Alcohol 0.956 0.00514
My hypotheses are the following:
For group 1:
For group 2:
H0 can be rejected for group 2, but not for group 1.
As the assumption of normality is violated, I will perform the non-parametric alternative to ANOVA - the Kruskal-Wallis sum test.
My hypotheses are the following:
kruskal.test(Alcohol ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Alcohol by as.factor(Group)
## Kruskal-Wallis chi-squared = 20.036, df = 1, p-value = 7.6e-06
As the p-value < 0.001, H0 can be rejected.
After removing six outliers in two steps, I clustered 172 wines into two segments based on six standardized variables. Both groups groups have equal number of units (86).
The wines in group 1 are above average in ash alcalinity and nonflavanoid phenols, which means that they come from regions where the soil is rich in alkaline minerals, such as potassium, calcium, magnesium, and sodium. Additionally, high nonflavanoid phenols values suggest wine bitterness. They have on average less alcohol than wines in group 2.
The wines in group 2 are above average in flavanoids, magnesium, proanthocyanins, and total phenols, and these attributes are apparently associated with higher quality. They on average contain more alcohol than wines in group 1.