The dataset contains the results of a chemical analysis of wines from three different cultivars grown in the same region of Italy. The analysis measured the quantities of 13 constituents found in each wine. The variables included in the dataset are alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline. The original dataset may have had more variables, but only the 13 variables in this dataset are available
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## corrplot 0.92 loaded
##
##
## Attaching package: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
summary(data)
## Alcohol Malic acid Ash Alcalinity_of_ash
## Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Total_phenols Flavanoids Nonflavanoid phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color intensity Hue OD280/OD315_of_diluted wines
## Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## Proline
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
The first variable, alcohol, ranges from 11.03 to 14.83 with a mean of 13.00. The second variable, malic acid, ranges from 0.74 to 5.80 with a mean of 2.336. The third variable, ash, ranges from 1.36 to 3.23 with a mean of 2.367. The fourth variable, alcalinity of ash, ranges from 10.60 to 30.00 with a mean of 19.49.
The fifth variable, magnesium, ranges from 70.00 to 162.00 with a mean of 99.74. The sixth variable, total phenols, ranges from 0.98 to 3.88 with a mean of 2.295. The seventh variable, flavanoids, ranges from 0.34 to 5.08 with a mean of 2.029. The eighth variable, nonflavanoid phenols, ranges from 0.13 to 0.66 with a mean of 0.3619.
The ninth variable, proanthocyanins, ranges from 0.41 to 3.58 with a mean of 1.591. The tenth variable, color intensity, ranges from 1.28 to 13.00 with a mean of 5.058. The eleventh variable, hue, ranges from 0.48 to 1.71 with a mean of 0.9574. The twelfth variable, OD280/OD315 of diluted wines, ranges from 1.27 to 4.00 with a mean of 2.612.
Finally, the thirteenth variable, proline, ranges from 278.0 to 1680.0 with a mean of 746.9. The dataset appears to be relatively normally distributed with the majority of variables having similar means and standard deviations.
str(data)
## 'data.frame': 178 obs. of 13 variables:
## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Malic acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Alcalinity_of_ash : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Total_phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Nonflavanoid phenols : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Color intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ OD280/OD315_of_diluted wines: num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
The data appears to consist of numeric and integer variables, making them suitable for use in cluster analysis. However, it’s important to carefully select only the relevant variables for this type of analysis.
data %>%
gather(attributes, value, 1:13) %>%
ggplot(aes(x = value)) +
geom_histogram(fill = 'lightblue2', color = 'black') +
facet_wrap(~attributes, scales = 'free_x') +
labs(x="Values", y="Frequency") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
corrplot(round(cor(data),1), type = 'upper', method = 'number', tl.cex = 0.9,number.cex = 0.5)
The variables Total_Phenols and Flavanoids exhibit a high degree of linear correlation, suggesting that a linear model may be appropriate for describing their relationship. This can be achieved by fitting a linear equation to the data.
# Relationship between Phenols and Flavanoids
ggplot(data, aes(x = Total_phenols, y = Flavanoids)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
ggtitle("Total Phenols v/s Flavanoids")+
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
data_normalize <- as.data.frame(scale(data))
head(data_normalize)
## Alcohol Malic acid Ash Alcalinity_of_ash Magnesium Total_phenols
## 1 1.5143408 -0.56066822 0.2313998 -1.1663032 1.90852151 0.8067217
## 2 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481
## 3 0.1963252 0.02117152 1.1062139 -0.2679823 0.08810981 0.8067217
## 4 1.6867914 -0.34583508 0.4865539 -0.8069748 0.92829983 2.4844372
## 5 0.2948684 0.22705328 1.8352256 0.4506745 1.27837900 0.8067217
## 6 1.4773871 -0.51591132 0.3043010 -1.2860793 0.85828399 1.5576991
## Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
## 1 1.0319081 -0.6577078 1.2214385 0.2510088 0.3611585
## 2 0.7315653 -0.8184106 -0.5431887 -0.2924962 0.4049085
## 3 1.2121137 -0.4970050 2.1299594 0.2682629 0.3174085
## 4 1.4623994 -0.9791134 1.0292513 1.1827317 -0.4263410
## 5 0.6614853 0.2261576 0.4002753 -0.3183774 0.3611585
## 6 1.3622851 -0.1755994 0.6623487 0.7298108 0.4049085
## OD280/OD315_of_diluted wines Proline
## 1 1.8427215 1.01015939
## 2 1.1103172 0.96252635
## 3 0.7863692 1.39122370
## 4 1.1807407 2.32800680
## 5 0.4483365 -0.03776747
## 6 0.3356589 2.23274072
To prepare for K-means clustering, we need to address the issue of different scales among the variables. Normalization of data is required in such cases, and it can be achieved by using the mean and standard deviation or using the scale function.
Dimension reduction in R is the process of reducing the number of variables or features in a dataset, while preserving as much of the variability in the data as possible. This is typically done to simplify the analysis or modeling process, reduce noise and redundancy, and improve interpretability and visualization of the data.
There are several techniques available in R for dimension reduction, including principal component analysis (PCA), factor analysis, and multidimensional scaling (MDS). These techniques involve transforming the original data into a lower-dimensional space that retains as much of the original information as possible.
PCA is a commonly used technique that involves finding linear combinations of the original variables that explain the most variation in the data. The resulting principal components can be used as new variables for subsequent analysis, and can often explain the majority of the variability in the data with just a few components.
pca <- prcomp(data, center = TRUE, scale = TRUE)
eigen(cor(data))$values
## [1] 4.7058503 2.4969737 1.4460720 0.9189739 0.8532282 0.6416570 0.5510283
## [8] 0.3484974 0.2888799 0.2509025 0.2257886 0.1687702 0.1033779
fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4", addlabels = TRUE, main = "Eigenvalues")
Based on the Kaiser’s rule, 3 components should be chosen, as the
eigenvalue for those is above 1.
library("pdp")
##
## Attaching package: 'pdp'
## The following object is masked from 'package:purrr':
##
## partial
pca_var <- get_pca_var(pca)
fviz_contrib(pca, "var", axes = 1:3, fill = "tomato3", color = "tomato4")
K-Means clustering is a type of unsupervised machine learning algorithm used to group similar data points into clusters. The algorithm works by first randomly selecting k number of points from the dataset to act as the centroids of the clusters. The remaining points are then assigned to the nearest centroid based on their similarity to the centroid. The similarity is calculated using a distance metric, such as Euclidean distance. Once all the points have been assigned, the centroids are recalculated based on the mean of the points in each cluster. This process is repeated until the centroids no longer change significantly or a maximum number of iterations is reached.
The goal of K-Means clustering is to minimize the sum of squared distances between the data points and their assigned centroids. The algorithm attempts to find the optimal number of clusters k, based on the data, but the number of clusters must be specified by the user. K-Means clustering is commonly used in a variety of fields, such as image segmentation, customer segmentation, and anomaly detection.
set.seed(123)
data_K2 <- kmeans(data_normalize, centers = 2, nstart = 25)
print(data_K2)
## K-means clustering with 2 clusters of sizes 91, 87
##
## Cluster means:
## Alcohol Malic acid Ash Alcalinity_of_ash Magnesium Total_phenols
## 1 -0.3106038 0.3374209 -0.04979045 0.4684435 -0.3065948 -0.7482598
## 2 0.3248845 -0.3529345 0.05207966 -0.4899811 0.3206911 0.7826625
## Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
## 1 -0.7873111 0.5661058 -0.6098110 0.0979495 -0.5385525
## 2 0.8235093 -0.5921337 0.6378483 -0.1024529 0.5633135
## OD280/OD315_of_diluted wines Proline
## 1 -0.6832374 -0.5785857
## 2 0.7146506 0.6051873
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 2 1 2 1 2
## [75] 2 1 2 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 2 2 2 1 2 2 2 2 1 1 1 2 1 1 1 1 2 2
## [112] 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 884.3435 765.0965
## (between_SS / total_SS = 28.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(data_K2, data = data_normalize)
The first cluster (labeled as 1) has negative means for Alcohol, Malic
acid, Ash, and Magnesium, and positive means for Alcalinity_of_ash,
Nonflavanoid phenols, and Color intensity. The second cluster (labeled
as 2) has positive means for Alcohol, Ash, and Magnesium, and negative
means for Malic acid, Alcalinity_of_ash, and Nonflavanoid phenols. Both
clusters have positive means for Total_phenols, Flavanoids,
Proanthocyanins, Hue, OD280/OD315_of_diluted wines, and Proline, but
cluster 2 has higher mean values for these variables compared to cluster
1.
The clustering vector shows the cluster assignments of each data point. For example, the first 35 data points are assigned to cluster 2, while the 36th and 38th data points are assigned to cluster 1.
The within-cluster sum of squares (WSS) is a measure of the variation within each cluster. The WSS values for clusters 1 and 2 are 884.3435 and 765.0965, respectively. The between-cluster sum of squares (BSS) is the variation between the cluster means. The percentage of BSS to total sum of squares (TSS) is 28.3%, indicating that the clustering is moderately effective in capturing the variation in the data.
Other available components of the K-means clustering result include the total sum of squares (TSS), the between-cluster sum of squares (BSS), the size of each cluster, the number of iterations, and the ifault component, which indicates if the algorithm has converged.
data_K3 <- kmeans(data_normalize, centers = 3, nstart = 25)
data_K4 <- kmeans(data_normalize, centers = 4, nstart = 25)
data_K5 <- kmeans(data_normalize, centers = 5, nstart = 25)
It is important to determine the number of clusters (k) before starting the K-means algorithm. To find the optimal k, it is recommended to run the algorithm with different values of k and compare the results.
In this case, we ran the K-means algorithm for 3, 4, and 5 clusters, and the resulting clusters are displayed in the figure. This approach can help determine the most appropriate number of clusters for the given dataset.
p1 <- fviz_cluster(data_K2, geom = "point", data = data_normalize) + ggtitle(" K = 2")
p2 <- fviz_cluster(data_K3, geom = "point", data = data_normalize) + ggtitle(" K = 3")
p3 <- fviz_cluster(data_K4, geom = "point", data = data_normalize) + ggtitle(" K = 4")
p4 <- fviz_cluster(data_K5, geom = "point", data = data_normalize) + ggtitle(" K = 5")
grid.arrange(p1, p2, p3, p4, nrow = 2)
The WSS is the sum of the squared distance between each data point and its assigned cluster centroid. As k increases, the WSS will tend to decrease, since more clusters will allow for better fitting to the data. However, at some point, adding more clusters will only capture noise or minor variations in the data, leading to diminishing returns in terms of WSS reduction.
The elbow method involves plotting the WSS for different values of k and identifying the point where the decrease in WSS starts to level off, resembling an elbow shape. This point indicates a trade-off between capturing more variance in the data (lower WSS) and avoiding overfitting or capturing noise (higher k).
Therefore, the optimal number of clusters can be chosen based on the “elbow point” in the WSS plot, which represents a good balance between model complexity and data fit.
# Determining Optimal clusters (k) Using Elbow method
fviz_nbclust(x = data_normalize,FUNcluster = kmeans, method = 'wss' )
Silhouette score is a metric used to evaluate the quality of clustering
results . It measures how similar an object is to its own cluster
compared to other clusters. The silhouette score ranges from -1 to 1,
where a score of 1 indicates that the object is well matched to its own
cluster and poorly matched to neighboring clusters, while a score of -1
indicates the opposite. A score close to zero indicates that the object
is on the boundary between two clusters.
The average silhouette score is calculated by taking the mean silhouette score over all objects in the dataset. A high average silhouette score indicates that the clustering is appropriate, while a low score suggests that the clustering may not be optimal. Silhouette score is a useful metric for choosing the number of clusters for K-means clustering, as it can be used to compare the quality of clustering results across different values of K. A higher silhouette score implies better clustering results.
# Determining Optimal clusters (k) Using Average Silhouette Method
fviz_nbclust(x = data_normalize,FUNcluster = kmeans, method = 'silhouette' )
# Compute k-means clustering with k = 3
set.seed(123)
final <- kmeans(data_normalize, centers = 3, nstart = 25)
print(final)
## K-means clustering with 3 clusters of sizes 51, 62, 65
##
## Cluster means:
## Alcohol Malic acid Ash Alcalinity_of_ash Magnesium Total_phenols
## 1 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
## 2 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
## 3 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
## Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
## 1 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122
## 2 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504
## 3 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046
## OD280/OD315_of_diluted wines Proline
## 1 -1.2887761 -0.4059428
## 2 0.7770551 1.1220202
## 3 0.2700025 -0.7517257
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2
## [75] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 326.3537 385.6983 558.6971
## (between_SS / total_SS = 44.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(final, data = data_normalize)
The cluster means table shows the mean values of each variable for each
of the three clusters that were created. Each row corresponds to a
different cluster, and the columns correspond to the different
variables. For example, the mean value for Alcohol in cluster 1 is 0.16,
while the mean value for Alcohol in cluster 2 is 0.83.
The clustering vector shows the cluster to which each observation in the dataset has been assigned. The numbers 1, 2, and 3 represent the three clusters. For example, the first 35 observations in the dataset have been assigned to cluster 2, while the observations numbered 36, 37, and 38 have been assigned to cluster 3.
The within-cluster sum of squares by cluster shows the sum of squared distances between each observation and its cluster center for each of the three clusters. The first cluster has a within-cluster sum of squares of 326.35, the second cluster has a within-cluster sum of squares of 385.70, and the third cluster has a within-cluster sum of squares of 558.70.
The between-cluster sum of squares is not explicitly provided, but it is possible to calculate it by subtracting the total within-cluster sum of squares from the total sum of squares. In this case, the between-cluster sum of squares is approximately 55.2% of the total sum of squares, while the within-cluster sum of squares is approximately 44.8% of the total sum of squares.
data_normalize %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarize_all('median')
## # A tibble: 3 × 14
## Cluster Alcohol `Malic acid` Ash Alcali…¹ Magne…² Total…³ Flavan…⁴ Nonfl…⁵
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.135 0.836 0.0491 0.451 -0.192 -1.03 -1.33e+0 0.869
## 2 2 0.905 -0.511 0.286 -0.747 0.403 0.847 9.47e-1 -0.577
## 3 3 -0.925 -0.650 -0.461 0.151 -0.822 -0.152 7.31e-4 -0.0952
## # … with 5 more variables: Proanthocyanins <dbl>, `Color intensity` <dbl>,
## # Hue <dbl>, `OD280/OD315_of_diluted wines` <dbl>, Proline <dbl>, and
## # abbreviated variable names ¹Alcalinity_of_ash, ²Magnesium, ³Total_phenols,
## # ⁴Flavanoids, ⁵`Nonflavanoid phenols`
we can conclude that the dataset can be grouped into 3 clusters based on the elbow method and silhouette score analysis. The cluster sizes were found to be 51, 62, and 65, respectively.
Further analysis of the cluster means reveals that there are significant differences in the various attributes of the wine samples in each cluster, such as alcohol content, malic acid, ash, and magnesium, among others. The within-cluster sum of squares shows that the majority of the variation in the data is explained by the clusters, which accounts for 44.8% of the total sum of squares.
Therefore, we can conclude that k-means clustering is a useful technique for analyzing and grouping data with multiple variables, and it can provide valuable insights into the underlying structure of the data.
Charu Aggarwal, Cecilia Procopiuc, Joel Wolf, Phillip Yu, and Jong Park. “Fast algorithms for projected clustering,” In ACM SIGMOD Conference, (1999).
Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, Prentice Hall, 1990.
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” In Proceedings of the 15th International Conference on Data Engineering (ICDE ’99), pp. 512-521 (1999).
Anil K. Jain and Richard C. Dubes, Algorithms for Clustering Data, Prentice Hall (1988).
Han, J.; Pei, J.; Kamber, M. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2011.
Di Vita, G.; Chinnici, G.; D’Amico, M. Clustering attitudes and behaviours of Italian wine consumers. Calitatea 2014, 15, 54.
Hall, D. Exploring wine knowledge, aesthetics and ephemerality: Clustering consumers. Int. J. Wine Bus. Res. 2016, 28, 134–153.
Vázquez-Fresno, R.; Llorach, R.; Perera, A.; Mandal, R.; Feliz, M.; Tinahones, F.J.; Wishart, D.S.; Andres-Lacueva, C. Clinical phenotype clustering in cardiovascular risk patients for the identification of responsive metabotypes after red wine polyphenol intake. J. Nutr. Biochem. 2016, 28, 114–120.