Introduction

The dataset contains the results of a chemical analysis of wines from three different cultivars grown in the same region of Italy. The analysis measured the quantities of 13 constituents found in each wine. The variables included in the dataset are alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline. The original dataset may have had more variables, but only the 13 variables in this dataset are available

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## corrplot 0.92 loaded
## 
## 
## Attaching package: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Exploratory Data Analysis

summary(data)

##     Alcohol        Malic acid         Ash        Alcalinity_of_ash
##  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60    
##  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20    
##  Median :13.05   Median :1.865   Median :2.360   Median :19.50    
##  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49    
##  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50    
##  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total_phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color intensity       Hue         OD280/OD315_of_diluted wines
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270               
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938               
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780               
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612               
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170               
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000               
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

The first variable, alcohol, ranges from 11.03 to 14.83 with a mean of 13.00. The second variable, malic acid, ranges from 0.74 to 5.80 with a mean of 2.336. The third variable, ash, ranges from 1.36 to 3.23 with a mean of 2.367. The fourth variable, alcalinity of ash, ranges from 10.60 to 30.00 with a mean of 19.49.

The fifth variable, magnesium, ranges from 70.00 to 162.00 with a mean of 99.74. The sixth variable, total phenols, ranges from 0.98 to 3.88 with a mean of 2.295. The seventh variable, flavanoids, ranges from 0.34 to 5.08 with a mean of 2.029. The eighth variable, nonflavanoid phenols, ranges from 0.13 to 0.66 with a mean of 0.3619.

The ninth variable, proanthocyanins, ranges from 0.41 to 3.58 with a mean of 1.591. The tenth variable, color intensity, ranges from 1.28 to 13.00 with a mean of 5.058. The eleventh variable, hue, ranges from 0.48 to 1.71 with a mean of 0.9574. The twelfth variable, OD280/OD315 of diluted wines, ranges from 1.27 to 4.00 with a mean of 2.612.

Finally, the thirteenth variable, proline, ranges from 278.0 to 1680.0 with a mean of 746.9. The dataset appears to be relatively normally distributed with the majority of variables having similar means and standard deviations.

str(data)

## 'data.frame':    178 obs. of  13 variables:
##  $ Alcohol                     : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic acid                  : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                         : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity_of_ash           : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium                   : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Total_phenols               : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids                  : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid phenols        : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins             : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color intensity             : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                         : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD280/OD315_of_diluted wines: num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline                     : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

The data appears to consist of numeric and integer variables, making them suitable for use in cluster analysis. However, it’s important to carefully select only the relevant variables for this type of analysis.

data %>%
  gather(attributes, value, 1:13) %>%
  ggplot(aes(x = value)) +
  geom_histogram(fill = 'lightblue2', color = 'black') +
  facet_wrap(~attributes, scales = 'free_x') +
  labs(x="Values", y="Frequency") +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

corrplot(round(cor(data),1), type = 'upper', method = 'number', tl.cex = 0.9,number.cex = 0.5)

The variables Total_Phenols and Flavanoids exhibit a high degree of linear correlation, suggesting that a linear model may be appropriate for describing their relationship. This can be achieved by fitting a linear equation to the data.

# Relationship between Phenols and Flavanoids
ggplot(data, aes(x = Total_phenols, y = Flavanoids)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  ggtitle("Total Phenols v/s Flavanoids")+
  theme_bw()

## `geom_smooth()` using formula = 'y ~ x'

data_normalize <- as.data.frame(scale(data))
head(data_normalize)

##     Alcohol  Malic acid        Ash Alcalinity_of_ash  Magnesium Total_phenols
## 1 1.5143408 -0.56066822  0.2313998        -1.1663032 1.90852151     0.8067217
## 2 0.2455968 -0.49800856 -0.8256672        -2.4838405 0.01809398     0.5670481
## 3 0.1963252  0.02117152  1.1062139        -0.2679823 0.08810981     0.8067217
## 4 1.6867914 -0.34583508  0.4865539        -0.8069748 0.92829983     2.4844372
## 5 0.2948684  0.22705328  1.8352256         0.4506745 1.27837900     0.8067217
## 6 1.4773871 -0.51591132  0.3043010        -1.2860793 0.85828399     1.5576991
##   Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity        Hue
## 1  1.0319081           -0.6577078       1.2214385       0.2510088  0.3611585
## 2  0.7315653           -0.8184106      -0.5431887      -0.2924962  0.4049085
## 3  1.2121137           -0.4970050       2.1299594       0.2682629  0.3174085
## 4  1.4623994           -0.9791134       1.0292513       1.1827317 -0.4263410
## 5  0.6614853            0.2261576       0.4002753      -0.3183774  0.3611585
## 6  1.3622851           -0.1755994       0.6623487       0.7298108  0.4049085
##   OD280/OD315_of_diluted wines     Proline
## 1                    1.8427215  1.01015939
## 2                    1.1103172  0.96252635
## 3                    0.7863692  1.39122370
## 4                    1.1807407  2.32800680
## 5                    0.4483365 -0.03776747
## 6                    0.3356589  2.23274072

To prepare for K-means clustering, we need to address the issue of different scales among the variables. Normalization of data is required in such cases, and it can be achieved by using the mean and standard deviation or using the scale function.

Dimension Reduction

Dimension reduction in R is the process of reducing the number of variables or features in a dataset, while preserving as much of the variability in the data as possible. This is typically done to simplify the analysis or modeling process, reduce noise and redundancy, and improve interpretability and visualization of the data.

There are several techniques available in R for dimension reduction, including principal component analysis (PCA), factor analysis, and multidimensional scaling (MDS). These techniques involve transforming the original data into a lower-dimensional space that retains as much of the original information as possible.

PCA is a commonly used technique that involves finding linear combinations of the original variables that explain the most variation in the data. The resulting principal components can be used as new variables for subsequent analysis, and can often explain the majority of the variability in the data with just a few components.

pca <- prcomp(data, center = TRUE, scale = TRUE)
eigen(cor(data))$values

##  [1] 4.7058503 2.4969737 1.4460720 0.9189739 0.8532282 0.6416570 0.5510283
##  [8] 0.3484974 0.2888799 0.2509025 0.2257886 0.1687702 0.1033779

fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4",  addlabels = TRUE,   main = "Eigenvalues")

Based on the Kaiser’s rule, 3 components should be chosen, as the eigenvalue for those is above 1.

library("pdp")

## 
## Attaching package: 'pdp'

## The following object is masked from 'package:purrr':
## 
##     partial

pca_var <- get_pca_var(pca)

fviz_contrib(pca, "var", axes = 1:3, fill = "tomato3", color = "tomato4")

K-Means Clustering

K-Means clustering is a type of unsupervised machine learning algorithm used to group similar data points into clusters. The algorithm works by first randomly selecting k number of points from the dataset to act as the centroids of the clusters. The remaining points are then assigned to the nearest centroid based on their similarity to the centroid. The similarity is calculated using a distance metric, such as Euclidean distance. Once all the points have been assigned, the centroids are recalculated based on the mean of the points in each cluster. This process is repeated until the centroids no longer change significantly or a maximum number of iterations is reached.

The goal of K-Means clustering is to minimize the sum of squared distances between the data points and their assigned centroids. The algorithm attempts to find the optimal number of clusters k, based on the data, but the number of clusters must be specified by the user. K-Means clustering is commonly used in a variety of fields, such as image segmentation, customer segmentation, and anomaly detection.

set.seed(123)

data_K2 <- kmeans(data_normalize, centers = 2, nstart = 25)
print(data_K2)

## K-means clustering with 2 clusters of sizes 91, 87
## 
## Cluster means:
##      Alcohol Malic acid         Ash Alcalinity_of_ash  Magnesium Total_phenols
## 1 -0.3106038  0.3374209 -0.04979045         0.4684435 -0.3065948    -0.7482598
## 2  0.3248845 -0.3529345  0.05207966        -0.4899811  0.3206911     0.7826625
##   Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity        Hue
## 1 -0.7873111            0.5661058      -0.6098110       0.0979495 -0.5385525
## 2  0.8235093           -0.5921337       0.6378483      -0.1024529  0.5633135
##   OD280/OD315_of_diluted wines    Proline
## 1                   -0.6832374 -0.5785857
## 2                    0.7146506  0.6051873
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 2 1 2 1 2
##  [75] 2 1 2 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 2 2 2 1 2 2 2 2 1 1 1 2 1 1 1 1 2 2
## [112] 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 884.3435 765.0965
##  (between_SS / total_SS =  28.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(data_K2, data = data_normalize)

The first cluster (labeled as 1) has negative means for Alcohol, Malic acid, Ash, and Magnesium, and positive means for Alcalinity_of_ash, Nonflavanoid phenols, and Color intensity. The second cluster (labeled as 2) has positive means for Alcohol, Ash, and Magnesium, and negative means for Malic acid, Alcalinity_of_ash, and Nonflavanoid phenols. Both clusters have positive means for Total_phenols, Flavanoids, Proanthocyanins, Hue, OD280/OD315_of_diluted wines, and Proline, but cluster 2 has higher mean values for these variables compared to cluster 1.

The clustering vector shows the cluster assignments of each data point. For example, the first 35 data points are assigned to cluster 2, while the 36th and 38th data points are assigned to cluster 1.

The within-cluster sum of squares (WSS) is a measure of the variation within each cluster. The WSS values for clusters 1 and 2 are 884.3435 and 765.0965, respectively. The between-cluster sum of squares (BSS) is the variation between the cluster means. The percentage of BSS to total sum of squares (TSS) is 28.3%, indicating that the clustering is moderately effective in capturing the variation in the data.

Other available components of the K-means clustering result include the total sum of squares (TSS), the between-cluster sum of squares (BSS), the size of each cluster, the number of iterations, and the ifault component, which indicates if the algorithm has converged.

data_K3 <- kmeans(data_normalize, centers = 3, nstart = 25)
data_K4 <- kmeans(data_normalize, centers = 4, nstart = 25)
data_K5 <- kmeans(data_normalize, centers = 5, nstart = 25)

It is important to determine the number of clusters (k) before starting the K-means algorithm. To find the optimal k, it is recommended to run the algorithm with different values of k and compare the results.

In this case, we ran the K-means algorithm for 3, 4, and 5 clusters, and the resulting clusters are displayed in the figure. This approach can help determine the most appropriate number of clusters for the given dataset.

p1 <- fviz_cluster(data_K2, geom = "point", data = data_normalize) + ggtitle(" K = 2")
p2 <- fviz_cluster(data_K3, geom = "point", data = data_normalize) + ggtitle(" K = 3")
p3 <- fviz_cluster(data_K4, geom = "point", data = data_normalize) + ggtitle(" K = 4")
p4 <- fviz_cluster(data_K5, geom = "point", data = data_normalize) + ggtitle(" K = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)

The WSS is the sum of the squared distance between each data point and its assigned cluster centroid. As k increases, the WSS will tend to decrease, since more clusters will allow for better fitting to the data. However, at some point, adding more clusters will only capture noise or minor variations in the data, leading to diminishing returns in terms of WSS reduction.

The elbow method involves plotting the WSS for different values of k and identifying the point where the decrease in WSS starts to level off, resembling an elbow shape. This point indicates a trade-off between capturing more variance in the data (lower WSS) and avoiding overfitting or capturing noise (higher k).

Therefore, the optimal number of clusters can be chosen based on the “elbow point” in the WSS plot, which represents a good balance between model complexity and data fit.

# Determining Optimal clusters (k) Using Elbow method
fviz_nbclust(x = data_normalize,FUNcluster = kmeans, method = 'wss' )

Silhouette score is a metric used to evaluate the quality of clustering results . It measures how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1, where a score of 1 indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters, while a score of -1 indicates the opposite. A score close to zero indicates that the object is on the boundary between two clusters.

The average silhouette score is calculated by taking the mean silhouette score over all objects in the dataset. A high average silhouette score indicates that the clustering is appropriate, while a low score suggests that the clustering may not be optimal. Silhouette score is a useful metric for choosing the number of clusters for K-means clustering, as it can be used to compare the quality of clustering results across different values of K. A higher silhouette score implies better clustering results.

# Determining Optimal clusters (k) Using Average Silhouette Method

fviz_nbclust(x = data_normalize,FUNcluster = kmeans, method = 'silhouette' )

# Compute k-means clustering with k = 3
set.seed(123)
final <- kmeans(data_normalize, centers = 3, nstart = 25)
print(final)

## K-means clustering with 3 clusters of sizes 51, 62, 65
## 
## Cluster means:
##      Alcohol Malic acid        Ash Alcalinity_of_ash   Magnesium Total_phenols
## 1  0.1644436  0.8690954  0.1863726         0.5228924 -0.07526047   -0.97657548
## 2  0.8328826 -0.3029551  0.3636801        -0.6084749  0.57596208    0.88274724
## 3 -0.9234669 -0.3929331 -0.4931257         0.1701220 -0.49032869   -0.07576891
##    Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity        Hue
## 1 -1.21182921           0.72402116     -0.77751312       0.9388902 -1.1615122
## 2  0.97506900          -0.56050853      0.57865427       0.1705823  0.4726504
## 3  0.02075402          -0.03343924      0.05810161      -0.8993770  0.4605046
##   OD280/OD315_of_diluted wines    Proline
## 1                   -1.2887761 -0.4059428
## 2                    0.7770551  1.1220202
## 3                    0.2700025 -0.7517257
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2
##  [75] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 326.3537 385.6983 558.6971
##  (between_SS / total_SS =  44.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(final, data = data_normalize)

The cluster means table shows the mean values of each variable for each of the three clusters that were created. Each row corresponds to a different cluster, and the columns correspond to the different variables. For example, the mean value for Alcohol in cluster 1 is 0.16, while the mean value for Alcohol in cluster 2 is 0.83.

The clustering vector shows the cluster to which each observation in the dataset has been assigned. The numbers 1, 2, and 3 represent the three clusters. For example, the first 35 observations in the dataset have been assigned to cluster 2, while the observations numbered 36, 37, and 38 have been assigned to cluster 3.

The within-cluster sum of squares by cluster shows the sum of squared distances between each observation and its cluster center for each of the three clusters. The first cluster has a within-cluster sum of squares of 326.35, the second cluster has a within-cluster sum of squares of 385.70, and the third cluster has a within-cluster sum of squares of 558.70.

The between-cluster sum of squares is not explicitly provided, but it is possible to calculate it by subtracting the total within-cluster sum of squares from the total sum of squares. In this case, the between-cluster sum of squares is approximately 55.2% of the total sum of squares, while the within-cluster sum of squares is approximately 44.8% of the total sum of squares.

data_normalize %>% 
  mutate(Cluster = final$cluster) %>%
  group_by(Cluster) %>%
  summarize_all('median')

## # A tibble: 3 × 14
##   Cluster Alcohol `Malic acid`     Ash Alcali…¹ Magne…² Total…³ Flavan…⁴ Nonfl…⁵
##     <int>   <dbl>        <dbl>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
## 1       1   0.135        0.836  0.0491    0.451  -0.192  -1.03  -1.33e+0  0.869 
## 2       2   0.905       -0.511  0.286    -0.747   0.403   0.847  9.47e-1 -0.577 
## 3       3  -0.925       -0.650 -0.461     0.151  -0.822  -0.152  7.31e-4 -0.0952
## # … with 5 more variables: Proanthocyanins <dbl>, `Color intensity` <dbl>,
## #   Hue <dbl>, `OD280/OD315_of_diluted wines` <dbl>, Proline <dbl>, and
## #   abbreviated variable names ¹Alcalinity_of_ash, ²Magnesium, ³Total_phenols,
## #   ⁴Flavanoids, ⁵`Nonflavanoid phenols`

Conclusion

we can conclude that the dataset can be grouped into 3 clusters based on the elbow method and silhouette score analysis. The cluster sizes were found to be 51, 62, and 65, respectively.

Further analysis of the cluster means reveals that there are significant differences in the various attributes of the wine samples in each cluster, such as alcohol content, malic acid, ash, and magnesium, among others. The within-cluster sum of squares shows that the majority of the variation in the data is explained by the clusters, which accounts for 44.8% of the total sum of squares.

Therefore, we can conclude that k-means clustering is a useful technique for analyzing and grouping data with multiple variables, and it can provide valuable insights into the underlying structure of the data.

References

Charu Aggarwal, Cecilia Procopiuc, Joel Wolf, Phillip Yu, and Jong Park. “Fast algorithms for projected clustering,” In ACM SIGMOD Conference, (1999).
Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, Prentice Hall, 1990.
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” In Proceedings of the 15th International Conference on Data Engineering (ICDE ’99), pp. 512-521 (1999).
1. Hinneburg, C. Aggarwal, and D.A. Keim, “What is the nearest neighbor in high dimensional spaces?” In Proceedings 26th International Conference on Very Large Data Bases (VLDB-2000), Cairo, Egypt, September 2000, pp. 506-515, Morgan Kaufmann (2000).
Anil K. Jain and Richard C. Dubes, Algorithms for Clustering Data, Prentice Hall (1988).
Han, J.; Pei, J.; Kamber, M. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2011.
Di Vita, G.; Chinnici, G.; D’Amico, M. Clustering attitudes and behaviours of Italian wine consumers. Calitatea 2014, 15, 54.
Hall, D. Exploring wine knowledge, aesthetics and ephemerality: Clustering consumers. Int. J. Wine Bus. Res. 2016, 28, 134–153.
Vázquez-Fresno, R.; Llorach, R.; Perera, A.; Mandal, R.; Feliz, M.; Tinahones, F.J.; Wishart, D.S.; Andres-Lacueva, C. Clinical phenotype clustering in cardiovascular risk patients for the identification of responsive metabotypes after red wine polyphenol intake. J. Nutr. Biochem. 2016, 28, 114–120.

Wine Analysis Clustering & Dimensional Reduction

Joshgun Amanov

02/26/2023