The wine company can use unsupervised learning to cluster their wine data set. This will allow them to identify different types of wines and understand the characteristics of each cluster. The company can then use this information to improve their wine production and marketing strategies.
First, we need to load Library
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## corrplot 0.92 loaded
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
We do data wrangling by ensuring the data type is appropriate, the data have no missing value and significant outliers.
## Rows: 178
## Columns: 15
## $ id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"…
## $ Wine <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, …
## $ Malic.acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
## $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
## $ Acl <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
## $ Mg <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 10…
## $ Phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
## $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
## $ Nonflavanoid.phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
## $ Proanth <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
## $ Color.int <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
## $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
## $ OD <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
## $ Proline <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…
## [1] 1 2 3
We don’t need Wine column. K-means is an unsupervised machine learning algorithm and works with unlabeled data.
# index names
rownames(wine) <- wine$id
# drop column `Wine`
# way 1
wine1 <- wine %>%
select(-c(id, Wine))
wine1Check the first row data
| Alcohol | Malic.acid | Ash | Acl | Mg | Phenols | Flavanoids | Nonflavanoid.phenols | Proanth | Color.int | Hue | OD | Proline |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
| 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
| 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
| 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
| 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
| 14.20 | 1.76 | 2.45 | 15.2 | 112 | 3.27 | 3.39 | 0.34 | 1.97 | 6.75 | 1.05 | 2.85 | 1450 |
Check the last row data
| Alcohol | Malic.acid | Ash | Acl | Mg | Phenols | Flavanoids | Nonflavanoid.phenols | Proanth | Color.int | Hue | OD | Proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 173 | 14.16 | 2.51 | 2.48 | 20.0 | 91 | 1.68 | 0.70 | 0.44 | 1.24 | 9.7 | 0.62 | 1.71 | 660 |
| 174 | 13.71 | 5.65 | 2.45 | 20.5 | 95 | 1.68 | 0.61 | 0.52 | 1.06 | 7.7 | 0.64 | 1.74 | 740 |
| 175 | 13.40 | 3.91 | 2.48 | 23.0 | 102 | 1.80 | 0.75 | 0.43 | 1.41 | 7.3 | 0.70 | 1.56 | 750 |
| 176 | 13.27 | 4.28 | 2.26 | 20.0 | 120 | 1.59 | 0.69 | 0.43 | 1.35 | 10.2 | 0.59 | 1.56 | 835 |
| 177 | 13.17 | 2.59 | 2.37 | 20.0 | 120 | 1.65 | 0.68 | 0.53 | 1.46 | 9.3 | 0.60 | 1.62 | 840 |
| 178 | 14.13 | 4.10 | 2.74 | 24.5 | 96 | 2.05 | 0.76 | 0.56 | 1.35 | 9.2 | 0.61 | 1.60 | 560 |
Check dimension data
## [1] 178 13
## [1] FALSE
## id Wine Alcohol
## 0 0 0
## Malic.acid Ash Acl
## 0 0 0
## Mg Phenols Flavanoids
## 0 0 0
## Nonflavanoid.phenols Proanth Color.int
## 0 0 0
## Hue OD Proline
## 0 0 0
here is no missing value on the data wine1 above. We can do next progress.
We do check outlier with ggplot. But we haven’t included magnesium and proline, cause their values are very high. They can worsen the visualization data.
# Check outlier - ggplot for each Attribute, exclude : `magnesium` and `proline`
wine1 %>%
gather(Attributes, values, c(1:4, 6:12)) %>%
ggplot(aes(x=reorder(Attributes, values, FUN=median), y=values, fill=Attributes)) +
geom_boxplot(show.legend=FALSE) +
labs(title="Wines Attributes - Boxplots") +
theme_bw() +
theme(axis.title.y=element_blank(),
axis.title.x=element_blank()) +
ylim(0, 35) +
coord_flip()
Check outlier with boxplot
In this steps, we can do next level explore data with visualize data.
# Histogram for each Attribute
wine1 %>%
gather(Attributes, value, 1:13) %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_histogram(colour="black", show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Wines Attributes - Histograms") +
theme_bw()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From histogram on above, we can see that Malic_Acid have higher
frequency around > 30 at point 1.8 compared to other variables.
# Density plot for each Attribute
wine1 %>%
gather(Attributes, value, 1:13) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_density(colour="black", alpha=0.5, show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Density",
title="Wines Attributes - Density plots") +
theme_bw()
From density plot on above, we can see that Nonflavanoid_Phenols have
denser area compared to other variables. For attributes such as
Ash_Alcanity (63 unique values), Color_Intensity(101 unique values),
Magnesium(53 unique values) and Proline (101 unique values) have many
different unique value, so the density of the area almost flat.
How about the relationship between the different attributes? We can use the ggcorr() or corrplot() function from packages GGallyto create a graphical display of a correlation matrix.
There is a strong linear correlation between Phenols and Flavanoids
(0.9). We can build model based on the relationship between the two
variables above by fitting a linear equation.
# Relationship between Phenols and Flavanoids
ggplot(wine1, aes(x=Phenols, y=Flavanoids)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(title="Wines Attributes",
subtitle="Relationship between Phenols and Flavanoids") +
theme_bw()## `geom_smooth()` using formula = 'y ~ x'
Now that we have done a exploratory data analysis, we can prepare the data in order to execute the k-means algorithm.
We have to normalize the variables to express them in the same range of values. In other words, normalization means adjusting values measured on different scales to a common scale.
set.seed(123)
# Normalization
wine1_Norm <- as.data.frame(scale(wine1))
# Original data
p1 <- ggplot(wine1, aes(x=Alcohol, y=Malic.acid)) +
geom_point() +
labs(title="Original data") +
theme_bw()
# Normalized data
p2 <- ggplot(wine1_Norm, aes(x=Alcohol, y=Malic.acid)) +
geom_point() +
labs(title="Normalized data") +
theme_bw()
# Subplot
grid.arrange(p1, p2, ncol=2)Based on the plot above, the points in the normalized data are the same as the original one. The only thing that changes is the scale of the axis.
We are going to execute the k-means algorithm. On the first try, We can do clustering model with k value = 2.
# Execution of k-means with k=2
set.seed(1234)
wine_km2 <- kmeans(wine1_Norm, centers=2)
summary(wine_km2)## Length Class Mode
## cluster 178 -none- numeric
## centers 26 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
# Clustering (with k=2)
ggpairs(cbind(wine1, Cluster=as.factor(wine_km2$cluster)),
columns=1:6, aes(colour=Cluster, alpha=0.5),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both") +
theme_bw()
Based on this result above, we focus with component : cluster, centers
and size
Cluster is a vector of integers indicating the cluster to which each point is allocated.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 2 2 2 1 2 1 1 2 2 1 2 1 2 1 1 2 1 2 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Centers is a matrix of cluster centers.
## Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids
## 1 0.3248845 -0.3529345 0.05207966 -0.4899811 0.3206911 0.7826625 0.8235093
## 2 -0.3106038 0.3374209 -0.04979045 0.4684435 -0.3065948 -0.7482598 -0.7873111
## Nonflavanoid.phenols Proanth Color.int Hue OD Proline
## 1 -0.5921337 0.6378483 -0.1024529 0.5633135 0.7146506 0.6051873
## 2 0.5661058 -0.6098110 0.0979495 -0.5385525 -0.6832374 -0.5785857
Size is the number of points in each cluster.
## [1] 87 91
The kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.
We check class of object wine_km2.
## [1] "kmeans"
We need to find optimum k to determine optimum cluster. We seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
To see clearly which optimum cluster K, we can use fviz_nbclust()
function on the below.
We can see that 2 plot above is the optimum number of K. After k=3,
increasing the number of K does not result in a considerable decrease of
the total within the sum of squares (strong internal cohesion).
# Execution of k-means with k=3
set.seed(1234)
wine_km3 <- kmeans(wine1_Norm, centers=3)
# Mean values of each cluster
aggregate(wine1, by=list(wine_km2$cluster), mean)# Clustering
ggpairs(cbind(wine1, Cluster=as.factor(wine_km3$cluster)),
columns=1:6, aes(colour=Cluster, alpha=0.5),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both") +
theme_bw()We can see on the plot with k value = 3 above, magnesium is already clearly separated compared to Ash.
#7 UL : Principal Component Analysis
##7.1 Build and visualization for PCA We build PCA models with outliers.
# build PCA with outliers
library(FactoMineR)
wine_pca <- PCA(wine1,
scale.unit = T,
graph = F,
ncp = 6)We can see the outlier by use plot.PCA() function on the below.
# make plot, visualization for 10 external outlier
plot.PCA(wine_pca,
choix = "ind", # plot distribution data
select = "contrib 10")
We take out the outliers, to clean the data.
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 4.7058503 36.1988481 36.19885
## comp 2 2.4969737 19.2074903 55.40634
## comp 3 1.4460720 11.1236305 66.52997
## comp 4 0.9189739 7.0690302 73.59900
## comp 5 0.8532282 6.5632937 80.16229
## comp 6 0.6416570 4.9358233 85.09812
## comp 7 0.5510283 4.2386793 89.33680
## comp 8 0.3484974 2.6807489 92.01754
## comp 9 0.2888799 2.2221534 94.23970
## comp 10 0.2509025 1.9300191 96.16972
## comp 11 0.2257886 1.7368357 97.90655
## comp 12 0.1687702 1.2982326 99.20479
## comp 13 0.1033779 0.7952149 100.00000
We get 80% data from PC 1, PC 2, PC 3, PC 4 and PC 5.
#8 Combining Clustering and PCA ## 8.1 Visualization for PCA
# visualisasi PCA + kmeans clustering
fviz_pca_biplot(wine_pca,
habillage = 13,
geom.ind = "point",
addEllipses = T)## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'
This plot is not recommended, cause the data is not clearly separated
and too crowded.
# elbow meter
fviz_eig(wine_pca, ncp = 15, addlabels = T, main = "Explained variance by each dimensions")80% of the variances can be explained by only using the first 5 dimensions, with the first two dimensions can explain 55% of the total variances.
fviz_pca_var(wine_pca, select.var = list(contrib = 31), col.var = "contrib",
gradient.cols = c("#FF3333", "#666600", "#339999"), repel = TRUE)This plot explains 55% variance from the data and describes each variable characteristics. Wine with high OD tends to have less Malic.acid, while Phenols, Flavanoids and Proanth share similar characteristics.
Phenols, Flavanoids, Proanth and OD have high correlation value from 0.7 until 0.9.
## K-means clustering with 3 clusters of sizes 62, 65, 51
##
## Cluster means:
## Alcohol Malic.acid Ash Acl Mg Phenols
## 1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
## 2 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
## 3 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
## Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD
## 1 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504 0.7770551
## 2 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046 0.2700025
## 3 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122 -1.2887761
## Proline
## 1 1.1220202
## 2 -0.7517257
## 3 -0.4059428
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 2 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 385.6983 558.6971 326.3537
## (between_SS / total_SS = 44.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
## 1 2 3 4 5 6
## 1 1 1 1 1 1
## NULL
# input label cluster in `wine_km3` to `wine1_Norm`
wine1_Norm$cluster <- wine_km3$cluster
# profiling - way 1
# do profiling with summarise data
wine_profile <- wine1_Norm %>%
group_by(cluster) %>%
summarise_all(mean)
wine_profilehigh : x > 0.6 medium : 0.5 =< x < 0.6 low : x < 0.5
Profiling:
With plot clustering above, we can see for each cluster clearly
separated.
We can get the conclusion from the analysis above, such as : - The clustering model can clearly separate the different clusters, making it the best option for profiling each cluster. - The PCA model can see correlations between variables, but it is not sufficient for profiling.