Hi, Welcome to my first LBB : Unsupervised Learning. We are going to use a Wine data set to cluster different types of wines. This data set contains the results of a chemical analysis of wines grown in a specific area of Italy. We will use clustering and PCA model to process analysis data.
First, we need to load Library
# import library
library(tidyverse)
library(corrplot)
library(gridExtra)
library(GGally)
library(knitr)Second, we need to read the data
# read data
wine <- read.csv("data_input/Wine.csv")
wine# insert index row
id <- rownames(wine)
wine <- cbind(id=id, wine)
wineWe do data wrangling by ensuring the data type is appropriate, the data have no missing value and significant outliers.
# check data type
glimpse(wine)#> Rows: 178
#> Columns: 15
#> $ id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"~
#> $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, ~
#> $ Malic_Acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1~
#> $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2~
#> $ Ash_Alcanity <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1~
#> $ Magnesium <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 10~
#> $ Total_Phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2~
#> $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2~
#> $ Nonflavanoid_Phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0~
#> $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1~
#> $ Color_Intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5~
#> $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1~
#> $ OD280 <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2~
#> $ Proline <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 10~
#> $ Customer_Segment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
unique(wine$Customer_Segment)#> [1] 1 2 3
We don’t need Customer_Segment column. K-means is an unsupervised machine learning algorithm and works with unlabeled data.
# index names
rownames(wine) <- wine$id
# drop column `Customer_Segment`
# way 1
wine1 <- wine %>%
select(-c(id, Customer_Segment))
# way 2
# wine1 <- wine[,-14]
wine1Check the first row data
# check data - first row
kable(head(wine1))| Alcohol | Malic_Acid | Ash | Ash_Alcanity | Magnesium | Total_Phenols | Flavanoids | Nonflavanoid_Phenols | Proanthocyanins | Color_Intensity | Hue | OD280 | Proline |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
| 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
| 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
| 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
| 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
| 14.20 | 1.76 | 2.45 | 15.2 | 112 | 3.27 | 3.39 | 0.34 | 1.97 | 6.75 | 1.05 | 2.85 | 1450 |
Check the last row data
# check data - last row
kable(tail(wine1))| Alcohol | Malic_Acid | Ash | Ash_Alcanity | Magnesium | Total_Phenols | Flavanoids | Nonflavanoid_Phenols | Proanthocyanins | Color_Intensity | Hue | OD280 | Proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 173 | 14.16 | 2.51 | 2.48 | 20.0 | 91 | 1.68 | 0.70 | 0.44 | 1.24 | 9.7 | 0.62 | 1.71 | 660 |
| 174 | 13.71 | 5.65 | 2.45 | 20.5 | 95 | 1.68 | 0.61 | 0.52 | 1.06 | 7.7 | 0.64 | 1.74 | 740 |
| 175 | 13.40 | 3.91 | 2.48 | 23.0 | 102 | 1.80 | 0.75 | 0.43 | 1.41 | 7.3 | 0.70 | 1.56 | 750 |
| 176 | 13.27 | 4.28 | 2.26 | 20.0 | 120 | 1.59 | 0.69 | 0.43 | 1.35 | 10.2 | 0.59 | 1.56 | 835 |
| 177 | 13.17 | 2.59 | 2.37 | 20.0 | 120 | 1.65 | 0.68 | 0.53 | 1.46 | 9.3 | 0.60 | 1.62 | 840 |
| 178 | 14.13 | 4.10 | 2.74 | 24.5 | 96 | 2.05 | 0.76 | 0.56 | 1.35 | 9.2 | 0.61 | 1.60 | 560 |
Check dimension data
# check dimension data
dim(wine1)#> [1] 178 13
# Check available missing value
anyNA(wine1)#> [1] FALSE
# Check missing value per variable data
wine1 %>%
is.na() %>%
colSums()#> Alcohol Malic_Acid Ash
#> 0 0 0
#> Ash_Alcanity Magnesium Total_Phenols
#> 0 0 0
#> Flavanoids Nonflavanoid_Phenols Proanthocyanins
#> 0 0 0
#> Color_Intensity Hue OD280
#> 0 0 0
#> Proline
#> 0
There is no missing value on the data wine1 above. We can do next progress.
We do check outlier with ggplot. But we haven’t included magnesium and proline, cause their values are very high. They can worsen the visualization data.
# Check outlier - ggplot for each Attribute, exclude : `magnesium` and `proline`
wine1 %>%
gather(Attributes, values, c(1:4, 6:12)) %>%
ggplot(aes(x=reorder(Attributes, values, FUN=median), y=values, fill=Attributes)) +
geom_boxplot(show.legend=FALSE) +
labs(title="Wines Attributes - Boxplots") +
theme_bw() +
theme(axis.title.y=element_blank(),
axis.title.x=element_blank()) +
ylim(0, 35) +
coord_flip()Check outlier with boxplot
# check outlier with boxplot - all variables
boxplot(scale(wine1)) Based on the plot above, there is outlier for variable :
Malic_Acid, Ash, Ash_Alcanity, Proanthocyanins, Color_Intensity and Hue.
In this steps, we can do next level explore data with visualize data.
# Histogram for each Attribute
wine1 %>%
gather(Attributes, value, 1:13) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_histogram(colour="black", show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Wines Attributes - Histograms") +
theme_bw() From histogram on above, we can see that
Malic_Acid have higher frequency around > 30 at point 1.8 compared to other variables.
# Density plot for each Attribute
wine1 %>%
gather(Attributes, value, 1:13) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_density(colour="black", alpha=0.5, show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Density",
title="Wines Attributes - Density plots") +
theme_bw() From density plot on above, we can see that
Nonflavanoid_Phenols have denser area compared to other variables. For attributes such as Ash_Alcanity (63 unique values), Color_Intensity(101 unique values), Magnesium(53 unique values) and Proline (101 unique values) have many different unique value, so the density of the area almost flat.
How about the relationship between the different attributes? We can use the ggcorr() or corrplot() function from packages GGallyto create a graphical display of a correlation matrix.
# Correlation matrix
# way 1
ggcorr(wine1, label = T, hjust=0.90)# way 2
# corrplot(cor(wine1), type="upper", method="ellipse", tl.cex=0.9)There is a strong linear correlation between Total_Phenols and Flavanoids (0.9). We can build model based on the relationship between the two variables above by fitting a linear equation.
# Relationship between Phenols and Flavanoids
ggplot(wine1, aes(x=Total_Phenols, y=Flavanoids)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(title="Wines Attributes",
subtitle="Relationship between Phenols and Flavanoids") +
theme_bw()Now that we have done a exploratory data analysis, we can prepare the data in order to execute the k-means algorithm.
We have to normalize the variables to express them in the same range of values. In other words, normalization means adjusting values measured on different scales to a common scale.
set.seed(123)
# Normalization
wine1_Norm <- as.data.frame(scale(wine1))
# Original data
p1 <- ggplot(wine1, aes(x=Alcohol, y=Malic_Acid)) +
geom_point() +
labs(title="Original data") +
theme_bw()
# Normalized data
p2 <- ggplot(wine1_Norm, aes(x=Alcohol, y=Malic_Acid)) +
geom_point() +
labs(title="Normalized data") +
theme_bw()
# Subplot
grid.arrange(p1, p2, ncol=2)Based on the plot above, the points in the normalized data are the same as the original one. The only thing that changes is the scale of the axis.
We are going to execute the k-means algorithm. On the first try, We can do clustering model with k value = 2.
# Execution of k-means with k=2
set.seed(1234)
wine_km2 <- kmeans(wine1_Norm, centers=2)
summary(wine_km2)#> Length Class Mode
#> cluster 178 -none- numeric
#> centers 26 -none- numeric
#> totss 1 -none- numeric
#> withinss 2 -none- numeric
#> tot.withinss 1 -none- numeric
#> betweenss 1 -none- numeric
#> size 2 -none- numeric
#> iter 1 -none- numeric
#> ifault 1 -none- numeric
# # Clustering (with k=2)
ggpairs(cbind(wine1, Cluster=as.factor(wine_km2$cluster)),
columns=1:6, aes(colour=Cluster, alpha=0.5),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both") +
theme_bw() Based on this result above, we focus with component :
cluster, centers and size
Cluster is a vector of integers indicating the cluster to which each point is allocated.
# Cluster to which each point is allocated
wine_km2$cluster#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
#> 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
#> 2 2 2 1 2 1 1 2 2 1 2 1 2 1 1 2 1 2 1 1
#> 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
#> 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 1
#> 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
#> 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
#> 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
#> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178
#> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Centers is a matrix of cluster centers.
# Cluster centers
wine_km2$centers#> Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
#> 1 0.3248845 -0.3529345 0.05207966 -0.4899811 0.3206911 0.7826625
#> 2 -0.3106038 0.3374209 -0.04979045 0.4684435 -0.3065948 -0.7482598
#> Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue
#> 1 0.8235093 -0.5921337 0.6378483 -0.1024529 0.5633135
#> 2 -0.7873111 0.5661058 -0.6098110 0.0979495 -0.5385525
#> OD280 Proline
#> 1 0.7146506 0.6051873
#> 2 -0.6832374 -0.5785857
Size is the number of points in each cluster.
# Cluster size
wine_km2$size#> [1] 87 91
The kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.
We check class of object wine_km2.
# check class object `wine_km2`
class(wine_km2)#> [1] "kmeans"
We need to find optimum k to determine optimum cluster. We seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster)
# elbow meter
library(factoextra)
fviz_nbclust(wine1_Norm,#data scaled
kmeans,#function
method = "wss")To see clearly which optimum cluster K, we can use fviz_nbclust() function on the below.
# give the line for optimum cluster K
fviz_nbclust(wine1_Norm , kmeans, method = "silhouette")We can see that 2 plot above is the optimum number of K. After k=3, increasing the number of K does not result in a considerable decrease of the total within the sum of squares (strong internal cohesion).
# Execution of k-means with k=3
set.seed(1234)
wine_km3 <- kmeans(wine1_Norm, centers=3)
# Mean values of each cluster
aggregate(wine1, by=list(wine_km2$cluster), mean)## Group.1 Alcohol Malic_Acid Ash Ash_Alcanity Magnesium
## 1 1 13.67677 1.997903 2.466290 17.46290 107.96774
## 2 2 12.25092 1.897385 2.231231 20.06308 92.73846
## 3 3 13.13412 3.307255 2.417647 21.24118 98.66667
## Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
## 1 2.847581 3.0032258 0.2920968 1.922097
## 2 2.247692 2.0500000 0.3576923 1.624154
## 3 1.683922 0.8188235 0.4519608 1.145882
## Color_Intensity Hue OD280 Proline
## 1 5.453548 1.0654839 3.163387 1100.2258
## 2 2.973077 1.0627077 2.803385 510.1692
## 3 7.234706 0.6919608 1.696667 619.0588
# Clustering
ggpairs(cbind(wine1, Cluster=as.factor(wine_km3$cluster)),
columns=1:6, aes(colour=Cluster, alpha=0.5),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both") +
theme_bw() We can see on the plot with k value = 3 above,
magnesium is already clearly separated compared to Ash.
We build PCA models with outliers.
# build PCA with outliers
library(FactoMineR)
wine_pca <- PCA(wine1,
scale.unit = T,
graph = F,
ncp = 6)We can see the outlier by use plot.PCA() function on the below.
# make plot, visualization for 10 external outlier
plot.PCA(wine_pca,
choix = "ind", # plot distribution data
select = "contrib 10")We take out the outliers, to clean the data.
# save index outlier in object "outlier"
outlier <- c(4,15,19,116,137,138,147,156,174,178)# drop data outlier
wine_normal <- wine1[-outlier,]wine_normal# check variance for each PC
# get 80% information data
wine_pca$eig#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 4.7058503 36.1988481 36.19885
#> comp 2 2.4969737 19.2074903 55.40634
#> comp 3 1.4460720 11.1236305 66.52997
#> comp 4 0.9189739 7.0690302 73.59900
#> comp 5 0.8532282 6.5632937 80.16229
#> comp 6 0.6416570 4.9358233 85.09812
#> comp 7 0.5510283 4.2386793 89.33680
#> comp 8 0.3484974 2.6807489 92.01754
#> comp 9 0.2888799 2.2221534 94.23970
#> comp 10 0.2509025 1.9300191 96.16972
#> comp 11 0.2257886 1.7368357 97.90655
#> comp 12 0.1687702 1.2982326 99.20479
#> comp 13 0.1033779 0.7952149 100.00000
We get 80% data from PC 1, PC 2, PC 3, PC 4 and PC 5.
# visualisasi PCA + kmeans clustering
fviz_pca_biplot(wine_pca,
habillage = 13,
geom.ind = "point",
addEllipses = T) This plot is not recommended, cause the data is not clearly separated and too crowded.
# elbow meter
fviz_eig(wine_pca, ncp = 15, addlabels = T, main = "Explained variance by each dimensions") 80% of the variances can be explained by only using the first 5 dimensions, with the first two dimensions can explain 55% of the total variances.
fviz_pca_var(wine_pca, select.var = list(contrib = 31), col.var = "contrib",
gradient.cols = c("#FF3333", "#666600", "#339999"), repel = TRUE) This plot explains 55% variance from the data and describes each variable characteristics. Wine with high
OD280 tends to have less Malic_Acid, while Total_Phenols, Flavanoids and
Proanthyocyanins share similar characteristics.
Total_Phenols, Flavanoids, Proanthyocyanins and OD280 have high correlation value from 0.7 until 0.9.
# centers = 3 (optimum k= 3)
wine_km3#> K-means clustering with 3 clusters of sizes 62, 65, 51
#>
#> Cluster means:
#> Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
#> 1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
#> 2 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
#> 3 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
#> Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue
#> 1 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504
#> 2 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046
#> 3 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122
#> OD280 Proline
#> 1 0.7770551 1.1220202
#> 2 0.2700025 -0.7517257
#> 3 -1.2887761 -0.4059428
#>
#> Clustering vector:
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
#> 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
#> 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
#> 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
#> 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
#> 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
#> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
#> 2 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
#> 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
#> 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178
#> 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#>
#> Within cluster sum of squares by cluster:
#> [1] 385.6983 558.6971 326.3537
#> (between_SS / total_SS = 44.8 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
#> [6] "betweenss" "size" "iter" "ifault"
head(wine_km3$cluster)#> 1 2 3 4 5 6
#> 1 1 1 1 1 1
head(wine1_Norm$cluster)#> NULL
# input label cluster in `wine_km3` to `wine1_Norm`
wine1_Norm$cluster <- wine_km3$cluster
# profiling - way 1
# do profiling with summarise data
wine_profile <- wine1_Norm %>%
group_by(cluster) %>%
summarise_all(mean)
wine_profile# profiling - way 2
# wine_profile %>%
# tidyr::pivot_longer(-cluster) %>%
# group_by(name) %>%
# summarize(cluster_min_val = which.min(value),
# cluster_max_val = which.max(value))high : x > 0.6 medium : 0.5 =< x < 0.6 low : x < 0.5
Profiling:
Alcohol, Total_Phenols, Flavanoids, OD280, Proline, and Ash_AlcanityMagnesium, Nonflavanoid_Phenols, and ProanthocyaninsMalic_Acid, Ash, Color_Intensity, HueAlcohol, Color_Intensity, ProlineMalic_Acid, Ash, Ash_Alcanity, Magnesium, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proanthocyanins, Hue, OD280Malic_Acid, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proanthocyanins, Color_Intensity, Hue, OD280Ash_AlcanityAlcohol, Ash, Magnesium, Proline# package factoextra
fviz_cluster(object = wine_km3,
data = wine1_Norm %>% select(-cluster)) With plot clustering above, we can see for each cluster clearly separated.
We can get the conclusion from the analysis above, such as : - Clustering model can clearly separated best for profiling for each cluster - PCA Model can see correlations between variables, but not good enough to do profiling