1 Business Problem

Hi, Welcome to my first LBB : Unsupervised Learning. We are going to use a Wine data set to cluster different types of wines. This data set contains the results of a chemical analysis of wines grown in a specific area of Italy. We will use clustering and PCA model to process analysis data.

2 Import Library and Read Data

2.1 Import Library

First, we need to load Library

# import library
library(tidyverse)
library(corrplot)
library(gridExtra)
library(GGally)
library(knitr)

2.2 Read Data

Second, we need to read the data

# read data
wine <- read.csv("data_input/Wine.csv")

wine

# insert index row
id <- rownames(wine)
wine <- cbind(id=id, wine)
wine

3 Data Wrangling

We do data wrangling by ensuring the data type is appropriate, the data have no missing value and significant outliers.

3.1 Check data type and useable variable

# check data type
glimpse(wine)

#> Rows: 178
#> Columns: 15
#> $ id                   <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"~
#> $ Alcohol              <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, ~
#> $ Malic_Acid           <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1~
#> $ Ash                  <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2~
#> $ Ash_Alcanity         <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1~
#> $ Magnesium            <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 10~
#> $ Total_Phenols        <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2~
#> $ Flavanoids           <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2~
#> $ Nonflavanoid_Phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0~
#> $ Proanthocyanins      <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1~
#> $ Color_Intensity      <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5~
#> $ Hue                  <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1~
#> $ OD280                <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2~
#> $ Proline              <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 10~
#> $ Customer_Segment     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~

unique(wine$Customer_Segment)

#> [1] 1 2 3

We don’t need Customer_Segment column. K-means is an unsupervised machine learning algorithm and works with unlabeled data.

# index names
rownames(wine) <- wine$id

# drop column `Customer_Segment`
# way 1
wine1 <- wine %>% 
  select(-c(id, Customer_Segment))


# way 2
# wine1 <- wine[,-14]

wine1

3.2 Explore data

Check the first row data

# check data - first row
kable(head(wine1))

Alcohol	Malic_Acid	Ash	Ash_Alcanity	Magnesium	Total_Phenols	Flavanoids	Nonflavanoid_Phenols	Proanthocyanins	Color_Intensity	Hue	OD280	Proline
14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735
14.20	1.76	2.45	15.2	112	3.27	3.39	0.34	1.97	6.75	1.05	2.85	1450

Check the last row data

# check data - last row
kable(tail(wine1))

	Alcohol	Malic_Acid	Ash	Ash_Alcanity	Magnesium	Total_Phenols	Flavanoids	Nonflavanoid_Phenols	Proanthocyanins	Color_Intensity	Hue	OD280	Proline
173	14.16	2.51	2.48	20.0	91	1.68	0.70	0.44	1.24	9.7	0.62	1.71	660
174	13.71	5.65	2.45	20.5	95	1.68	0.61	0.52	1.06	7.7	0.64	1.74	740
175	13.40	3.91	2.48	23.0	102	1.80	0.75	0.43	1.41	7.3	0.70	1.56	750
176	13.27	4.28	2.26	20.0	120	1.59	0.69	0.43	1.35	10.2	0.59	1.56	835
177	13.17	2.59	2.37	20.0	120	1.65	0.68	0.53	1.46	9.3	0.60	1.62	840
178	14.13	4.10	2.74	24.5	96	2.05	0.76	0.56	1.35	9.2	0.61	1.60	560

Check dimension data

# check dimension data
dim(wine1)

#> [1] 178  13

3.3 Check Missing Value

# Check available missing value
anyNA(wine1)

#> [1] FALSE

# Check missing value per variable data
wine1 %>% 
  is.na() %>% 
  colSums()

#>              Alcohol           Malic_Acid                  Ash 
#>                    0                    0                    0 
#>         Ash_Alcanity            Magnesium        Total_Phenols 
#>                    0                    0                    0 
#>           Flavanoids Nonflavanoid_Phenols      Proanthocyanins 
#>                    0                    0                    0 
#>      Color_Intensity                  Hue                OD280 
#>                    0                    0                    0 
#>              Proline 
#>                    0

There is no missing value on the data wine1 above. We can do next progress.

3.4 Check Outlier

We do check outlier with ggplot. But we haven’t included magnesium and proline, cause their values are very high. They can worsen the visualization data.

# Check outlier - ggplot for each Attribute, exclude : `magnesium` and `proline`  
wine1 %>%
  gather(Attributes, values, c(1:4, 6:12)) %>%
  ggplot(aes(x=reorder(Attributes, values, FUN=median), y=values, fill=Attributes)) +
  geom_boxplot(show.legend=FALSE) +
  labs(title="Wines Attributes - Boxplots") +
  theme_bw() +
  theme(axis.title.y=element_blank(),
        axis.title.x=element_blank()) +
  ylim(0, 35) +
  coord_flip()

Check outlier with boxplot

# check outlier with boxplot - all variables
boxplot(scale(wine1))

Based on the plot above, there is outlier for variable : Malic_Acid, Ash, Ash_Alcanity, Proanthocyanins, Color_Intensity and Hue.

4 Exploratory Data Analysis

In this steps, we can do next level explore data with visualize data.

4.1 Visualize data with Histogram

# Histogram for each Attribute
wine1 %>%
  gather(Attributes, value, 1:13) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_histogram(colour="black", show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Wines Attributes - Histograms") +
  theme_bw()

From histogram on above, we can see that Malic_Acid have higher frequency around > 30 at point 1.8 compared to other variables.

4.2 Visualize data with Density Plot

# Density plot for each Attribute
wine1 %>%
  gather(Attributes, value, 1:13) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_density(colour="black", alpha=0.5, show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Density",
       title="Wines Attributes - Density plots") +
  theme_bw()

From density plot on above, we can see that Nonflavanoid_Phenols have denser area compared to other variables. For attributes such as Ash_Alcanity (63 unique values), Color_Intensity(101 unique values), Magnesium(53 unique values) and Proline (101 unique values) have many different unique value, so the density of the area almost flat.

4.3 Check correlation for variable data

How about the relationship between the different attributes? We can use the ggcorr() or corrplot() function from packages GGallyto create a graphical display of a correlation matrix.

# Correlation matrix 

# way 1
ggcorr(wine1, label = T, hjust=0.90)

# way 2
# corrplot(cor(wine1), type="upper", method="ellipse", tl.cex=0.9)

There is a strong linear correlation between Total_Phenols and Flavanoids (0.9). We can build model based on the relationship between the two variables above by fitting a linear equation.

# Relationship between Phenols and Flavanoids
ggplot(wine1, aes(x=Total_Phenols, y=Flavanoids)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(title="Wines Attributes",
       subtitle="Relationship between Phenols and Flavanoids") +
  theme_bw()

Now that we have done a exploratory data analysis, we can prepare the data in order to execute the k-means algorithm.

5 Data Pre-processing

We have to normalize the variables to express them in the same range of values. In other words, normalization means adjusting values measured on different scales to a common scale.

set.seed(123)

# Normalization
wine1_Norm <- as.data.frame(scale(wine1))

# Original data
p1 <- ggplot(wine1, aes(x=Alcohol, y=Malic_Acid)) +
  geom_point() +
  labs(title="Original data") +
  theme_bw()

# Normalized data 
p2 <- ggplot(wine1_Norm, aes(x=Alcohol, y=Malic_Acid)) +
  geom_point() +
  labs(title="Normalized data") +
  theme_bw()

# Subplot
grid.arrange(p1, p2, ncol=2)

Based on the plot above, the points in the normalized data are the same as the original one. The only thing that changes is the scale of the axis.

6 k-means execution

We are going to execute the k-means algorithm. On the first try, We can do clustering model with k value = 2.

# Execution of k-means with k=2
set.seed(1234)
wine_km2 <- kmeans(wine1_Norm, centers=2)
summary(wine_km2)

#>              Length Class  Mode   
#> cluster      178    -none- numeric
#> centers       26    -none- numeric
#> totss          1    -none- numeric
#> withinss       2    -none- numeric
#> tot.withinss   1    -none- numeric
#> betweenss      1    -none- numeric
#> size           2    -none- numeric
#> iter           1    -none- numeric
#> ifault         1    -none- numeric

# # Clustering (with k=2)
ggpairs(cbind(wine1, Cluster=as.factor(wine_km2$cluster)),
        columns=1:6, aes(colour=Cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both") +
        theme_bw()

Based on this result above, we focus with component : cluster, centers and size

6.1 Cluster

Cluster is a vector of integers indicating the cluster to which each point is allocated.

# Cluster to which each point is allocated
wine_km2$cluster

#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
#>  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
#>  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
#>  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
#>   2   2   2   1   2   1   1   2   2   1   2   1   2   1   1   2   1   2   1   1 
#>  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
#>   1   1   2   2   1   1   2   2   2   2   2   2   2   1   1   1   2   1   1   1 
#> 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
#>   1   2   2   2   1   2   2   2   2   1   1   2   2   2   2   1   2   2   2   2 
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
#>   1   1   2   2   1   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
#> 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
#>   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
#> 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 
#>   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2

6.2 Centers

Centers is a matrix of cluster centers.

# Cluster centers
wine_km2$centers

#>      Alcohol Malic_Acid         Ash Ash_Alcanity  Magnesium Total_Phenols
#> 1  0.3248845 -0.3529345  0.05207966   -0.4899811  0.3206911     0.7826625
#> 2 -0.3106038  0.3374209 -0.04979045    0.4684435 -0.3065948    -0.7482598
#>   Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity        Hue
#> 1  0.8235093           -0.5921337       0.6378483      -0.1024529  0.5633135
#> 2 -0.7873111            0.5661058      -0.6098110       0.0979495 -0.5385525
#>        OD280    Proline
#> 1  0.7146506  0.6051873
#> 2 -0.6832374 -0.5785857

6.3 Size

Size is the number of points in each cluster.

# Cluster size
wine_km2$size

#> [1] 87 91

The kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.

6.4 Optimum K value

We check class of object wine_km2.

# check class object `wine_km2`
class(wine_km2)

#> [1] "kmeans"

We need to find optimum k to determine optimum cluster. We seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster)

# elbow meter
library(factoextra)
fviz_nbclust(wine1_Norm,#data scaled
             kmeans,#function
             method = "wss")

To see clearly which optimum cluster K, we can use fviz_nbclust() function on the below.

# give the line for optimum cluster K
fviz_nbclust(wine1_Norm , kmeans, method = "silhouette")

We can see that 2 plot above is the optimum number of K. After k=3, increasing the number of K does not result in a considerable decrease of the total within the sum of squares (strong internal cohesion).

# Execution of k-means with k=3
set.seed(1234)

wine_km3 <- kmeans(wine1_Norm, centers=3)

# Mean values of each cluster
aggregate(wine1, by=list(wine_km2$cluster), mean)

##   Group.1  Alcohol Malic_Acid      Ash Ash_Alcanity Magnesium
## 1       1 13.67677   1.997903 2.466290     17.46290 107.96774
## 2       2 12.25092   1.897385 2.231231     20.06308  92.73846
## 3       3 13.13412   3.307255 2.417647     21.24118  98.66667
##   Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
## 1      2.847581  3.0032258            0.2920968        1.922097
## 2      2.247692  2.0500000            0.3576923        1.624154
## 3      1.683922  0.8188235            0.4519608        1.145882
##   Color_Intensity       Hue    OD280   Proline
## 1        5.453548 1.0654839 3.163387 1100.2258
## 2        2.973077 1.0627077 2.803385  510.1692
## 3        7.234706 0.6919608 1.696667  619.0588
# Clustering 
ggpairs(cbind(wine1, Cluster=as.factor(wine_km3$cluster)),
        columns=1:6, aes(colour=Cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both") +
        theme_bw()

We can see on the plot with k value = 3 above, magnesium is already clearly separated compared to Ash.

7 UL : Principal Component Analysis

7.1 Build and visualization for PCA

We build PCA models with outliers.

# build PCA with outliers
library(FactoMineR)
wine_pca <- PCA(wine1, 
                     scale.unit = T, 
                     graph = F,
                     ncp = 6)

We can see the outlier by use plot.PCA() function on the below.

# make plot, visualization for 10 external outlier
plot.PCA(wine_pca,
         choix = "ind", # plot distribution data
         select = "contrib 10")

We take out the outliers, to clean the data.

# save index outlier in object "outlier"
outlier <- c(4,15,19,116,137,138,147,156,174,178)

# drop data outlier
wine_normal <- wine1[-outlier,]

wine_normal

7.2 Check eigen value

# check variance for each PC
# get 80% information data
wine_pca$eig

#>         eigenvalue percentage of variance cumulative percentage of variance
#> comp 1   4.7058503             36.1988481                          36.19885
#> comp 2   2.4969737             19.2074903                          55.40634
#> comp 3   1.4460720             11.1236305                          66.52997
#> comp 4   0.9189739              7.0690302                          73.59900
#> comp 5   0.8532282              6.5632937                          80.16229
#> comp 6   0.6416570              4.9358233                          85.09812
#> comp 7   0.5510283              4.2386793                          89.33680
#> comp 8   0.3484974              2.6807489                          92.01754
#> comp 9   0.2888799              2.2221534                          94.23970
#> comp 10  0.2509025              1.9300191                          96.16972
#> comp 11  0.2257886              1.7368357                          97.90655
#> comp 12  0.1687702              1.2982326                          99.20479
#> comp 13  0.1033779              0.7952149                         100.00000

We get 80% data from PC 1, PC 2, PC 3, PC 4 and PC 5.

8 Combining Clustering and PCA

8.1 Visualization for PCA

# visualisasi PCA + kmeans clustering

fviz_pca_biplot(wine_pca,
                habillage = 13,
                geom.ind = "point",
                addEllipses = T)

This plot is not recommended, cause the data is not clearly separated and too crowded.

# elbow meter
fviz_eig(wine_pca, ncp = 15, addlabels = T, main = "Explained variance by each dimensions")

80% of the variances can be explained by only using the first 5 dimensions, with the first two dimensions can explain 55% of the total variances.

fviz_pca_var(wine_pca, select.var = list(contrib = 31), col.var = "contrib", 
    gradient.cols = c("#FF3333", "#666600", "#339999"), repel = TRUE)

This plot explains 55% variance from the data and describes each variable characteristics. Wine with high OD280 tends to have less Malic_Acid, while Total_Phenols, Flavanoids and
Proanthyocyanins share similar characteristics.

Total_Phenols, Flavanoids, Proanthyocyanins and OD280 have high correlation value from 0.7 until 0.9.

8.2 Visualization for Cluster and Profiling

# centers = 3 (optimum k= 3)
wine_km3

#> K-means clustering with 3 clusters of sizes 62, 65, 51
#> 
#> Cluster means:
#>      Alcohol Malic_Acid        Ash Ash_Alcanity   Magnesium Total_Phenols
#> 1  0.8328826 -0.3029551  0.3636801   -0.6084749  0.57596208    0.88274724
#> 2 -0.9234669 -0.3929331 -0.4931257    0.1701220 -0.49032869   -0.07576891
#> 3  0.1644436  0.8690954  0.1863726    0.5228924 -0.07526047   -0.97657548
#>    Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity        Hue
#> 1  0.97506900          -0.56050853      0.57865427       0.1705823  0.4726504
#> 2  0.02075402          -0.03343924      0.05810161      -0.8993770  0.4605046
#> 3 -1.21182921           0.72402116     -0.77751312       0.9388902 -1.1615122
#>        OD280    Proline
#> 1  0.7770551  1.1220202
#> 2  0.2700025 -0.7517257
#> 3 -1.2887761 -0.4059428
#> 
#> Clustering vector:
#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
#>  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
#>  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
#>  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
#>   2   3   2   2   2   2   2   2   2   2   2   2   2   1   2   2   2   2   2   2 
#>  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
#>   2   2   2   3   2   2   2   2   2   2   2   2   2   2   2   1   2   2   2   2 
#> 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
#>   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   3   2 
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
#>   2   1   2   2   2   2   2   2   2   2   3   3   3   3   3   3   3   3   3   3 
#> 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
#>   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
#> 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 
#>   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
#> 
#> Within cluster sum of squares by cluster:
#> [1] 385.6983 558.6971 326.3537
#>  (between_SS / total_SS =  44.8 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

head(wine_km3$cluster)

#> 1 2 3 4 5 6 
#> 1 1 1 1 1 1

head(wine1_Norm$cluster)

#> NULL

# input label cluster in `wine_km3` to `wine1_Norm`
wine1_Norm$cluster <- wine_km3$cluster

# profiling - way 1
# do profiling with summarise data
wine_profile <- wine1_Norm %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

wine_profile

# profiling - way 2
# wine_profile %>% 
#   tidyr::pivot_longer(-cluster) %>% 
#   group_by(name) %>% 
#   summarize(cluster_min_val = which.min(value),
#             cluster_max_val = which.max(value))

high : x > 0.6 medium : 0.5 =< x < 0.6 low : x < 0.5

Profiling:

Cluster 1 is wine with
- high : Alcohol, Total_Phenols, Flavanoids, OD280, Proline, and Ash_Alcanity
- medium : Magnesium, Nonflavanoid_Phenols, and Proanthocyanins
- LOW : Malic_Acid, Ash, Color_Intensity, Hue
Cluster 2 is wine with
- high : Alcohol, Color_Intensity, Proline
- low : Malic_Acid, Ash, Ash_Alcanity, Magnesium, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proanthocyanins, Hue, OD280
Cluster 3 is wine with
- high : Malic_Acid, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proanthocyanins, Color_Intensity, Hue, OD280
- medium : Ash_Alcanity
- low : Alcohol, Ash, Magnesium, Proline

# package factoextra
fviz_cluster(object = wine_km3,
             data = wine1_Norm %>% select(-cluster))

With plot clustering above, we can see for each cluster clearly separated.

9 Conclusion

We can get the conclusion from the analysis above, such as : - Clustering model can clearly separated best for profiling for each cluster - PCA Model can see correlations between variables, but not good enough to do profiling

LBB-UL-Melia Sie

Melia Sie

12/26/2021