1 Business Problem

The wine company can use unsupervised learning to cluster their wine data set. This will allow them to identify different types of wines and understand the characteristics of each cluster. The company can then use this information to improve their wine production and marketing strategies.

2 Import Library and Read Data

2.1 Import Library

First, we need to load Library

# import library
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.92 loaded

library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(knitr)

2.2 Read Data

Second, we need to read the data

# read data
wine <- read.csv("wine.csv")

wine

# insert index row
id <- rownames(wine)
wine <- cbind(id=id, wine)
wine

3 Data Wrangling

We do data wrangling by ensuring the data type is appropriate, the data have no missing value and significant outliers.

3.1 Check data type and useable variable

# check data type
glimpse(wine)

## Rows: 178
## Columns: 15
## $ id                   <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"…
## $ Wine                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Alcohol              <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, …
## $ Malic.acid           <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
## $ Ash                  <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
## $ Acl                  <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
## $ Mg                   <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 10…
## $ Phenols              <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
## $ Flavanoids           <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
## $ Nonflavanoid.phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
## $ Proanth              <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
## $ Color.int            <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
## $ Hue                  <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
## $ OD                   <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
## $ Proline              <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…

unique(wine$Wine)

## [1] 1 2 3

We don’t need Wine column. K-means is an unsupervised machine learning algorithm and works with unlabeled data.

# index names
rownames(wine) <- wine$id

# drop column `Wine`
# way 1
wine1 <- wine %>% 
  select(-c(id, Wine))



wine1

3.2 Explore data

Check the first row data

# check data - first row
kable(head(wine1))

Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735
14.20	1.76	2.45	15.2	112	3.27	3.39	0.34	1.97	6.75	1.05	2.85	1450

Check the last row data

# check data - last row
kable(tail(wine1))

	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
173	14.16	2.51	2.48	20.0	91	1.68	0.70	0.44	1.24	9.7	0.62	1.71	660
174	13.71	5.65	2.45	20.5	95	1.68	0.61	0.52	1.06	7.7	0.64	1.74	740
175	13.40	3.91	2.48	23.0	102	1.80	0.75	0.43	1.41	7.3	0.70	1.56	750
176	13.27	4.28	2.26	20.0	120	1.59	0.69	0.43	1.35	10.2	0.59	1.56	835
177	13.17	2.59	2.37	20.0	120	1.65	0.68	0.53	1.46	9.3	0.60	1.62	840
178	14.13	4.10	2.74	24.5	96	2.05	0.76	0.56	1.35	9.2	0.61	1.60	560

Check dimension data

# check dimension data
dim(wine1)

## [1] 178  13

3.3 Check Missing Value

# Check available missing value
anyNA(wine)

## [1] FALSE

# Check missing value per variable data
wine %>% 
  is.na() %>% 
  colSums()

##                   id                 Wine              Alcohol 
##                    0                    0                    0 
##           Malic.acid                  Ash                  Acl 
##                    0                    0                    0 
##                   Mg              Phenols           Flavanoids 
##                    0                    0                    0 
## Nonflavanoid.phenols              Proanth            Color.int 
##                    0                    0                    0 
##                  Hue                   OD              Proline 
##                    0                    0                    0

here is no missing value on the data wine1 above. We can do next progress.

3.4 Check Outlier

We do check outlier with ggplot. But we haven’t included magnesium and proline, cause their values are very high. They can worsen the visualization data.

# Check outlier - ggplot for each Attribute, exclude : `magnesium` and `proline`  
wine1 %>%
  gather(Attributes, values, c(1:4, 6:12)) %>%
  ggplot(aes(x=reorder(Attributes, values, FUN=median), y=values, fill=Attributes)) +
  geom_boxplot(show.legend=FALSE) +
  labs(title="Wines Attributes - Boxplots") +
  theme_bw() +
  theme(axis.title.y=element_blank(),
        axis.title.x=element_blank()) +
  ylim(0, 35) +
  coord_flip()

Check outlier with boxplot

# check outlier with boxplot - all variables
boxplot(scale(wine1))

4 Exploratory Data Analysis

In this steps, we can do next level explore data with visualize data.

4.1 Visualize data with Histogram

Histogram for each Attribute

# Histogram for each Attribute
wine1 %>%
  gather(Attributes, value, 1:13) %>%
  mutate(value = as.numeric(value)) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_histogram(colour="black", show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Wines Attributes - Histograms") +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From histogram on above, we can see that Malic_Acid have higher frequency around > 30 at point 1.8 compared to other variables.

4.2 Visualize data with Density Plot

# Density plot for each Attribute
wine1 %>%
  gather(Attributes, value, 1:13) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_density(colour="black", alpha=0.5, show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Density",
       title="Wines Attributes - Density plots") +
  theme_bw()

From density plot on above, we can see that Nonflavanoid_Phenols have denser area compared to other variables. For attributes such as Ash_Alcanity (63 unique values), Color_Intensity(101 unique values), Magnesium(53 unique values) and Proline (101 unique values) have many different unique value, so the density of the area almost flat.

4.3 Check correlation for variable data

How about the relationship between the different attributes? We can use the ggcorr() or corrplot() function from packages GGallyto create a graphical display of a correlation matrix.

# Correlation matrix 

# way 1
ggcorr(wine1, label = T, hjust=0.90)

There is a strong linear correlation between Phenols and Flavanoids (0.9). We can build model based on the relationship between the two variables above by fitting a linear equation.

# Relationship between Phenols and Flavanoids
ggplot(wine1, aes(x=Phenols, y=Flavanoids)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(title="Wines Attributes",
       subtitle="Relationship between Phenols and Flavanoids") +
  theme_bw()

## `geom_smooth()` using formula = 'y ~ x'

Now that we have done a exploratory data analysis, we can prepare the data in order to execute the k-means algorithm.

5 Data Pre-processing

We have to normalize the variables to express them in the same range of values. In other words, normalization means adjusting values measured on different scales to a common scale.

set.seed(123)

# Normalization
wine1_Norm <- as.data.frame(scale(wine1))

# Original data
p1 <- ggplot(wine1, aes(x=Alcohol, y=Malic.acid)) +
  geom_point() +
  labs(title="Original data") +
  theme_bw()

# Normalized data 
p2 <- ggplot(wine1_Norm, aes(x=Alcohol, y=Malic.acid)) +
  geom_point() +
  labs(title="Normalized data") +
  theme_bw()

# Subplot
grid.arrange(p1, p2, ncol=2)

Based on the plot above, the points in the normalized data are the same as the original one. The only thing that changes is the scale of the axis.

6 k-means execution

We are going to execute the k-means algorithm. On the first try, We can do clustering model with k value = 2.

# Execution of k-means with k=2
set.seed(1234)
wine_km2 <- kmeans(wine1_Norm, centers=2)
summary(wine_km2)

##              Length Class  Mode   
## cluster      178    -none- numeric
## centers       26    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

# Clustering (with k=2)
ggpairs(cbind(wine1, Cluster=as.factor(wine_km2$cluster)),
        columns=1:6, aes(colour=Cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both") +
        theme_bw()

Based on this result above, we focus with component : cluster, centers and size

6.1 Cluster

Cluster is a vector of integers indicating the cluster to which each point is allocated.

# Cluster to which each point is allocated
wine_km2$cluster

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   2   2   2   1   2   1   1   2   2   1   2   1   2   1   1   2   1   2   1   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   1   2   2   1   1   2   2   2   2   2   2   2   1   1   1   2   1   1   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   2   2   2   1   2   2   2   2   1   1   2   2   2   2   1   2   2   2   2 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   1   1   2   2   1   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2

6.2 Centers

Centers is a matrix of cluster centers.

# Cluster centers
wine_km2$centers

##      Alcohol Malic.acid         Ash        Acl         Mg    Phenols Flavanoids
## 1  0.3248845 -0.3529345  0.05207966 -0.4899811  0.3206911  0.7826625  0.8235093
## 2 -0.3106038  0.3374209 -0.04979045  0.4684435 -0.3065948 -0.7482598 -0.7873111
##   Nonflavanoid.phenols    Proanth  Color.int        Hue         OD    Proline
## 1           -0.5921337  0.6378483 -0.1024529  0.5633135  0.7146506  0.6051873
## 2            0.5661058 -0.6098110  0.0979495 -0.5385525 -0.6832374 -0.5785857

6.3 Size

Size is the number of points in each cluster.

# Cluster size
wine_km2$size

## [1] 87 91

The kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.

6.4 Optimum K value

We check class of object wine_km2.

# check class object `wine_km2`
class(wine_km2)

## [1] "kmeans"

We need to find optimum k to determine optimum cluster. We seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster)

# elbow meter
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_nbclust(wine1_Norm,#data scaled
             kmeans,#function
             method = "wss")

To see clearly which optimum cluster K, we can use fviz_nbclust() function on the below.

# give the line for optimum cluster K
fviz_nbclust(wine1_Norm , kmeans, method = "silhouette")

We can see that 2 plot above is the optimum number of K. After k=3, increasing the number of K does not result in a considerable decrease of the total within the sum of squares (strong internal cohesion).

# Execution of k-means with k=3
set.seed(1234)

wine_km3 <- kmeans(wine1_Norm, centers=3)

# Mean values of each cluster
aggregate(wine1, by=list(wine_km2$cluster), mean)

# Clustering 
ggpairs(cbind(wine1, Cluster=as.factor(wine_km3$cluster)),
        columns=1:6, aes(colour=Cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both") +
        theme_bw()

We can see on the plot with k value = 3 above, magnesium is already clearly separated compared to Ash.

#7 UL : Principal Component Analysis

##7.1 Build and visualization for PCA We build PCA models with outliers.

# build PCA with outliers
library(FactoMineR)
wine_pca <- PCA(wine1, 
                     scale.unit = T, 
                     graph = F,
                     ncp = 6)

We can see the outlier by use plot.PCA() function on the below.

# make plot, visualization for 10 external outlier
plot.PCA(wine_pca,
         choix = "ind", # plot distribution data
         select = "contrib 10")

We take out the outliers, to clean the data.

# save index outlier in object "outlier"
outlier <- c(4,15,19,116,137,138,147,156,174,178)

# drop data outlier
wine_normal <- wine1[-outlier,]

wine_normal

7.2 Check eigen value

# check variance for each PC
# get 80% information data
wine_pca$eig

##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   4.7058503             36.1988481                          36.19885
## comp 2   2.4969737             19.2074903                          55.40634
## comp 3   1.4460720             11.1236305                          66.52997
## comp 4   0.9189739              7.0690302                          73.59900
## comp 5   0.8532282              6.5632937                          80.16229
## comp 6   0.6416570              4.9358233                          85.09812
## comp 7   0.5510283              4.2386793                          89.33680
## comp 8   0.3484974              2.6807489                          92.01754
## comp 9   0.2888799              2.2221534                          94.23970
## comp 10  0.2509025              1.9300191                          96.16972
## comp 11  0.2257886              1.7368357                          97.90655
## comp 12  0.1687702              1.2982326                          99.20479
## comp 13  0.1033779              0.7952149                         100.00000

We get 80% data from PC 1, PC 2, PC 3, PC 4 and PC 5.

#8 Combining Clustering and PCA ## 8.1 Visualization for PCA

# visualisasi PCA + kmeans clustering

fviz_pca_biplot(wine_pca,
                habillage = 13,
                geom.ind = "point",
                addEllipses = T)

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'

## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'

This plot is not recommended, cause the data is not clearly separated and too crowded.

# elbow meter
fviz_eig(wine_pca, ncp = 15, addlabels = T, main = "Explained variance by each dimensions")

80% of the variances can be explained by only using the first 5 dimensions, with the first two dimensions can explain 55% of the total variances.

fviz_pca_var(wine_pca, select.var = list(contrib = 31), col.var = "contrib", 
    gradient.cols = c("#FF3333", "#666600", "#339999"), repel = TRUE)

This plot explains 55% variance from the data and describes each variable characteristics. Wine with high OD tends to have less Malic.acid, while Phenols, Flavanoids and Proanth share similar characteristics.

Phenols, Flavanoids, Proanth and OD have high correlation value from 0.7 until 0.9.

8.2 Visualization for Cluster and Profiling

centers = 3 (optimum k= 3)

# centers = 3 (optimum k= 3)
wine_km3

## K-means clustering with 3 clusters of sizes 62, 65, 51
## 
## Cluster means:
##      Alcohol Malic.acid        Ash        Acl          Mg     Phenols
## 1  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208  0.88274724
## 2 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869 -0.07576891
## 3  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.97657548
##    Flavanoids Nonflavanoid.phenols     Proanth  Color.int        Hue         OD
## 1  0.97506900          -0.56050853  0.57865427  0.1705823  0.4726504  0.7770551
## 2  0.02075402          -0.03343924  0.05810161 -0.8993770  0.4605046  0.2700025
## 3 -1.21182921           0.72402116 -0.77751312  0.9388902 -1.1615122 -1.2887761
##      Proline
## 1  1.1220202
## 2 -0.7517257
## 3 -0.4059428
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   2   3   2   2   2   2   2   2   2   2   2   2   2   1   2   2   2   2   2   2 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   2   2   2   3   2   2   2   2   2   2   2   2   2   2   2   1   2   2   2   2 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   3   2 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   2   1   2   2   2   2   2   2   2   2   3   3   3   3   3   3   3   3   3   3 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 
## Within cluster sum of squares by cluster:
## [1] 385.6983 558.6971 326.3537
##  (between_SS / total_SS =  44.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

head(wine_km3$cluster)

## 1 2 3 4 5 6 
## 1 1 1 1 1 1

head(wine1_Norm$cluster)

## NULL

# input label cluster in `wine_km3` to `wine1_Norm`
wine1_Norm$cluster <- wine_km3$cluster

# profiling - way 1
# do profiling with summarise data
wine_profile <- wine1_Norm %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

wine_profile

high : x > 0.6 medium : 0.5 =< x < 0.6 low : x < 0.5

Profiling:

Cluster 1 is wine with high : Alcohol, Phenols, Flavanoids, OD, Proline, and Ash medium : Mg, Nonflavanoid.Phenols, and Proanth LOW : Malic.acid, Ash, Color_Int, Hue
Cluster 2 is wine with high : Alcohol, Color_Int, Proline low : Malic.acid, Ash, Ash, Mg, Phenols, Flavanoids, Nonflavanoid.Phenols, Proanth, Hue, OD
Cluster 3 is wine with high : Malic.acid, Phenols, Flavanoids, Nonflavanoid.Phenols, Proanth, Color_Int, Hue, OD medium : Ash low : Alcohol, Ash, Mg, Proline

# package factoextra
fviz_cluster(object = wine_km3,
             data = wine1_Norm %>% select(-cluster))

With plot clustering above, we can see for each cluster clearly separated.

9 Conclusion

We can get the conclusion from the analysis above, such as : - The clustering model can clearly separate the different clusters, making it the best option for profiling each cluster. - The PCA model can see correlations between variables, but it is not sufficient for profiling.

Clustering of Wine

Muhamad Ilyas Haikal