The objective of this report is to study and practice the data visualization with ggplot2, machine learning by caret, cluster analysis by comparing different method. Expectation of this report to help the company classify the characteristic of high class wine, or acknowledge the most important factors to improve the wine quality.

1. Background

1.1 The data set

A wine company send 178 samples for a characteristic testing. They want to know what are the difference of wine between Customer Segmentation.

1.2 The method

  • Use tidyverse for data wrangling.
  • Use caret for modeling.
  • Use corrplot to evaluate the correlation between variables.
  • Use GGally for cluster analysis with kmean.
  • Use ggplot2 and GridExtra for data visualization with graph.
  • Use principle component analysis with base R.
  • Use MASS for cluster visualization.

2. Import library

3. Import data

wine <- read.csv("C:/Users/dell/Desktop/Data/Bai tap R/Cluster/Wine.csv")

4. Data visualization

The first step is to overview the number of observation (rows), number of column (variables), type of variables (factor vs number, continuous vs discrete,…)

str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ Alcohol             : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic_Acid          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                 : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Ash_Alcanity        : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium           : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Total_Phenols       : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids          : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid_Phenols: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins     : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color_Intensity     : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                 : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD280               : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline             : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
##  $ Customer_Segment    : int  1 1 1 1 1 1 1 1 1 1 ...

There are total 14 variables, specified to 2 types of data: number and integer. But the 14th variables should be factor instead of integer. I will change it to factor.

wine$Customer_Segment <- as.factor(wine$Customer_Segment)

Now creating graph to quick view on data.

wine %>% gather(1:13, key = "variables", value = "result") %>%
  ggplot(aes(Customer_Segment, result, fill = Customer_Segment)) +
  geom_jitter(color = "grey", alpha = 0.5)+
  geom_boxplot()+
  theme_classic()+
  facet_wrap(.~variables, scale = "free")

wine %>% gather(1:13, key = "variables", value = "result") %>%
  ggplot(aes(result)) +
  geom_histogram(aes(fill = variables), color = "white")+
  theme_classic()+
  facet_wrap(.~variables, scale = "free") +
  theme(legend.position = "none")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

wine %>% gather(1:13, key = "variables", value = "result") %>%
  ggplot(aes(result, fill = Customer_Segment)) +
  geom_density(alpha = 0.5)+
  theme_classic()+
  facet_wrap(.~variables, scale = "free")

Correlation graph between variables.

corrplot(cor(wine[,-14]), 
                   method = "color",
                   type = "upper",
                   addCoef.col = "white",number.cex = 0.7,
                   tl.col="black", tl.srt=35,tl.cex = 0.7,
                   cl.cex = 0.7,order = "hclust",
                   title = "correlation matrix of wine variable")

The flavanoid is high correlated to other variables. I will remove later after evaluate its impact to our model accuracy.

5. Model building

Create data to 2 sets of data. One for modeling and one for modeling accuracy validation.

set.seed(123)
index <- createDataPartition(wine$Customer_Segment, times = 1, p = 0.8, list = F)
train_set <- wine[index,]
test_set <- wine[-index,]

5.1 Classification model

Linear model is the most common and simpler model in modeling.

fit_rf <- train(wine[,-14],
                 wine$Customer_Segment,
                 method = "rf",
                 trControl = trainControl(method = "cv",
                                          number = 3,
                                          p = 0.8))
y_hat <- predict(fit_rf,
          test_set)
  confusionMatrix(y_hat,
                  test_set$Customer_Segment)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 11  0  0
##          2  0 14  0
##          3  0  0  9
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8972, 1)
##     No Information Rate : 0.4118     
##     P-Value [Acc > NIR] : 7.908e-14  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000
## Prevalence             0.3235   0.4118   0.2647
## Detection Rate         0.3235   0.4118   0.2647
## Detection Prevalence   0.3235   0.4118   0.2647
## Balanced Accuracy      1.0000   1.0000   1.0000

Great results, Accuracy is awesome by 100%.

So what is the most important varibles impact to the quality of the wine?

plot(varImp(fit_rf))

Results show that flavonoid, color intensity, proline are 3 most important variables impact to the quality of wine.

5.2 Cluster analysis with Kmean

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. [1]

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.

AndreyBu, who has more than 5 years of machine learning experience and currently teaches people his skills, says that “the objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.” You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.[2]

wine_kmean <- kmeans(wine, centers = 2)
wine_kmean$centers
##    Alcohol Malic_Acid      Ash Ash_Alcanity Magnesium Total_Phenols
## 1 12.70285   2.544553 2.339106     20.40813  96.81301      2.062114
## 2 13.66655   1.870727 2.427818     17.45273 106.29091      2.816182
##   Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1   1.641463            0.3926829        1.454065        4.851382
## 2   2.896545            0.2929091        1.896909        5.520364
##         Hue    OD280   Proline Customer_Segment
## 1 0.9086179 2.408211  565.8699         2.308943
## 2 1.0666545 3.066727 1151.7273         1.109091

What is Cluster size?

wine_kmean$size
## [1] 123  55

5.3 Optimize the Cluster

The kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.

betweenss. The between-cluster sum of squares. In an optimal segmentation, one expects this ratio to be as higher as possible, since we would like to have heterogeneous clusters.

withinss. Vector of within-cluster sum of squares, one component per cluster. In an optimal segmentation, one expects this ratio to be as lower as possible for each cluster, since we would like to have homogeneity within the clusters.

tot.withinss. Total within-cluster sum of squares.

totss. The total sum of squares.

Above Kmean model, I choose center = 2. Let’s review the performance of this model.

data.frame(betweenss = wine_kmean$betweenss,
           withinss = wine_kmean$withinss,
           tot.withinss = wine_kmean$tot.withinss,
           totss = wine_kmean$totss)
##   betweenss withinss tot.withinss    totss
## 1  13048601  2566072      4543801 17592403
## 2  13048601  1977729      4543801 17592403

Now to choose the best fit for parameter center, I will do a loop on this.

center = 3 or 4 will be the best choice in this case. Now build the the kmean with center = 3.

5.4 Principle component

wine_pca <- prcomp(wine[,-14])
summary(wine_pca)
## Importance of components:
##                             PC1      PC2     PC3     PC4     PC5     PC6
## Standard deviation     314.9632 13.13527 3.07215 2.23409 1.10853 0.91710
## Proportion of Variance   0.9981  0.00174 0.00009 0.00005 0.00001 0.00001
## Cumulative Proportion    0.9981  0.99983 0.99992 0.99997 0.99998 0.99999
##                           PC7    PC8    PC9   PC10   PC11   PC12    PC13
## Standard deviation     0.5282 0.3891 0.3348 0.2678 0.1938 0.1452 0.09057
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.00000
## Cumulative Proportion  1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.00000

The first PC describe 99.81% of the data.

plot(wine_pca, col = "steel blue")

head(wine_pca$x)
##             PC1         PC2        PC3        PC4        PC5         PC6
## [1,] -318.56298 -21.4921307  3.1307347  0.2501138  0.6770782 -0.56808104
## [2,] -303.09742   5.3647177  6.8228355  0.8640347 -0.4860960 -0.01433987
## [3,] -438.06113   6.5373094 -1.1132230 -0.9124107  0.3806514 -0.67240375
## [4,] -733.24014  -0.1927290 -0.9172570  0.5412506  0.8586623 -0.59912165
## [5,]   11.57143 -18.4899946 -0.5544221 -1.3608961  0.2764416 -0.76888353
## [6,] -703.23119   0.3321587  0.9493753  0.3599938  0.1568271 -0.06101136
##              PC7         PC8         PC9        PC10        PC11
## [1,]  0.61964183 -0.19955538  0.70128028  0.09500757  0.08873400
## [2,] -0.10886512  0.60471445  0.28671685  0.04578198  0.03977819
## [3,] -0.78581886 -0.50088570  0.02454666  0.20895977  0.23777003
## [4,] -0.01877023  0.19042835  0.05427684 -0.53168357 -0.09604379
## [5,]  0.30997581  0.11909101 -0.19584299 -0.06177064  0.31646644
## [6,] -0.02607580  0.09780938 -0.38809618 -0.10195325 -0.03064081
##              PC12        PC13
## [1,] -0.038547563  0.08026443
## [2,] -0.057191577  0.01359275
## [3,] -0.048797875 -0.03540816
## [4,] -0.166353072  0.01634436
## [5,] -0.007118042  0.01527761
## [6,] -0.031614149  0.07486414
data.frame(PC1 = wine_pca$x[,1], PC2 = wine_pca$x[,2], Customer_Segment = wine[,14])%>%
  ggplot(aes(PC1, PC2, color = Customer_Segment))+
  geom_point(size = 3, alpha = 0.6)+
  stat_ellipse()+
  theme_bw()

autoplot(wine_pca, data = wine, colour = "Customer_Segment",
         frame = T, frame.colour = "Customer_Segment")

5.5 Linear Discrimination Analysis

wlda <- lda(Customer_Segment ~., data = wine)
wlda
## Call:
## lda(Customer_Segment ~ ., data = wine)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3314607 0.3988764 0.2696629 
## 
## Group means:
##    Alcohol Malic_Acid      Ash Ash_Alcanity Magnesium Total_Phenols
## 1 13.74475   2.010678 2.455593     17.03729  106.3390      2.840169
## 2 12.27873   1.932676 2.244789     20.23803   94.5493      2.258873
## 3 13.15375   3.333750 2.437083     21.41667   99.3125      1.678750
##   Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1  2.9823729             0.290000        1.899322        5.528305
## 2  2.0808451             0.363662        1.630282        3.086620
## 3  0.7814583             0.447500        1.153542        7.396250
##         Hue    OD280   Proline
## 1 1.0620339 3.157797 1115.7119
## 2 1.0562817 2.785352  519.5070
## 3 0.6827083 1.683542  629.8958
## 
## Coefficients of linear discriminants:
##                               LD1           LD2
## Alcohol              -0.403399781  0.8717930699
## Malic_Acid            0.165254596  0.3053797325
## Ash                  -0.369075256  2.3458497486
## Ash_Alcanity          0.154797889 -0.1463807654
## Magnesium            -0.002163496 -0.0004627565
## Total_Phenols         0.618052068 -0.0322128171
## Flavanoids           -1.661191235 -0.4919980543
## Nonflavanoid_Phenols -1.495818440 -1.6309537953
## Proanthocyanins       0.134092628 -0.3070875776
## Color_Intensity       0.355055710  0.2532306865
## Hue                  -0.818036073 -1.5156344987
## OD280                -1.157559376  0.0511839665
## Proline              -0.002691206  0.0028529846
## 
## Proportion of trace:
##    LD1    LD2 
## 0.6875 0.3125
wpredict <- predict(wlda)
head(wpredict$x)
##         LD1       LD2
## 1 -4.700244 1.9791383
## 2 -4.301958 1.1704129
## 3 -3.420720 1.4291014
## 4 -4.205754 4.0028715
## 5 -1.509982 0.4512239
## 6 -4.518689 3.2131376
ldahist(data = wpredict$x[,1], g = wine$Customer_Segment)

ldahist(data = wpredict$x[,2], g = wine$Customer_Segment)

data.frame(LD1 = wpredict$x[,1], LD2 = wpredict$x[,2], Type = wine$Customer_Segment) %>%
  ggplot(aes(LD1, LD2, color = Type))+
  geom_point(size = 3, alpha = 0.6)+
  stat_ellipse()+
  theme_bw()

library(lfda)
## Warning: package 'lfda' was built under R version 3.5.3
m = lfda(wine[,-14],
         wine[,14],
         r = 4,
         metric = "plain")
autoplot(m, data = wine, frame = T, frame.colour = "Customer_Segment")

5.6 Cluster with dendogram graph

d <- dist(wine[,-14])
hc <- hclust(d)
plot(hc, labels = F, hang = -1)

library(cluster)
wpam = pam(d, 3)
autoplot(pam(wine[,-14],3),  frame = TRUE, frame.type = "norm")

6. Conclusion

  1. The data can reflect the characteristic of wine, we can build the predict model with accuracy by 100%, validated on test set. We can choose Random Forest for predit model. The company can apply this to classify the quality of the wine and determine exactly price for different wines.
  2. From 3 most important variables proline, flavonoid, color intensity the company can narrow their focus and resources on research and development about improving wine quality by 3 mentioned variables.
  3. For this data set, the method linear discrimination analysis (LDA) performed better than other methods classifying different wines to different clusters.

7. Reference

[1] https://en.wikipedia.org/wiki/K-means_clustering

[2] https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1

[3] other resources from internet