Intro

I want to clustering coffee based on their characteristics.

Import Library Needed

library(tidyverse)
library(dplyr)
library(factoextra)
library(FactoMineR)

Read Dataset

coffee <- read.csv("coffee.csv")

str(coffee)

#> 'data.frame':    1082 obs. of  13 variables:
#>  $ coffeeId     : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Aroma        : num  7.83 8 7.92 8 8.33 8 7.67 7.67 7.67 7.67 ...
#>  $ Flavor       : num  8.08 7.75 7.83 7.92 7.83 7.92 7.75 7.75 7.75 7.83 ...
#>  $ Aftertaste   : num  7.75 7.92 7.92 7.92 7.83 7.67 7.83 7.83 7.58 7.83 ...
#>  $ Acidity      : num  7.92 8 8 7.75 7.75 8 7.83 7.67 7.83 7.83 ...
#>  $ Body         : num  8.25 7.92 7.83 7.83 8.25 7.75 7.92 7.92 7.83 7.92 ...
#>  $ Balance      : num  7.92 7.92 7.92 7.75 7.75 7.92 7.75 7.83 8 7.75 ...
#>  $ Uniformity   : num  10 10 10 10 10 10 10 10 10 10 ...
#>  $ Clean.Cup    : num  10 10 10 10 10 10 10 10 10 10 ...
#>  $ Sweetness    : num  8 8 7.83 7.75 7.58 7.75 8 7.92 7.92 7.75 ...
#>  $ Cupper.Points: num  8 8 8 8.08 7.67 7.75 7.83 7.92 7.92 7.83 ...
#>  $ Moisture     : num  0.12 0 0 0.12 0.12 0 0 0.1 0.09 0.12 ...
#>  $ Quakers      : int  0 0 0 0 0 0 0 0 0 0 ...

coffee

Principal Component Analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

coffee_scale <- scale(coffee)

pca_coffee <- PCA(coffee_scale, scale.unit = F, ncp = 13, graph = F)
summary(pca_coffee)

#> 
#> Call:
#> PCA(X = coffee_scale, scale.unit = F, ncp = 13, graph = F) 
#> 
#> 
#> Eigenvalues
#>                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
#> Variance               6.938   1.443   0.996   0.941   0.630   0.474
#> % of var.             53.418  11.114   7.670   7.244   4.853   3.653
#> Cumulative % of var.  53.418  64.531  72.202  79.446  84.299  87.951
#>                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
#> Variance               0.353   0.313   0.247   0.230   0.175   0.155
#> % of var.              2.716   2.406   1.902   1.771   1.346   1.192
#> Cumulative % of var.  90.668  93.074  94.976  96.746  98.092  99.284
#>                       Dim.13
#> Variance               0.093
#> % of var.              0.716
#> Cumulative % of var. 100.000
#> 
#> Individuals (the 10 first)
#>                   Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
#> 1             |  4.858 |  2.793  0.104  0.331 | -2.668  0.456  0.302 |
#> 2             |  5.010 |  2.743  0.100  0.300 | -3.450  0.762  0.474 |
#> 3             |  5.123 |  2.623  0.092  0.262 | -3.604  0.832  0.495 |
#> 4             |  4.770 |  2.278  0.069  0.228 | -2.830  0.513  0.352 |
#> 5             |  5.331 |  2.434  0.079  0.208 | -2.997  0.575  0.316 |
#> 6             |  5.049 |  2.281  0.069  0.204 | -3.586  0.823  0.504 |
#> 7             |  4.597 |  1.978  0.052  0.185 | -3.284  0.691  0.510 |
#> 8             |  4.232 |  1.801  0.043  0.181 | -2.679  0.460  0.401 |
#> 9             |  4.208 |  1.792  0.043  0.181 | -2.740  0.481  0.424 |
#> 10            |  4.503 |  1.805  0.043  0.161 | -2.717  0.473  0.364 |
#>                Dim.3    ctr   cos2  
#> 1              0.367  0.012  0.006 |
#> 2              0.076  0.001  0.000 |
#> 3              0.106  0.001  0.000 |
#> 4              0.432  0.017  0.008 |
#> 5              0.446  0.018  0.007 |
#> 6              0.109  0.001  0.000 |
#> 7              0.050  0.000  0.000 |
#> 8              0.315  0.009  0.006 |
#> 9              0.280  0.007  0.004 |
#> 10             0.392  0.014  0.008 |
#> 
#> Variables (the 10 first)
#>                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
#> coffeeId      | -0.746  8.025  0.557 |  0.384 10.215  0.148 | -0.083
#> Aroma         |  0.855 10.542  0.732 | -0.066  0.303  0.004 |  0.021
#> Flavor        |  0.940 12.742  0.885 | -0.075  0.388  0.006 |  0.018
#> Aftertaste    |  0.933 12.536  0.871 | -0.087  0.520  0.008 |  0.013
#> Acidity       |  0.874 11.012  0.765 | -0.083  0.473  0.007 | -0.004
#> Body          |  0.854 10.518  0.730 | -0.084  0.490  0.007 | -0.007
#> Balance       |  0.890 11.416  0.793 | -0.076  0.401  0.006 | -0.005
#> Uniformity    |  0.596  5.122  0.356 |  0.519 18.670  0.270 | -0.051
#> Clean.Cup     |  0.534  4.107  0.285 |  0.523 18.952  0.274 | -0.067
#> Sweetness     |  0.412  2.445  0.170 |  0.733 37.229  0.538 | -0.104
#>                  ctr   cos2  
#> coffeeId       0.687  0.007 |
#> Aroma          0.045  0.000 |
#> Flavor         0.031  0.000 |
#> Aftertaste     0.018  0.000 |
#> Acidity        0.001  0.000 |
#> Body           0.005  0.000 |
#> Balance        0.003  0.000 |
#> Uniformity     0.261  0.003 |
#> Clean.Cup      0.455  0.005 |
#> Sweetness      1.077  0.011 |

Plot PCA also to see the outlier

plot.PCA(pca_coffee)

fviz_eig(pca_coffee, ncp = 13,  addlabels = T, main = "Variance explained by each dimensions")

dimdesc(pca_coffee)

#> $Dim.1
#> $Dim.1$quanti
#>               correlation       p.value
#> Flavor          0.9406636  0.000000e+00
#> Aftertaste      0.9330372  0.000000e+00
#> Balance         0.8903672  0.000000e+00
#> Cupper.Points   0.8775643  0.000000e+00
#> Acidity         0.8744702  0.000000e+00
#> Aroma           0.8555946 4.061215e-311
#> Body            0.8546484 1.049902e-309
#> Uniformity      0.5963953 3.319503e-105
#> Clean.Cup       0.5340393  8.258183e-81
#> Sweetness       0.4120568  1.347442e-45
#> Moisture       -0.1751626  6.631645e-09
#> coffeeId       -0.7464989 2.702566e-193
#> 
#> 
#> $Dim.2
#> $Dim.2$quanti
#>               correlation       p.value
#> Sweetness      0.73340077 3.081049e-183
#> Clean.Cup      0.52327594  4.265727e-77
#> Uniformity     0.51937303  8.785091e-76
#> coffeeId       0.38416983  2.236144e-39
#> Moisture       0.37555249  1.424046e-37
#> Quakers        0.13601584  7.129601e-06
#> Aroma         -0.06616688  2.952929e-02
#> Flavor        -0.07489075  1.373761e-02
#> Balance       -0.07615240  1.222151e-02
#> Acidity       -0.08264565  6.527425e-03
#> Body          -0.08411397  5.630897e-03
#> Aftertaste    -0.08666795  4.332094e-03
#> Cupper.Points -0.13789572  5.302953e-06
#> 
#> 
#> $Dim.3
#> $Dim.3$quanti
#>           correlation      p.value
#> Quakers    0.97902816 0.0000000000
#> Moisture   0.11155973 0.0002361216
#> Clean.Cup -0.06733053 0.0267800937
#> coffeeId  -0.08273990 0.0064662377
#> Sweetness -0.10363297 0.0006398488

Because on the plot we see the observation in 1082 is outlier, so I have to take out.

coffee_new <- coffee[-1082,-1]

k-Mean

kmeansTunning <- function(data, maxK = 10) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(654)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

In using k-means, I have to consider the optimal k. One of the optimum method is using Elbow Method

kmeansTunning(coffee_new, maxK = 10)

From the graph, I can use the efficient k = 3

Making Cluster

set.seed(654)
coffe_cluster <- kmeans(coffee_new, centers = 3)
coffee_new$cluster <- coffe_cluster$cluster

coffee_new

Visualizing the cluster

fviz_cluster(object = coffe_cluster, 
             data = coffee_new)

Conclusion

coffee_new %>% 
  mutate(cluster = coffe_cluster$cluster) %>% 
  group_by(cluster) %>% 
  summarise(aroma = mean(Aroma),
            sweetness = mean(Sweetness),
            flavor = mean(Flavor),
            body = mean(Body),
            acidity = mean(Acidity))

After clustered intu 3 class, I want to see the characteristics of each cluster based on aroma, sweetness, flavor, body, and acidity. From the cluster we can see that coffe in the cluster 3 have the highest value in all characteristics except sweetness. Type of coffe in cluster 1 have the lowest value of all characteristics.

Clustering Coffee

Toho Dustin

December 11, 2019