I want to clustering coffee based on their characteristics.
#> 'data.frame': 1082 obs. of 13 variables:
#> $ coffeeId : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Aroma : num 7.83 8 7.92 8 8.33 8 7.67 7.67 7.67 7.67 ...
#> $ Flavor : num 8.08 7.75 7.83 7.92 7.83 7.92 7.75 7.75 7.75 7.83 ...
#> $ Aftertaste : num 7.75 7.92 7.92 7.92 7.83 7.67 7.83 7.83 7.58 7.83 ...
#> $ Acidity : num 7.92 8 8 7.75 7.75 8 7.83 7.67 7.83 7.83 ...
#> $ Body : num 8.25 7.92 7.83 7.83 8.25 7.75 7.92 7.92 7.83 7.92 ...
#> $ Balance : num 7.92 7.92 7.92 7.75 7.75 7.92 7.75 7.83 8 7.75 ...
#> $ Uniformity : num 10 10 10 10 10 10 10 10 10 10 ...
#> $ Clean.Cup : num 10 10 10 10 10 10 10 10 10 10 ...
#> $ Sweetness : num 8 8 7.83 7.75 7.58 7.75 8 7.92 7.92 7.75 ...
#> $ Cupper.Points: num 8 8 8 8.08 7.67 7.75 7.83 7.92 7.92 7.83 ...
#> $ Moisture : num 0.12 0 0 0.12 0.12 0 0 0.1 0.09 0.12 ...
#> $ Quakers : int 0 0 0 0 0 0 0 0 0 0 ...
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
#>
#> Call:
#> PCA(X = coffee_scale, scale.unit = F, ncp = 13, graph = F)
#>
#>
#> Eigenvalues
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
#> Variance 6.938 1.443 0.996 0.941 0.630 0.474
#> % of var. 53.418 11.114 7.670 7.244 4.853 3.653
#> Cumulative % of var. 53.418 64.531 72.202 79.446 84.299 87.951
#> Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
#> Variance 0.353 0.313 0.247 0.230 0.175 0.155
#> % of var. 2.716 2.406 1.902 1.771 1.346 1.192
#> Cumulative % of var. 90.668 93.074 94.976 96.746 98.092 99.284
#> Dim.13
#> Variance 0.093
#> % of var. 0.716
#> Cumulative % of var. 100.000
#>
#> Individuals (the 10 first)
#> Dist Dim.1 ctr cos2 Dim.2 ctr cos2
#> 1 | 4.858 | 2.793 0.104 0.331 | -2.668 0.456 0.302 |
#> 2 | 5.010 | 2.743 0.100 0.300 | -3.450 0.762 0.474 |
#> 3 | 5.123 | 2.623 0.092 0.262 | -3.604 0.832 0.495 |
#> 4 | 4.770 | 2.278 0.069 0.228 | -2.830 0.513 0.352 |
#> 5 | 5.331 | 2.434 0.079 0.208 | -2.997 0.575 0.316 |
#> 6 | 5.049 | 2.281 0.069 0.204 | -3.586 0.823 0.504 |
#> 7 | 4.597 | 1.978 0.052 0.185 | -3.284 0.691 0.510 |
#> 8 | 4.232 | 1.801 0.043 0.181 | -2.679 0.460 0.401 |
#> 9 | 4.208 | 1.792 0.043 0.181 | -2.740 0.481 0.424 |
#> 10 | 4.503 | 1.805 0.043 0.161 | -2.717 0.473 0.364 |
#> Dim.3 ctr cos2
#> 1 0.367 0.012 0.006 |
#> 2 0.076 0.001 0.000 |
#> 3 0.106 0.001 0.000 |
#> 4 0.432 0.017 0.008 |
#> 5 0.446 0.018 0.007 |
#> 6 0.109 0.001 0.000 |
#> 7 0.050 0.000 0.000 |
#> 8 0.315 0.009 0.006 |
#> 9 0.280 0.007 0.004 |
#> 10 0.392 0.014 0.008 |
#>
#> Variables (the 10 first)
#> Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
#> coffeeId | -0.746 8.025 0.557 | 0.384 10.215 0.148 | -0.083
#> Aroma | 0.855 10.542 0.732 | -0.066 0.303 0.004 | 0.021
#> Flavor | 0.940 12.742 0.885 | -0.075 0.388 0.006 | 0.018
#> Aftertaste | 0.933 12.536 0.871 | -0.087 0.520 0.008 | 0.013
#> Acidity | 0.874 11.012 0.765 | -0.083 0.473 0.007 | -0.004
#> Body | 0.854 10.518 0.730 | -0.084 0.490 0.007 | -0.007
#> Balance | 0.890 11.416 0.793 | -0.076 0.401 0.006 | -0.005
#> Uniformity | 0.596 5.122 0.356 | 0.519 18.670 0.270 | -0.051
#> Clean.Cup | 0.534 4.107 0.285 | 0.523 18.952 0.274 | -0.067
#> Sweetness | 0.412 2.445 0.170 | 0.733 37.229 0.538 | -0.104
#> ctr cos2
#> coffeeId 0.687 0.007 |
#> Aroma 0.045 0.000 |
#> Flavor 0.031 0.000 |
#> Aftertaste 0.018 0.000 |
#> Acidity 0.001 0.000 |
#> Body 0.005 0.000 |
#> Balance 0.003 0.000 |
#> Uniformity 0.261 0.003 |
#> Clean.Cup 0.455 0.005 |
#> Sweetness 1.077 0.011 |
Plot PCA also to see the outlier
#> $Dim.1
#> $Dim.1$quanti
#> correlation p.value
#> Flavor 0.9406636 0.000000e+00
#> Aftertaste 0.9330372 0.000000e+00
#> Balance 0.8903672 0.000000e+00
#> Cupper.Points 0.8775643 0.000000e+00
#> Acidity 0.8744702 0.000000e+00
#> Aroma 0.8555946 4.061215e-311
#> Body 0.8546484 1.049902e-309
#> Uniformity 0.5963953 3.319503e-105
#> Clean.Cup 0.5340393 8.258183e-81
#> Sweetness 0.4120568 1.347442e-45
#> Moisture -0.1751626 6.631645e-09
#> coffeeId -0.7464989 2.702566e-193
#>
#>
#> $Dim.2
#> $Dim.2$quanti
#> correlation p.value
#> Sweetness 0.73340077 3.081049e-183
#> Clean.Cup 0.52327594 4.265727e-77
#> Uniformity 0.51937303 8.785091e-76
#> coffeeId 0.38416983 2.236144e-39
#> Moisture 0.37555249 1.424046e-37
#> Quakers 0.13601584 7.129601e-06
#> Aroma -0.06616688 2.952929e-02
#> Flavor -0.07489075 1.373761e-02
#> Balance -0.07615240 1.222151e-02
#> Acidity -0.08264565 6.527425e-03
#> Body -0.08411397 5.630897e-03
#> Aftertaste -0.08666795 4.332094e-03
#> Cupper.Points -0.13789572 5.302953e-06
#>
#>
#> $Dim.3
#> $Dim.3$quanti
#> correlation p.value
#> Quakers 0.97902816 0.0000000000
#> Moisture 0.11155973 0.0002361216
#> Clean.Cup -0.06733053 0.0267800937
#> coffeeId -0.08273990 0.0064662377
#> Sweetness -0.10363297 0.0006398488
Because on the plot we see the observation in 1082 is outlier, so I have to take out.
kmeansTunning <- function(data, maxK = 10) {
withinall <- NULL
total_k <- NULL
for (i in 2:maxK) {
set.seed(654)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}
In using k-means, I have to consider the optimal k
. One of the optimum method is using Elbow Method
From the graph, I can use the efficient k = 3
set.seed(654)
coffe_cluster <- kmeans(coffee_new, centers = 3)
coffee_new$cluster <- coffe_cluster$cluster
coffee_new
Visualizing the cluster
coffee_new %>%
mutate(cluster = coffe_cluster$cluster) %>%
group_by(cluster) %>%
summarise(aroma = mean(Aroma),
sweetness = mean(Sweetness),
flavor = mean(Flavor),
body = mean(Body),
acidity = mean(Acidity))
After clustered intu 3 class, I want to see the characteristics of each cluster based on aroma, sweetness, flavor, body, and acidity. From the cluster we can see that coffe in the cluster 3 have the highest value in all characteristics except sweetness. Type of coffe in cluster 1 have the lowest value of all characteristics.