Data Exploration & Preparation
## Observations: 1,082
## Variables: 13
## $ coffeeId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ Aroma <dbl> 7.83, 8.00, 7.92, 8.00, 8.33, 8.00, 7.67, 7.67, 7.67,...
## $ Flavor <dbl> 8.08, 7.75, 7.83, 7.92, 7.83, 7.92, 7.75, 7.75, 7.75,...
## $ Aftertaste <dbl> 7.75, 7.92, 7.92, 7.92, 7.83, 7.67, 7.83, 7.83, 7.58,...
## $ Acidity <dbl> 7.92, 8.00, 8.00, 7.75, 7.75, 8.00, 7.83, 7.67, 7.83,...
## $ Body <dbl> 8.25, 7.92, 7.83, 7.83, 8.25, 7.75, 7.92, 7.92, 7.83,...
## $ Balance <dbl> 7.92, 7.92, 7.92, 7.75, 7.75, 7.92, 7.75, 7.83, 8.00,...
## $ Uniformity <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.0...
## $ Clean.Cup <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.0...
## $ Sweetness <dbl> 8.00, 8.00, 7.83, 7.75, 7.58, 7.75, 8.00, 7.92, 7.92,...
## $ Cupper.Points <dbl> 8.00, 8.00, 8.00, 8.08, 7.67, 7.75, 7.83, 7.92, 7.92,...
## $ Moisture <dbl> 0.12, 0.00, 0.00, 0.12, 0.12, 0.00, 0.00, 0.10, 0.09,...
## $ Quakers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
This dataset consisted of 13 variables with 1082 rows and contain reviews of varieties of Arabica coffee by highly trained individuals from the Coffee Quality Institute. Take a look at the following glossary:
coffeId: id of coffeeAroma: the smell of coffee after adding hot water (e.g.floral, spicy, fruity, winery, sweety, earthy or nutty etc.).Flavor: the taste characteristics (e.g. fruity, sour, bitter, rich or balanced etc.).Aftertaste: overall impression of the coffee remains in the mouth.Acidity: the sharpness and liveliness of the acidity (e.g. sharp, thin, flat, mild, or neutral etc.).Body: the tactile feeling of the coffee in the mouth (e.g. full, thick, balanced, buttery or thin etc.).Balance: no single flavor dominates the other.Uniformity: how similarly sized the ground coffee particles are.Clean cup: no flavor defects present.Sweetness: mild, smooth taste sensation with no harsh flavors.Cupper points: points earned from a Cupper (a person who objectively review over the taste and aroma of a brewed coffee, to know whether it’s a Specialty Grade Coffee).Moisture: any amount of liquid diffused in small quantities within a green coffee beans, if humidity is stable, the coffee beans will retain that moisture until roasting.Quakers: unripened coffee beans, often with a wrinkled surface, not darken well when roasted.
## [1] "Aroma" "Flavor" "Aftertaste" "Acidity"
## [5] "Body" "Balance" "Uniformity" "Clean.Cup"
## [9] "Sweetness" "Cupper.Points" "Moisture" "Quakers"
1. Principal Component Analysis (PCA)
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
Data Pre-Processing
PCA is very useful to retain information while reducing the dimension of the data. However, we need to make sure that our data is properly scaled in order to get a useful PCA.
Build Principal Component
We have prepared the scaled data to be used for PCA. Next, we will try to generate the principal component from the coffee_scale.
##
## Call:
## PCA(X = coffee_scale, scale.unit = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 6.422 1.326 0.990 0.939 0.557 0.473 0.320
## % of var. 53.565 11.061 8.261 7.829 4.647 3.946 2.665
## Cumulative % of var. 53.565 64.626 72.887 80.716 85.363 89.310 91.975
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.287 0.245 0.181 0.156 0.093
## % of var. 2.391 2.045 1.509 1.304 0.776
## Cumulative % of var. 94.366 96.411 97.920 99.224 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 4.540 | 2.342 0.079 0.266 | -2.367 0.391 0.272 | 0.350
## 2 | 4.703 | 2.281 0.075 0.235 | -3.275 0.748 0.485 | 0.291
## 3 | 4.824 | 2.151 0.067 0.199 | -3.403 0.807 0.497 | 0.319
## 4 | 4.449 | 1.800 0.047 0.164 | -2.428 0.411 0.298 | 0.386
## 5 | 5.047 | 1.965 0.056 0.152 | -2.667 0.496 0.279 | 0.434
## 6 | 4.749 | 1.799 0.047 0.144 | -3.352 0.783 0.498 | 0.311
## 7 | 4.267 | 1.492 0.032 0.122 | -3.016 0.634 0.500 | 0.234
## 8 | 3.873 | 1.314 0.025 0.115 | -2.259 0.356 0.340 | 0.292
## 9 | 3.848 | 1.305 0.024 0.115 | -2.336 0.380 0.369 | 0.278
## 10 | 4.169 | 1.318 0.025 0.100 | -2.275 0.361 0.298 | 0.335
## ctr cos2
## 1 0.011 0.006 |
## 2 0.008 0.004 |
## 3 0.009 0.004 |
## 4 0.014 0.008 |
## 5 0.018 0.007 |
## 6 0.009 0.004 |
## 7 0.005 0.003 |
## 8 0.008 0.006 |
## 9 0.007 0.005 |
## 10 0.010 0.006 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## Aroma | 0.856 11.414 0.734 | -0.119 1.063 0.014 | 0.034 0.117
## Flavor | 0.936 13.641 0.877 | -0.116 1.012 0.013 | 0.027 0.076
## Aftertaste | 0.927 13.367 0.859 | -0.124 1.151 0.015 | 0.025 0.063
## Acidity | 0.873 11.865 0.763 | -0.134 1.348 0.018 | 0.010 0.009
## Body | 0.859 11.495 0.739 | -0.158 1.888 0.025 | 0.021 0.046
## Balance | 0.888 12.276 0.789 | -0.127 1.216 0.016 | 0.015 0.023
## Uniformity | 0.617 5.922 0.381 | 0.497 18.630 0.247 | -0.075 0.574
## Clean.Cup | 0.549 4.698 0.302 | 0.528 21.045 0.279 | -0.102 1.053
## Sweetness | 0.454 3.204 0.206 | 0.668 33.635 0.446 | -0.123 1.519
## Cupper.Points | 0.866 11.685 0.751 | -0.162 1.982 0.026 | 0.032 0.105
## cos2
## Aroma 0.001 |
## Flavor 0.001 |
## Aftertaste 0.001 |
## Acidity 0.000 |
## Body 0.000 |
## Balance 0.000 |
## Uniformity 0.006 |
## Clean.Cup 0.010 |
## Sweetness 0.015 |
## Cupper.Points 0.001 |
- Based on the summary, if we choose to only tolerate no more than 20% of information loss, we will use 4 principal components
Another great implementation of PCA is to visualize high dimensional data into 2 dimensional plot for various purposes, such as cluster analysis or detecting any outliers.
library(FactoMineR)
# PCA visualization - Individuals Factor Map
plot.PCA(x = pca_coffee,
choix = "ind",
invisible = "quali",
select = "contrib 5",
habillage = 8)- Judging from the plot, we can see three observations that can be considered as outlier. It is observation number 1082, 1080, 1081.
Exctracting information from the loading information using the dimdesc() function
## $quanti
## correlation p.value
## Flavor 0.9363936 0.000000e+00
## Aftertaste 0.9269447 0.000000e+00
## Balance 0.8882928 0.000000e+00
## Acidity 0.8732907 0.000000e+00
## Cupper.Points 0.8666618 0.000000e+00
## Body 0.8595808 3.508914e-317
## Aroma 0.8565349 1.566517e-312
## Uniformity 0.6169480 1.775321e-114
## Clean.Cup 0.5495376 2.132945e-86
## Sweetness 0.4538191 4.405880e-56
## Moisture -0.1661858 3.831785e-08
##
## attr(,"class")
## [1] "condes" "list "
we can see that 3 most contributing variables on PC 1 based on the correlation between variables with the PC 1 are Balance, Flavor, Aftertaste
In the principal component analysis, each produced PC has an eigen value obtained from the covariance matrix. The greater the eigen value, the greater the variance captured by the PC.
2. K-Means Clustering
Data clustering is a common data mining technique to create clusters of data that can be identified as “data with the same characteristics”. Before performing data clustering, you will need to remove the identified outlier based the previous individual PCA plot. The observation with coffeeId 1082 is a fairly extending outlier compared to the rest of the observation. Remove the observation from our initial dataset and once again scale the data.
2.1 Choosing Optimum K
The next step in building a K-means clustering is to find the optimum cluster number to model our data. We will use the function belowe to find the optimum K using Elbow method.
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 2:maxK) {
set.seed(101)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}
kmeansTunning(coffee_scale2, maxK = 5)Based on the elbow plot generated from the function above, the optimal number of clusters use is 4.
K-means is a clustering algorithm that groups the data based on distance. The resulting clusters are stated to be optimum if the distance between data in the same cluster is low and the distance between data from different clusters is high.
2.2 Building Cluster
We will do K-means clustering from our data and store it.
2.3 Clusters Profiling
- Supposedly that a customer enjoy a coffee with a particular Id, then we can suggest coffee beans that may be characteristically similar enough to warrant a recommendation by knowing which cluster the former coffeeid may fall into.
Visualizing Cluster
Conclusion
## # A tibble: 4 x 13
## cluster Aroma Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 7.63 7.59 7.50 7.59 7.62 7.67 9.92 9.95
## 2 2 7.78 7.78 7.64 7.75 7.70 7.73 9.93 9.96
## 3 3 7.11 6.85 6.70 7.07 7.16 6.85 8.72 7.05
## 4 4 7.41 7.33 7.20 7.36 7.33 7.31 9.91 9.97
## # ... with 4 more variables: Sweetness <dbl>, Cupper.Points <dbl>,
## # Moisture <dbl>, Quakers <dbl>
After I clustered the coffee into 4 class, I want to see the characteristics of each cluster based on aroma, sweetness, flavor, body, and acidity. From the cluster we can see that coffee in the cluster 2 have the highest value in all characteristics except clean cup and sweetness.