Data Exploration & Preparation

# your code here
coffee <- read.csv("coffee.csv")
glimpse(coffee)

## Observations: 1,082
## Variables: 13
## $ coffeeId      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ Aroma         <dbl> 7.83, 8.00, 7.92, 8.00, 8.33, 8.00, 7.67, 7.67, 7.67,...
## $ Flavor        <dbl> 8.08, 7.75, 7.83, 7.92, 7.83, 7.92, 7.75, 7.75, 7.75,...
## $ Aftertaste    <dbl> 7.75, 7.92, 7.92, 7.92, 7.83, 7.67, 7.83, 7.83, 7.58,...
## $ Acidity       <dbl> 7.92, 8.00, 8.00, 7.75, 7.75, 8.00, 7.83, 7.67, 7.83,...
## $ Body          <dbl> 8.25, 7.92, 7.83, 7.83, 8.25, 7.75, 7.92, 7.92, 7.83,...
## $ Balance       <dbl> 7.92, 7.92, 7.92, 7.75, 7.75, 7.92, 7.75, 7.83, 8.00,...
## $ Uniformity    <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.0...
## $ Clean.Cup     <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.0...
## $ Sweetness     <dbl> 8.00, 8.00, 7.83, 7.75, 7.58, 7.75, 8.00, 7.92, 7.92,...
## $ Cupper.Points <dbl> 8.00, 8.00, 8.00, 8.08, 7.67, 7.75, 7.83, 7.92, 7.92,...
## $ Moisture      <dbl> 0.12, 0.00, 0.00, 0.12, 0.12, 0.00, 0.00, 0.10, 0.09,...
## $ Quakers       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

This dataset consisted of 13 variables with 1082 rows and contain reviews of varieties of Arabica coffee by highly trained individuals from the Coffee Quality Institute. Take a look at the following glossary:

coffeId : id of coffee
Aroma : the smell of coffee after adding hot water (e.g.floral, spicy, fruity, winery, sweety, earthy or nutty etc.).
Flavor : the taste characteristics (e.g. fruity, sour, bitter, rich or balanced etc.).
Aftertaste : overall impression of the coffee remains in the mouth.
Acidity : the sharpness and liveliness of the acidity (e.g. sharp, thin, flat, mild, or neutral etc.).
Body : the tactile feeling of the coffee in the mouth (e.g. full, thick, balanced, buttery or thin etc.).
Balance : no single flavor dominates the other.
Uniformity : how similarly sized the ground coffee particles are.
Clean cup : no flavor defects present.
Sweetness : mild, smooth taste sensation with no harsh flavors.
Cupper points : points earned from a Cupper (a person who objectively review over the taste and aroma of a brewed coffee, to know whether it’s a Specialty Grade Coffee).
Moisture : any amount of liquid diffused in small quantities within a green coffee beans, if humidity is stable, the coffee beans will retain that moisture until roasting.
Quakers: unripened coffee beans, often with a wrinkled surface, not darken well when roasted.

# your code here
coffee <- coffee %>% 
  select(-coffeeId)
colnames(coffee)

##  [1] "Aroma"         "Flavor"        "Aftertaste"    "Acidity"      
##  [5] "Body"          "Balance"       "Uniformity"    "Clean.Cup"    
##  [9] "Sweetness"     "Cupper.Points" "Moisture"      "Quakers"

1. Principal Component Analysis (PCA)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

Data Pre-Processing

PCA is very useful to retain information while reducing the dimension of the data. However, we need to make sure that our data is properly scaled in order to get a useful PCA.

coffee_scale <- scale(coffee)

Build Principal Component

We have prepared the scaled data to be used for PCA. Next, we will try to generate the principal component from the coffee_scale.

pca_coffee <- PCA(coffee_scale,scale.unit = F)

summary(pca_coffee)

## 
## Call:
## PCA(X = coffee_scale, scale.unit = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               6.422   1.326   0.990   0.939   0.557   0.473   0.320
## % of var.             53.565  11.061   8.261   7.829   4.647   3.946   2.665
## Cumulative % of var.  53.565  64.626  72.887  80.716  85.363  89.310  91.975
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.287   0.245   0.181   0.156   0.093
## % of var.              2.391   2.045   1.509   1.304   0.776
## Cumulative % of var.  94.366  96.411  97.920  99.224 100.000
## 
## Individuals (the 10 first)
##                   Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1             |  4.540 |  2.342  0.079  0.266 | -2.367  0.391  0.272 |  0.350
## 2             |  4.703 |  2.281  0.075  0.235 | -3.275  0.748  0.485 |  0.291
## 3             |  4.824 |  2.151  0.067  0.199 | -3.403  0.807  0.497 |  0.319
## 4             |  4.449 |  1.800  0.047  0.164 | -2.428  0.411  0.298 |  0.386
## 5             |  5.047 |  1.965  0.056  0.152 | -2.667  0.496  0.279 |  0.434
## 6             |  4.749 |  1.799  0.047  0.144 | -3.352  0.783  0.498 |  0.311
## 7             |  4.267 |  1.492  0.032  0.122 | -3.016  0.634  0.500 |  0.234
## 8             |  3.873 |  1.314  0.025  0.115 | -2.259  0.356  0.340 |  0.292
## 9             |  3.848 |  1.305  0.024  0.115 | -2.336  0.380  0.369 |  0.278
## 10            |  4.169 |  1.318  0.025  0.100 | -2.275  0.361  0.298 |  0.335
##                  ctr   cos2  
## 1              0.011  0.006 |
## 2              0.008  0.004 |
## 3              0.009  0.004 |
## 4              0.014  0.008 |
## 5              0.018  0.007 |
## 6              0.009  0.004 |
## 7              0.005  0.003 |
## 8              0.008  0.006 |
## 9              0.007  0.005 |
## 10             0.010  0.006 |
## 
## Variables (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Aroma         |  0.856 11.414  0.734 | -0.119  1.063  0.014 |  0.034  0.117
## Flavor        |  0.936 13.641  0.877 | -0.116  1.012  0.013 |  0.027  0.076
## Aftertaste    |  0.927 13.367  0.859 | -0.124  1.151  0.015 |  0.025  0.063
## Acidity       |  0.873 11.865  0.763 | -0.134  1.348  0.018 |  0.010  0.009
## Body          |  0.859 11.495  0.739 | -0.158  1.888  0.025 |  0.021  0.046
## Balance       |  0.888 12.276  0.789 | -0.127  1.216  0.016 |  0.015  0.023
## Uniformity    |  0.617  5.922  0.381 |  0.497 18.630  0.247 | -0.075  0.574
## Clean.Cup     |  0.549  4.698  0.302 |  0.528 21.045  0.279 | -0.102  1.053
## Sweetness     |  0.454  3.204  0.206 |  0.668 33.635  0.446 | -0.123  1.519
## Cupper.Points |  0.866 11.685  0.751 | -0.162  1.982  0.026 |  0.032  0.105
##                 cos2  
## Aroma          0.001 |
## Flavor         0.001 |
## Aftertaste     0.001 |
## Acidity        0.000 |
## Body           0.000 |
## Balance        0.000 |
## Uniformity     0.006 |
## Clean.Cup      0.010 |
## Sweetness      0.015 |
## Cupper.Points  0.001 |

Based on the summary, if we choose to only tolerate no more than 20% of information loss, we will use 4 principal components

Another great implementation of PCA is to visualize high dimensional data into 2 dimensional plot for various purposes, such as cluster analysis or detecting any outliers.

fviz_eig(pca_coffee, ncp = 13,  addlabels = T, main = "Variance explained by each dimensions")

library(FactoMineR)

# PCA visualization - Individuals Factor Map
plot.PCA(x = pca_coffee, 
         choix = "ind", 
         invisible = "quali", 
         select = "contrib 5", 
         habillage = 8)

Judging from the plot, we can see three observations that can be considered as outlier. It is observation number 1082, 1080, 1081.

Exctracting information from the loading information using the dimdesc() function

pca_dimdesc <- dimdesc(pca_coffee)
pca_dimdesc$Dim.1

## $quanti
##               correlation       p.value
## Flavor          0.9363936  0.000000e+00
## Aftertaste      0.9269447  0.000000e+00
## Balance         0.8882928  0.000000e+00
## Acidity         0.8732907  0.000000e+00
## Cupper.Points   0.8666618  0.000000e+00
## Body            0.8595808 3.508914e-317
## Aroma           0.8565349 1.566517e-312
## Uniformity      0.6169480 1.775321e-114
## Clean.Cup       0.5495376  2.132945e-86
## Sweetness       0.4538191  4.405880e-56
## Moisture       -0.1661858  3.831785e-08
## 
## attr(,"class")
## [1] "condes" "list "

we can see that 3 most contributing variables on PC 1 based on the correlation between variables with the PC 1 are Balance, Flavor, Aftertaste

In the principal component analysis, each produced PC has an eigen value obtained from the covariance matrix. The greater the eigen value, the greater the variance captured by the PC.

2. K-Means Clustering

Data clustering is a common data mining technique to create clusters of data that can be identified as “data with the same characteristics”. Before performing data clustering, you will need to remove the identified outlier based the previous individual PCA plot. The observation with coffeeId 1082 is a fairly extending outlier compared to the rest of the observation. Remove the observation from our initial dataset and once again scale the data.

# your code here
coffee <- coffee[-c(1082),]

coffee_scale2 <- scale(coffee)

2.1 Choosing Optimum K

The next step in building a K-means clustering is to find the optimum cluster number to model our data. We will use the function belowe to find the optimum K using Elbow method.

RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(101)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

kmeansTunning(coffee_scale2, maxK = 5)

Based on the elbow plot generated from the function above, the optimal number of clusters use is 4.

K-means is a clustering algorithm that groups the data based on distance. The resulting clusters are stated to be optimum if the distance between data in the same cluster is low and the distance between data from different clusters is high.

2.2 Building Cluster

We will do K-means clustering from our data and store it.

# your code here
set.seed(101)
coffee_km2 <- kmeans(coffee_scale2, 4)
coffee$cluster <- coffee_km2$cluster

2.3 Clusters Profiling

Supposedly that a customer enjoy a coffee with a particular Id, then we can suggest coffee beans that may be characteristically similar enough to warrant a recommendation by knowing which cluster the former coffeeid may fall into.

Visualizing Cluster

Combining PCA with K-Means

library(factoextra)

fviz_cluster(object = coffee_km2, 
             data = coffee)

Conclusion

coffee %>% 
  group_by(cluster) %>% 
  summarise_all("mean")

## # A tibble: 4 x 13
##   cluster Aroma Flavor Aftertaste Acidity  Body Balance Uniformity Clean.Cup
##     <int> <dbl>  <dbl>      <dbl>   <dbl> <dbl>   <dbl>      <dbl>     <dbl>
## 1       1  7.63   7.59       7.50    7.59  7.62    7.67       9.92      9.95
## 2       2  7.78   7.78       7.64    7.75  7.70    7.73       9.93      9.96
## 3       3  7.11   6.85       6.70    7.07  7.16    6.85       8.72      7.05
## 4       4  7.41   7.33       7.20    7.36  7.33    7.31       9.91      9.97
## # ... with 4 more variables: Sweetness <dbl>, Cupper.Points <dbl>,
## #   Moisture <dbl>, Quakers <dbl>

After I clustered the coffee into 4 class, I want to see the characteristics of each cluster based on aroma, sweetness, flavor, body, and acidity. From the cluster we can see that coffee in the cluster 2 have the highest value in all characteristics except clean cup and sweetness.

Coffee Clustering

Ezra Soterion Nugroho

April 8, 2020