1 Introduction

This project is part of the Harvard: PH125.9 course Data science: Capstone course. The aim of this project is to predict the quality of arabica coffee on a scale of 0-10 or low-high-medium given a set of features as inputs.

Coffee is consumed on daily basis and it’s one of the most popular non-alcoholic drinks. It’s the second traded commodity in the world market (Andrew Menke 2018) and the sector is expected to continue growing and the revenue in the coffee segment is worth \(US\$ 362,601m\) in 2020.(Satista n.d.)

From growing to brewing, a lot of factors affect the flavor, intensity and the quality of coffee (Co n.d.). Mainly, we have \(3\) species: Arabica, Canephora (known as Robusta) and Liberica (Co n.d.). The most popular are Arbica and Robusta. Arabica grows on high altitude while Robusta grows on lower altitude. Each of these two species have some different varieties. Most of the commercial roasters use Arabica beans (Andrea Trevisan n.d.).

This project is organized as follows: In section 2, we present the dataset used and detailed explanation on all of the models. In section 3, we explore the Coffee.data dataset and its statistical properties. In addition we present the algorithms of the models used in this project. In Section 4, we discuss the results and we conclude in section 5.

The evaluation of the validity of our model is based on the Residual Mean Squared Error RMSE for the linear regression models and the confusion matrix for the categorical algorithms.

2 Methodology

2.1 Dataset

First, we load libraries and data. Data is retrieved from Kaggle website who redirect to https://github.com/jldbc/coffee-quality-database. The source of this data is the Coffee Quality Institute1. There is two datasets created for arabica and robusta coffee. The quality measures used as predictors in this project for coffee are:

  • Altitude,

  • Aroma,

  • Flavor,

  • Aftertaste,

  • Acidity,

  • Body,

  • Balance,

  • Uniformity,

  • Clean Cup,

  • Sweetness,

  • Cupper Points. For predicting the type of coffee, we used species as outcome. For predicting quality of coffee, we created two column:

  • score : we divided the total cup points by \(10\) and we rounded to nearest integer so we have a score range between \(0\) and \(10\).

  • quality: This column is added by classifying score into \(3\) categories:

    • low with score below \(8\)
    • medium with score equals to \(8\)
    • high with score greater to \(8\).

We remove data that contains NA values and we make sure that our data has no missing information. One missing country has been renamed missing since only its name is missed. We removed all observations that have an altitude over \(5000 m\) since its clear that there is an error in data entry. We kept all observations having an average over \(4\) on aroma and cleaning cup. The reason is to reduce outliers.

## [1] FALSE

Now,Data is clean, tidy and ready for exploration and analysis.

2.2 Model Evaluation

2.2.1 Confusion Matrix

To evaluate the performance of our classifications models, we are going use the confusion matrix.

Actually Positive Actually Negative
Predicted positive True positives (TP) False positives (FP)
Predicted negative False negatives (FN) True negatives (TN)

From this matrix, we compute some rates:

  • Accuracy is the ability of the model to predict correctly and it is calculated as follows: \[ Accuracy=\frac{TP+TN}{TP+TN+FP+FN} \]
  • Sensitivity is the probability to predict a positive outcome when the actual outcome is positive and it is calculated as follows: \[ Sensitivity=\frac{TP}{TP+FN} \]
  • Specificity : is the probability to not predict a positive outcome when the actual outcome is not a positive and its calculated as follows:
    \[ Specificity=\frac{TN}{TN+FP} \]

2.2.2 Root Mean Square Error

Validation of our model is based on the value of Root Mean Square Error RMSE. The RMSE loss function is calculated as follows: \[\begin{equation} RMSE=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y-\hat{y}\right)^2} \end{equation}\] with

  • \(y\) the actual outcome,
  • \(\hat{y}\) the predicted outcome.
  • \(N\) number of observations.

The code that compute the RMSE is :

3 Data analysis and model preparation

Below the top \(6\) rows of the dataset.

##   Species Country.of.Origin altitude_mean_meters Aroma Flavor Aftertaste
## 1 Arabica          Ethiopia                 2075  8.67   8.83       8.67
## 2 Arabica          Ethiopia                 2075  8.75   8.67       8.50
## 3 Arabica         Guatemala                 1700  8.42   8.50       8.42
## 4 Arabica          Ethiopia                 2000  8.17   8.58       8.42
## 5 Arabica          Ethiopia                 2075  8.25   8.50       8.25
## 6 Arabica          Ethiopia                 1635  8.25   8.33       8.50
##   Acidity Body Balance Uniformity Clean.Cup Sweetness Cupper.Points
## 1    8.75 8.50    8.42         10        10     10.00          8.75
## 2    8.58 8.42    8.42         10        10     10.00          8.58
## 3    8.42 8.33    8.42         10        10     10.00          9.25
## 4    8.42 8.50    8.25         10        10     10.00          8.67
## 5    8.50 8.42    8.33         10        10     10.00          8.58
## 6    8.42 8.33    8.50         10        10      9.33          9.00
##   Total.Cup.Points score quality
## 1            90.58     9    high
## 2            89.92     9    high
## 3            89.75     9    high
## 4            89.00     9    high
## 5            88.83     9    high
## 6            88.67     9    high

3.1 Data exploration and visualization

Before we start, let’s have a look on the structure on our data and observe a statistical summary.

## 'data.frame':    1076 obs. of  16 variables:
##  $ Species             : Factor w/ 2 levels "Arabica","Robusta": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country.of.Origin   : Factor w/ 36 levels "Brazil","Burundi",..: 9 9 10 9 9 9 9 9 9 32 ...
##  $ altitude_mean_meters: num  2075 2075 1700 2000 2075 ...
##  $ Aroma               : num  8.67 8.75 8.42 8.17 8.25 8.25 8.67 8.08 8.17 8.25 ...
##  $ Flavor              : num  8.83 8.67 8.5 8.58 8.5 8.33 8.67 8.58 8.67 8.42 ...
##  $ Aftertaste          : num  8.67 8.5 8.42 8.42 8.25 8.5 8.58 8.5 8.25 8.17 ...
##  $ Acidity             : num  8.75 8.58 8.42 8.42 8.5 8.42 8.42 8.5 8.5 8.33 ...
##  $ Body                : num  8.5 8.42 8.33 8.5 8.42 8.33 8.33 7.67 7.75 8.08 ...
##  $ Balance             : num  8.42 8.42 8.42 8.25 8.33 8.5 8.42 8.42 8.17 8.17 ...
##  $ Uniformity          : num  10 10 10 10 10 10 9.33 10 10 10 ...
##  $ Clean.Cup           : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ Sweetness           : num  10 10 10 10 10 9.33 9.33 10 10 10 ...
##  $ Cupper.Points       : num  8.75 8.58 9.25 8.67 8.58 9 8.67 8.5 8.58 8.5 ...
##  $ Total.Cup.Points    : num  90.6 89.9 89.8 89 88.8 ...
##  $ score               : num  9 9 9 9 9 9 9 9 9 9 ...
##  $ quality             : Factor w/ 3 levels "high","low","medium": 1 1 1 1 1 1 1 1 1 1 ...
##     Species     Country.of.Origin altitude_mean_meters     Aroma      
##  Arabica:1052   Mexico   :227     Min.   : 110         Min.   :5.080  
##  Robusta:  24   Guatemala:150     1st Qu.:1000         1st Qu.:7.420  
##                 Colombia :149     Median :1300         Median :7.580  
##                 Brazil   : 91     Mean   :1263         Mean   :7.582  
##                 Taiwan   : 69     3rd Qu.:1600         3rd Qu.:7.750  
##                 Honduras : 50     Max.   :4287         Max.   :8.750  
##                 (Other)  :340                                         
##      Flavor        Aftertaste       Acidity           Body      
##  Min.   :6.170   Min.   :6.170   Min.   :5.250   Min.   :6.330  
##  1st Qu.:7.330   1st Qu.:7.250   1st Qu.:7.330   1st Qu.:7.330  
##  Median :7.580   Median :7.420   Median :7.500   Median :7.500  
##  Mean   :7.531   Mean   :7.404   Mean   :7.538   Mean   :7.517  
##  3rd Qu.:7.750   3rd Qu.:7.580   3rd Qu.:7.750   3rd Qu.:7.670  
##  Max.   :8.830   Max.   :8.670   Max.   :8.750   Max.   :8.580  
##                                                                 
##     Balance        Uniformity       Clean.Cup        Sweetness     
##  Min.   :6.080   Min.   : 6.000   Min.   : 5.330   Min.   : 6.000  
##  1st Qu.:7.330   1st Qu.:10.000   1st Qu.:10.000   1st Qu.:10.000  
##  Median :7.500   Median :10.000   Median :10.000   Median :10.000  
##  Mean   :7.516   Mean   : 9.876   Mean   : 9.888   Mean   : 9.889  
##  3rd Qu.:7.750   3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:10.000  
##  Max.   :8.750   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                                                    
##  Cupper.Points    Total.Cup.Points     score         quality   
##  Min.   : 5.170   Min.   :63.08    Min.   :6.000   high  : 70  
##  1st Qu.: 7.250   1st Qu.:81.25    1st Qu.:8.000   low   : 17  
##  Median : 7.500   Median :82.50    Median :8.000   medium:989  
##  Mean   : 7.496   Mean   :82.24    Mean   :8.048               
##  3rd Qu.: 7.750   3rd Qu.:83.60    3rd Qu.:8.000               
##  Max.   :10.000   Max.   :90.58    Max.   :9.000               
## 

## `summarise()` ungrouping output (override with `.groups` argument)

Overall, We have 1076 observations dominated by arabica species with 1052 observations and 24 observations for robusta species.

Almost all variables have outliers as shown by the plot below.

## NULL

Most of coffee samples in these dataset come from Mexico.

## `summarise()` regrouping output by 'Species' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

As shown above, coffee originated from United States (Hawaii) has the highest score among others.

Assessing correlation analysis is important before any type of modeling to check the importance of each predictor variable in our analysis. Sweetness is less correlated to other variables.

Adding the altitude variable does not have an impact on the correlation matrix. Thus, its not correlated to any qaulity measure variable.

We clearly observe from these density distributions that variables : Clean.Cup, Sweetness and Uniformity are skewed. The skew in these distributions is explained by the extreme values and score in the data.

3.2 Importance of variables

Arabica coffee tend to have a sweeter taste (Co n.d.) and that what makes Sweetness the most important factor in predicting the species of beans followed by Body.

##                        Arabica   Robusta
## altitude_mean_meters 0.5285171 0.5285171
## Aroma                0.6902329 0.6902329
## Flavor               0.6946293 0.6946293
## Aftertaste           0.7335036 0.7335036
## Acidity              0.7008278 0.7008278
## Body                 0.7446728 0.7446728
## Balance              0.6781527 0.6781527
## Uniformity           0.5165558 0.5165558
## Clean.Cup            0.5136050 0.5136050
## Sweetness            0.9957818 0.9957818
## Cupper.Points        0.7248693 0.7248693
## Total.Cup.Points     0.6289013 0.6289013
## score                0.5251901 0.5251901

As all coffee are Arabica, it’s clear that the variable altitude doesn’t play an important role in predicting the quality of the coffee. As mentioned above, Arabica coffee tend to be sweetness so Sweetness is a not an important predictor of quality of a bean when all coffee beans are Arabica.

##                           high       low    medium
## altitude_mean_meters 0.7752101 0.7752101 0.7239564
## Aroma                1.0000000 1.0000000 0.9665735
## Flavor               1.0000000 1.0000000 0.9805210
## Aftertaste           1.0000000 1.0000000 0.9768334
## Acidity              1.0000000 1.0000000 0.9616207
## Body                 0.9995798 0.9995798 0.9304059
## Balance              1.0000000 1.0000000 0.9812348
## Uniformity           0.8428571 0.8428571 0.8267115
## Clean.Cup            0.9701681 0.9701681 0.9630346
## Sweetness            0.7512605 0.7512605 0.7374948
## Cupper.Points        1.0000000 1.0000000 0.9764171
## Total.Cup.Points     1.0000000 1.0000000 1.0000000
## score                1.0000000 1.0000000 1.0000000
## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

All variables are somehow related with exception of altitude. The altitude which does not have a direct effect of the quality of coffee it affects mainly the cultivation of beans.

We visualize the probability distributions or density functions of each variable. We observe from these density distributions that variables : Clean.Cup, Sweetness and Uniformity are skewed. The skew in these distributions is explained by the extreme values and score in the data.

3.3 Modeling

Dataset is partitioned in train set (\(70\%\)) and test set (\(30\%\)). These two sets are standardized train_s and test_s and normalized train_n and train_n. Linear regression are sensitive to data so we trained our models on both standardized and normalized sets. Then, we compare RMSE, r-squared and adjusted r-squared results. The model with the lowest RMSE and highest r-squared and adjusted r-squared is selected.

3.3.1 Model 1: Predicting species with linear regression

We will start by predicting the species of coffee using linear regression models that take the form of \(Y=f(X)+\epsilon\) for an unknown function \(f\) where \(\epsilon\) is a mean-zero random error. If \(f\) is a linear function then we can write our multiple regression model as: \[Y=\beta_0+\beta_1X_1+\beta_2X_2+\dots+\beta_nX_n+\epsilon\] where \(X_i\) represents the \(i\)th predictor and \(\beta_i\) represents the association between the variable and the response. \(\beta_i\) represent the slope of \(i\)th predictor so we can interpret it as the variation of \(Y\) moving \(X_i\) one unit assuming all other predictors as fixed.

Models are resumed in the following table:

Model Data Predictors Outcome
Model 1 Standardized Quality measures without altitude Species
Model 2 Standardized Quality measures with altitude Species
Model 3 Standardized Quality measures with altitude and score Species
Model 4 Normalized Quality measures without altitude Species
Model 5 Normalized Quality measures with altitude Species
Model 6 Normalized Quality measures with altitude and score Species

3.3.2 Model : Predicting score using linear regression

In the follow, we remove Countries, total points species columns from the data. In addition, we remove all data related to robusta coffee. We standardized and normalized data for regression models. In classified algorithms, we used raw data.

Multiple regression models have been described above and have been applied to predict the score of coffee. The table below resumes data used and predictors.

Model Data Predictors Outcome
Model 7 Standardized Altitude Score
Model 8 Standardized Quality measures Score
Model 9 Standardized Quality measures and altitude Score
Model 10 Normalized Altitude Score
Model 11 Normalized Quality measures Score
Model 12 Normalized Quality measures and altitude Score

3.3.3 Model 13: Predicting quality using Knn model

K-Nearest Neighbors algorithm is a supervised machine learning algorithm simple and easy to use. Knn assume that similar data points are neighbors (close to each others). Knn calculated the distance between neighbor data points called proximity. To choose the best k that minimizes the number of errors, we run KNN algorithm several times with different k.

3.3.4 Model 14: Predicting quality using rpart model

A decision tree model is a tree-like graph including probability event outcomes. It’s an intuitive algorithm that is easily applied in modeling2. It uses tree representation to solve the problem. Starting from the root and going down each leaf node represents a decision or terminal node.

3.3.5 Model 15: Predicting quality using random forest

Random forests are generally an improved version of decision trees. The forest builds multiple decision trees and merge them to get more accurate and stable outcome. Random forests’ biggest advantage is that it can be used for both regression and classification. Here, we used it as a classifier and applied it to predict the quality of the coffee.

4 Results

4.1 Linear regression for predicting coffee species

All predicted scores shows the same RMSE. As we know, RMSE is a measure data concentration and R-squared is a measure of fit. From the observations of all RMSE values and R-squared, we have concluded that the best linear regression model. Noteworthy, the regression model provides a reasonable but not perfect fit to the data because \(0.67\) is not to close too \(1\).

Model Data RMSE r.squared Adjusted.r.squared
1 Standardized 0.1111111 0.6739291 0.6695287
2 Standardized 0.1111111 0.6740123 0.6691666
3 Standardized 0.1111111 0.6740362 0.6687431
4 Normalized 0.1111111 0.6739291 0.6695287
5 Normalized 0.1111111 0.6740123 0.6691666
6 Normalized 0.1111111 0.6740362 0.6687431

4.2 Linear regression for predicting coffee score

All predicted scores shows different RMSE. The best model has the lowest one. Therefore, when the data is normalized, predictors including quality measures and altitude provides a perfect fit to the data. About \(99\%\) of the variability in the data can be explained by the fitted linear regression model. The high adjusted R-squared value means adding new predictor improveed our regression model.

Model Data RMSE r.squared Adjusted.r.squared
7 Standardized 0.9837560 0.0168657 0.0155244
8 Standardized 0.7643157 0.4466669 0.4390242
9 Standardized 0.7622242 0.4475609 0.4391558
10 Normalized 0.9837560 0.0168657 0.0155244
11 Normalized 0.7643157 0.4466669 0.4390242
12 Normalized 0.0648084 0.9879322 0.9877146

4.3 Knn

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low medium
##     high     15   0      0
##     low       0   2      0
##     medium    6   4    290
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9685          
##                  95% CI : (0.9428, 0.9848)
##     No Information Rate : 0.9148          
##     P-Value [Acc > NIR] : 0.00009951      
##                                           
##                   Kappa : 0.7592          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: high Class: low Class: medium
## Sensitivity              0.71429   0.333333        1.0000
## Specificity              1.00000   1.000000        0.6296
## Pos Pred Value           1.00000   1.000000        0.9667
## Neg Pred Value           0.98013   0.987302        1.0000
## Prevalence               0.06625   0.018927        0.9148
## Detection Rate           0.04732   0.006309        0.9148
## Detection Prevalence     0.04732   0.006309        0.9464
## Balanced Accuracy        0.85714   0.666667        0.8148

In Knn model, we get a high accuracy but sensitivity for low quality is low. It might be the result of the number of low quality coffee in our dataset.

4.4 Regression tree-rpart

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9463722      0.6296729      0.9155234      0.9684531      0.9148265 
## AccuracyPValue  McnemarPValue 
##      0.0226431            NaN

The most important variables in the decision tree model are the body and flavor. Other variables seem with no effect on predicting quality of coffee. Routes for predicting quality are shown in the graph above.

4.5 Random Forest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low medium
##     high     13   0      2
##     low       0   3      1
##     medium    8   3    287
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9558         
##                  95% CI : (0.927, 0.9756)
##     No Information Rate : 0.9148         
##     P-Value [Acc > NIR] : 0.003389       
##                                          
##                   Kappa : 0.6768         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: high Class: low Class: medium
## Sensitivity              0.61905   0.500000        0.9897
## Specificity              0.99324   0.996785        0.5926
## Pos Pred Value           0.86667   0.750000        0.9631
## Neg Pred Value           0.97351   0.990415        0.8421
## Prevalence               0.06625   0.018927        0.9148
## Detection Rate           0.04101   0.009464        0.9054
## Detection Prevalence     0.04732   0.012618        0.9401
## Balanced Accuracy        0.80615   0.748392        0.7911

Random forest algorithm predicted the quality of coffee with high accuracy. The error for medium decrease and stabilize faster than high and low quality. Number of observations might be the reason.

5 Conclusion and discussion

This project started by exploring and visualizing coffee distributions. We built our models starting from the baseline till a more complicated model. Two datasets were used: Arabica and Robusta.

The machine learning models used predicted species of coffee with accuracy. This is due to a large number of Arabica observations. In addition, predicting the quality of coffee was attained with high accuracy. The performance of each model predicting the quality of coffee were discussed in the results section.

The data is imbalanced between species and quality. Thus, this report felt limited in the amount of data being trained and tested. It’s interesting to gather more data and balance samples significantly to improve accuracy for low and high quality. A large size of observations might be bring advantages in preventing overfitting, identifying outliers and give more reliable results with more precision.

Since data were gathered from a website and it was created and cleaned by a person and not an official organization thus the results might have been impacted. Though, it’s important to include machine learning in the coffee industry as it’s a growing sector. Further work on a project such as this should include sales and prices and comparing more models on large data sets.

References

Andrea Trevisan. n.d. “What Are the Factors That Affect Coffee Quality?” Accessed June 22, 2020. https://blog.eureka.co.it/en/factors-that-affect-coffee-quality.

Andrew Menke. 2018. “The Global Coffee Industry.” https://globaledge.msu.edu/blog/post/55607/the-global-coffee-industry.

Co, Blackout Coffee. n.d. “Coffee Characteristics: What Affects the Quality of Coffee Before You Brew It.” Blackout Coffee Co. Accessed June 22, 2020. https://www.blackoutcoffee.com/blogs/the-reading-room/coffee-characteristics-what-affects-the-quality-of-coffee-before-you-brew-it.

Satista. n.d. “Coffee - Worldwide Statista Market Forecast.” Statista. Accessed June 22, 2020. https://www.statista.com/outlook/30010000/100/coffee/worldwide.