Executive Summary

In this project, the data science team uses supervised machine learning models to predict coffee quality rating rating and important factors associated this rating. This analysis was conducted to provide business insights to the executive team at the coffee chain, Coffee Crew before venturing into their own trade marked coffee production idea.

Business tasks: 1. Are certain characteristics of coffee beans associated with customer preference. They would ideally like to use this information in future coffee production and sale investments. 2. Provide recommendations based on data findings on what factors to consider before they start a coffee production business.

Findings

Our model to predict coffee quality performed averagely when tested on unseen data, with a R^2 score of 0.67. This score is used to evaluate the performance of the models with 1.0 being the highest. We can interpret this score as 17% of the changeability of the coffee quality rating variable can be explained by our model while the remaining 83 % variability is still unaccounted for. With this model, three most influential factors associated with this prediction were altitude, moisture, and coffee bean country of origin - Colombia.

Recommendations

We recommend refining the methodology to improve this prediction model before it is applied. Incorrectly classifying the quality of coffee or using these results in its present form may have a large economic impact on this company. We have elaborated on the limitations of this project and next steps for our team at the end of the report.

Introduction

With the increased demand in coffee culture world wide and business growth of the Coffee Crew franchise, the company has decided to expand its business to start their very own coffee production. Recent statistics shows that world coffee production decreased by 2.2% in 2019-20 while the demand in global consumption increased by 0.3% (Economic Cooperation and Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH 2020). Prices of coffee export has also increased steadily by 7.5% (195.17 USD/lb) in November 2021 compared with October 2021 (181.57 USD/lb). Before venturing into the coffee production business, it is important to understand what factors are associated with the quality of coffee. Prior literature discusses some of these factors as raw bean size, colour and shape of the bean, altitude, weather, preprocessing techniques and so on (Njoroge 1998).

Business tasks:

  1. In order to proceed with this business plan, the executive team would like to know if certain characteristics of coffee beans are associated with customer preference. They would ideally like to use this information in future coffee production and sale investments.
  2. Provide recommendations based on data findings on what factors to consider before they start a coffee production business.

Research Questions:

The data science team proposed few research questions to help identify recommendations to the business tasks at hand.

  1. Given a set of characteristics, what is the quality of a cup of arabica coffee?
  2. Which features are most influential in determining coffee quality?

In this analysis, we use machine learning techniques to build a model that can accurately predict the quality of coffee. If it can do this effectively, we could see it being helpful to Coffee Crew’s new business venture. We would like to note that the quality of coffee is not all that makes it successful. However, accessibility of this information may enable the production process when deciding where to start production and what factors should be considered.

Methods

Data collection and cleaning

A public dataset initially collected by the Coffee Quality Institute in January of 2018 was used for this data analysis. The dataset was retrieved from tidytuesday (Mock 2022), courtesy of James LeDoux, a Data Scientist at Buzzfeed.

We chose this dataset as it contains data on 1312 Arabica coffee beans, which is the coffee bean species of interest for the executives at Coffee Crew. The dataset contains information on quality measures, metadata on bean preprocessing methods, and coffee bean farming data. The data is collected on Arabica coffee beans from across the world and professionally rated on a scale of 0-100 based on factors like acidity, sweetness, fragrance, balance, etc.

While we trust the integrity of tidytuesday platform and the cleaned dataset provided by James LeDoux, we performed additional data cleaning and data processing as necessary to answer our research questions.

Exploratory data analysis

Before we ran our machine learning models to address the questions, a preliminary exploratory data analysis was performed to see if we find any interesting patterns in the data. These questions included, does coffee quality rating differ by colour of beans, country of origin and coffee growing regions of the world.

Machine learning methodology

To address the business task, i.e., identify if certain factors of coffee beans are associated with customer preference, we decided to train few machine learning models to see if it can learn these factors from our dataset and then predict the customer preference rating. We then identified the top three factors that influenced the prediction.

We then selected two regression models (Ridge and Random Forest Regressor) (Pedregosa et al. 2011) that are reliable methods of identifying which production, farming or quality measures have an impact on coffee rating. We were able to use these models as the variable of interest, coffee quality rating, was in numerical form. We scaled all numerical variables in the data to avoid any long learning time and lower accuracy for the models. Any variables that weren’t numerical were transformed to numerical values based on unique categorization assigned by the team analysts. While regression modelling was the pre-determined methodology, we later re-processed our data and explored other modelling techniques (classification model) to see if we could improve our coffee rating prediction score. We divided our coffee rating variable to “Good” and “Poor” categories based on the median coffee ratings and then trained the new classification model on this data. We applied validation tests to see how well these models perform on unseen data.

Results

We performed some exploratory analysis to identify any specific patterns in the data. Three main questions were explored and data visualizations are provided below.

Question 1: Does coffee quality rating differ between the different colours of coffee beans?

Coffee quality rating by colour of coffee beans

Coffee quality rating by colour of coffee beans

The average coffee quality rating did not differ vastly by the colour of the coffee beans. Green color coffee beans were slightly lower in coffee quality rating than blue-green or bluish-green groups.

Question 2: Does coffee quality rating differ by country of origin of coffee beans?

Coffee quality rating by country of origin of coffee beans

Coffee quality rating by country of origin of coffee beans

We saw that the average coffee rating differed between the various countries of origin. The highest average rating of coffee beans were from Ethiopia and the lowest average coffee rating were from Haiti.

Question 3: Does coffee quality rating differ by coffee growing regions of the world?

Coffee quality rating by coffee growing regions of the world

Coffee quality rating by coffee growing regions of the world

The average coffee quality rating seem to be slightly higher among regions of East Africa and the Arabian Peninsula.

Our initial regression models did not perform well on our dataset. The models gives us an \({R}^2\) score of -0.065 and 0.169 for the two models we built (ridge and random forest models respectively). This score is used to evaluate the performance of the models with 1.0 being the highest. We can interpret this score as 17% of the changeability of the coffee rating variable can be explained by the Random Forest model while the remaining 83 % of the variability is still unaccounted for. The model performed worse for Ridge regression which is why we see negative values.

Table 1. Regression and Classification Cross-validation Results.
Ridge RForest_Regressor RF_classification
fit_time 0.003 (+/- 0.007) 0.349 (+/- 0.022) 0.130 (+/- 0.005)
score_time 0.003 (+/- 0.007) 0.008 (+/- 0.006) 0.009 (+/- 0.004)
test_score -0.065 (+/- 0.594) 0.169 (+/- 0.128) 0.722 (+/- 0.058)
train_score 0.430 (+/- 0.019) 0.887 (+/- 0.008) 0.999 (+/- 0.000)

We optimized our model algorithms for the best learning process, which gave a \({R}^2\) score of 0.25 for the random forest regression models. With this we then tried to answer our main business tasks - what characteristics of coffee bean are most important in predicting coffee quality?. As shown in figure 4, the top three important features from the random forest model were altitude, moisture, and Mexico as country of Origin.

Random Forest Regressor Feature Importances

Random Forest Regressor Feature Importances

While optimization improved the model score slightly, since the \({R}^2\) scores were below 0.5, we also applied a Random Forest Classification model. We used this model by first processing our coffee rating variable from numeric to categorical variables, “Good” and “Poor” based of the median coffee rating of 0.82. This model’s performance was more acceptable with a \({R}^2\) score of 0.72 and 0.67 on test or unseen data. The top five important features interpreted by the model were altitude, moisture, and Colombia as country of origin.

Random Forest Classifier Feature Importances

Random Forest Classifier Feature Importances

Limitations

There were several limitations in our analysis such as small dataset size and limited types of features available for feature engineering and modelling. As mentioned in the methods section, we removed several features as they were used in the calculation of the coffee rating itself. In addition, many features had to be discarded due to their lack of relevance to our models. For example, features such as aroma, flavour, aftertaste, acidity, body, balance, uniformity, or sweetness were all discarded as they were just individual contributors to the calculation of the target coffee rating variable.

Conclusion

Therefore, to predict the coffee quality rating given a set of coffee bean characteristics, our model performance was only 67% of new unseen data and the most important features associated with coffee quality according to our best model were altitude, moisture, and Colombia as country of Origin.

While these factors may be a starting point to consider before venturing into the coffee production business, we strongly feel like there are several improvements to be made to the models here. Our models could be improved with the inclusion of more relevant predictive factors or more data. Further research on coffee production and quality literature need to be conducted to add these relevant factors into our model. Additionally, our data science team would also carry out more classification models (Naive Bayes, Logistic Regression, etc.) as we see the clear advantage of using classification over regression models.

Attribution

The data analysis for this project was completed as part of course requirement for DSCI 522: Data Science Workflows course at the University of British Columbia’s Master of Data Science Program. Kristin Bunyan, Michelle Wang, Berkay Bulut are co-contributors of this project. Detailed code and data analytical reports can be found here.

This report was constructed using Rmarkdown (Allaire et al. 2021), Knitr (Xie 2021) and kableExtra (Zhu 2021) packages in R.

References

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. Rmarkdown: Dynamic Documents for r.
Economic Cooperation, German Federal Ministry for, and Development through the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH. 2020. “Coffee Development Report 2020.” International Coffee Organization. https://www.ncausa.org/Research-Trends/Economic-Impact.
Mock, Thomas. 2022. “Tidy Tuesday: A Weekly Data Project Aimed at the r Ecosystem.” https://github.com/rfordatascience/tidytuesday.
Njoroge, J. M. 1998. “Agronomic and Processing Factors Affecting Coffee Quality.” Outlook on Agriculture 27 (3): 163–66. https://doi.org/10.1177/003072709802700306.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
Xie, Yihui. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.