Following on from the exploratory data analysis of reviews of coffee quality from different suppliers, we will now attempt to model the main quality metric, Total.Cup.Points, based on the characteristics of the beans used to make the coffee.
We have 1338 rows of data, of which the first 10 rows are given below.
We would like to develop a random forest model to describe how Total.Cup.Points depends on the input variables of this data. To choose the optimal parameters to use with our model, we would like to crossvalidate using training data, before testing the chosen parameters using test data.
Feature Importance
To estimate the relative importance of the features, we fit an unoptimised random forest model to all the data and determine variable importance from the model.

The feature variables with greatest influence on quality appear to be:
- Altitude
- Category.Two.Defects
- Owner
- Moisture
- Weight of Sample Bag
- Grading Month
- Variety
- Category.One.Defects
- Harvest and Grading Year
Processing method, Quakers, Species and Color appear to not be as important.
We have difficulties with this data set in that some of the categorical variables have low frequency values within them. These factor levels may not appear in all folds of the crossvalidation data, and may not be present in both the training and test data. If we use our workflow to just create dummies for these variables as is, we will encounter errors during prediction steps when the training and testing variables don’t match.
To compensate for this, we will identify our top 14 countries of origin, top 6 Owners, top 5 varieties and the top 3 processing methods and change the other values in these categories to “Other”. We are reducing the number of distinct categories for each of these variables.
After removing rows with missing data for Owner, Country or Variety, a random 20% will be removed to act as the test set. We will also randomly define the 4-fold crossvalidation splits for the training data.
Training data rows = 890
Testing data rows = 222
Set up Pre-treatment Recipes
We now define the pretreatment recipes. We are going to impute some data into missing values for crossvalidation. We will impute the mean altitude, the median Harvest.Year, and the mode for the categorical variables processing method and Color. We will also normalize numeric variables and remove those that are highly correlated. We will create variables for grading month and year and create dummy variables for the remaining categorical variables.
recipe_1 <-
train %>%
recipe(Total.Cup.Points ~ .) %>%
step_meanimpute(mean_alt) %>%
step_medianimpute(Harvest.Year) %>%
step_modeimpute(pm, Color) %>%
step_normalize(all_numeric(), -all_outcomes(), -Harvest.Year) %>%
step_date(new_date, features = c("month", "year")) %>%
step_rm(new_date) %>%
step_dummy(Owner, Species, Color, coo, var, pm, new_date_month) %>%
step_corr(all_numeric(), -all_outcomes())
What does the subsequent training frame look like?
Random Forest Model
We will create a tuning workflow in order to optimise hyperparameters for a random forest regression model.
We will be optimising trees and mtry values for our ranger model, using crossvalidation results of our training data. Mean absolute error (MAE) will be evaluated for each set of predictions.
rf_model <-
rand_forest(trees = tune(), mtry = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
grid_rf <-
grid_max_entropy(
trees(range = c(100, 2000)),
mtry(range = c(3, 15)),
size = 30)
my_metrics <- metric_set(mae)
After crossvalidating for a range of trees and mtry values, let’s plot values of MAE and list the best models.

We will pick the values of trees and mtry with the lowest error. Now we will choose these values to train a model using the training data and fit the model to the test data. Let’s see how the predictions compare with reality for the test set and calculate the test MAE. The line on the plot is the line of equality.

Mean Absolute Error for Test set = 1.421002
Our test set error is better than our best crossvalidation error.
Linear Model
Let’s compare this with a linear regression model. We will be optimising mixture and penalty values for our glmnet model, using crossvalidation results of our training data. Mean absolute error (MAE) will be evaluated for each set of predictions.
model_glm <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
grid_glm <-
grid_max_entropy(
penalty(range = c(0.01, 0.3), trans = NULL),
mixture(range = c(0, 1)),
size = 30)
my_metrics <- metric_set(mae)
After crossvalidating for a range of mixture and penalty values, let’s plot values of MAE and list the best models.

Again we will pick the values of mixture and penalty with the lowest error. Now we will choose these values to train a model using the training data and fit the model to the test data. Again let’s see how the predictions compare with reality for the test set and calculate the test MAE. The line on the plot is the line of equality.

Mean Absolute Error for Test set = 1.479737
Again, the error on the test set is better than that of the crossvalidation results.
Overall, it appears that the random forest model is better at predicting coffee quality than the linear regression model. It may be possible to reduce the prediction error further by combining the two model predictions.
End
---
title: "Coffee Quality Score Modelling"
output: html_notebook
---

Following on from the exploratory data analysis of reviews of coffee quality from different suppliers, we will now attempt to model the main quality metric, Total.Cup.Points, based on the characteristics of the beans used to make the coffee.  

```{r Initialise and load data, include=FALSE}
rm(list = ls())

library(here)
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)

options(tidymodels.dark = TRUE)

load(here("data/interim/bean_data_clean.RData"))

bean_data <- bean_data %>%
  rename(coo = Country.of.Origin,
         var = Variety,
         pm = Processing.Method) %>%
  mutate(coo = str_replace_all(coo, " ", "_"),
         coo = str_replace_all(coo, "\\(", ""),
         coo = str_replace_all(coo, "\\)", ""),
         var = str_replace_all(var, " ", "_"),
         pm = str_replace_all(pm, " ", "_"),
         pm = str_replace_all(pm, "_\\/", ""),
         pm = str_replace_all(pm, "-", ""))

model_data <- bean_data %>%
  filter(is.na(coo) == FALSE,
         is.na(var) == FALSE,
         is.na(Owner) == FALSE)

```

```{r Functions, include=FALSE}
metrics_plot <- function(model_fit, metric){
  
  data <- model_fit$.metrics
  params <- names(data[[1]])[!(names(data[[1]]) %in% names(data[[1]])[grep("^\\.", names(data[[1]]))])]
  
  fit_metrics <- unique(data[[1]]$.metric)
  
  if (!(metric %in% fit_metrics)){
    return("no metric by that name")
  }
  
  mean_data <- data[[1]] %>% select(-.estimate)
  
  mean_data$.estimate <- data %>%
    map(".estimate") %>%
    as_tibble(.name_repair = "unique") %>% 
    rowMeans()
  
  plot_data <- mean_data %>%
    filter(.metric == metric) %>%
    pivot_longer(cols = all_of(params), names_to = "param", values_to = "value")
  
  plot <- ggplot(plot_data, aes(value, .estimate)) + geom_point() +
    theme_bw() + facet_wrap(~ param, scales = "free_x") + labs(y = metric)
  
  return(plot)
}

check_dummy <- function(df, starts, keep){
  
  temp_df <- df %>%
    select(starts_with(starts))
  
  var_sum <- colSums(temp_df, na.rm = TRUE)
  
  keep_cols <- which(var_sum >= keep)
  
  temp_df <- temp_df %>%
    select(all_of(keep_cols))
  
  return_df <- df %>%
    select(-starts_with(starts)) %>%
    bind_cols(temp_df)
  
  return(return_df)
}


```

We have `r nrow(bean_data)` rows of data, of which the first 10 rows are given below.  

```{r Data example, echo=FALSE}
head(bean_data, 10)
```

We would like to develop a random forest model to describe how Total.Cup.Points depends on the input variables of this data. To choose the optimal parameters to use with our model, we would like to crossvalidate using training data, before testing the chosen parameters using test data.  

## <span style="color:teal;">Feature Importance</span>  

To estimate the relative importance of the features, we fit an unoptimised random forest model to all the data and determine variable importance from the model.  

```{r Define vars, include=FALSE}
var_set_imp <- c("Owner", "Species", "Color", "coo", "var", "pm", "mean_alt", "Harvest.Year", "new_wt", "new_date", 
                 "Moisture", "Category.One.Defects", "Category.Two.Defects", "Quakers")

imp_model_data <- bean_data %>%
  select(Total.Cup.Points, var_set_imp)
```

```{r Variable Importance, echo=FALSE, message=FALSE}
recipe_all <-
  imp_model_data %>%
  recipe(Total.Cup.Points ~ .) %>%
  step_naomit(var, pm, Color) %>%
  step_meanimpute(mean_alt) %>%
  step_medianimpute(Harvest.Year) %>%
  step_date(new_date, features = c("month", "year")) %>%
  step_rm(new_date) %>%
  step_zv(all_predictors()) %>%
  step_normalize(mean_alt, new_wt, Moisture, Category.One.Defects, Category.Two.Defects, Quakers)

all_prep <- recipe_all %>% prep() %>% juice()

simple_rf_model <- 
  rand_forest(trees = 200) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

rf_model_fit <- simple_rf_model %>%
  fit(Total.Cup.Points ~ ., data = all_prep)

imp_terms <- rf_model_fit$fit$variable.importance

imp_plot_df <- data.frame(variable = names(imp_terms), importance = imp_terms)
imp_plot_df$variable <- as.character(imp_plot_df$variable)

imp_plot <- ggplot(imp_plot_df, aes(reorder(variable, importance), importance)) + geom_point() + coord_flip() +
  theme_bw() + ggtitle("Relative Importance of Variables", subtitle = "Simple Random Forest Model") +
  labs(x = "Variable")
ggsave(here("plots/imp_plot.png"), height = 6, dpi = 200)

knitr::include_graphics(here("plots/imp_plot.png"))
```

The feature variables with greatest influence on quality appear to be:

* Altitude
* Category.Two.Defects
* Owner
* Moisture
* Weight of Sample Bag
* Grading Month
* Variety
* Category.One.Defects
* Harvest and Grading Year

Processing method, Quakers, Species and Color appear to not be as important.  

We have difficulties with this data set in that some of the categorical variables have low frequency values within them.  These factor levels may not appear in all folds of the crossvalidation data, and may not be present in both the training and test data. If we use our workflow to just create dummies for these variables as is, we will encounter errors during prediction steps when the training and testing variables don't match.  

To compensate for this, we will identify our top 14 countries of origin, top 6 Owners, top 5 varieties and the top 3 processing methods and change the other values in these categories to "Other". We are reducing the number of distinct categories for each of these variables.  


```{r Create dummies, include=FALSE}
own_list <- c("juan luis alvarado romero", "racafe  cia sca", "exportadora de cafe condor sa", 
              "kona pacific farmers cooperative", "ipanema coffees", "cqi taiwan icp cqi")
coo_list <- c("Mexico", "Colombia", "Guatemala", "Brazil", "Taiwan", "United_States_Hawaii", "Honduras",
              "Costa Rica", "Ethiopia", "Tanzania", "Uganda", "Thailand", "Nicaragua", "Kenya")
var_list <- c("Caturra", "Bourbon", "Typica", "Catuai", "Hawaiian Kona")
pm_list <- c("Washed_Wet", "Natural_Dry", "Semiwashed_Semipulped")

own_notNA <- which(is.na(model_data$Owner) == FALSE)
own_inlist <- which(model_data$Owner %in% own_list)
model_data$Owner[setdiff(own_notNA, own_inlist)] <- "Other"

coo_notNA <- which(is.na(model_data$coo) == FALSE)
coo_inlist <- which(model_data$coo %in% coo_list)
model_data$coo[setdiff(coo_notNA, coo_inlist)] <- "Other"

var_notNA <- which(is.na(model_data$var) == FALSE)
var_inlist <- which(model_data$var %in% var_list)
model_data$var[setdiff(var_notNA, var_inlist)] <- "Other"

pm_notNA <- which(is.na(model_data$pm) == FALSE)
pm_inlist <- which(model_data$pm %in% pm_list)
model_data$pm[setdiff(pm_notNA, pm_inlist)] <- "Other"

```


```{r Define crossval vars, include=FALSE}
var_set1 <- "coo"
var_set2 <- "var"
var_set3 <- "pm"

var_set4 <- c("Owner", "Species", "Color", "mean_alt", "Harvest.Year", "new_wt", "new_date", "Moisture",
              "Category.One.Defects", "Category.Two.Defects", "Quakers")

var_set <- c(var_set1, var_set2, var_set3, var_set4)

model_data <- model_data %>%
  select(Total.Cup.Points, var_set)
```


After removing rows with missing data for Owner, Country or Variety, a random 20% will be removed to act as the test set. We will also randomly define the 4-fold crossvalidation splits for the training data.


```{r Data split and cv split, echo=FALSE}
data_split <- model_data %>%
  initial_split(prop = 4/5)

test <- testing(data_split)
train <- training(data_split)
cv <- vfold_cv(train, v = 4)

cat("Training data rows =", nrow(train), "\n")
cat("Testing data rows =", nrow(test), "\n")

```

## <span style="color:teal;">Set up Pre-treatment Recipes</span>  

We now define the pretreatment recipes.  We are going to impute some data into missing values for crossvalidation. We will impute the mean altitude, the median Harvest.Year, and the mode for the categorical variables processing method and Color. We will also normalize numeric variables and remove those that are highly correlated. We will create variables for grading month and year and create dummy variables for the remaining categorical variables.  

```{r Pretreat 1}
recipe_1 <-
  train %>%
  recipe(Total.Cup.Points ~ .) %>%
  step_meanimpute(mean_alt) %>%
  step_medianimpute(Harvest.Year) %>%
  step_modeimpute(pm, Color) %>%
  step_normalize(all_numeric(), -all_outcomes(), -Harvest.Year) %>%
  step_date(new_date, features = c("month", "year")) %>%
  step_rm(new_date) %>% 
  step_dummy(Owner, Species, Color, coo, var, pm, new_date_month) %>% 
  step_corr(all_numeric(), -all_outcomes())


```

What does the subsequent training frame look like?  

```{r Training frame, echo=FALSE, message=FALSE, warning=FALSE}
train_prep <- recipe_1 %>% prep() %>% juice()
head(train_prep, 10)
```


## <span style="color:teal;">Random Forest Model</span>  

We will create a tuning workflow in order to optimise hyperparameters for a random forest regression model.  

We will be optimising **trees** and **mtry** values for our ranger model, using crossvalidation results of our training data. Mean absolute error (MAE) will be evaluated for each set of predictions.  


```{r Workflow RF}
rf_model <- 
  rand_forest(trees = tune(), mtry = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

grid_rf <- 
  grid_max_entropy(
    trees(range = c(100, 2000)),
    mtry(range = c(3, 15)),
    size = 30)

my_metrics <- metric_set(mae)

```


```{r Tune RF, include=FALSE}
workflow_rf <- 
  workflow() %>% 
  add_recipe(recipe_1) %>%
  add_model(rf_model)

rf_fit <- tune_grid(
  workflow_rf,
  resamples = cv,
  grid = grid_rf,
  metrics = my_metrics,
  control = control_grid(verbose = TRUE)
)
```

After crossvalidating for a range of **trees** and **mtry** values, let's plot values of MAE and list the best models.

```{r Plot metrics RF, echo=FALSE, message=FALSE}
mae_plot <- metrics_plot(rf_fit, metric = "mae")
ggsave(here::here("plots/metrics_1.png"), height = 5, dpi = 200)

knitr::include_graphics(here::here("plots/metrics_1.png"))

show_best(rf_fit, metric = "mae")

```

We will pick the values of **trees** and **mtry** with the lowest error. Now we will choose these values to train a model using the training data and fit the model to the test data. Let's see how the predictions compare with reality for the test set and calculate the test MAE. The line on the plot is the line of equality.  


```{r Test prediction RF, echo=FALSE, message=FALSE}
tuned_rf <-
  workflow_rf %>% 
  finalize_workflow(select_best(rf_fit, metric = "mae")) %>% 
  fit(data = train)

pred_rf <- predict(tuned_rf, test) %>%
  bind_cols(test %>% select(Total.Cup.Points))

pred_plot <- ggplot(pred_rf, aes(.pred, Total.Cup.Points)) + geom_point(colour = "tan3") + 
  geom_abline(slope = 1, intercept = 0) +
  theme_bw() + ggtitle("Test Set Predictions for Total.Cup.Points", 
                       subtitle = "Final Random Forest Model") + 
  labs(x = "Prediction", y = "Actual")
ggsave(here("plots/pred_plot.png"), height = 5, dpi = 200)

knitr::include_graphics(here("plots/pred_plot.png"))

cat("Mean Absolute Error for Test set =", mae_vec(truth = pred_rf$Total.Cup.Points, estimate = pred_rf$.pred))
```
Our test set error is better than our best crossvalidation error.  

## <span style="color:teal;">Linear Model</span>  

Let's compare this with a linear regression model. We will be optimising **mixture** and **penalty** values for our glmnet model, using crossvalidation results of our training data. Mean absolute error (MAE) will be evaluated for each set of predictions.  


```{r Workflow GLM}
model_glm <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

grid_glm <- 
  grid_max_entropy(
    penalty(range = c(0.01, 0.3), trans = NULL),
    mixture(range = c(0, 1)),
    size = 30)

my_metrics <- metric_set(mae)

```


```{r Tune GLM, include=FALSE}
workflow_glm <- 
  workflow() %>% 
  add_recipe(recipe_1) %>%
  add_model(model_glm)

glm_fit <- tune_grid(
  workflow_glm,
  resamples = cv,
  grid = grid_glm,
  metrics = my_metrics,
  control = control_grid(verbose = TRUE)
)
```

After crossvalidating for a range of **mixture** and **penalty** values, let's plot values of MAE and list the best models.

```{r Plot metrics GLM, echo=FALSE, message=FALSE}
mae_plot <- metrics_plot(glm_fit, metric = "mae")
ggsave(here::here("plots/metrics_2.png"), height = 5, dpi = 200)

knitr::include_graphics(here::here("plots/metrics_2.png"))

show_best(glm_fit, metric = "mae")

```

Again we will pick the values of **mixture** and **penalty** with the lowest error. Now we will choose these values to train a model using the training data and fit the model to the test data. Again let's see how the predictions compare with reality for the test set and calculate the test MAE. The line on the plot is the line of equality.  


```{r Test prediction GLM, echo=FALSE, message=FALSE}
tuned_glm <-
  workflow_glm %>% 
  finalize_workflow(select_best(glm_fit, metric = "mae")) %>% 
  fit(data = train)

pred_glm <- predict(tuned_glm, test) %>%
  bind_cols(test %>% select(Total.Cup.Points))

pred_plot <- ggplot(pred_glm, aes(.pred, Total.Cup.Points)) + geom_point(colour = "tan4") + 
  geom_abline(slope = 1, intercept = 0) +
  theme_bw() + ggtitle("Test Set Predictions for Total.Cup.Points", 
                       subtitle = "Final Linear Model") + 
  labs(x = "Prediction", y = "Actual")
ggsave(here("plots/pred_plot.png"), height = 5, dpi = 200)

knitr::include_graphics(here("plots/pred_plot.png"))

cat("Mean Absolute Error for Test set =", mae_vec(truth = pred_glm$Total.Cup.Points, estimate = pred_glm$.pred))
```

Again, the error on the test set is better than that of the crossvalidation results.  

Overall, it appears that the random forest model is better at predicting coffee quality than the linear regression model. It may be possible to reduce the prediction error further by combining the two model predictions.   

***

End





