Assignment

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results. Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees, how can you change this perception when using the decision tree you created to solve a real problem? Format: document with screen captures & analysis.

Data

For this assignment, I used the “Flavors of Cacao” dataset from this https://www.kaggle.com/datasets/techiesid01/flavours-of-cacao?resource=download. With this dataset, I will create a decision tree that will solve a regression problem: what will be the final grade for a cocoa bean. The dataset will be split in a training and testing partitions.

The following code sets up our libraries and imports the dataset.

library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --
## v broom        0.7.12     v recipes      0.2.0 
## v dials        0.1.0      v rsample      0.1.1 
## v dplyr        1.0.8      v tibble       3.1.6 
## v ggplot2      3.3.5      v tidyr        1.2.0 
## v infer        1.0.0      v tune         0.1.6 
## v modeldata    0.1.1      v workflows    0.2.4 
## v parsnip      0.2.0      v workflowsets 0.1.0 
## v purrr        0.3.4      v yardstick    0.0.9
## Warning: package 'parsnip' was built under R version 4.1.3
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()
## x tune::tune()     masks parsnip::tune()
## * Dig deeper into tidy modeling with R at https://www.tmwr.org
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v readr   2.1.2     v forcats 0.5.1
## v stringr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard()    masks scales::discard()
## x dplyr::filter()     masks stats::filter()
## x stringr::fixed()    masks recipes::fixed()
## x dplyr::lag()        masks stats::lag()
## x readr::spec()       masks yardstick::spec()
library(readr)
library(rpart.plot)
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
# Used for Random Forest
library(ranger)
## Warning: package 'ranger' was built under R version 4.1.3
chocolate <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data622/hw2/flavors_of_cacao.csv')
## Rows: 1795 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): company, specific_bean_origin_or_bar_name, cocoa_percent, company_l...
## dbl (3): ref, review_date, final_grade
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

This is a view of the imported dataset.

names(chocolate)
## [1] "company"                          "specific_bean_origin_or_bar_name"
## [3] "ref"                              "review_date"                     
## [5] "cocoa_percent"                    "company_location"                
## [7] "final_grade"                      "bean_type"                       
## [9] "broad_bean_origin"
head(chocolate)

Cleaning

The column cocoa_percent has a chr data type so this needs to be converted to a double data type. The column ref was removed. The columns broad_bean_origin and bean_type were converted to factor variables. I also did a drop_na.

chocolate <- chocolate %>% drop_na()


# Parses the cocoa_percent and converts it to percentage
chocolate$cocoa_percent <- parse_number(chocolate$cocoa_percent)/100

# Removes the ref column
chocolate <- chocolate[-c(3)]

# Convert some columns to factors
chocolate$broad_bean_origin  <- as.factor(chocolate$broad_bean_origin )
chocolate$bean_type  <- as.factor(chocolate$bean_type)

This is a view of the data after our modification.

head(chocolate)
summary(chocolate)
##    company          specific_bean_origin_or_bar_name  review_date  
##  Length:1793        Length:1793                      Min.   :2006  
##  Class :character   Class :character                 1st Qu.:2010  
##  Mode  :character   Mode  :character                 Median :2013  
##                                                      Mean   :2012  
##                                                      3rd Qu.:2015  
##                                                      Max.   :2017  
##                                                                    
##  cocoa_percent   company_location    final_grade                   bean_type  
##  Min.   :0.420   Length:1793        Min.   :1.000                       :887  
##  1st Qu.:0.700   Class :character   1st Qu.:3.000   Trinitario          :418  
##  Median :0.700   Mode  :character   Median :3.250   Criollo             :153  
##  Mean   :0.717                      Mean   :3.186   Forastero           : 87  
##  3rd Qu.:0.750                      3rd Qu.:3.500   Forastero (Nacional): 52  
##  Max.   :1.000                      Max.   :5.000   Blend               : 41  
##                                                     (Other)             :155  
##           broad_bean_origin
##  Venezuela         :214    
##  Ecuador           :193    
##  Peru              :165    
##  Madagascar        :145    
##  Dominican Republic:141    
##                    : 73    
##  (Other)           :862

Partition

# Splitting the data 80/20
set.seed(32022)

data_split <- initial_split(chocolate, prop = 0.8, strata = 'bean_type')
chocolate_train <- training(data_split)
chocolate_test <- testing(data_split)

Regression Decision Tree

Constructing the First Model

I will construct the model for a Decision Tree regression tree using the training dataset.

# Build the model specification for a decision tree
model_spec <- decision_tree() %>%
  set_mode("regression") %>%
  set_engine("rpart")

model_spec
## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart

Fit the Data

The following code will fit the data using the chocolate_train dataset which we partitioned from the original dataset. This will look at the variable final_grade in relation to cocoa_percent and bean_type.

# Train the model
model <- model_spec %>%
  fit(formula = final_grade ~ cocoa_percent + bean_type,
      data = chocolate_train)

# Information about the model
model
## parsnip model object
## 
## n= 1433 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1433 334.539700 3.183182  
##    2) cocoa_percent>=0.905 22  10.536930 2.284091 *
##    3) cocoa_percent< 0.905 1411 305.941400 3.197201  
##      6) bean_type= ,Criollo (Amarru),Criollo (Ocumare),Forastero,Forastero (Arriba),Forastero (Arriba) ASS,Forastero(Arriba, CCN),Forastero, Trinitario,Nacional,Trinitario (Amelonado),Trinitario, Criollo,Trinitario, Forastero 812 182.707100 3.136392  
##       12) cocoa_percent>=0.755 92  16.887910 2.942935 *
##       13) cocoa_percent< 0.755 720 161.936100 3.161111 *
##      7) bean_type=Amazon,Amazon mix,Amazon, ICS,Beniano,Blend,Blend-Forastero,Criollo,CCN51,Criollo,Criollo (Ocumare 61),Criollo (Ocumare 77),Criollo (Porcelana),Criollo (Wild),Criollo, +,Criollo, Forastero,Criollo, Trinitario,EET,Forastero (Amelonado),Forastero (Arriba) ASSS,Forastero (Catongo),Forastero (Nacional),Forastero (Parazinho),Matina,Nacional (Arriba),Trinitario,Trinitario (85% Criollo),Trinitario, TCGA 599 116.161500 3.279633  
##       14) bean_type=Amazon,Blend,Criollo,Criollo (Ocumare 61),Criollo (Porcelana),Criollo, Trinitario,Forastero (Nacional),Nacional (Arriba),Trinitario 572 110.526700 3.263112 *
##       15) bean_type=Amazon mix,Amazon, ICS,Beniano,Blend-Forastero,Criollo,CCN51,Criollo (Ocumare 77),Criollo (Wild),Criollo, +,Criollo, Forastero,EET,Forastero (Amelonado),Forastero (Arriba) ASSS,Forastero (Catongo),Forastero (Parazinho),Matina,Trinitario (85% Criollo),Trinitario, TCGA 27   2.171296 3.629630 *

Predict the Data

The following code will use the dataset chocolate_test to make predictions about final_grade using the model that was built.

# Make predictions with the model.
predictions <- predict(model,
                       new_data = chocolate_test) %>%
  bind_cols(chocolate_test)

predictions

Visualization

# Visualization of the model
model$fit %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Analysis

Surprisingly, the cocoa_percent did not have a strong influence on the final_grade. A cocoa_percent greater than or equal to 0.91, resulted in a rating of 2.3. The grade also depended upon the type of bean: Amazon mix,Amazon, ICS,Beniano,Blend-Forastero,Criollo,CCN51,Criollo (Ocumare 77),Criollo (Wild),Criollo, +,Criollo, Forastero,EET,Forastero (Amelonado),Forastero (Arriba) ASSS,Forastero (Catongo),Forastero (Parazinho),Matina,Trinitario (85% Criollo),Trinitario, TCGA 27.

Constructing the Second Model

I will construct the second model for a Decision Tree regression tree using the training dataset.

Fit the Data

The following code will fit the data using the chocolate_train dataset which we partitioned from the original dataset. This will look at the variable final_grade in relation to cocoa_percent, bean_type, broad_bean_origin, and review_date.

# Train the model
model2 <- model_spec %>%
  fit(formula = final_grade ~ cocoa_percent + bean_type + broad_bean_origin + review_date,
      data = chocolate_train)

# Information about the model
model2
## parsnip model object
## 
## n= 1433 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1433 334.539700 3.183182  
##    2) cocoa_percent>=0.905 22  10.536930 2.284091 *
##    3) cocoa_percent< 0.905 1411 305.941400 3.197201  
##      6) broad_bean_origin= ,Africa, Carribean, C. Am.,Burma,Carribean,Central and S. America,Colombia, Ecuador,Cost Rica, Ven,Costa Rica,Ecuador, Mad., PNG,El Salvador,Ghana,Ghana & Madagascar,Ghana, Domin. Rep,Ghana, Panama, Ecuador,Grenada,India,Ivory Coast,Jamaica,Liberia,Martinique,Mexico,Peru,Peru(SMartin,Pangoa,nacional),Peru, Ecuador, Venezuela,Peru, Madagascar,Philippines,Principe,Puerto Rico,Sao Tome,South America, Africa,Sri Lanka,St. Lucia,Togo,Trinidad,Trinidad-Tobago,Uganda,Ven., Trinidad, Mad.,Venezuela, Carribean,Venezuela, Ghana,Venezuela, Trinidad,West Africa 404 107.531400 3.056312  
##       12) bean_type= ,Criollo,Criollo (Amarru),Criollo, Trinitario,Forastero,Forastero, Trinitario,Trinitario,Trinitario, Criollo,Trinitario, Forastero 359  92.007310 3.018106  
##         24) broad_bean_origin= ,Africa, Carribean, C. Am.,Colombia, Ecuador,Cost Rica, Ven,Ghana, Domin. Rep,Ivory Coast,Martinique,Peru, Madagascar,Philippines,Principe,Puerto Rico,Sri Lanka,Trinidad-Tobago,Uganda,Venezuela, Trinidad,West Africa 76  23.289470 2.802632  
##           48) review_date< 2007.5 12   5.807292 2.145833 *
##           49) review_date>=2007.5 64  11.334960 2.925781 *
##         25) broad_bean_origin=Burma,Carribean,Central and S. America,Costa Rica,El Salvador,Ghana,Grenada,India,Jamaica,Liberia,Mexico,Peru,Peru(SMartin,Pangoa,nacional),Peru, Ecuador, Venezuela,Sao Tome,South America, Africa,St. Lucia,Togo,Trinidad,Venezuela, Ghana 283  64.241610 3.075972 *
##       13) bean_type=Amazon mix,Blend,Forastero (Nacional),Matina 45  10.819440 3.361111 *
##      7) broad_bean_origin=Australia,Belize,Bolivia,Brazil,Cameroon,Colombia,Congo,Cuba,Dom. Rep., Madagascar,Domincan Republic,Dominican Rep., Bali,Dominican Republic,Ecuador,Fiji,Gabon,Gre., PNG, Haw., Haiti, Mad,Guat., D.R., Peru, Mad., PNG,Guatemala,Haiti,Hawaii,Honduras,Indonesia,Mad., Java, PNG,Madagascar,Malaysia,Nicaragua,Nigeria,Panama,Papua New Guinea,PNG, Vanuatu, Mad,Samoa,Sao Tome & Principe,Solomon Islands,South America,Suriname,Tanzania,Tobago,Trinidad, Ecuador,Trinidad, Tobago,Vanuatu,Ven., Indonesia, Ecuad.,Ven.,Ecu.,Peru,Nic.,Venez,Africa,Brasil,Peru,Mex,Venezuela,Venezuela, Java,Venezuela/ Ghana,Vietnam 1007 187.173500 3.253724  
##       14) cocoa_percent>=0.745 264  51.625000 3.125000  
##         28) broad_bean_origin=Bolivia,Colombia,Congo,Domincan Republic,Dominican Republic,Ecuador,Guatemala,Honduras,Madagascar,Nicaragua,Papua New Guinea,Tanzania,Vanuatu,Venezuela 198  39.865210 3.059343  
##           56) review_date>=2006.5 188  35.089100 3.029255 *
##           57) review_date< 2006.5 10   1.406250 3.625000 *
##         29) broad_bean_origin=Belize,Brazil,Cuba,Fiji,Gabon,Guat., D.R., Peru, Mad., PNG,Haiti,Hawaii,Indonesia,Panama,Sao Tome & Principe,Solomon Islands,Trinidad, Ecuador,Ven.,Ecu.,Peru,Nic.,Venezuela/ Ghana,Vietnam 66   8.345644 3.321970 *
##       15) cocoa_percent< 0.745 743 129.619800 3.299462  
##         30) bean_type= ,Amazon,Criollo,Criollo (Ocumare 61),Criollo (Porcelana),Forastero,Forastero (Arriba),Forastero (Arriba) ASS,Forastero (Nacional),Nacional,Nacional (Arriba),Trinitario,Trinitario (Amelonado),Trinitario, Criollo,Trinitario, Forastero 698 121.212800 3.280802 *
##         31) bean_type=Amazon, ICS,Beniano,Blend,Blend-Forastero,Criollo,CCN51,Criollo (Ocumare 77),Criollo (Wild),Criollo, +,Criollo, Trinitario,EET,Forastero (Amelonado),Forastero (Arriba) ASSS,Forastero (Catongo),Forastero (Parazinho),Trinitario (85% Criollo),Trinitario, TCGA 45   4.394444 3.588889 *

Predict the Data

The following code will use the dataset chocolate_test to make predictions about final_grade using the model that was built.

# Make predictions with the model.
predictions <- predict(model2,
                       new_data = chocolate_test) %>%
  bind_cols(chocolate_test)

predictions

Visualization

# Visualization of the model
model2$fit %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Analysis

Overall, the final_grade came down to bean_type and cocoa_percent. The year and other variables contributed noise.

Random Forest

Creating the Model

The following code create a random forest model.

# creates the random forest model
(chocolate_rf <- ranger(final_grade ~ bean_type + cocoa_percent + broad_bean_origin,
                         chocolate_train,
                         num.trees=500,
                         respect.unordered.factors = "order",
                         seed=32022))
## Ranger result
## 
## Call:
##  ranger(final_grade ~ bean_type + cocoa_percent + broad_bean_origin,      chocolate_train, num.trees = 500, respect.unordered.factors = "order",      seed = 32022) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      1433 
## Number of independent variables:  3 
## Mtry:                             1 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.2051888 
## R squared (OOB):                  0.1216875

Predictions and RMSE

The following code runs predictions using the random forest model and the RMSE (Root Mean Square Error).

# Create a column called pred to store the prediction from the random forest model
chocolate_test$pred <- predict(chocolate_rf, chocolate_test)$predictions

# Calculate the RMSE of the predictions
chocolate_test %>% 
  mutate(residual = final_grade - pred) %>%
  summarize(rmse = sqrt(mean(residual^2)))
# Plot the actual outcome vs predictions (prediction on the x-axis)
ggplot(chocolate_test, aes(x = pred, y = final_grade)) +
  geom_point() + 
  geom_abline()

Summary

The decision trees were quick to compute and provided preliminary information. However, care had to be taken in splitting the train and test data because the system would throw errors for new variables. I did find it more difficult to use the decision tree to do a continuous prediction as opposed to a regular classification.