Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results. Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees, how can you change this perception when using the decision tree you created to solve a real problem? Format: document with screen captures & analysis.
For this assignment, I used the “Flavors of Cacao” dataset from this https://www.kaggle.com/datasets/techiesid01/flavours-of-cacao?resource=download. With this dataset, I will create a decision tree that will solve a regression problem: what will be the final grade for a cocoa bean. The dataset will be split in a training and testing partitions.
The following code sets up our libraries and imports the dataset.
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --
## v broom 0.7.12 v recipes 0.2.0
## v dials 0.1.0 v rsample 0.1.1
## v dplyr 1.0.8 v tibble 3.1.6
## v ggplot2 3.3.5 v tidyr 1.2.0
## v infer 1.0.0 v tune 0.1.6
## v modeldata 0.1.1 v workflows 0.2.4
## v parsnip 0.2.0 v workflowsets 0.1.0
## v purrr 0.3.4 v yardstick 0.0.9
## Warning: package 'parsnip' was built under R version 4.1.3
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## x tune::tune() masks parsnip::tune()
## * Dig deeper into tidy modeling with R at https://www.tmwr.org
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v readr 2.1.2 v forcats 0.5.1
## v stringr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x stringr::fixed() masks recipes::fixed()
## x dplyr::lag() masks stats::lag()
## x readr::spec() masks yardstick::spec()
library(readr)
library(rpart.plot)
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
# Used for Random Forest
library(ranger)
## Warning: package 'ranger' was built under R version 4.1.3
chocolate <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data622/hw2/flavors_of_cacao.csv')
## Rows: 1795 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): company, specific_bean_origin_or_bar_name, cocoa_percent, company_l...
## dbl (3): ref, review_date, final_grade
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
This is a view of the imported dataset.
names(chocolate)
## [1] "company" "specific_bean_origin_or_bar_name"
## [3] "ref" "review_date"
## [5] "cocoa_percent" "company_location"
## [7] "final_grade" "bean_type"
## [9] "broad_bean_origin"
head(chocolate)
The column cocoa_percent has a chr data
type so this needs to be converted to a double data type. The column
ref was removed. The columns broad_bean_origin
and bean_type were converted to factor
variables. I also did a drop_na.
chocolate <- chocolate %>% drop_na()
# Parses the cocoa_percent and converts it to percentage
chocolate$cocoa_percent <- parse_number(chocolate$cocoa_percent)/100
# Removes the ref column
chocolate <- chocolate[-c(3)]
# Convert some columns to factors
chocolate$broad_bean_origin <- as.factor(chocolate$broad_bean_origin )
chocolate$bean_type <- as.factor(chocolate$bean_type)
This is a view of the data after our modification.
head(chocolate)
summary(chocolate)
## company specific_bean_origin_or_bar_name review_date
## Length:1793 Length:1793 Min. :2006
## Class :character Class :character 1st Qu.:2010
## Mode :character Mode :character Median :2013
## Mean :2012
## 3rd Qu.:2015
## Max. :2017
##
## cocoa_percent company_location final_grade bean_type
## Min. :0.420 Length:1793 Min. :1.000 :887
## 1st Qu.:0.700 Class :character 1st Qu.:3.000 Trinitario :418
## Median :0.700 Mode :character Median :3.250 Criollo :153
## Mean :0.717 Mean :3.186 Forastero : 87
## 3rd Qu.:0.750 3rd Qu.:3.500 Forastero (Nacional): 52
## Max. :1.000 Max. :5.000 Blend : 41
## (Other) :155
## broad_bean_origin
## Venezuela :214
## Ecuador :193
## Peru :165
## Madagascar :145
## Dominican Republic:141
## : 73
## (Other) :862
# Splitting the data 80/20
set.seed(32022)
data_split <- initial_split(chocolate, prop = 0.8, strata = 'bean_type')
chocolate_train <- training(data_split)
chocolate_test <- testing(data_split)
I will construct the model for a Decision Tree regression tree using the training dataset.
# Build the model specification for a decision tree
model_spec <- decision_tree() %>%
set_mode("regression") %>%
set_engine("rpart")
model_spec
## Decision Tree Model Specification (regression)
##
## Computational engine: rpart
The following code will fit the data using the
chocolate_train dataset which we partitioned from the
original dataset. This will look at the variable
final_grade in relation to cocoa_percent and
bean_type.
# Train the model
model <- model_spec %>%
fit(formula = final_grade ~ cocoa_percent + bean_type,
data = chocolate_train)
# Information about the model
model
## parsnip model object
##
## n= 1433
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1433 334.539700 3.183182
## 2) cocoa_percent>=0.905 22 10.536930 2.284091 *
## 3) cocoa_percent< 0.905 1411 305.941400 3.197201
## 6) bean_type= ,Criollo (Amarru),Criollo (Ocumare),Forastero,Forastero (Arriba),Forastero (Arriba) ASS,Forastero(Arriba, CCN),Forastero, Trinitario,Nacional,Trinitario (Amelonado),Trinitario, Criollo,Trinitario, Forastero 812 182.707100 3.136392
## 12) cocoa_percent>=0.755 92 16.887910 2.942935 *
## 13) cocoa_percent< 0.755 720 161.936100 3.161111 *
## 7) bean_type=Amazon,Amazon mix,Amazon, ICS,Beniano,Blend,Blend-Forastero,Criollo,CCN51,Criollo,Criollo (Ocumare 61),Criollo (Ocumare 77),Criollo (Porcelana),Criollo (Wild),Criollo, +,Criollo, Forastero,Criollo, Trinitario,EET,Forastero (Amelonado),Forastero (Arriba) ASSS,Forastero (Catongo),Forastero (Nacional),Forastero (Parazinho),Matina,Nacional (Arriba),Trinitario,Trinitario (85% Criollo),Trinitario, TCGA 599 116.161500 3.279633
## 14) bean_type=Amazon,Blend,Criollo,Criollo (Ocumare 61),Criollo (Porcelana),Criollo, Trinitario,Forastero (Nacional),Nacional (Arriba),Trinitario 572 110.526700 3.263112 *
## 15) bean_type=Amazon mix,Amazon, ICS,Beniano,Blend-Forastero,Criollo,CCN51,Criollo (Ocumare 77),Criollo (Wild),Criollo, +,Criollo, Forastero,EET,Forastero (Amelonado),Forastero (Arriba) ASSS,Forastero (Catongo),Forastero (Parazinho),Matina,Trinitario (85% Criollo),Trinitario, TCGA 27 2.171296 3.629630 *
The following code will use the dataset chocolate_test
to make predictions about final_grade using the model that
was built.
# Make predictions with the model.
predictions <- predict(model,
new_data = chocolate_test) %>%
bind_cols(chocolate_test)
predictions
# Visualization of the model
model$fit %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Surprisingly, the cocoa_percent did not have a strong
influence on the final_grade. A cocoa_percent
greater than or equal to 0.91, resulted in a rating of 2.3. The grade
also depended upon the type of bean: Amazon mix,Amazon,
ICS,Beniano,Blend-Forastero,Criollo,CCN51,Criollo (Ocumare 77),Criollo
(Wild),Criollo, +,Criollo, Forastero,EET,Forastero (Amelonado),Forastero
(Arriba) ASSS,Forastero (Catongo),Forastero
(Parazinho),Matina,Trinitario (85% Criollo),Trinitario, TCGA 27.
I will construct the second model for a Decision Tree regression tree using the training dataset.
The following code will fit the data using the
chocolate_train dataset which we partitioned from the
original dataset. This will look at the variable
final_grade in relation to cocoa_percent,
bean_type, broad_bean_origin, and
review_date.
# Train the model
model2 <- model_spec %>%
fit(formula = final_grade ~ cocoa_percent + bean_type + broad_bean_origin + review_date,
data = chocolate_train)
# Information about the model
model2
## parsnip model object
##
## n= 1433
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1433 334.539700 3.183182
## 2) cocoa_percent>=0.905 22 10.536930 2.284091 *
## 3) cocoa_percent< 0.905 1411 305.941400 3.197201
## 6) broad_bean_origin= ,Africa, Carribean, C. Am.,Burma,Carribean,Central and S. America,Colombia, Ecuador,Cost Rica, Ven,Costa Rica,Ecuador, Mad., PNG,El Salvador,Ghana,Ghana & Madagascar,Ghana, Domin. Rep,Ghana, Panama, Ecuador,Grenada,India,Ivory Coast,Jamaica,Liberia,Martinique,Mexico,Peru,Peru(SMartin,Pangoa,nacional),Peru, Ecuador, Venezuela,Peru, Madagascar,Philippines,Principe,Puerto Rico,Sao Tome,South America, Africa,Sri Lanka,St. Lucia,Togo,Trinidad,Trinidad-Tobago,Uganda,Ven., Trinidad, Mad.,Venezuela, Carribean,Venezuela, Ghana,Venezuela, Trinidad,West Africa 404 107.531400 3.056312
## 12) bean_type= ,Criollo,Criollo (Amarru),Criollo, Trinitario,Forastero,Forastero, Trinitario,Trinitario,Trinitario, Criollo,Trinitario, Forastero 359 92.007310 3.018106
## 24) broad_bean_origin= ,Africa, Carribean, C. Am.,Colombia, Ecuador,Cost Rica, Ven,Ghana, Domin. Rep,Ivory Coast,Martinique,Peru, Madagascar,Philippines,Principe,Puerto Rico,Sri Lanka,Trinidad-Tobago,Uganda,Venezuela, Trinidad,West Africa 76 23.289470 2.802632
## 48) review_date< 2007.5 12 5.807292 2.145833 *
## 49) review_date>=2007.5 64 11.334960 2.925781 *
## 25) broad_bean_origin=Burma,Carribean,Central and S. America,Costa Rica,El Salvador,Ghana,Grenada,India,Jamaica,Liberia,Mexico,Peru,Peru(SMartin,Pangoa,nacional),Peru, Ecuador, Venezuela,Sao Tome,South America, Africa,St. Lucia,Togo,Trinidad,Venezuela, Ghana 283 64.241610 3.075972 *
## 13) bean_type=Amazon mix,Blend,Forastero (Nacional),Matina 45 10.819440 3.361111 *
## 7) broad_bean_origin=Australia,Belize,Bolivia,Brazil,Cameroon,Colombia,Congo,Cuba,Dom. Rep., Madagascar,Domincan Republic,Dominican Rep., Bali,Dominican Republic,Ecuador,Fiji,Gabon,Gre., PNG, Haw., Haiti, Mad,Guat., D.R., Peru, Mad., PNG,Guatemala,Haiti,Hawaii,Honduras,Indonesia,Mad., Java, PNG,Madagascar,Malaysia,Nicaragua,Nigeria,Panama,Papua New Guinea,PNG, Vanuatu, Mad,Samoa,Sao Tome & Principe,Solomon Islands,South America,Suriname,Tanzania,Tobago,Trinidad, Ecuador,Trinidad, Tobago,Vanuatu,Ven., Indonesia, Ecuad.,Ven.,Ecu.,Peru,Nic.,Venez,Africa,Brasil,Peru,Mex,Venezuela,Venezuela, Java,Venezuela/ Ghana,Vietnam 1007 187.173500 3.253724
## 14) cocoa_percent>=0.745 264 51.625000 3.125000
## 28) broad_bean_origin=Bolivia,Colombia,Congo,Domincan Republic,Dominican Republic,Ecuador,Guatemala,Honduras,Madagascar,Nicaragua,Papua New Guinea,Tanzania,Vanuatu,Venezuela 198 39.865210 3.059343
## 56) review_date>=2006.5 188 35.089100 3.029255 *
## 57) review_date< 2006.5 10 1.406250 3.625000 *
## 29) broad_bean_origin=Belize,Brazil,Cuba,Fiji,Gabon,Guat., D.R., Peru, Mad., PNG,Haiti,Hawaii,Indonesia,Panama,Sao Tome & Principe,Solomon Islands,Trinidad, Ecuador,Ven.,Ecu.,Peru,Nic.,Venezuela/ Ghana,Vietnam 66 8.345644 3.321970 *
## 15) cocoa_percent< 0.745 743 129.619800 3.299462
## 30) bean_type= ,Amazon,Criollo,Criollo (Ocumare 61),Criollo (Porcelana),Forastero,Forastero (Arriba),Forastero (Arriba) ASS,Forastero (Nacional),Nacional,Nacional (Arriba),Trinitario,Trinitario (Amelonado),Trinitario, Criollo,Trinitario, Forastero 698 121.212800 3.280802 *
## 31) bean_type=Amazon, ICS,Beniano,Blend,Blend-Forastero,Criollo,CCN51,Criollo (Ocumare 77),Criollo (Wild),Criollo, +,Criollo, Trinitario,EET,Forastero (Amelonado),Forastero (Arriba) ASSS,Forastero (Catongo),Forastero (Parazinho),Trinitario (85% Criollo),Trinitario, TCGA 45 4.394444 3.588889 *
The following code will use the dataset chocolate_test
to make predictions about final_grade using the model that
was built.
# Make predictions with the model.
predictions <- predict(model2,
new_data = chocolate_test) %>%
bind_cols(chocolate_test)
predictions
# Visualization of the model
model2$fit %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
Overall, the final_grade came down to
bean_type and cocoa_percent. The year and
other variables contributed noise.
The following code create a random forest model.
# creates the random forest model
(chocolate_rf <- ranger(final_grade ~ bean_type + cocoa_percent + broad_bean_origin,
chocolate_train,
num.trees=500,
respect.unordered.factors = "order",
seed=32022))
## Ranger result
##
## Call:
## ranger(final_grade ~ bean_type + cocoa_percent + broad_bean_origin, chocolate_train, num.trees = 500, respect.unordered.factors = "order", seed = 32022)
##
## Type: Regression
## Number of trees: 500
## Sample size: 1433
## Number of independent variables: 3
## Mtry: 1
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 0.2051888
## R squared (OOB): 0.1216875
The following code runs predictions using the random forest model and the RMSE (Root Mean Square Error).
# Create a column called pred to store the prediction from the random forest model
chocolate_test$pred <- predict(chocolate_rf, chocolate_test)$predictions
# Calculate the RMSE of the predictions
chocolate_test %>%
mutate(residual = final_grade - pred) %>%
summarize(rmse = sqrt(mean(residual^2)))
# Plot the actual outcome vs predictions (prediction on the x-axis)
ggplot(chocolate_test, aes(x = pred, y = final_grade)) +
geom_point() +
geom_abline()
The decision trees were quick to compute and provided preliminary information. However, care had to be taken in splitting the train and test data because the system would throw errors for new variables. I did find it more difficult to use the decision tree to do a continuous prediction as opposed to a regular classification.