library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
library(e1071)
library(ggplot2)
library(ggforce)
library(labelled)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(mlbench)
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(ModelMetrics)
##
## Attaching package: 'ModelMetrics'
## The following objects are masked from 'package:Metrics':
##
## auc, ce, logLoss, mae, mse, msle, precision, recall, rmse, rmsle
## The following objects are masked from 'package:caret':
##
## confusionMatrix, precision, recall, sensitivity, specificity
## The following object is masked from 'package:base':
##
## kappa
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:ModelMetrics':
##
## auc
## The following object is masked from 'package:Metrics':
##
## auc
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:ggplot2':
##
## margin
library(readr)
library(RColorBrewer)
library(rpart)
library(rpart.plot)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.1 ✔ tibble 3.2.1
## ✔ dplyr 1.1.4 ✔ tidyr 1.3.1
## ✔ infer 1.0.6 ✔ tune 1.1.2
## ✔ modeldata 1.3.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.0 ✔ workflowsets 1.0.1
## ✔ purrr 1.0.2 ✔ yardstick 1.3.0
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ yardstick::accuracy() masks Metrics::accuracy()
## ✖ scales::alpha() masks psych::alpha(), ggplot2::alpha()
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ yardstick::mae() masks ModelMetrics::mae(), Metrics::mae()
## ✖ yardstick::mape() masks Metrics::mape()
## ✖ randomForest::margin() masks ggplot2::margin()
## ✖ yardstick::mase() masks Metrics::mase()
## ✖ yardstick::mcc() masks ModelMetrics::mcc()
## ✖ yardstick::npv() masks ModelMetrics::npv()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ yardstick::ppv() masks ModelMetrics::ppv()
## ✖ yardstick::precision() masks ModelMetrics::precision(), Metrics::precision(), caret::precision()
## ✖ dials::prune() masks rpart::prune()
## ✖ yardstick::recall() masks ModelMetrics::recall(), Metrics::recall(), caret::recall()
## ✖ yardstick::rmse() masks ModelMetrics::rmse(), Metrics::rmse()
## ✖ yardstick::sensitivity() masks ModelMetrics::sensitivity(), caret::sensitivity()
## ✖ yardstick::smape() masks Metrics::smape()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks ModelMetrics::specificity(), caret::specificity()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ scales::alpha() masks psych::alpha(), ggplot2::alpha()
## ✖ scales::col_factor() masks readr::col_factor()
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ randomForest::margin() masks ggplot2::margin()
## ✖ yardstick::spec() masks readr::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Read this blog: https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees which shows some of the issues with decision trees.
Choose a dataset from a source in Assignment #1, or another dataset of your choice.
Based on the latest topics presented, choose a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.
Switch variables* to generate 2 decision trees and compare the results. Create a random forest and analyze the results.
Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
Essay (minimum 500 word document) Write a short essay explaining your analysis, and how you would address the concerns in the blog (listed in pre-work)
Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document)
* Note: 1. We are trying to train 2 different decision trees to compare bias and variance - so switch the features used for the first node (split) to force a different decision tree (How did the performance change?) 2. You will create 3 models: 2 x decision trees (to compare variance) and a random forest
head(df_1k)
describe(df_1k)
str(df_1k)
## 'data.frame': 1000 obs. of 14 variables:
## $ Region : chr "Middle East and North Africa" "North America" "Middle East and North Africa" "Asia" ...
## $ Country : chr "Libya" "Canada" "Libya" "Japan" ...
## $ Item.Type : chr "Cosmetics" "Vegetables" "Baby Food" "Cereal" ...
## $ Sales.Channel : chr "Offline" "Online" "Offline" "Offline" ...
## $ Order.Priority: chr "M" "M" "C" "C" ...
## $ Order.Date : chr "10/18/2014" "11/7/2011" "10/31/2016" "4/10/2010" ...
## $ Order.ID : int 686800706 185941302 246222341 161442649 645713555 683458888 679414975 208630645 266467225 118598544 ...
## $ Ship.Date : chr "10/31/2014" "12/8/2011" "12/9/2016" "5/12/2010" ...
## $ Units.Sold : int 8446 3018 1517 3322 9845 9528 2844 7299 2428 4800 ...
## $ Unit.Price : num 437.2 154.06 255.28 205.7 9.33 ...
## $ Unit.Cost : num 263.33 90.93 159.42 117.11 6.92 ...
## $ Total.Revenue : num 3692591 464953 387260 683335 91854 ...
## $ Total.Cost : num 2224085 274427 241840 389039 68127 ...
## $ Total.Profit : num 1468506 190526 145420 294296 23726 ...
summary(df_1k)
## Region Country Item.Type Sales.Channel
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## Length:1000 Length:1000 Min. :102928006 Length:1000
## Class :character Class :character 1st Qu.:328074026 Class :character
## Mode :character Mode :character Median :556609714 Mode :character
## Mean :549681325
## 3rd Qu.:769694483
## Max. :995529830
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 13 Min. : 9.33 Min. : 6.92 Min. : 2043
## 1st Qu.:2420 1st Qu.: 81.73 1st Qu.: 56.67 1st Qu.: 281192
## Median :5184 Median :154.06 Median : 97.44 Median : 754939
## Mean :5054 Mean :262.11 Mean :184.97 Mean :1327322
## 3rd Qu.:7537 3rd Qu.:421.89 3rd Qu.:263.33 3rd Qu.:1733503
## Max. :9998 Max. :668.27 Max. :524.96 Max. :6617210
## Total.Cost Total.Profit
## Min. : 1417 Min. : 532.6
## 1st Qu.: 164932 1st Qu.: 98376.1
## Median : 464726 Median : 277226.0
## Mean : 936119 Mean : 391202.6
## 3rd Qu.:1141750 3rd Qu.: 548456.8
## Max. :5204978 Max. :1726181.4
glimpse(df_1k)
## Rows: 1,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …
look_for(df_1k)
apply(df_1k, 2, function(x) sum(is.na(x)))
## Region Country Item.Type Sales.Channel Order.Priority
## 0 0 0 0 0
## Order.Date Order.ID Ship.Date Units.Sold Unit.Price
## 0 0 0 0 0
## Unit.Cost Total.Revenue Total.Cost Total.Profit
## 0 0 0 0
unique(df_1k$Region)
## [1] "Middle East and North Africa" "North America"
## [3] "Asia" "Sub-Saharan Africa"
## [5] "Europe" "Central America and the Caribbean"
## [7] "Australia and Oceania"
#unique(df_1k$Country)
length(unique(df_1k$Country))
## [1] 185
table(df_1k$Item.Type)
##
## Baby Food Beverages Cereal Clothes Cosmetics
## 87 101 79 78 75
## Fruits Household Meat Office Supplies Personal Care
## 70 77 78 89 87
## Snacks Vegetables
## 82 97
table(df_1k$Sales.Channel)
##
## Offline Online
## 520 480
unique(df_1k$Order.Priority)
## [1] "M" "C" "H" "L"
#select numeric columns 1k
df_1k_num <- df_1k %>%
keep(is.numeric)
#stats
describe(df_1k_num, fast=TRUE) %>%
select(c(-vars,-n))
#distributions
df_1k_num %>%
pivot_longer(cols = 1:6, names_to = "variable", values_to = "value") %>%
ggplot(aes(value)) +
facet_wrap(~variable, scales = "free") +
geom_density() +
geom_histogram(aes(y = after_stat(density)), bins = 40, alpha = 0.2, fill = "lightblue", color = "darkgreen")
From the initial EDA we see the following:
corr_matrix <- cor(df_1k_num)
corrplot(corr_matrix,
type = "lower",
order = "hclust",
tl.col = "blue",
addCoef.col = "white",
diag = FALSE,
title = "Corrplot",
mar = c(0, 0, 1, 0),
col = brewer.pal(10, "RdYlBu"))
Looking at the correlation plot we see the following:
I suspect multicollinearity but will use and additional method to confirm.
set.seed(321)
sample_1k_train <- df_1k_num$Total.Revenue %>%
createDataPartition(p = 0.8, list = FALSE)
df_train_1k <- df_1k_num[sample_1k_train, ]
df_test_1k <- df_1k_num[-sample_1k_train, ]
model<- lm(Total.Revenue~., data=df_train_1k )
vif_values<-car::vif(model)
print(vif_values)
## Order.ID Units.Sold Unit.Price Unit.Cost Total.Cost Total.Profit
## 1.002970 3.041373 167.637725 167.273600 11.445468 14.149909
The values interpret as: * Order.ID has low multicollinearity * Units.Sold low multicollinearity * Unit.Price and Unit.Cost has high levels of multicollinearity * Total.Cost and Total.Profit has moderate levels of multicollinearity.
Only transformation needed are: * date values to Month, Day and Year * levels for categorical values. * scaling for pre-processing for modelling * Attribute selection of relevant data will also be best, which includes excluding Order.ID and Country
df_1k_norm <- df_1k %>%
mutate(
`Order.Date` = mdy(`Order.Date`), # Convert to Date using lubridate's mdy function
`Ship.Date` = mdy(`Ship.Date`),
`Sales.Channel` = as.factor(`Sales.Channel`),
`Order.Priority` = as.factor(`Order.Priority`),
`Item.Type` = as.factor(`Item.Type`),
`Region` = as.factor(`Region`),
`Country` = as.factor(`Country`),
`Order.ID` = as.character(`Order.ID`)
) %>%
predict(preProcess(., method = c("center", "scale")), .)
levels(df_1k_norm$Sales.Channel)
## [1] "Offline" "Online"
df_1k_norm %>%
keep(is.numeric) %>%
describe(fast=TRUE) %>%
select(-c(vars,n))
#distribution
df_1k_norm %>%
keep(is.numeric) %>%
gather(variable, value, 1:6) %>%
ggplot(aes(value)) +
facet_wrap(~variable, scales = "free") +
geom_density() +
geom_histogram(aes(y=after_stat(density)), alpha=0.2, fill = "lightblue",
color="darkgreen", position="identity",bins = 40)
df_1k_norm <- df_1k_norm %>%
select(-c(Country,Order.ID,))
set.seed(1234)
df_1k_norm_svm <- df_1k_norm
#split
training.samples <- df_1k_norm_svm$Total.Revenue %>%
createDataPartition(p = 0.8, list = FALSE)
train_df <- df_1k_norm_svm[training.samples, ]
test_df <- df_1k_norm_svm[-training.samples, ]
svm_model<-svm(formula = Total.Revenue ~ ., data = train_df,
type = 'eps-regression')
print(svm_model)
##
## Call:
## svm(formula = Total.Revenue ~ ., data = train_df, type = "eps-regression")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.03448276
## epsilon: 0.1
##
##
## Number of Support Vectors: 70
Note SVM-Kernel: radial is the default
predictions_SVM <- predict(svm_model, newdata = test_df) %>%
bind_cols(test_df)
## New names:
## • `` -> `...1`
predictions_SVM$...1 <- as.numeric(predictions_SVM$...1)
MAE <- MAE(predictions_SVM$Total.Revenue, predictions_SVM$...1)
RMSE <- RMSE(predictions_SVM$Total.Revenue, predictions_SVM$...1)
R2 <- R2(predictions_SVM$Total.Revenue, predictions_SVM$...1)
# Create a data frame to store the results
a_svm <- data.frame(Model = "SVM",
MAE = MAE,
RMSE = RMSE,
R2 = R2)
# Print the results
print(a_svm)
## Model MAE RMSE R2
## 1 SVM 0.04818865 0.06337936 0.9961623
df_1k_2 <- df_1k
preproc1 <-preProcess(df_1k_2, method=c("center", "scale"))
norm1 <- predict(preproc1,df_1k_2)
training.samples_slr <- norm1$Total.Revenue %>%
createDataPartition(p = 0.8, list = FALSE)
train.data_slr <- norm1[training.samples_slr, ]
test.data_slr <- norm1[-training.samples_slr, ]
model_slr<- lm(Total.Revenue~Units.Sold, data=train.data_slr )
# Make predictions
predictions_slr <- model_slr %>% predict(test.data_slr)
# Calculate the metrics
MAE_slr <- MAE(predictions_slr, test.data_slr$Total.Revenue)
RMSE_slr <- RMSE(predictions_slr, test.data_slr$Total.Revenue)
R2_slr <- R2(predictions_slr, test.data_slr$Total.Revenue)
# Create a data frame to store the results
a_slr <- data.frame(Model = "Simple Linear Regression",
MAE = MAE_slr,
RMSE = RMSE_slr,
R2 = R2_slr)
# Print the results
print(a_slr)
## Model MAE RMSE R2
## 1 Simple Linear Regression 0.6444445 0.8850026 0.2945233
norm2 <- norm1 %>%
select(-c(Country, Unit.Cost, Unit.Price, Total.Cost, Total.Profit, Order.ID, Ship.Date, Order.Date))
set.seed(146)
training.samples_mlr <- norm2$Total.Revenue %>%
createDataPartition(p = 0.8, list = FALSE)
train.data_mlr <- norm2[training.samples_mlr, ]
test.data_mlr <- norm2[-training.samples_mlr, ]
model_mlr<- lm(Total.Revenue~., data=train.data_mlr )
test.data_mlr <- na.omit(test.data_mlr)
# Make predictions
predictions_mlr <- model_mlr %>% predict(test.data_mlr)
# Calculate the metrics
MAE_mlr <- MAE(predictions_mlr, test.data_mlr$Total.Revenue)
RMSE_mlr <- RMSE(predictions_mlr, test.data_mlr$Total.Revenue)
R2_mlr <- R2(predictions_mlr, test.data_mlr$Total.Revenue)
# Create a data frame to store the results
a_mlr <- data.frame(Model = "Multiple Linear Regression",
MAE = MAE_mlr,
RMSE = RMSE_mlr,
R2 = R2_mlr)
# Print the results
print(a_mlr)
## Model MAE RMSE R2
## 1 Multiple Linear Regression 0.3009253 0.4055539 0.8059421
set.seed(1234)
tree <- rpart(Total.Revenue ~., data = train_df, cp = 0.004, method = 'anova')
predictions <- predict(tree, newdata = test_df) %>%
bind_cols(test_df)
## New names:
## • `` -> `...1`
predictions$...1 <- as.numeric(predictions$...1)
a_tree <- data.frame(Model = "Decision Tree 1",
#mean absolute error
MAE = MAE(predictions$Total.Revenue, predictions$...1),
#rmse Root Mean Squared Error
RMSE = RMSE(predictions$Total.Revenue, predictions$...1),
#r squared
R2 = R2(predictions$Total.Revenue, predictions$...1)
)
set.seed(123456)
#colnames(df1k_norm1)
df_1k_norm3 <- df_1k_norm %>%
select(-c("Unit.Price","Unit.Cost","Total.Cost", "Total.Profit"))
#split
training.samples3 <- df_1k_norm3$Total.Revenue %>%
createDataPartition(p = 0.8, list = FALSE)
train3 <- df_1k_norm3[training.samples3, ]
test3 <- df_1k_norm3[-training.samples3, ]
#train using rpart, cp- complexity, smaller # = more complexity,
#method- anova is for regression
tree3 <- rpart(Total.Revenue ~., data = train3, cp = 0.004, method = 'anova')
predictions3 <- predict(tree3, newdata = test3) %>%
bind_cols(test3)
## New names:
## • `` -> `...1`
predictions3$...1 <- as.numeric(predictions3$...1)
a_tree2 <- data.frame(Model = "Decision Tree 2",
#mean absolute error
MAE = MAE(predictions3$Total.Revenue, predictions3$...1),
#rmse Root Mean Squared Error
RMSE = RMSE(predictions3$Total.Revenue, predictions3$...1),
#r squared
R2 = R2(predictions3$Total.Revenue, predictions3$...1)
)
set.seed(201)
rf <- randomForest(formula = Total.Revenue ~ .,
data = train_df, importance=TRUE)
predictions4 <- predict(rf, newdata = test_df) %>%
bind_cols(test_df)
## New names:
## • `` -> `...1`
predictions4$...1 <- as.numeric(predictions4$...1)
a_rf <- data.frame(Model = "Random Forest",
#mean absolute error
MAE = MAE(predictions4$Total.Revenue, predictions4$...1),
#rmse Root Mean Squared Error
RMSE = RMSE(predictions4$Total.Revenue, predictions4$...1),
#r squared
R2 = R2(predictions4$Total.Revenue, predictions4$...1)
)
set.seed(908)
train1 <- train_df %>%
dplyr::select(-Total.Revenue)
bestmtry <- tuneRF(train1,train_df$Total.Revenue, stepFactor = 2, improve = 0.01,
trace=T, plot= F, doBest=TRUE, importance=TRUE)
## mtry = 3 OOB error = 0.002839379
## Searching left ...
## mtry = 2 OOB error = 0.007202836
## -1.536765 0.01
## Searching right ...
## mtry = 6 OOB error = 0.00095211
## 0.6646766 0.01
## mtry = 11 OOB error = 0.002508913
## -1.635108 0.01
predictions5 <- predict(bestmtry, newdata = test_df) %>%
bind_cols(test_df)
## New names:
## • `` -> `...1`
predictions5$...1 <- as.numeric(predictions5$...1)
a_trf <- data.frame(Model = "Tuned Random Forest",
#mean absolute error
MAE = MAE(predictions5$Total.Revenue, predictions5$...1),
#rmse Root Mean Squared Error
RMSE = RMSE(predictions5$Total.Revenue, predictions5$...1),
#r squared
R2 = R2(predictions5$Total.Revenue, predictions5$...1)
)
rbind( a_svm, a_slr, a_mlr, a_tree, a_tree2, a_trf )
The article Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study discusses machine learning algorithms specifically for Decision tree ensembles, to predict Covid-19 positive cases base on common lab tests. The article highlights the importance of classifiers designed for imbalanced datasets, with supporting information on how decision tree ensembles designed through classifiers for imbalanced data is better performing thatn standard methods. In the examples given, age was a major classifier in the tree ensemble, but future studies would incorporate other classifiers to understand the impact. In A novel approach to predict COVID-19 using support vector machine a paper presents a mthod with SVM classification, that predict if a person has covid-19 base on the patients symptoms. SVM classifiers notable had an accuracy of 87% based on the article. From there challenges in adaption of covid-19 was highlighted, alongside a need for ongoing analysis.
For my 3 academic papers I chose Decision tree learning through a Predictive Model for Student Academic Performance in Intelligent M-Learning environments, Decision Tree-Based Predictive Models for Academic Achievement Using College Students’ Support Networks and Using Decision Trees and Random Forest Algorithms to Predict and Determine Factors Contributing to First-Year University Students’ Learning Performance. The overall theme regarding the articles was the impact of the classifiers of the decision tree. For one use case economic and demographic factors were used to identify high risk students, which allowed the school to focus on ensuring their success. In another tools such as elearning and its impact was a major factor while another used Decision tree alongside CHAID and cforest algorithms to focus on the gender identification alongside other demographic factors. Overall, I believe whats highlighted was that a diverse understanding of the use case will help in designing a model with the correct classifiers to drive results.
As far as the data used for this assignment Tuned Random Forest performed best with a MAE of 0.01200354 RMSE of 0.02333389 and R^2 of 0.9995702