library(caret)
library(corrplot)
library(dplyr)
library(e1071)
library(forecast)
library(ggforce)
library(ggplot2)
library(labelled)
library(Metrics)
library(mlbench)
library(ModelMetrics)
library(pROC)
library(psych)
library(RColorBrewer)
library(readr)
library(readxl)
library(randomForest)
library(rpart)
library(rpart.plot)
library(tidymodels)
library(tidyr)
library(tidyverse)
library(tsibble)
Based on the latest topics presented, choose a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used. Switch variables* to generate 2 decision trees and compare the results. Create a random forest and analyze the results. Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
Write a short essay explaining your analysis, and how you would address the concerns in the blog (listed in pre-work) Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document)
Note:
**NOTE: originally attempted with 100k data set but randomforest function would not compute.
head(df_1k)
describe(df_1k)
str(df_1k)
'data.frame': 1000 obs. of 14 variables:
$ Region : chr "Middle East and North Africa" "North America" "Middle East and North Africa" "Asia" ...
$ Country : chr "Libya" "Canada" "Libya" "Japan" ...
$ Item.Type : chr "Cosmetics" "Vegetables" "Baby Food" "Cereal" ...
$ Sales.Channel : chr "Offline" "Online" "Offline" "Offline" ...
$ Order.Priority: chr "M" "M" "C" "C" ...
$ Order.Date : chr "10/18/2014" "11/7/2011" "10/31/2016" "4/10/2010" ...
$ Order.ID : int 686800706 185941302 246222341 161442649 645713555 683458888 679414975 208630645 266467225 118598544 ...
$ Ship.Date : chr "10/31/2014" "12/8/2011" "12/9/2016" "5/12/2010" ...
$ Units.Sold : int 8446 3018 1517 3322 9845 9528 2844 7299 2428 4800 ...
$ Unit.Price : num 437.2 154.06 255.28 205.7 9.33 ...
$ Unit.Cost : num 263.33 90.93 159.42 117.11 6.92 ...
$ Total.Revenue : num 3692591 464953 387260 683335 91854 ...
$ Total.Cost : num 2224085 274427 241840 389039 68127 ...
$ Total.Profit : num 1468506 190526 145420 294296 23726 ...
summary(df_1k)
Region Country Item.Type Sales.Channel
Length:1000 Length:1000 Length:1000 Length:1000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Order.Priority Order.Date Order.ID Ship.Date
Length:1000 Length:1000 Min. :102928006 Length:1000
Class :character Class :character 1st Qu.:328074026 Class :character
Mode :character Mode :character Median :556609714 Mode :character
Mean :549681325
3rd Qu.:769694483
Max. :995529830
Units.Sold Unit.Price Unit.Cost Total.Revenue
Min. : 13 Min. : 9.33 Min. : 6.92 Min. : 2043
1st Qu.:2420 1st Qu.: 81.73 1st Qu.: 56.67 1st Qu.: 281192
Median :5184 Median :154.06 Median : 97.44 Median : 754939
Mean :5054 Mean :262.11 Mean :184.97 Mean :1327322
3rd Qu.:7537 3rd Qu.:421.89 3rd Qu.:263.33 3rd Qu.:1733503
Max. :9998 Max. :668.27 Max. :524.96 Max. :6617210
Total.Cost Total.Profit
Min. : 1417 Min. : 532.6
1st Qu.: 164932 1st Qu.: 98376.1
Median : 464726 Median : 277226.0
Mean : 936119 Mean : 391202.6
3rd Qu.:1141750 3rd Qu.: 548456.8
Max. :5204978 Max. :1726181.4
glimpse(df_1k)
Rows: 1,000
Columns: 14
$ Region <chr> "Middle East and North Africa", "North America", "Middle…
$ Country <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia", …
$ Item.Type <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Fruit…
$ Sales.Channel <chr> "Offline", "Online", "Offline", "Offline", "Offline", "O…
$ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", "…
$ Order.Date <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "8…
$ Order.ID <int> 686800706, 185941302, 246222341, 161442649, 645713555, 6…
$ Ship.Date <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8/…
$ Units.Sold <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 48…
$ Unit.Price <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 10…
$ Unit.Cost <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35.…
$ Total.Revenue <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, 1…
$ Total.Cost <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, 1…
$ Total.Profit <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, 8…
look_for(df_1k)
pos variable label col_type missing values
1 Region — chr 0
2 Country — chr 0
3 Item.Type — chr 0
4 Sales.Channel — chr 0
5 Order.Priority — chr 0
6 Order.Date — chr 0
7 Order.ID — int 0
8 Ship.Date — chr 0
9 Units.Sold — int 0
10 Unit.Price — dbl 0
11 Unit.Cost — dbl 0
12 Total.Revenue — dbl 0
13 Total.Cost — dbl 0
14 Total.Profit — dbl 0
apply(df_1k, 2, function(x) sum(is.na(x)))
Region Country Item.Type Sales.Channel Order.Priority
0 0 0 0 0
Order.Date Order.ID Ship.Date Units.Sold Unit.Price
0 0 0 0 0
Unit.Cost Total.Revenue Total.Cost Total.Profit
0 0 0 0
unique(df_1k$Region)
[1] "Middle East and North Africa" "North America"
[3] "Asia" "Sub-Saharan Africa"
[5] "Europe" "Central America and the Caribbean"
[7] "Australia and Oceania"
#unique(df_1k$Country)
length(unique(df_1k$Country))
[1] 185
table(df_1k$Item.Type)
Baby Food Beverages Cereal Clothes Cosmetics
87 101 79 78 75
Fruits Household Meat Office Supplies Personal Care
70 77 78 89 87
Snacks Vegetables
82 97
table(df_1k$Sales.Channel)
Offline Online
520 480
unique(df_1k$Order.Priority)
[1] "M" "C" "H" "L"
#select numeric columns 1k
df_1k_num <- df_1k %>%
keep(is.numeric)
#stats
describe(df_1k_num, fast=TRUE) %>%
select(c(-vars,-n))
#distributions
df_1k_num %>%
pivot_longer(cols = 1:6, names_to = "variable", values_to = "value") %>%
ggplot(aes(value)) +
facet_wrap(~variable, scales = "free") +
geom_density() +
geom_histogram(aes(y = after_stat(density)), bins = 40, alpha = 0.2, fill = "lightblue", color = "darkgreen")
From the initial EDA we see the following:
corr_matrix <- cor(df_1k_num)
corrplot(corr_matrix,
type = "lower",
order = "hclust",
tl.col = "blue",
addCoef.col = "white",
diag = FALSE,
title = "Corrplot",
mar = c(0, 0, 1, 0),
col = brewer.pal(10, "RdYlBu"))
Looking at the correlation plot we see the following:
I suspect multicollinearity but will use and additional method to confirm.
set.seed(321)
sample_1k_train <- df_1k_num$Total.Revenue %>%
createDataPartition(p = 0.8, list = FALSE)
df_train_1k <- df_1k_num[sample_1k_train, ]
df_test_1k <- df_1k_num[-sample_1k_train, ]
model<- lm(Total.Revenue~., data=df_train_1k )
vif_values<-car::vif(model)
print(vif_values)
Order.ID Units.Sold Unit.Price Unit.Cost Total.Cost Total.Profit
1.002970 3.041373 167.637725 167.273600 11.445468 14.149909
The values interpret as: * Order.ID has low multicollinearity * Units.Sold low multicollinearity * Unit.Price and Unit.Cost has high levels of multicollinearity * Total.Cost and Total.Profit has moderate levels of multicollinearity.
Only transformation needed are: * date values to Month, Day and Year * levels for categorical values. * scaling for pre-processing for modelling * Attribute selection of relevant data will also be best
df_1k[['Order.Date']] <- as.Date(df_1k[['Order.Date']], "%m/%d/%Y")
df_1k[['Ship.Date']] <- as.Date(df_1k[['Ship.Date']], "%m/%d/%Y")
df_1k[['Sales.Channel']] <- as.factor(df_1k[['Sales.Channel']])
df_1k[['Order.Priority']] <- as.factor(df_1k[['Order.Priority']])
df_1k[['Item.Type']] <- as.factor(df_1k[['Item.Type']])
df_1k[['Region']] <- as.factor(df_1k[['Region']])
df_1k[['Country']] <- as.factor(df_1k[['Country']])
df_1k[['Order.ID']] <- as.character(df_1k[['Order.ID']])
df_1k_norm<-predict(preProcess(df_1k, method=c("center", "scale")),df_1k)
df_1k_norm %>%
keep(is.numeric) %>%
describe(fast=TRUE) %>%
select(-c(vars,n))
df_1k_norm %>%
select(where(is.numeric)) %>% # keep numeric columns
{list(summary = summary(.),
plot = ggplot(tidyr::pivot_longer(., cols = everything()),
aes(value)) +
facet_wrap(~name, scales = "free") +
geom_density() +
geom_histogram(aes(y=after_stat(density)), alpha=0.2, fill = "lightblue",
color="darkgreen", position="identity", bins = 40))
}
$summary
Units.Sold Unit.Price Unit.Cost Total.Revenue
Min. :-1.73745 Min. :-1.1701 Min. :-1.0157 Min. :-0.8915
1st Qu.:-0.90775 1st Qu.:-0.8350 1st Qu.:-0.7319 1st Qu.:-0.7037
Median : 0.04481 Median :-0.5002 Median :-0.4993 Median :-0.3851
Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.85572 3rd Qu.: 0.7397 3rd Qu.: 0.4471 3rd Qu.: 0.2732
Max. : 1.70402 Max. : 1.8802 Max. : 1.9396 Max. : 3.5586
Total.Cost Total.Profit
Min. :-0.8040 Min. :-1.0183
1st Qu.:-0.6633 1st Qu.:-0.7633
Median :-0.4055 Median :-0.2971
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.1769 3rd Qu.: 0.4099
Max. : 3.6719 Max. : 3.4798
$plot
df_1k_norm <- df_1k_norm %>%
select(-c(Country,Order.ID,))
set.seed(1234)
df1k_norm1 <- df_1k_norm
#split
training_1k_samples <- df1k_norm1$Total.Revenue %>%
createDataPartition(p = 0.8, list = FALSE)
train_1k1 <- df1k_norm1[training_1k_samples, ]
test_1k1 <- df1k_norm1[-training_1k_samples, ]
#train using rpart, cp- complexity, smaller # = more complexity,
#method- anova is for regression
tree_1k1 <- rpart(Total.Revenue ~., data = train_1k1, cp = 0.004, method = 'anova')
#visualize
rpart.plot(tree_1k1)
print(tree_1k1)
n= 800
node), split, n, deviance, yval
* denotes terminal node
1) root 800 789.9383000 -0.006752507
2) Total.Cost< 0.6147312 654 108.3324000 -0.410574000
4) Total.Cost< -0.3106351 456 16.8432000 -0.634194800
8) Total.Cost< -0.6429517 216 1.1617710 -0.802689100 *
9) Total.Cost>=-0.6429517 240 4.0300370 -0.482550000 *
5) Total.Cost>=-0.3106351 198 16.1706800 0.104431600
10) Total.Cost< 0.002629725 109 1.8064560 -0.101871600 *
11) Total.Cost>=0.002629725 89 4.0434160 0.357095100 *
3) Total.Cost>=0.6147312 146 97.2280900 1.802146000
6) Total.Cost< 2.244415 102 19.0550000 1.348138000
12) Total.Profit< 0.1788319 27 1.5098580 0.846133500 *
13) Total.Profit>=0.1788319 75 8.2913770 1.528860000
26) Total.Cost< 1.124162 25 1.4624880 1.217733000 *
27) Total.Cost>=1.124162 50 3.1988910 1.684424000 *
7) Total.Cost>=2.244415 44 8.4097410 2.854619000
14) Total.Cost< 2.891512 18 0.9282189 2.399053000 *
15) Total.Cost>=2.891512 26 1.1594990 3.170012000 *
Predictions
predictions <- predict(tree_1k1, newdata = test_1k1) %>%
bind_cols(test_1k1 )
predictions$...1 <- as.numeric(predictions$...1)
Performance
decision_tree_model <- data.frame(Model = "Decision Tree 1",
MAE = ModelMetrics::mae(predictions$Total.Revenue, predictions$...1),
#rmse Root Mean Squared Error
RMSE = ModelMetrics::rmse(predictions$Total.Revenue, predictions$...1),
#r squared
R2 = caret::R2(predictions$Total.Revenue, predictions$...1)
)
decision_tree_model
set.seed(4321)
df_1k_norm2 <- df_1k_norm %>%
select(-c("Unit.Price","Unit.Cost","Total.Cost", "Total.Profit"))
#split
training_1k_samples2 <- df_1k_norm2$Total.Revenue %>%
createDataPartition(p = 0.8, list = FALSE)
train_1k2 <- df_1k_norm2[training_1k_samples2, ]
test_1k2 <- df_1k_norm2[-training_1k_samples2, ]
#train using rpart, cp- complexity, smaller # = more complexity,
#method- anova is for regression
tree_1k2 <- rpart(Total.Revenue ~., data = train_1k2, cp = 0.004, method = 'anova')
#visualize
rpart.plot(tree_1k2)
print(tree_1k2)
n= 800
node), split, n, deviance, yval
* denotes terminal node
1) root 800 822.8269000 0.003789579
2) Item.Type=Baby Food,Beverages,Cereal,Clothes,Fruits,Personal Care,Snacks,Vegetables 539 75.4121000 -0.474707900
4) Item.Type=Beverages,Clothes,Fruits,Personal Care 270 8.8690330 -0.685383800 *
5) Item.Type=Baby Food,Cereal,Snacks,Vegetables 269 42.5309600 -0.263248700
10) Units.Sold< -0.08719589 131 4.9589340 -0.587279500
20) Units.Sold< -0.9037052 64 0.4596006 -0.751204600 *
21) Units.Sold>=-0.9037052 67 1.1367960 -0.430694400 *
11) Units.Sold>=-0.08719589 138 10.7607700 0.044345760
22) Item.Type=Snacks,Vegetables 69 1.2076570 -0.148450800 *
23) Item.Type=Baby Food,Cereal 69 4.4235870 0.237142300 *
3) Item.Type=Cosmetics,Household,Meat,Office Supplies 261 369.1487000 0.991951000
6) Units.Sold< 0.1121923 133 47.2663700 0.039783430
12) Units.Sold< -0.7953083 75 6.9743660 -0.381010600
24) Units.Sold< -1.301275 32 0.6975396 -0.671731000 *
25) Units.Sold>=-1.301275 43 1.5595200 -0.164660400 *
13) Units.Sold>=-0.7953083 58 9.8394350 0.583913600
26) Item.Type=Cosmetics,Meat 32 1.5942220 0.310831200 *
27) Item.Type=Household,Office Supplies 26 2.9217760 0.920015000 *
7) Units.Sold>=0.1121923 128 76.0103700 1.981313000
14) Item.Type=Cosmetics,Meat 63 7.7936750 1.394445000
28) Units.Sold< 0.9454178 28 0.7889117 1.056832000 *
29) Units.Sold>=0.9454178 35 1.2600410 1.664536000 *
15) Item.Type=Household,Office Supplies 65 25.4882900 2.550122000
30) Units.Sold< 0.9302526 32 3.0820170 1.987527000 *
31) Units.Sold>=0.9302526 33 2.4563190 3.095669000 *
Predictions
predictions2 <- predict(tree_1k2, newdata = test_1k2) %>%
bind_cols(test_1k2)
predictions2$...1 <- as.numeric(predictions2$...1)
Performance
decision_tree_model2 <- data.frame(Model = "Decision Tree 2",
#mean absolute error
MAE = ModelMetrics::mae(predictions2$Total.Revenue, predictions2$...1),
#rmse Root Mean Squared Error
RMSE = ModelMetrics::rmse(predictions2$Total.Revenue, predictions2$...1),
#r squared
R2 = caret::R2(predictions2$Total.Revenue, predictions2$...1)
)
decision_tree_model2
set.seed(222)
rf <- randomForest::randomForest(formula = Total.Revenue ~ .,
data = train_1k1, importance=TRUE)
rf
Call:
randomForest(formula = Total.Revenue ~ ., data = train_1k1, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 0.001567821
% Var explained: 99.84
ImpData <- as.data.frame(importance(rf))
ImpData$Var.Names <- row.names(ImpData)
ggplot(ImpData, aes(x=Var.Names, y=`%IncMSE`)) +
geom_segment( aes(x=Var.Names, xend=Var.Names, y=0, yend=`%IncMSE`), color="lightgreen") +
geom_point(aes(size = IncNodePurity), color="darkgreen", alpha=1) +
theme_light() +
coord_flip() +
theme(
legend.position="bottom",
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)
NA
NA
ggplot(ImpData, aes(x=Var.Names, y=`%IncMSE`)) +
geom_segment( aes(x=Var.Names, xend=Var.Names, y=0, yend=`%IncMSE`), color="lightblue") +
geom_point(aes(size = IncNodePurity), color="darkblue", alpha=1) +
theme_light() +
coord_flip() +
theme(
legend.position="bottom",
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)
Predictions
predictions3 <- predict(rf, newdata = test_1k1) %>%
bind_cols(test_1k1)
predictions3$...1 <- as.numeric(predictions3$...1)
Performance
random_forest_model <- data.frame(Model = "Random Forest",
#mean absolute error
MAE = ModelMetrics::mae(predictions3$Total.Revenue, predictions3$...1),
#rmse Root Mean Squared Error
RMSE = ModelMetrics::rmse(predictions3$Total.Revenue, predictions3$...1),
#r squared
R2 = R2(predictions3$Total.Revenue, predictions3$...1)
)
random_forest_model
set.seed(333)
train_tuned_rf <- train_1k1 %>%
select(-Total.Revenue)
bestmtry <- tuneRF(train_tuned_rf,train_1k1$Total.Revenue, stepFactor = 2, improve = 0.01,
trace=T, plot= T, doBest=TRUE, importance=TRUE)
mtry = 3 OOB error = 0.003143373
Searching left ...
mtry = 2 OOB error = 0.01035999
-2.29582 0.01
Searching right ...
mtry = 6 OOB error = 0.001037668
0.6698873 0.01
mtry = 11 OOB error = 0.002151529
-1.073428 0.01
bestmtry
Call:
randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1], importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 0.0005806283
% Var explained: 99.94
#importance(bestmtry)
# Get variable importance from the model fit
ImpData <- as.data.frame(importance(bestmtry))
ImpData$Var.Names <- row.names(ImpData)
ggplot(ImpData, aes(x=Var.Names, y=`%IncMSE`)) +
geom_segment( aes(x=Var.Names, xend=Var.Names, y=0, yend=`%IncMSE`), color="lightgreen") +
geom_point(aes(size = IncNodePurity), color="darkgreen", alpha=1) +
theme_light() +
coord_flip() +
theme(
legend.position="bottom",
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)
Predictions
predictions4 <- predict(bestmtry, newdata = test_1k1) %>%
bind_cols(test_1k1)
predictions4$...1 <- as.numeric(predictions4$...1)
Model Performance
random_forest_tuned_model <- data.frame(Model = "Tuned Random Forest",
#mean absolute error
MAE = ModelMetrics::mae(predictions4$Total.Revenue, predictions4$...1),
#rmse Root Mean Squared Error
RMSE = ModelMetrics::rmse(predictions4$Total.Revenue, predictions4$...1),
#r squared
R2 = caret::R2(predictions4$Total.Revenue, predictions4$...1)
)
random_forest_tuned_model
This assignment is a build-on to HW1, with an implementation of randomrorest algorithm. Originally my goal for the assignment was to incorporate the 100k dataset with 100k observations used for HW1. The initial plan was to use to assess performance and practicality or the randomforest and decision tree, after assessing the best way to transform the data. Afterwards, for my benefit I would compare to my original assignment and learn from the experience. An issue that arose was with the randomforest method and the large data set. The size created to big a computation load and cause the function to cycle with not result. Due to the submission deadline, I chose to utilize the 1k dataset for this assignment as a result. For my own benefit, I will rerun the function on my own time, to get a gauge on time needed for the computation to complete. Understanding the time needed for this method, would be useful if I chose to use randomforest again in the future. In this assignment I also utilized VIF scores to better assess the level of multicollinearity, which in the HW1 was only assessed with a correlation plot. During the EDA stage of the data, a few transformations were identified before moving to the modelling for this data. Categorical data was ranked, and the dates were defined as dates before proceeding. The distribution of the data was shown to be skewed in some case and the correlation plot showed, that numeric values would be best to utilize with my model. There was very little correlation with the categorical or data values and so those attributes were removed. All numerical data was used regardless if they showed multicollinearity which we identified using VIF. Preprocess function was used for scaling. The motivation behind using this function, was to ensure the values would contribute equally to the analysis, which can be impacted if ranges vary more among the attributes. For models 1 & 2 a decision tree was use. For Model 2, highly correlated variables were removed to assess the impact. I expected Model 2 to out perform on all levels, but it only retained a higher R2 value, which means a higher proportion of the variance is explained by the model, however Model 1 had a higher MAE and RMSE indicating better precision and accuracy. Random forest also had similar results, with a higher R2 but also higher RMSE and MAE, indicating a larger proportion of the dependent variable is explained, while technically being lower in accuracy and precision. Tuning random forest gave some improvement in the area of precision and accuracy, RMSE and MAE, while performing the best as indicated by the R2. However, when compared to the decision tree its RMSE and MAE values is slightly higher. I imagine this data and results would differ if a larger dataset was used, and I intend to rerun this work on my own after the assignment it submitted.