DATA 622 HW2
Introduction
This assignment creates two decision trees from a single dataset from https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ to solve a classification problem and predict the outcome of a particular feature or detail of the data. We then create a random forest for regression, analyze the results, and consider how we can change negative perceptions of decision trees with our result. We will use the following packages for this study,
The Data
We’ll use the 1000 sales records dataset for this assignment.
## Rows: 1,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …
The dataset contains 1000 observations and 14 dimensions, a mix of numerical and categorical values.
## Region Country Item.Type Sales.Channel Order.Priority
## 0 0 0 0 0
## Order.Date Order.ID Ship.Date Units.Sold Unit.Price
## 0 0 0 0 0
## Unit.Cost Total.Revenue Total.Cost Total.Profit
## 0 0 0 0
There are no missing values in the dataset.
Data Prep
We’ll process the following changes to the data,
- column names to lowercase.
- drop
order.id
andcountry
columns.
- convert
order.date
andship.date
to date. - convert remaning
columns to factor.
- create column
time
as the difference in days betweenorder.date
andship.date
- drop
order.date
andship.date
- drop the dependencies
total.cost
,total.profit
,total.revenue
# convert all column names to lowercase
colnames(df) <- tolower(colnames(df))
# drop columns and convert column types
df <- df |>
select(-c(order.id, country)) |>
mutate(order.date = as.Date(order.date, format = "%m/%d/%Y"),
ship.date = as.Date(ship.date, format = "%m/%d/%Y")) |>
mutate(time = as.integer(ship.date - order.date),
month = as.factor(month(order.date)),
year = as.factor(year(order.date))) |>
mutate_if(is.character, as.factor) |>
select(-c(order.date, ship.date, total.cost, total.profit, total.revenue))
glimpse(df)
## Rows: 1,000
## Columns: 10
## $ region <fct> Middle East and North Africa, North America, Middle Eas…
## $ item.type <fct> Cosmetics, Vegetables, Baby Food, Cereal, Fruits, Cerea…
## $ sales.channel <fct> Offline, Online, Offline, Offline, Offline, Online, Onl…
## $ order.priority <fct> M, M, C, C, H, H, H, M, H, H, M, M, C, C, L, H, H, M, C…
## $ units.sold <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ unit.price <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ unit.cost <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ time <int> 13, 31, 39, 32, 15, 34, 44, 42, 37, 26, 21, 19, 35, 16,…
## $ month <fct> 10, 11, 10, 4, 8, 11, 3, 5, 1, 12, 12, 2, 11, 12, 1, 6,…
## $ year <fct> 2014, 2011, 2016, 2010, 2011, 2014, 2015, 2012, 2015, 2…
df <- df |>
group_by(region,
year,
month,
item.type,
sales.channel,
order.priority) |>
summarize(sold = sum(units.sold),
avg.cost = mean(unit.cost),
avg.price = mean(unit.price),
avg.time = mean(time)) |>
rename(item = item.type,
channel = sales.channel,
priority = order.priority)
glimpse(df)
## Rows: 991
## Columns: 10
## Groups: region, year, month, item, channel [962]
## $ region <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <fct> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
## $ month <fct> 2, 3, 3, 3, 4, 4, 4, 5, 7, 7, 7, 7, 7, 9, 9, 10, 10, 10, 11,…
## $ item <fct> Beverages, Cereal, Fruits, Snacks, Beverages, Cereal, Office…
## $ channel <fct> Offline, Online, Offline, Offline, Online, Offline, Online, …
## $ priority <fct> L, C, L, H, C, C, H, C, H, M, M, M, L, C, H, H, H, M, L, C, …
## $ sold <int> 4571, 5132, 3972, 8929, 1547, 3322, 702, 1237, 8862, 6569, 4…
## $ avg.cost <dbl> 31.79, 117.11, 6.92, 97.44, 31.79, 117.11, 524.96, 263.33, 6…
## $ avg.price <dbl> 47.45, 205.70, 9.33, 152.58, 47.45, 205.70, 651.21, 437.20, …
## $ avg.time <dbl> 33, 10, 10, 41, 6, 32, 7, 1, 24, 39, 3, 31, 45, 41, 16, 50, …
We’ve now reduced our dataset to 991 observations and 10 dimensions.
Let’s take a look at the distribution of average turnaround times, as this could be a possible outcome variable,
df |>
ggplot(aes(x=avg.time)) +
geom_density(fill="lightblue") +
theme_few() +
labs(x="average time")
The distribution of average times are fairly uniform.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.00 25.00 25.03 38.00 50.00
The mean and the median are each 25. We create a new variable
delayed
for all orders that took longer than 30 days to
ship:
df <- df |>
mutate(delayed = as.factor(ifelse(
avg.time > 30,
"yes",
"no"))) |>
select(-avg.time)
table(df$delayed)
##
## yes no
## 383 608
We see there is some class imbalance. This could cause problems for our decision trees, so we oversample the minority class to balance the data.
##
## no yes
## 503 488
Data Modeling
We’re going to predict if an order was delayed. The
delayed
variable is our response variable.
First we’ll split the data 80/20 into training and testing sets.
set.seed(1)
# shuffle the dataset
bd <- balanced.data[sample(nrow(balanced.data)), ]
# create split point
split.index <- createDataPartition(bd$delayed, p = .8, list = FALSE)
# split the data
train.set <- bd[split.index, ]
test.set <- bd[-split.index, ]
Decision Trees
We generate the first decision tree and then hold out the first split node from our formula for the second tree.
We observe that the root node in Decision Tree 1 splits at the
item
feature, so we hold that out Decision Tree 2, and we
get a very different tree.
Decision Tree 1 splits at the items
feature and achieves a depth of 8, then splits at two different features
depending on the outcome of the root split. In this tree, the most
important features are items
, year
, and
month
.
Decision Tree 2 splits at the year
feature, and achieves a depth of 6. In this tree, the most important
features are year
, month
, and
region
.
Decision Tree 1 is far more complex than Decision Tree 2, indicating low bias, yet is barely more accurate, which suggests it is overfitting the data, a characteristic of high variance.
Decision Tree 1
set.seed(1)
tree.1 <- rpart(delayed ~ .,
data = train.set,
method = "class")
rpart.plot(tree.1, tweak = 1.5)
We check the model against our test set,
set.seed(1)
tree1.preds <- predict(tree.1, test.set, type="class")
t1 <- confusionMatrix(tree1.preds, test.set$delayed)
t1.results <- c(t1$overall['Accuracy'],
t1$byClass['Sensitivity'],
t1$byClass['Specificity'],
t1$byClass['Precision'],
t1$byClass['F1'])
t1.results
## Accuracy Sensitivity Specificity Precision F1
## 0.6345178 0.6600000 0.6082474 0.6346154 0.6470588
Decision Tree 2
set.seed(1)
tree.2 <- rpart(delayed ~ . -item,
data = train.set,
method = "class")
rpart.plot(tree.2, tweak=1.5)
We check the model against our test set,
set.seed(1)
tree2.preds <- predict(tree.2, test.set, type="class")
t2 <- confusionMatrix(tree2.preds, test.set$delayed)
t2.results <- c(t2$overall['Accuracy'],
t2$byClass['Sensitivity'],
t2$byClass['Specificity'],
t2$byClass['Precision'],
t2$byClass['F1'])
t2.results
## Accuracy Sensitivity Specificity Precision F1
## 0.6649746 0.6800000 0.6494845 0.6666667 0.6732673
Random Forest
Now we’ll create a random forest for regression and analyze the results.
This can be computationally expensive so we prepare additional cores for parallel processing,
Next we’ll train our model using 10-fold cross-validation,
set.seed(1)
registerDoParallel(cl)
ctrl <- trainControl(
method = "cv",
number = 10,
allowParallel = TRUE
)
rf.model <- train(delayed ~ .,
data = train.set,
ntrees = 1000,
method = "rf",
trControl = ctrl,
parallel = "multicore"
)
stopCluster(cl)
## Random Forest
##
## 794 samples
## 9 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 715, 715, 715, 714, 715, 714, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6664082 0.3317786
## 22 0.6789399 0.3573030
## 42 0.6740032 0.3476397
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.
Now we can check our model against the test data,
set.seed(1)
rf.preds <- predict(rf.model, test.set)
rfm <- confusionMatrix(rf.preds, test.set$delayed)
rf.results <- c(rfm$overall['Accuracy'],
rfm$byClass['Sensitivity'],
rfm$byClass['Specificity'],
rfm$byClass['Precision'],
rfm$byClass['F1'])
rf.results
## Accuracy Sensitivity Specificity Precision F1
## 0.7411168 0.7300000 0.7525773 0.7525773 0.7411168
Our model performs better on the testing data than on the training data, achieving 74% accuracy – this demonstrates low variance. The sensitivity (true positive) is 73%. The model is good at detecting positive cases, in this case, if an order isn’t delayed. The model can predict with 73% accuracy if an order isn’t delayed. The specificity (true negative) is a bit higher at 75%. This indicates the model can identify negative cases (delayed orders) with an accuracy of 75%.
We can observe the most important variables to our model:
Interestingly, these are not the variables that our individual
decision trees chose. In our random forest, the size of the order
(sold
) and the price and cost of the items in the order are
by far the biggest predictors of whether an order will be on time or
delayed. This makes instinctive sense, as large orders may require
restocks, or the price of an order may indicate its popularity, which
may also affect restocking – things that might contribute to whether an
order is shipped quickly.
Comparing results across models,
Accuracy | Sensitivity | Specificity | Precision | F1 | |
---|---|---|---|---|---|
tree1 | 0.6345178 | 0.66 | 0.6082474 | 0.6346154 | 0.6470588 |
tree2 | 0.6649746 | 0.68 | 0.6494845 | 0.6666667 | 0.6732673 |
rf | 0.7411168 | 0.73 | 0.7525773 | 0.7525773 | 0.7411168 |
Discussion
To summarize this analysis, we feature engineered a dataset containing sales data, and created a binary response variable representing whether an order was delayed, resampling our data in order to achieve class balance of our response variable. Splitting our data into training and testing sets (using an 80/20 split) we built a decision tree that identified the items contained in the order as a predictor, then built a second tree removing the items as a predictor, and observed the tree selected specific years as predictors. We next trained the models to make predictions on the testing data and observed that the first tree was more complex and exhibited low bias and high variance - as it was likely overfitting the data, while the second tree was simpler and proved to be more accurate. This highlights the bias-variance tradeoff. Bias refers to the error introduced when simplifying a problem, for example making assumptions about the data, whereas variance refers to the model’s sensitivity to small changes in the training data. High bias models may have lower variance but they may be underfitting the data. In this case, decision tree 2 has higher bias and lower variance than decision tree 1, and is more accurate. One way to balance bias and variance is to utilize an ensemble method such as random forest, that randomly samples and aggregates the predictions of multiple trees, and as such are less likely to overfit and more likely to better generalize to the unseen data, providing a more accurate model. Our next step illustrates this, as we trained a random forest model using 10-fold cross-validation to further reduce variance and achieved a predictive accuracy of 74% on the remaining 20% of our data, while our model predicted with 73% accuracy whether an order wasn’t delayed. We observed in our random forest that the most important predictors in our model were the number of items sold, the average cost, and the average price, not, as identified in our earlier individual trees, the specific items or years that an order was placed. This highlights what is described in the https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees article as “The BAD”, or disadvantages of using a decision tree. The individual decision trees failed to identify what were overall the most important variables to our predictive model. Our individual decision trees also highlight what the article refers to as “The UGLY” – the trees are complex and hard to interpret. One of the ways that we can change the perception when using the decision trees we created would be to contextualize the tree – describe it as part of the exploratory analysis, and simplifying the labeling so it is more legible, explaining to the audience the decision-making process of excluding the key feature from the second tree, and comparing the final result with the earlier iterations, explaining how the random forest model uses cross-validation to reduce variance in the model. The key takeaway from this assignment is that random forests, by randomly bootstrapping samples from the data and aggregating the predictions of multiple trees, is able to capture complex patterns in the data that might is not apparent in an individual tree. The result is a model with less variance than individual models and is more generizable to unseen data. One way to illustrate this to stakeholders if you’re walking through the concept of a random forest would be to recreate the steps we just took – build a few decision trees with different predictors that will look different and then explain how a random forest does this n times (in our case, 1000) and averages the results to make more accurate predictions.