In this project, I am going to implement decision tree models on Carseats data set.
Decision Tree: Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. The main components of a decision tree model are nodes and branches and the most important steps in building a model are splitting, stopping, and pruning.
Tree Pruning: An alternative way to build a decision tree model is to grow a large tree first, and then prune it to optimal size by removing nodes that provide less additional information. A smaller tree with fewer splits might lead to lower variance and better interpretation at the cost of a little bias.
Bootstrapping: Bootstrapping is random sampling with replacement from the available training data set.
Bagging (Bootstrap aggregation): Bagging is one of the ensemble(combine different models) machine learning techniques that helps reducing variance. When we train our models on the bootstrapped training data sets, we find the average all predictors(voting system) to find final outcome, it is called Bagging.
Random Forest: In bagging, we build number of decision trees on bootstrapped training samples. In random forest, we randomly selects subsets of features used in each data sample. Since we randomly select features, all the bag trees will be uncorrelated.
Boosting: Boosting is another method that increase the prediction. In Boosting, each tree are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling, instead each tree is fit on a modified version of the original data set.
Data:
Carseats data set has 400 observations on the following 11 variables.
Sales: Unit sales (in thousands) at each location
CompPrice: Price charged by competitor at each location
Income: Community income level (in thousands of dollars)
Advertising: Local advertising budget for company at each location (in thousands of dollars)
Population: Population size in region (in thousands)
Price: Price company charges for car seats at each site
ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
Age: Average age of the local population
Education: Education level at each location
Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location
US: A factor with levels No and Yes to indicate whether the store is in the US or not
Objective:
Analyze the data
Fit a decision tree and report the test error
Perform Cross Validation to determine the optimal level of tree complexity and find MSE
Perform Random Forest Model and find MSE
Perform Bagging and find MSE
Perform Boosting and Find MSE
Find the Most Important Features and Build the final model
Compare the models
Loading Libraries
#Loading necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v stringr 1.4.0
## v tidyr 1.1.3 v forcats 0.5.1
## v readr 2.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(formattable)
library(pls) #PCR & PLS REgression
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(ISLR)
library(ggplot2)
library(tree)
## Registered S3 method overwritten by 'tree':
## method from
## print.tree cli
library(gbm)
## Loaded gbm 2.1.8
Load Data Set
attach(Carseats)
head(Carseats,2)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## Urban US
## 1 Yes Yes
## 2 Yes Yes
#Structure of the data
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
#check missing variables
sum(is.na(Carseats))
## [1] 0
There is no missing data in the dataset.
# Minimum Number of Carseat Sales
head(arrange(Carseats, Sales),5)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 175 0.00 139 24 0 358 185 Medium 79 15
## 107 0.16 102 33 0 217 139 Medium 70 18
## 166 0.37 147 58 7 100 191 Bad 27 15
## 144 0.53 122 88 7 36 159 Bad 28 17
## 58 0.91 93 91 0 22 117 Bad 75 11
## Urban US
## 175 No No
## 107 No No
## 166 Yes Yes
## 144 Yes Yes
## 58 Yes No
# Maximum Number of Carseat Sales
head(arrange(Carseats, desc(Sales)),5)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 377 16.27 141 60 19 319 92 Good 44 11
## 317 15.63 122 36 5 369 72 Good 35 10
## 26 14.90 139 32 0 176 82 Good 54 11
## 368 14.37 95 106 0 256 53 Good 52 17
## 19 13.91 110 110 0 408 68 Good 46 17
## Urban US
## 377 Yes Yes
## 317 Yes Yes
## 26 No No
## 368 Yes No
## 19 No Yes
# Highest Number of Advertising
head(arrange(Carseats, desc(Advertising )))
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 311 9.53 175 65 29 419 166 Medium 53 12
## 306 8.03 115 29 26 394 132 Medium 33 13
## 290 8.75 143 77 25 448 156 Medium 43 17
## 99 12.49 122 77 24 382 127 Good 36 16
## 76 8.55 88 111 23 480 92 Bad 36 16
## 255 9.58 108 104 23 353 129 Good 37 17
## Urban US
## 311 Yes Yes
## 306 Yes Yes
## 290 Yes Yes
## 99 No Yes
## 76 No Yes
## 255 Yes Yes
# Car seat sales and Advertising by Shelve Location
ggplot(Carseats, aes(Sales, Advertising, colour = ShelveLoc)) +
geom_point()
# Car seat sales vs Population by Shelve Location
ggplot(Carseats, aes(Sales, Population, colour = ShelveLoc)) +
geom_point()
# Car seat sales by City type
ggplot(Carseats, aes(Urban, Sales)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 19)
# Car seats Price by City type
ggplot(Carseats, aes(Urban,Price)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 19)
# Car seats sales
ggplot(Carseats, aes(Sales, Price, colour = US)) +
geom_point() +
stat_smooth(method = "lm",fill = NA)
## `geom_smooth()` using formula 'y ~ x'
# MarketValue vs Sales Number
# Note that market value is calculated by the original price - competitor price
Market_Value <- Carseats$Price-Carseats$CompPrice
newdat <- data.frame(Carseats, Market_Value)
newdf <- arrange(newdat, desc(Sales))
mean_value<-mean(newdf$Market_Value)
ggplot(data=newdf, aes(x=Market_Value, y=Sales , fill=Market_Value)) +
ggtitle("Market Value Vs Sales")+
geom_bar(stat="identity",width=0.6, fill="steelblue")+
geom_segment(aes(x = mean_value, y = -13, xend = mean_value, yend = 100, colour = "red"), width=0.2) +
geom_text(aes(x=mean_value, y= -15, label = mean_value), colour = "red", vjust = 1, size = 2.6)+
geom_text(aes(x=mean_value, y= 101, label = "Mean"), colour = "red", vjust = 1, size = 3)+
theme_minimal()
Data Partition
set.seed(111)
ind <- sample(2,nrow(Carseats), replace=TRUE, prob = c(0.8,0.2))
train <- Carseats[ind==1,]
test <- Carseats[ind==2,]
#Decision Tree
library(tree)
tree.carseats <-tree(Sales~., data = train)
#Summary Table
summary(tree.carseats)
##
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "CompPrice" "Age" "Population"
## [6] "Advertising" "Income"
## Number of terminal nodes: 18
## Residual mean deviance: 2.597 = 779.1 / 300
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.56200 -1.08600 -0.08932 0.00000 1.19200 4.51400
Summary function also returns which variable actually used in the tree model.
#Plot tree
plot(tree.carseats)
text(tree.carseats, pretty = 0)
#Prediction
pred.sales <- predict(tree.carseats, newdata = test)
#Test Error
tree_err <- mean((pred.sales - test$Sales)^2)
tree_err
## [1] 3.714836
#Cross Validation Method
cv.carseats <- cv.tree(tree.carseats)
#Plot CV
plot(cv.carseats$size, cv.carseats$dev,xlab = "Size of Tree", ylab = "Deviance", type = "b")
tree.min <- which.min(cv.carseats$dev)
points((max(cv.carseats$size)-tree.min), cv.carseats$dev[tree.min], col = "red", cex = 2, pch = 20)
#Pruning Method
prune.carseats <- prune.tree(tree.carseats, best = 6)
plot(prune.carseats)
text(prune.carseats, pretty = 0)
#Yes, Pruning the tree really help us to see the graph.
#Prediction
pred.sales2<- predict(prune.carseats, newdata = test)
#Test Error after pruning
cv_err<-mean((pred.sales2 - test$Sales)^2)
cv_err
## [1] 4.738796
But, Pruning increases the test error.
m=round(sqrt(ncol(Carseats)))
rf.carseats <- randomForest(Sales ~ ., data = train,
mtry = m, ntree = 500, importance = TRUE)
yhat.rf <- predict(rf.carseats, newdata = test)
#Test Error
rf_err <- mean((yhat.rf - test$Sales)^2)
rf_err
## [1] 2.439138
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 16.7404564 235.46198
## Income 5.5674035 189.86313
## Advertising 17.1316753 221.04856
## Population 0.5630675 161.63041
## Price 45.3255688 603.65342
## ShelveLoc 51.4543796 642.41121
## Age 16.8224116 280.15512
## Education 1.7932076 102.42041
## Urban -2.1603225 23.36524
## US 2.8066489 26.37619
# We found that ShelveLoc and Price are the key/the most important variable for predicting Sales
# total number of Predictor
p=ncol(Carseats)-1
# The number of trees I choose is 500
bag.carseats <- randomForest(Sales ~ ., data = train,
mtry = p, ntree = 500, importance = TRUE)
yhat.bag <- predict(bag.carseats, newdata = test)
#Test Error
bag_err <- mean((yhat.bag - test$Sales)^2)
bag_err
## [1] 2.031078
dim(Carseats)
## [1] 400 11
# The number of trees I choose is 5000
#Interaction depth is the limits the depth of eah tree
boost.carseats = gbm(Sales~., data=train, distribution= "gaussian",
n.trees=5000, interaction.depth=4 )
yhat.boost <- predict(boost.carseats, newdata = test, n.trees=5000)
#Test Error
boost_err <- mean((yhat.boost - test$Sales)^2)
boost_err
## [1] 1.874585
importance(bag.carseats)
## %IncMSE IncNodePurity
## CompPrice 31.9242021 268.01916
## Income 12.0772967 144.21809
## Advertising 21.0065904 174.78007
## Population -0.2151486 88.09067
## Price 69.9622675 723.69072
## ShelveLoc 76.7780005 863.07981
## Age 21.5749280 224.57575
## Education 1.0830188 60.03488
## Urban -2.2458234 13.32689
## US 2.6382137 10.43693
varImpPlot(bag.carseats)
# ShelveLoc and price are the most important variables, because they have a higher effect on MSE.
# Nevertheless, compPrice advertising and age are also considerably higher than others
Final Model
#Decision Tree Model with selected features
library(tree)
tree.final<-tree(Sales~ShelveLoc + Price + CompPrice + Advertising + Age, data = train)
#Plot tree
plot(tree.final)
text(tree.final, pretty = 0)
#Prediction
pred.sale <- predict(tree.final, newdata = test)
#Test Error
fin_err <-mean((pred.sale - test$Sales)^2)
fin_err
## [1] 3.806637
df<-data_frame(
id = 1:6,
Model = c('Decision Tree', 'CV', 'Random Forest','Bagging', 'Boosting','Final'),
MSE = c(tree_err, cv_err, rf_err, bag_err, boost_err, fin_err),
)
formattable(df, list(
Model = formatter("span",
style = ~ style(color = ifelse((MSE)<mean(MSE), "green", "red"))),
area(col = c(MSE)) ~ normalize_bar("pink", 0.2)
))
| id | Model | MSE |
|---|---|---|
| 1 | Decision Tree | 3.714836 |
| 2 | CV | 4.738796 |
| 3 | Random Forest | 2.439138 |
| 4 | Bagging | 2.031078 |
| 5 | Boosting | 1.874585 |
| 6 | Final | 3.806637 |
Random forest model and its applications Boosting, Bagging models return the lowest MSE.
References