Introduction

In this project, I am going to implement decision tree models on Carseats data set.

Decision Tree: Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. The main components of a decision tree model are nodes and branches and the most important steps in building a model are splitting, stopping, and pruning.

Tree Pruning: An alternative way to build a decision tree model is to grow a large tree first, and then prune it to optimal size by removing nodes that provide less additional information. A smaller tree with fewer splits might lead to lower variance and better interpretation at the cost of a little bias.

Bootstrapping: Bootstrapping is random sampling with replacement from the available training data set.

Bagging (Bootstrap aggregation): Bagging is one of the ensemble(combine different models) machine learning techniques that helps reducing variance. When we train our models on the bootstrapped training data sets, we find the average all predictors(voting system) to find final outcome, it is called Bagging.

Random Forest: In bagging, we build number of decision trees on bootstrapped training samples. In random forest, we randomly selects subsets of features used in each data sample. Since we randomly select features, all the bag trees will be uncorrelated.

Boosting: Boosting is another method that increase the prediction. In Boosting, each tree are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling, instead each tree is fit on a modified version of the original data set.

Data:

Carseats data set has 400 observations on the following 11 variables.

Sales: Unit sales (in thousands) at each location

CompPrice: Price charged by competitor at each location

Income: Community income level (in thousands of dollars)

Advertising: Local advertising budget for company at each location (in thousands of dollars)

Population: Population size in region (in thousands)

Price: Price company charges for car seats at each site

ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

Age: Average age of the local population

Education: Education level at each location

Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location

US: A factor with levels No and Yes to indicate whether the store is in the US or not

Objective:

  1. Analyze the data

  2. Fit a decision tree and report the test error

  3. Perform Cross Validation to determine the optimal level of tree complexity and find MSE

  4. Perform Random Forest Model and find MSE

  5. Perform Bagging and find MSE

  6. Perform Boosting and Find MSE

  7. Find the Most Important Features and Build the final model

  8. Compare the models



Loading Libraries

#Loading necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v stringr 1.4.0
## v tidyr   1.1.3     v forcats 0.5.1
## v readr   2.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)  
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(formattable)
library(pls) #PCR & PLS REgression 
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
library(ISLR)
library(ggplot2)
library(tree)
## Registered S3 method overwritten by 'tree':
##   method     from
##   print.tree cli
library(gbm)
## Loaded gbm 2.1.8
1. Analyze the data

Load Data Set

attach(Carseats)

head(Carseats,2)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
#Structure of the data
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
#check missing variables
sum(is.na(Carseats))
## [1] 0

There is no missing data in the dataset.

# Minimum Number of Carseat Sales
head(arrange(Carseats, Sales),5)
##     Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 175  0.00       139     24           0        358   185    Medium  79        15
## 107  0.16       102     33           0        217   139    Medium  70        18
## 166  0.37       147     58           7        100   191       Bad  27        15
## 144  0.53       122     88           7         36   159       Bad  28        17
## 58   0.91        93     91           0         22   117       Bad  75        11
##     Urban  US
## 175    No  No
## 107    No  No
## 166   Yes Yes
## 144   Yes Yes
## 58    Yes  No
# Maximum Number of Carseat Sales
head(arrange(Carseats, desc(Sales)),5)
##     Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 377 16.27       141     60          19        319    92      Good  44        11
## 317 15.63       122     36           5        369    72      Good  35        10
## 26  14.90       139     32           0        176    82      Good  54        11
## 368 14.37        95    106           0        256    53      Good  52        17
## 19  13.91       110    110           0        408    68      Good  46        17
##     Urban  US
## 377   Yes Yes
## 317   Yes Yes
## 26     No  No
## 368   Yes  No
## 19     No Yes
# Highest Number of Advertising 
head(arrange(Carseats, desc(Advertising )))
##     Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 311  9.53       175     65          29        419   166    Medium  53        12
## 306  8.03       115     29          26        394   132    Medium  33        13
## 290  8.75       143     77          25        448   156    Medium  43        17
## 99  12.49       122     77          24        382   127      Good  36        16
## 76   8.55        88    111          23        480    92       Bad  36        16
## 255  9.58       108    104          23        353   129      Good  37        17
##     Urban  US
## 311   Yes Yes
## 306   Yes Yes
## 290   Yes Yes
## 99     No Yes
## 76     No Yes
## 255   Yes Yes
# Car seat sales and Advertising by Shelve Location
ggplot(Carseats, aes(Sales, Advertising, colour = ShelveLoc)) +
  geom_point()

# Car seat sales vs Population by Shelve Location
ggplot(Carseats, aes(Sales, Population, colour = ShelveLoc)) +
  geom_point()

# Car seat sales by City type
ggplot(Carseats, aes(Urban, Sales)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 19)  

# Car seats Price by City type
ggplot(Carseats, aes(Urban,Price)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 19)  

# Car seats sales 
ggplot(Carseats, aes(Sales, Price, colour = US)) +
  geom_point() +
  stat_smooth(method = "lm",fill = NA)
## `geom_smooth()` using formula 'y ~ x'

# MarketValue vs Sales Number
# Note that market value is calculated by the original price - competitor price
Market_Value <- Carseats$Price-Carseats$CompPrice
newdat <- data.frame(Carseats, Market_Value)
newdf <- arrange(newdat, desc(Sales))
mean_value<-mean(newdf$Market_Value)


ggplot(data=newdf, aes(x=Market_Value, y=Sales , fill=Market_Value)) +
  ggtitle("Market Value Vs Sales")+
  geom_bar(stat="identity",width=0.6,  fill="steelblue")+
  geom_segment(aes(x = mean_value, y = -13, xend = mean_value, yend = 100, colour = "red"), width=0.2) + 
  geom_text(aes(x=mean_value, y= -15, label = mean_value), colour = "red", vjust = 1, size = 2.6)+
  geom_text(aes(x=mean_value, y= 101, label = "Mean"), colour = "red", vjust = 1, size = 3)+
  theme_minimal() 

2. Fit a Decission tree

Data Partition

set.seed(111)
ind <- sample(2,nrow(Carseats), replace=TRUE, prob = c(0.8,0.2))

train <- Carseats[ind==1,]
test <- Carseats[ind==2,]
#Decision Tree
library(tree)
tree.carseats <-tree(Sales~., data = train)

#Summary Table
summary(tree.carseats)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "CompPrice"   "Age"         "Population" 
## [6] "Advertising" "Income"     
## Number of terminal nodes:  18 
## Residual mean deviance:  2.597 = 779.1 / 300 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.56200 -1.08600 -0.08932  0.00000  1.19200  4.51400

Summary function also returns which variable actually used in the tree model.

#Plot tree
plot(tree.carseats)
text(tree.carseats, pretty = 0)

#Prediction
pred.sales <- predict(tree.carseats, newdata = test)
#Test Error
tree_err <- mean((pred.sales - test$Sales)^2)
tree_err
## [1] 3.714836
3. Perform Cross Validation and MSE
#Cross Validation Method
cv.carseats <- cv.tree(tree.carseats)

#Plot CV
plot(cv.carseats$size, cv.carseats$dev,xlab = "Size of Tree", ylab = "Deviance", type = "b")
tree.min <- which.min(cv.carseats$dev)
points((max(cv.carseats$size)-tree.min), cv.carseats$dev[tree.min], col = "red", cex = 2, pch = 20)

#Pruning Method
prune.carseats <- prune.tree(tree.carseats, best = 6)
plot(prune.carseats)
text(prune.carseats, pretty = 0)

#Yes, Pruning the tree really help us to see the graph.

#Prediction
pred.sales2<- predict(prune.carseats, newdata = test)

#Test Error after pruning

cv_err<-mean((pred.sales2 - test$Sales)^2)
cv_err
## [1] 4.738796

But, Pruning increases the test error.

4. Perform Random Forest and Find MSE
m=round(sqrt(ncol(Carseats)))

rf.carseats <- randomForest(Sales ~ ., data = train,
                            mtry = m, ntree = 500, importance = TRUE)
yhat.rf <- predict(rf.carseats, newdata = test)

#Test Error
rf_err <- mean((yhat.rf - test$Sales)^2)
rf_err
## [1] 2.439138
importance(rf.carseats)
##                %IncMSE IncNodePurity
## CompPrice   16.7404564     235.46198
## Income       5.5674035     189.86313
## Advertising 17.1316753     221.04856
## Population   0.5630675     161.63041
## Price       45.3255688     603.65342
## ShelveLoc   51.4543796     642.41121
## Age         16.8224116     280.15512
## Education    1.7932076     102.42041
## Urban       -2.1603225      23.36524
## US           2.8066489      26.37619
# We found that ShelveLoc and Price are the key/the most important variable for predicting Sales
5.Perform Bagging and Find MSE
# total number of Predictor
p=ncol(Carseats)-1 

# The number of trees I choose is 500
bag.carseats <- randomForest(Sales ~ ., data = train, 
                             mtry = p, ntree = 500, importance = TRUE)

yhat.bag <- predict(bag.carseats, newdata = test)

#Test Error
bag_err <- mean((yhat.bag - test$Sales)^2)
bag_err
## [1] 2.031078
6. Perform Boosting and Find MSE
dim(Carseats)
## [1] 400  11
# The number of trees I choose is 5000 
#Interaction depth  is the limits the depth of eah tree
boost.carseats = gbm(Sales~., data=train, distribution= "gaussian",
                     n.trees=5000, interaction.depth=4 )


yhat.boost <- predict(boost.carseats, newdata = test, n.trees=5000)

#Test Error
boost_err <- mean((yhat.boost - test$Sales)^2)
boost_err
## [1] 1.874585
7.Find Importance of variables and Build Final Model
importance(bag.carseats)
##                %IncMSE IncNodePurity
## CompPrice   31.9242021     268.01916
## Income      12.0772967     144.21809
## Advertising 21.0065904     174.78007
## Population  -0.2151486      88.09067
## Price       69.9622675     723.69072
## ShelveLoc   76.7780005     863.07981
## Age         21.5749280     224.57575
## Education    1.0830188      60.03488
## Urban       -2.2458234      13.32689
## US           2.6382137      10.43693
varImpPlot(bag.carseats)

# ShelveLoc and price are the most important variables, because they have a higher effect on MSE. 
# Nevertheless, compPrice advertising and age are also considerably higher than others

Final Model

#Decision Tree Model with selected features
library(tree)
tree.final<-tree(Sales~ShelveLoc + Price + CompPrice + Advertising + Age, data = train)

#Plot tree
plot(tree.final)
text(tree.final, pretty = 0)

#Prediction
pred.sale <- predict(tree.final, newdata = test)
#Test Error
fin_err <-mean((pred.sale - test$Sales)^2)
fin_err
## [1] 3.806637
8. Compare the models
df<-data_frame(
  id = 1:6,
  Model = c('Decision Tree', 'CV', 'Random Forest','Bagging', 'Boosting','Final'),
  MSE =  c(tree_err, cv_err, rf_err, bag_err, boost_err, fin_err), 
  )

formattable(df, list(
                Model = formatter("span", 
                        style = ~ style(color = ifelse((MSE)<mean(MSE), "green", "red"))),
                area(col = c(MSE)) ~ normalize_bar("pink", 0.2)
                
    ))
id Model MSE
1 Decision Tree 3.714836
2 CV 4.738796
3 Random Forest 2.439138
4 Bagging 2.031078
5 Boosting 1.874585
6 Final 3.806637

Random forest model and its applications Boosting, Bagging models return the lowest MSE.



References

  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4466856/
  2. An Introduction to Statistical learning. Springer 2013
  3. The Elements of Statistical Learning. Springer; 2001.




*************************