Here is an example of Random Forests.
We load the Carseats data from package “ISLR” and to run Random Forests, we use “party” package.
require(ISLR)
require(party)
data(Carseats)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1 9.50 138 73 11 276 120 Bad 42
## 2 11.22 111 48 16 260 83 Good 65
## 3 10.06 113 35 10 269 80 Medium 59
## 4 7.40 117 100 4 466 97 Medium 55
## 5 4.15 141 64 3 340 128 Bad 38
## 6 10.81 124 113 13 501 72 Bad 78
## Education Urban US
## 1 17 Yes Yes
## 2 10 Yes Yes
## 3 12 Yes Yes
## 4 14 Yes Yes
## 5 13 Yes No
## 6 16 No Yes
Since our response variable Sales is a continous variable, we change it to binary variable.
high <- ifelse(Carseats$Sales <= 8, "No", "Yes")
Carseats <- data.frame(Carseats, high)
Since we have high as a response variable, we will remove the Sales variable.
carseats <- Carseats[, c(12, 2:11)]
names(carseats)
## [1] "high" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
dim(carseats)
## [1] 400 11
We divide the dataset into train and test dataset.
set.seed(500)
train <- sample(1:nrow(carseats), .7*nrow(carseats))
carseats.train <- carseats[train, ]
carseats.test <- carseats[-train, ]
dim(carseats.train)
## [1] 280 11
dim(carseats.test)
## [1] 120 11
First, we will build a tree model and then we will build random forests model, and see whether the model can improve.
carseats.tree <- ctree(high ~., data = carseats.train)
carseats.tree
##
## Conditional inference tree with 9 terminal nodes
##
## Response: high
## Inputs: CompPrice, Income, Advertising, Population, Price, ShelveLoc, Age, Education, Urban, US
## Number of observations: 280
##
## 1) ShelveLoc == {Good}; criterion = 1, statistic = 48.659
## 2) Price <= 134; criterion = 0.999, statistic = 15.62
## 3) US == {Yes}; criterion = 0.998, statistic = 13.559
## 4)* weights = 37
## 3) US == {No}
## 5)* weights = 14
## 2) Price > 134
## 6)* weights = 10
## 1) ShelveLoc == {Bad, Medium}
## 7) Price <= 87; criterion = 1, statistic = 31.944
## 8)* weights = 21
## 7) Price > 87
## 9) Advertising <= 13; criterion = 1, statistic = 20.076
## 10) Price <= 126; criterion = 0.987, statistic = 10.335
## 11) CompPrice <= 126; criterion = 0.999, statistic = 14.531
## 12)* weights = 64
## 11) CompPrice > 126
## 13) ShelveLoc == {Bad}; criterion = 0.977, statistic = 9.235
## 14)* weights = 8
## 13) ShelveLoc == {Medium}
## 15)* weights = 33
## 10) Price > 126
## 16)* weights = 61
## 9) Advertising > 13
## 17)* weights = 32
plot(carseats.tree)
actuals <- carseats.test$high
predicted <- predict(carseats.tree, newdata = carseats.test)
table(true = actuals, pred = predicted)
## pred
## true No Yes
## No 51 20
## Yes 13 36
mean(carseats.test$high != predicted)
## [1] 0.275
Now we build random forests model.
carseats.forests <- cforest(high ~., data = carseats.train)
carseats.forests
##
## Random Forest using Conditional Inference Trees
##
## Number of trees: 500
##
## Response: high
## Inputs: CompPrice, Income, Advertising, Population, Price, ShelveLoc, Age, Education, Urban, US
## Number of observations: 280
predicted <- predict(carseats.forests, newdata = carseats.test)
table(true = actuals, pred = predicted)
## pred
## true No Yes
## No 65 6
## Yes 14 35
mean(carseats.test$high != predicted)
## [1] 0.1666667
As we can see with random forests model, misclassification rate came down from 27.5% to 16.7%. Hence, random forests model improves the prediction accuracy compares to tree model.