Decision Trees

Regression Trees

Growing a regression tree using annonymous company employee data available on Github.

cp <- read.csv("~/GitHub/R-Pubs-Projects/Data/company2.csv")

We will predict the employees’ salaries based on the following independent variables:

gender, education level, job and job time

require(rpart)
## Loading required package: rpart
require(rpart.plot)
## Loading required package: rpart.plot

Create the training set and the test set

i1 <- sample(474, 237)

cp_train <- cp[i1, 1:5]

cp_test <- cp[-i1, 1:5]

Grow the regression tree with the rpart() function. The method parameter must be set to “anova”

The rpart() function has built-in cross validation. It performs a 10-fold cross-validation

fit1 <- rpart(salary~., data = cp_train, method = "anova")

Plot the tree with the prp() function

prp(fit1)

Plot the tree with the rpart.plot() function

rpart.plot(fit1)

Print the complexity parameter table

printcp(fit1)
## 
## Regression tree:
## rpart(formula = salary ~ ., data = cp_train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] educ   gender job   
## 
## Root node error: 6.1759e+10/237 = 260585007
## 
## n= 237 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.689880      0   1.00000 1.00891 0.165939
## 2 0.030626      1   0.31012 0.31307 0.049974
## 3 0.013541      2   0.27949 0.29572 0.053197
## 4 0.010345      3   0.26595 0.30872 0.060556
## 5 0.010000      4   0.25561 0.30353 0.060350

Compute the goodness-of-fit in the training set

pred1 <- predict(fit1, cp_train)

pred1[1:10]
##       19      191       75       16       36      268      134      184 
## 32103.57 23993.49 23993.49 32103.57 23993.49 28672.67 23993.49 23993.49 
##      160      252 
## 23993.49 32103.57
mse1 <- sum((pred1 - cp_train$salary)^2)/237

var1.y <- sum((cp_train$salary - mean(cp_train$salary))^2)/236

rsq1 <- 1 - mse1/var1.y

rsq1
## [1] 0.7454698

Compute the goodness-of-fit in the TEST set

pred2 <- predict(fit1, cp_test)

mse2 <- sum((pred2 - cp_test$salary)^2)/237

var2.y <- sum((cp_test$salary - mean(cp_test$salary))^2)/236

rsq2 <- 1 - mse2/var2.y

rsq2
## [1] 0.7023751

Classification Tree

Growing a classification tree

phone <- read.csv("~/Desktop/BUAN 6356/Data/phone.csv")

We will predict whether a customer will abandon the company the target variable is churn (1 - yes, 0 - no) and the predictors are tenure, age, income, education and family members

require(rpart)

require(rpart.plot)

Create the training set and the test set

i2 <- sample(1000, 500)

phone_train <- phone[i2,]

phone_test <- phone[-i2,]

Grow the classification tree with the rpart() function. The method parameter must be set to “class”

fit2 <- rpart(churn~., data = phone_train, method = "class")

Plot the tree

prp(fit2)

rpart.plot(fit2)

Print the CP table

printcp(fit2)
## 
## Classification tree:
## rpart(formula = churn ~ ., data = phone_train, method = "class")
## 
## Variables actually used in tree construction:
## [1] age     educ    income  members tenure 
## 
## Root node error: 140/500 = 0.28
## 
## n= 500 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.082143      0   1.00000 1.00000 0.071714
## 2 0.021429      2   0.83571 0.83571 0.067621
## 3 0.019048      7   0.71429 0.85000 0.068018
## 4 0.010000     11   0.63571 0.82143 0.067215

Compute the predictive accuracy in the training set. The type parameter must be set to “class”

pred3 <- predict(fit2, phone_train, type = "class")

mean(pred3 == phone_train$churn)
## [1] 0.822

Compute the predictive accuracy in the test set

pred4 <- predict(fit2, phone_test, type = "class")

mean(pred4 == phone_test$churn)
## [1] 0.72

Prune Regression

We will prune the regression tree we have grown in the previous lecture, using the cost complexity method and compute the prediction accuracy of the initial tree in the TEST set

pred1 <- predict(fit1, cp_train)

pred1[1:10]
##       19      191       75       16       36      268      134      184 
## 32103.57 23993.49 23993.49 32103.57 23993.49 28672.67 23993.49 23993.49 
##      160      252 
## 23993.49 32103.57
mse1 <- sum((pred1 - cp_train$salary)^2)/237

var1.y <- sum((cp_train$salary - mean(cp_train$salary))^2)/236

rsq1 <- 1 - mse1/var1.y

rsq1
## [1] 0.7454698

Print the complexity parameter table to identify the lower cross-validation error

printcp(fit1)
## 
## Regression tree:
## rpart(formula = salary ~ ., data = cp_train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] educ   gender job   
## 
## Root node error: 6.1759e+10/237 = 260585007
## 
## n= 237 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.689880      0   1.00000 1.00891 0.165939
## 2 0.030626      1   0.31012 0.31307 0.049974
## 3 0.013541      2   0.27949 0.29572 0.053197
## 4 0.010345      3   0.26595 0.30872 0.060556
## 5 0.010000      4   0.25561 0.30353 0.060350

To prune the tree we use the prune() function this function has two main arguments: the tree (fit) and the complexity parameter value

Extract the cp value corresponding to the lowest cross-validation error (xerror)

ocp <- fit1$cptable[which.min(fit1$cptable[,"xerror"]),"CP"]

ocp
## [1] 0.01354073

Prune the tree

prfit <- prune(fit1, ocp)

rpart.plot(prfit)

Compute the prediction accuracy for our simplified tree

prpred <- predict(prfit, cp_test)

mse <- sum((prpred - cp_test$salary)^2)/237

var.y <- sum((cp_test$salary - mean(cp_test$salary))^2)/236

pr_rsq <- 1 - mse/var.y

pr_rsq
## [1] 0.6454298

Prune with a particular cp value to see whether we can get a good prediction accuracy with a less complex tree

printcp(fit1)
## 
## Regression tree:
## rpart(formula = salary ~ ., data = cp_train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] educ   gender job   
## 
## Root node error: 6.1759e+10/237 = 260585007
## 
## n= 237 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.689880      0   1.00000 1.00891 0.165939
## 2 0.030626      1   0.31012 0.31307 0.049974
## 3 0.013541      2   0.27949 0.29572 0.053197
## 4 0.010345      3   0.26595 0.30872 0.060556
## 5 0.010000      4   0.25561 0.30353 0.060350
prfit2 <- prune(fit1, 0.03)

rpart.plot(prfit2)

Get the predicted values and compute the r squared

prpred2 <- predict(prfit2, cp_test)

mse <- sum((prpred2 - cp_test$salary)^2)/237

var.y <- sum((cp_test$salary - mean(cp_test$salary))^2)/236

rsq <- 1 - mse/var.y

rsq
## [1] 0.6454298