Growing a regression tree using annonymous company employee data available on Github.
cp <- read.csv("~/GitHub/R-Pubs-Projects/Data/company2.csv")
We will predict the employees’ salaries based on the following independent variables:
gender, education level, job and job time
require(rpart)
## Loading required package: rpart
require(rpart.plot)
## Loading required package: rpart.plot
Create the training set and the test set
i1 <- sample(474, 237)
cp_train <- cp[i1, 1:5]
cp_test <- cp[-i1, 1:5]
Grow the regression tree with the rpart() function. The method parameter must be set to “anova”
The rpart() function has built-in cross validation. It performs a 10-fold cross-validation
fit1 <- rpart(salary~., data = cp_train, method = "anova")
Plot the tree with the prp() function
prp(fit1)
Plot the tree with the rpart.plot() function
rpart.plot(fit1)
Print the complexity parameter table
printcp(fit1)
##
## Regression tree:
## rpart(formula = salary ~ ., data = cp_train, method = "anova")
##
## Variables actually used in tree construction:
## [1] educ gender job
##
## Root node error: 6.1759e+10/237 = 260585007
##
## n= 237
##
## CP nsplit rel error xerror xstd
## 1 0.689880 0 1.00000 1.00891 0.165939
## 2 0.030626 1 0.31012 0.31307 0.049974
## 3 0.013541 2 0.27949 0.29572 0.053197
## 4 0.010345 3 0.26595 0.30872 0.060556
## 5 0.010000 4 0.25561 0.30353 0.060350
Compute the goodness-of-fit in the training set
pred1 <- predict(fit1, cp_train)
pred1[1:10]
## 19 191 75 16 36 268 134 184
## 32103.57 23993.49 23993.49 32103.57 23993.49 28672.67 23993.49 23993.49
## 160 252
## 23993.49 32103.57
mse1 <- sum((pred1 - cp_train$salary)^2)/237
var1.y <- sum((cp_train$salary - mean(cp_train$salary))^2)/236
rsq1 <- 1 - mse1/var1.y
rsq1
## [1] 0.7454698
Compute the goodness-of-fit in the TEST set
pred2 <- predict(fit1, cp_test)
mse2 <- sum((pred2 - cp_test$salary)^2)/237
var2.y <- sum((cp_test$salary - mean(cp_test$salary))^2)/236
rsq2 <- 1 - mse2/var2.y
rsq2
## [1] 0.7023751
Growing a classification tree
phone <- read.csv("~/Desktop/BUAN 6356/Data/phone.csv")
We will predict whether a customer will abandon the company the target variable is churn (1 - yes, 0 - no) and the predictors are tenure, age, income, education and family members
require(rpart)
require(rpart.plot)
Create the training set and the test set
i2 <- sample(1000, 500)
phone_train <- phone[i2,]
phone_test <- phone[-i2,]
Grow the classification tree with the rpart() function. The method parameter must be set to “class”
fit2 <- rpart(churn~., data = phone_train, method = "class")
Plot the tree
prp(fit2)
rpart.plot(fit2)
Print the CP table
printcp(fit2)
##
## Classification tree:
## rpart(formula = churn ~ ., data = phone_train, method = "class")
##
## Variables actually used in tree construction:
## [1] age educ income members tenure
##
## Root node error: 140/500 = 0.28
##
## n= 500
##
## CP nsplit rel error xerror xstd
## 1 0.082143 0 1.00000 1.00000 0.071714
## 2 0.021429 2 0.83571 0.83571 0.067621
## 3 0.019048 7 0.71429 0.85000 0.068018
## 4 0.010000 11 0.63571 0.82143 0.067215
Compute the predictive accuracy in the training set. The type parameter must be set to “class”
pred3 <- predict(fit2, phone_train, type = "class")
mean(pred3 == phone_train$churn)
## [1] 0.822
Compute the predictive accuracy in the test set
pred4 <- predict(fit2, phone_test, type = "class")
mean(pred4 == phone_test$churn)
## [1] 0.72
We will prune the regression tree we have grown in the previous lecture, using the cost complexity method and compute the prediction accuracy of the initial tree in the TEST set
pred1 <- predict(fit1, cp_train)
pred1[1:10]
## 19 191 75 16 36 268 134 184
## 32103.57 23993.49 23993.49 32103.57 23993.49 28672.67 23993.49 23993.49
## 160 252
## 23993.49 32103.57
mse1 <- sum((pred1 - cp_train$salary)^2)/237
var1.y <- sum((cp_train$salary - mean(cp_train$salary))^2)/236
rsq1 <- 1 - mse1/var1.y
rsq1
## [1] 0.7454698
Print the complexity parameter table to identify the lower cross-validation error
printcp(fit1)
##
## Regression tree:
## rpart(formula = salary ~ ., data = cp_train, method = "anova")
##
## Variables actually used in tree construction:
## [1] educ gender job
##
## Root node error: 6.1759e+10/237 = 260585007
##
## n= 237
##
## CP nsplit rel error xerror xstd
## 1 0.689880 0 1.00000 1.00891 0.165939
## 2 0.030626 1 0.31012 0.31307 0.049974
## 3 0.013541 2 0.27949 0.29572 0.053197
## 4 0.010345 3 0.26595 0.30872 0.060556
## 5 0.010000 4 0.25561 0.30353 0.060350
To prune the tree we use the prune() function this function has two main arguments: the tree (fit) and the complexity parameter value
Extract the cp value corresponding to the lowest cross-validation error (xerror)
ocp <- fit1$cptable[which.min(fit1$cptable[,"xerror"]),"CP"]
ocp
## [1] 0.01354073
prfit <- prune(fit1, ocp)
rpart.plot(prfit)
prpred <- predict(prfit, cp_test)
mse <- sum((prpred - cp_test$salary)^2)/237
var.y <- sum((cp_test$salary - mean(cp_test$salary))^2)/236
pr_rsq <- 1 - mse/var.y
pr_rsq
## [1] 0.6454298
printcp(fit1)
##
## Regression tree:
## rpart(formula = salary ~ ., data = cp_train, method = "anova")
##
## Variables actually used in tree construction:
## [1] educ gender job
##
## Root node error: 6.1759e+10/237 = 260585007
##
## n= 237
##
## CP nsplit rel error xerror xstd
## 1 0.689880 0 1.00000 1.00891 0.165939
## 2 0.030626 1 0.31012 0.31307 0.049974
## 3 0.013541 2 0.27949 0.29572 0.053197
## 4 0.010345 3 0.26595 0.30872 0.060556
## 5 0.010000 4 0.25561 0.30353 0.060350
prfit2 <- prune(fit1, 0.03)
rpart.plot(prfit2)
prpred2 <- predict(prfit2, cp_test)
mse <- sum((prpred2 - cp_test$salary)^2)/237
var.y <- sum((cp_test$salary - mean(cp_test$salary))^2)/236
rsq <- 1 - mse/var.y
rsq
## [1] 0.6454298