Introduction

We will introduce the package, which uses build regression tree when we analysis and predict data. According to Rpart cran, Rpart package is for Recursive partitioning for classification, regression and survival trees.


Content Overview

Specifically, we’ll explain how to use rpart to do regression tree and show some examples to explain this package.We will use dataset from IRIS as supplement.


Why You Should Care

Decision tree is one of important predicting models using in statistic, data mining and machine learning. Decision tree method can be applied to both regression and classification problem (from wiki). We can use rpart package to build decision tree model.


Learning Objectives

You’ll learn how to use rpart package to build decision tree model and see that what kind of decision tree we can present by using rpart package.



Rpart Example

Here, we’ll show how to use rpart package to analyze and predict data. We will separate data to training and valid data. We use training data to build model and then predict valid data.


Further Exposition

We will use confusion matrix to check the result. We will use accuracy, Sensitivity, and Specificity value in matrix to valuate the model.


Basic Example

We analyze default dataset from ISLR and build a simple decision tree. Package Rpart have function rpart() and the function rpart.plot() to show the result. The CNblog have a great example using Kaggle dataset “UniversalBank”. We will use Default dataset from ISLR package and build decision tree in similar processes.

library(rpart)
library(ISLR)
library(rpart.plot)


set.seed(123)
train.index=sample(1:dim(Default)[1],dim(Default)[1]*0.6)
train.df=Default[train.index,]
valid.df=Default[-train.index,]

default.ct=rpart(default~.,data=train.df,method="class",control=rpart.control(maxdepth=4))
rpart.plot(default.ct,type=1,extra=2,under=TRUE,split.font=1,varlen=-5)

You may find the similar code in Here. CNblog


Advanced Examples

We can show decision tree with more node using same data.

# deeper tree
deeper.ct=rpart(default~.,data=train.df,method="class",cp=0,minsplit=1)
length(deeper.ct$frame$var[deeper.ct$frame$var=="<leaf>"])
## [1] 225
prp(deeper.ct,type=1,extra=1,under=TRUE,split.font=1,varlen=-5,box.col=ifelse(deeper.ct$frame$var=="<leaf>",'blue','red'))

You may find the same code in Here. CNblog


We predict the training data ’s default using decision tree model in advanced example and build confusion matrix to check the result.

# prediction
library(caret)
train.df.Loan=as.factor(train.df$default)
default.ct.pred.train=as.factor(predict(default.ct,train.df,type="class"))
confusionMatrix(default.ct.pred.train,train.df.Loan)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  5768  120
##        Yes   33   79
##                                           
##                Accuracy : 0.9745          
##                  95% CI : (0.9702, 0.9783)
##     No Information Rate : 0.9668          
##     P-Value [Acc > NIR] : 0.0003338       
##                                           
##                   Kappa : 0.496           
##                                           
##  Mcnemar's Test P-Value : 3.584e-12       
##                                           
##             Sensitivity : 0.9943          
##             Specificity : 0.3970          
##          Pos Pred Value : 0.9796          
##          Neg Pred Value : 0.7054          
##              Prevalence : 0.9668          
##          Detection Rate : 0.9613          
##    Detection Prevalence : 0.9813          
##       Balanced Accuracy : 0.6956          
##                                           
##        'Positive' Class : No              
## 

You may find the same code in Here. CNblog


Since the accuracy for training data is high (0.9745). we use valid data to do prediction and check the confusion matrix.

valid.df.Loan=as.factor(valid.df$default)
default.ct.pred.valid=as.factor(predict(default.ct,valid.df,type="class"))
confusionMatrix(default.ct.pred.valid,valid.df.Loan)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3844   89
##        Yes   22   45
##                                           
##                Accuracy : 0.9722          
##                  95% CI : (0.9667, 0.9771)
##     No Information Rate : 0.9665          
##     P-Value [Acc > NIR] : 0.02166         
##                                           
##                   Kappa : 0.4351          
##                                           
##  Mcnemar's Test P-Value : 3.742e-10       
##                                           
##             Sensitivity : 0.9943          
##             Specificity : 0.3358          
##          Pos Pred Value : 0.9774          
##          Neg Pred Value : 0.6716          
##              Prevalence : 0.9665          
##          Detection Rate : 0.9610          
##    Detection Prevalence : 0.9832          
##       Balanced Accuracy : 0.6651          
##                                           
##        'Positive' Class : No              
## 

You may find the same code in Here. CNblog



Further Resources

Learn more about [package, technique, dataset] with the following:




Works Cited

This code through references and cites the following sources: