Introduction

We will introduce the package, which uses build regression tree when we analysis and predict data. According to Rpart cran, Rpart package is for Recursive partitioning for classification, regression and survival trees.

Content Overview

Specifically, we’ll explain how to use rpart to do regression tree and show some examples to explain this package.We will use dataset from IRIS as supplement.

Why You Should Care

Decision tree is one of important predicting models using in statistic, data mining and machine learning. Decision tree method can be applied to both regression and classification problem (from wiki). We can use rpart package to build decision tree model.

Learning Objectives

You’ll learn how to use rpart package to build decision tree model and see that what kind of decision tree we can present by using rpart package.

Rpart Example

Here, we’ll show how to use rpart package to analyze and predict data. We will separate data to training and valid data. We use training data to build model and then predict valid data.

Further Exposition

We will use confusion matrix to check the result. We will use accuracy, Sensitivity, and Specificity value in matrix to valuate the model.

Basic Example

We analyze default dataset from ISLR and build a simple decision tree. Package Rpart have function rpart() and the function rpart.plot() to show the result. The CNblog have a great example using Kaggle dataset “UniversalBank”. We will use Default dataset from ISLR package and build decision tree in similar processes.

library(rpart)
library(ISLR)
library(rpart.plot)


set.seed(123)
train.index=sample(1:dim(Default)[1],dim(Default)[1]*0.6)
train.df=Default[train.index,]
valid.df=Default[-train.index,]

default.ct=rpart(default~.,data=train.df,method="class",control=rpart.control(maxdepth=4))
rpart.plot(default.ct,type=1,extra=2,under=TRUE,split.font=1,varlen=-5)

You may find the similar code in Here. CNblog

Advanced Examples

We can show decision tree with more node using same data.

# deeper tree
deeper.ct=rpart(default~.,data=train.df,method="class",cp=0,minsplit=1)
length(deeper.ct$frame$var[deeper.ct$frame$var=="<leaf>"])

## [1] 225

prp(deeper.ct,type=1,extra=1,under=TRUE,split.font=1,varlen=-5,box.col=ifelse(deeper.ct$frame$var=="<leaf>",'blue','red'))

You may find the same code in Here. CNblog

We predict the training data ’s default using decision tree model in advanced example and build confusion matrix to check the result.

# prediction
library(caret)
train.df.Loan=as.factor(train.df$default)
default.ct.pred.train=as.factor(predict(default.ct,train.df,type="class"))
confusionMatrix(default.ct.pred.train,train.df.Loan)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  5768  120
##        Yes   33   79
##                                           
##                Accuracy : 0.9745          
##                  95% CI : (0.9702, 0.9783)
##     No Information Rate : 0.9668          
##     P-Value [Acc > NIR] : 0.0003338       
##                                           
##                   Kappa : 0.496           
##                                           
##  Mcnemar's Test P-Value : 3.584e-12       
##                                           
##             Sensitivity : 0.9943          
##             Specificity : 0.3970          
##          Pos Pred Value : 0.9796          
##          Neg Pred Value : 0.7054          
##              Prevalence : 0.9668          
##          Detection Rate : 0.9613          
##    Detection Prevalence : 0.9813          
##       Balanced Accuracy : 0.6956          
##                                           
##        'Positive' Class : No              
##

You may find the same code in Here. CNblog

Since the accuracy for training data is high (0.9745). we use valid data to do prediction and check the confusion matrix.

valid.df.Loan=as.factor(valid.df$default)
default.ct.pred.valid=as.factor(predict(default.ct,valid.df,type="class"))
confusionMatrix(default.ct.pred.valid,valid.df.Loan)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3844   89
##        Yes   22   45
##                                           
##                Accuracy : 0.9722          
##                  95% CI : (0.9667, 0.9771)
##     No Information Rate : 0.9665          
##     P-Value [Acc > NIR] : 0.02166         
##                                           
##                   Kappa : 0.4351          
##                                           
##  Mcnemar's Test P-Value : 3.742e-10       
##                                           
##             Sensitivity : 0.9943          
##             Specificity : 0.3358          
##          Pos Pred Value : 0.9774          
##          Neg Pred Value : 0.6716          
##              Prevalence : 0.9665          
##          Detection Rate : 0.9610          
##    Detection Prevalence : 0.9832          
##       Balanced Accuracy : 0.6651          
##                                           
##        'Positive' Class : No              
##

You may find the same code in Here. CNblog

Further Resources

Learn more about [package, technique, dataset] with the following:

Resource I Rpart cran
Resource II rpart document

Works Cited

This code through references and cites the following sources:

Rebecca C. Steorts, Duke University (2017). Source I. Tree Based Methods: Regression Trees
Wiki (2020). Source II. Decision tree learning
ISLR (2017). Source III. ISLR Manual
CNblog(2019). Source IV.CART分类树

Decision Tree–rpart package V2

Rongxin Zhang

01 August 2020

Introduction

Content Overview

Why You Should Care

Learning Objectives

Rpart Example

Further Exposition

Basic Example

Advanced Examples

Further Resources

Works Cited