We will introduce the package, which uses build regression tree when we analysis and predict data. According to Rpart cran, Rpart package is for Recursive partitioning for classification, regression and survival trees.
Specifically, we’ll explain how to use rpart to do regression tree and show some examples to explain this package.We will use dataset from IRIS as supplement.
Decision tree is one of important predicting models using in statistic, data mining and machine learning. Decision tree method can be applied to both regression and classification problem (from wiki). We can use rpart package to build decision tree model.
You’ll learn how to use rpart package to build decision tree model and see that what kind of decision tree we can present by using rpart package.
Here, we’ll show how to use rpart package to analyze and predict data. We will separate data to training and valid data. We use training data to build model and then predict valid data.
We will use confusion matrix to check the result. We will use accuracy, Sensitivity, and Specificity value in matrix to valuate the model.
We analyze default dataset from ISLR and build a simple decision tree. Package Rpart have function rpart() and the function rpart.plot() to show the result. The CNblog have a great example using Kaggle dataset “UniversalBank”. We will use Default dataset from ISLR package and build decision tree in similar processes.
library(rpart)
library(ISLR)
library(rpart.plot)
set.seed(123)
train.index=sample(1:dim(Default)[1],dim(Default)[1]*0.6)
train.df=Default[train.index,]
valid.df=Default[-train.index,]
default.ct=rpart(default~.,data=train.df,method="class",control=rpart.control(maxdepth=4))
rpart.plot(default.ct,type=1,extra=2,under=TRUE,split.font=1,varlen=-5) You may find the similar code in Here. CNblog
We can show decision tree with more node using same data.
# deeper tree
deeper.ct=rpart(default~.,data=train.df,method="class",cp=0,minsplit=1)
length(deeper.ct$frame$var[deeper.ct$frame$var=="<leaf>"])## [1] 225
prp(deeper.ct,type=1,extra=1,under=TRUE,split.font=1,varlen=-5,box.col=ifelse(deeper.ct$frame$var=="<leaf>",'blue','red'))You may find the same code in Here. CNblog
We predict the training data ’s default using decision tree model in advanced example and build confusion matrix to check the result.
# prediction
library(caret)
train.df.Loan=as.factor(train.df$default)
default.ct.pred.train=as.factor(predict(default.ct,train.df,type="class"))
confusionMatrix(default.ct.pred.train,train.df.Loan)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 5768 120
## Yes 33 79
##
## Accuracy : 0.9745
## 95% CI : (0.9702, 0.9783)
## No Information Rate : 0.9668
## P-Value [Acc > NIR] : 0.0003338
##
## Kappa : 0.496
##
## Mcnemar's Test P-Value : 3.584e-12
##
## Sensitivity : 0.9943
## Specificity : 0.3970
## Pos Pred Value : 0.9796
## Neg Pred Value : 0.7054
## Prevalence : 0.9668
## Detection Rate : 0.9613
## Detection Prevalence : 0.9813
## Balanced Accuracy : 0.6956
##
## 'Positive' Class : No
##
You may find the same code in Here. CNblog
Since the accuracy for training data is high (0.9745). we use valid data to do prediction and check the confusion matrix.
valid.df.Loan=as.factor(valid.df$default)
default.ct.pred.valid=as.factor(predict(default.ct,valid.df,type="class"))
confusionMatrix(default.ct.pred.valid,valid.df.Loan)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3844 89
## Yes 22 45
##
## Accuracy : 0.9722
## 95% CI : (0.9667, 0.9771)
## No Information Rate : 0.9665
## P-Value [Acc > NIR] : 0.02166
##
## Kappa : 0.4351
##
## Mcnemar's Test P-Value : 3.742e-10
##
## Sensitivity : 0.9943
## Specificity : 0.3358
## Pos Pred Value : 0.9774
## Neg Pred Value : 0.6716
## Prevalence : 0.9665
## Detection Rate : 0.9610
## Detection Prevalence : 0.9832
## Balanced Accuracy : 0.6651
##
## 'Positive' Class : No
##
You may find the same code in Here. CNblog
Learn more about [package, technique, dataset] with the following:
Resource I Rpart cran
Resource II rpart document
This code through references and cites the following sources:
Rebecca C. Steorts, Duke University (2017). Source I. Tree Based Methods: Regression Trees
Wiki (2020). Source II. Decision tree learning
ISLR (2017). Source III. ISLR Manual
CNblog(2019). Source IV.CART分类树