The data used in this project can be found here UCI Machine Learning Repository
The data contains 16 variables, both character and integer vectors.
Please note that all attribute name and values have been changed to meaningless symbols to protect the confidentiality of the data
I am going to use a classification tree to predict credit Approval
lets load the packages that would be needed to work on this project
library(readr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
Now lets copy the url where the data is lying into r
url <- ("http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data")
rawData <- read_csv(url, col_names = FALSE, na = "?")
I just read the data into r using the readr package. The data does not include headers, hence col_names is set to FALSE. The question mark symbols represent missing values in the data set
Our next work after loading the data is to look at the data
str(rawData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 690 obs. of 16 variables:
## $ X1 : chr "b" "a" "a" "b" ...
## $ X2 : num 30.8 58.7 24.5 27.8 20.2 ...
## $ X3 : num 0 4.46 0.5 1.54 5.62 ...
## $ X4 : chr "u" "u" "u" "u" ...
## $ X5 : chr "g" "g" "g" "g" ...
## $ X6 : chr "w" "q" "q" "w" ...
## $ X7 : chr "v" "h" "h" "v" ...
## $ X8 : num 1.25 3.04 1.5 3.75 1.71 ...
## $ X9 : chr "t" "t" "t" "t" ...
## $ X10: chr "t" "t" "f" "t" ...
## $ X11: chr "01" "06" "0" "05" ...
## $ X12: chr "f" "f" "f" "t" ...
## $ X13: chr "g" "g" "g" "g" ...
## $ X14: chr "00202" "00043" "00280" "00100" ...
## $ X15: int 0 560 824 3 0 0 31285 1349 314 1442 ...
## $ X16: chr "+" "+" "+" "+" ...
summary(rawData)
## X1 X2 X3 X4
## Length:690 Min. :13.75 Min. : 0.000 Length:690
## Class :character 1st Qu.:22.60 1st Qu.: 1.000 Class :character
## Mode :character Median :28.46 Median : 2.750 Mode :character
## Mean :31.57 Mean : 4.759
## 3rd Qu.:38.23 3rd Qu.: 7.207
## Max. :80.25 Max. :28.000
## NA's :12
## X5 X6 X7 X8
## Length:690 Length:690 Length:690 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 0.165
## Mode :character Mode :character Mode :character Median : 1.000
## Mean : 2.223
## 3rd Qu.: 2.625
## Max. :28.500
##
## X9 X10 X11
## Length:690 Length:690 Length:690
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## X12 X13 X14
## Length:690 Length:690 Length:690
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## X15 X16
## Min. : 0.0 Length:690
## 1st Qu.: 0.0 Class :character
## Median : 5.0 Mode :character
## Mean : 1017.4
## 3rd Qu.: 395.5
## Max. :100000.0
##
head(rawData)
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16
## 1 b 30.83 0.000 u g w v 1.25 t t 01 f g 00202 0 +
## 2 a 58.67 4.460 u g q h 3.04 t t 06 f g 00043 560 +
## 3 a 24.50 0.500 u g q h 1.50 t f 0 f g 00280 824 +
## 4 b 27.83 1.540 u g w v 3.75 t t 05 t g 00100 3 +
## 5 b 20.17 5.625 u g w v 1.71 t f 0 f s 00120 0 +
## 6 b 32.08 4.000 u g m v 2.50 t f 0 t g 00360 0 +
We discovered the variable X14 is represented as a character vector instead of an integer This usually happens when a string appears in the vector as a missing symbol
Lets convert it back to an integer vector
rawData$X14 <- as.integer(rawData$X14)
We also noticed our data contains some missing values, however we need not worry about that since classificataion tree models can easily handle missing data
lets label our class variable and convert to factor
rawData$X16 <- ifelse(rawData$X16 == "+", "approved", "disapproved")
rawData$X16 <- factor(rawData$X16)
We now examine the distribution of the class variable to ensure it is quite balanced
ggplot(rawData, aes(x = rawData$X16)) + geom_bar()
The class appears to be balanced according to the bar plot shown above. There wouldn’t be a need for subsampling
Now, Let’s extract the class variable from rawData
class <- rawData$X16
Let’s extract our predictor variables from rawData
predictors <- rawData[, -16]
Since our predictors contains character vectors, lets use a function dummyVars in the caret package to determine encodings for the predictors
dmy <- dummyVars(~ ., data = predictors)
predictors <- predict(dmy, predictors)
I just generated dummy variables for the training set
We are going to split our data into a training and a validations set inorder to test the result of our model on the validation set
To create stratified random splits of the data (based on the classes), the createDataPartition function in the caret package can be used
indx <- createDataPartition(class, p = .70, list = FALSE)
trainX <- predictors[indx, ]
testX <- predictors[-indx, ]
trainY <- class[indx]
testY <- class[-indx]
check the dimensions of your training and test set
dim(trainX)
## [1] 484 68
length(trainY)
## [1] 484
dim(testX)
## [1] 206 68
length(testY)
## [1] 206
we now tune our classification tree model over the complexity parameter
ctrl <- trainControl(method = "cv", number = 10, summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = TRUE)
treeModel <- train(x = trainX, y = trainY,
method = "rpart",
trControl = ctrl,
metric = "ROC",
tuneLength = 15)
## Loading required package: rpart
Now lets check the result of our model
treeModel
## CART
##
## 484 samples
## 68 predictor
## 2 classes: 'approved', 'disapproved'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 435, 435, 436, 436, 435, 436, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.00000000 0.8958406 0.8285714 0.8659544
## 0.04651163 0.8527685 0.9207792 0.7847578
## 0.09302326 0.8527685 0.9207792 0.7847578
## 0.13953488 0.8527685 0.9207792 0.7847578
## 0.18604651 0.8527685 0.9207792 0.7847578
## 0.23255814 0.8527685 0.9207792 0.7847578
## 0.27906977 0.8527685 0.9207792 0.7847578
## 0.32558140 0.8527685 0.9207792 0.7847578
## 0.37209302 0.8527685 0.9207792 0.7847578
## 0.41860465 0.8527685 0.9207792 0.7847578
## 0.46511628 0.8527685 0.9207792 0.7847578
## 0.51162791 0.8527685 0.9207792 0.7847578
## 0.55813953 0.8527685 0.9207792 0.7847578
## 0.60465116 0.8527685 0.9207792 0.7847578
## 0.65116279 0.6237614 0.3623377 0.8851852
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
The resulting object is then used to generate the ROC curve or calculate the area under the curve.
rocCurve <- roc(response = treeModel$pred$obs,
predictor = treeModel$pred$approved,
levels = rev(levels(treeModel$pred$obs)))
plot(rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = treeModel$pred$obs, predictor = treeModel$pred$approved, levels = rev(levels(treeModel$pred$obs)))
##
## Data: treeModel$pred$approved in 4035 controls (treeModel$pred$obs disapproved) < 3225 cases (treeModel$pred$obs approved).
## Area under the curve: 0.8281
auc(rocCurve)
## Area under the curve: 0.8281
Test the model on the validation set
testResult <- predict(treeModel, testX)
testResult[1:10]
## [1] approved approved approved approved approved
## [6] disapproved approved approved approved disapproved
## Levels: approved disapproved
To get the class probabilities use an additional option to predict
testResult2 <- predict(treeModel, testX, type = "prob")
head(testResult2)
## approved disapproved
## 1 0.9041916 0.09580838
## 2 0.9041916 0.09580838
## 7 0.6111111 0.38888889
## 8 0.8125000 0.18750000
## 10 0.8125000 0.18750000
## 14 0.0745614 0.92543860
Lets plot the tree
fancyRpartPlot(treeModel$finalModel)