The data used in this project can be found here UCI Machine Learning Repository

The data contains 16 variables, both character and integer vectors.

Please note that all attribute name and values have been changed to meaningless symbols to protect the confidentiality of the data

I am going to use a classification tree to predict credit Approval

lets load the packages that would be needed to work on this project

library(readr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
  1. The readr package would enable us to read the data into r
  2. The caret package would be used in tuning our model
  3. The pROC package would be used to draw the ROC curve
  4. The rattle package would be used to plot the classification tree

Now lets copy the url where the data is lying into r

url <- ("http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data")
rawData <- read_csv(url, col_names = FALSE, na = "?")

I just read the data into r using the readr package. The data does not include headers, hence col_names is set to FALSE. The question mark symbols represent missing values in the data set

Our next work after loading the data is to look at the data

str(rawData)
## Classes 'tbl_df', 'tbl' and 'data.frame':    690 obs. of  16 variables:
##  $ X1 : chr  "b" "a" "a" "b" ...
##  $ X2 : num  30.8 58.7 24.5 27.8 20.2 ...
##  $ X3 : num  0 4.46 0.5 1.54 5.62 ...
##  $ X4 : chr  "u" "u" "u" "u" ...
##  $ X5 : chr  "g" "g" "g" "g" ...
##  $ X6 : chr  "w" "q" "q" "w" ...
##  $ X7 : chr  "v" "h" "h" "v" ...
##  $ X8 : num  1.25 3.04 1.5 3.75 1.71 ...
##  $ X9 : chr  "t" "t" "t" "t" ...
##  $ X10: chr  "t" "t" "f" "t" ...
##  $ X11: chr  "01" "06" "0" "05" ...
##  $ X12: chr  "f" "f" "f" "t" ...
##  $ X13: chr  "g" "g" "g" "g" ...
##  $ X14: chr  "00202" "00043" "00280" "00100" ...
##  $ X15: int  0 560 824 3 0 0 31285 1349 314 1442 ...
##  $ X16: chr  "+" "+" "+" "+" ...
summary(rawData)
##       X1                  X2              X3              X4           
##  Length:690         Min.   :13.75   Min.   : 0.000   Length:690        
##  Class :character   1st Qu.:22.60   1st Qu.: 1.000   Class :character  
##  Mode  :character   Median :28.46   Median : 2.750   Mode  :character  
##                     Mean   :31.57   Mean   : 4.759                     
##                     3rd Qu.:38.23   3rd Qu.: 7.207                     
##                     Max.   :80.25   Max.   :28.000                     
##                     NA's   :12                                         
##       X5                 X6                 X7                  X8        
##  Length:690         Length:690         Length:690         Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 0.165  
##  Mode  :character   Mode  :character   Mode  :character   Median : 1.000  
##                                                           Mean   : 2.223  
##                                                           3rd Qu.: 2.625  
##                                                           Max.   :28.500  
##                                                                           
##       X9                X10                X11           
##  Length:690         Length:690         Length:690        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      X12                X13                X14           
##  Length:690         Length:690         Length:690        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##       X15               X16           
##  Min.   :     0.0   Length:690        
##  1st Qu.:     0.0   Class :character  
##  Median :     5.0   Mode  :character  
##  Mean   :  1017.4                     
##  3rd Qu.:   395.5                     
##  Max.   :100000.0                     
## 
head(rawData)
##   X1    X2    X3 X4 X5 X6 X7   X8 X9 X10 X11 X12 X13   X14 X15 X16
## 1  b 30.83 0.000  u  g  w  v 1.25  t   t  01   f   g 00202   0   +
## 2  a 58.67 4.460  u  g  q  h 3.04  t   t  06   f   g 00043 560   +
## 3  a 24.50 0.500  u  g  q  h 1.50  t   f   0   f   g 00280 824   +
## 4  b 27.83 1.540  u  g  w  v 3.75  t   t  05   t   g 00100   3   +
## 5  b 20.17 5.625  u  g  w  v 1.71  t   f   0   f   s 00120   0   +
## 6  b 32.08 4.000  u  g  m  v 2.50  t   f   0   t   g 00360   0   +

We discovered the variable X14 is represented as a character vector instead of an integer This usually happens when a string appears in the vector as a missing symbol

Lets convert it back to an integer vector

rawData$X14 <- as.integer(rawData$X14)

We also noticed our data contains some missing values, however we need not worry about that since classificataion tree models can easily handle missing data

lets label our class variable and convert to factor

rawData$X16 <- ifelse(rawData$X16 == "+", "approved", "disapproved")
rawData$X16 <- factor(rawData$X16)

We now examine the distribution of the class variable to ensure it is quite balanced

ggplot(rawData, aes(x = rawData$X16)) + geom_bar()

The class appears to be balanced according to the bar plot shown above. There wouldn’t be a need for subsampling

Now, Let’s extract the class variable from rawData

class <- rawData$X16

Let’s extract our predictor variables from rawData

predictors <- rawData[, -16]

Since our predictors contains character vectors, lets use a function dummyVars in the caret package to determine encodings for the predictors

dmy <- dummyVars(~ ., data = predictors)
predictors <- predict(dmy, predictors)

I just generated dummy variables for the training set

We are going to split our data into a training and a validations set inorder to test the result of our model on the validation set

To create stratified random splits of the data (based on the classes), the createDataPartition function in the caret package can be used

indx <- createDataPartition(class, p = .70, list = FALSE)

trainX  <- predictors[indx, ]
testX <- predictors[-indx, ]

trainY <- class[indx]
testY <- class[-indx]

check the dimensions of your training and test set

dim(trainX)
## [1] 484  68
length(trainY)
## [1] 484
dim(testX)
## [1] 206  68
length(testY)
## [1] 206

we now tune our classification tree model over the complexity parameter

ctrl <- trainControl(method = "cv", number = 10, summaryFunction = twoClassSummary,
                     classProbs = TRUE, 
                     savePredictions = TRUE)

treeModel <- train(x = trainX, y = trainY,
                   method = "rpart",
                   trControl = ctrl,
                   metric = "ROC",
                   tuneLength = 15)
## Loading required package: rpart

Now lets check the result of our model

treeModel
## CART 
## 
## 484 samples
##  68 predictor
##   2 classes: 'approved', 'disapproved' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 435, 435, 436, 436, 435, 436, ... 
## Resampling results across tuning parameters:
## 
##   cp          ROC        Sens       Spec     
##   0.00000000  0.8958406  0.8285714  0.8659544
##   0.04651163  0.8527685  0.9207792  0.7847578
##   0.09302326  0.8527685  0.9207792  0.7847578
##   0.13953488  0.8527685  0.9207792  0.7847578
##   0.18604651  0.8527685  0.9207792  0.7847578
##   0.23255814  0.8527685  0.9207792  0.7847578
##   0.27906977  0.8527685  0.9207792  0.7847578
##   0.32558140  0.8527685  0.9207792  0.7847578
##   0.37209302  0.8527685  0.9207792  0.7847578
##   0.41860465  0.8527685  0.9207792  0.7847578
##   0.46511628  0.8527685  0.9207792  0.7847578
##   0.51162791  0.8527685  0.9207792  0.7847578
##   0.55813953  0.8527685  0.9207792  0.7847578
##   0.60465116  0.8527685  0.9207792  0.7847578
##   0.65116279  0.6237614  0.3623377  0.8851852
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.

The resulting object is then used to generate the ROC curve or calculate the area under the curve.

rocCurve <- roc(response = treeModel$pred$obs,
 predictor = treeModel$pred$approved,
 levels = rev(levels(treeModel$pred$obs)))

plot(rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = treeModel$pred$obs, predictor = treeModel$pred$approved,     levels = rev(levels(treeModel$pred$obs)))
## 
## Data: treeModel$pred$approved in 4035 controls (treeModel$pred$obs disapproved) < 3225 cases (treeModel$pred$obs approved).
## Area under the curve: 0.8281
 auc(rocCurve)
## Area under the curve: 0.8281

Test the model on the validation set

testResult <- predict(treeModel, testX)
testResult[1:10]
##  [1] approved    approved    approved    approved    approved   
##  [6] disapproved approved    approved    approved    disapproved
## Levels: approved disapproved

To get the class probabilities use an additional option to predict

testResult2 <- predict(treeModel, testX, type = "prob")
head(testResult2)
##     approved disapproved
## 1  0.9041916  0.09580838
## 2  0.9041916  0.09580838
## 7  0.6111111  0.38888889
## 8  0.8125000  0.18750000
## 10 0.8125000  0.18750000
## 14 0.0745614  0.92543860

Lets plot the tree

fancyRpartPlot(treeModel$finalModel)