The idea behind this dataset is to identify factors that are predictive of higher risk of default. The data set is available for download from the UCI Machine Learning Data Repository at http://archive.ics.uci.edu/ml by Hans Hofmann of the University of Hamburg. Th dataset contians information on lans obtained from a credit agency in Germany. Since the data was obtained ni Germany, the currency is recorded in Deutsche Marks (DM) (Lantz, 2015).
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
require(caret)
credit <- read.csv("C:/Users/KEVIN/Downloads/credit.csv")
Suppose we have a data frame named credit with 1000 rows of data. We can divide it into three partitions as follows. First, we create a vector of randomly ordered row IDs from 1 to 1000 using the runif() function, which by default generates a specified number of random values between 0 and 1. The runif() function gets its name from the random uniform distribution.
random_ids <- order(runif(1000))
We can use the resulting random IDs to divide the credit data frame into 500, 250, and 250 records comprising the training, validation, and test datasets:
credit_train <- credit[random_ids[1:500], ]
credit_validate <- credit[random_ids[501:750], ]
credit_test <- credit[random_ids[751:1000], ]
The caret package provides a createDataPartition() function that will create partitions based on stratified holdout sampling. The code to create a stratified sample of training and test data for the credit dataset is shown in the following commands. To use the function, a vector of the class values must be specified (here, default refers to whether a loan went into default) in addition to a parameter p, which specifies the proportion of instances to be included in the partition. The list = FALSE parameter prevents the result from being stored in the list format:
in_train <- createDataPartition(credit$default, p = 0.75, list = FALSE)
credit_train <- credit[in_train, ]
credit_test <- credit[-in_train, ]
The intrain vector indicates row numbers included in the training sample. We can use these row numbers to select examples for the credittrain data frame. Similarly, by using a negative symbol, we can use the rows not found in the intrain vector for the credit_test dataset.
Datasets for cross-validation can be created using the createFolds() function in the caret package. Similar to the stratified random holdout sampling, this function will attempt to maintain the same class balance in each of the folds as in the original dataset. The following is the command to create 10 folds:
folds <- createFolds(credit$default, k = 10)
The result of the createFolds() function is a list of vectors storing the row numbers for each of the requested k = 10 folds. We can peek at the contents, using str():
str(folds)
## List of 10
## $ Fold01: int [1:100] 5 8 26 35 57 67 82 86 89 105 ...
## $ Fold02: int [1:100] 2 24 29 32 43 45 48 49 56 87 ...
## $ Fold03: int [1:100] 39 42 52 68 72 94 97 101 107 119 ...
## $ Fold04: int [1:100] 7 13 28 51 63 64 93 98 116 117 ...
## $ Fold05: int [1:100] 4 30 33 50 71 78 90 95 100 103 ...
## $ Fold06: int [1:100] 3 19 22 38 62 96 110 114 115 123 ...
## $ Fold07: int [1:100] 6 12 37 53 69 73 79 81 109 111 ...
## $ Fold08: int [1:100] 11 18 21 25 31 74 76 77 80 83 ...
## $ Fold09: int [1:100] 1 9 15 34 41 44 46 54 55 59 ...
## $ Fold10: int [1:100] 10 14 16 17 20 23 27 36 40 47 ...
Here, we see that the first fold is named Fold01 and stores 100 integers, indicating the 100 rows in the credit data frame for the first fold. To create training and test datasets to build and evaluate a model, an additional step is needed. The following commands show how to create data for the first fold. We’ll assign the selected 10 percent to the test dataset, and use the negative symbol to assign the remaining 90 percent to the training dataset:
credit01_test <- credit[folds$Fold01, ]
credit01_train <- credit[-folds$Fold01, ]
library(caret)
library(C50)
library(irr)
## Loading required package: lpSolve
Next, we’ll create a list of 10 folds as we have done previously. The set.seed() function is used here to ensure that the results are consistent if the same code is run again:
set.seed(123)
folds <- createFolds(credit$default, k = 10)
Finally, we will apply a series of identical steps to the list of folds using the lapply() function. As shown in the following code, because there is no existing function that does exactly what we need, we must define our own function to pass to lapply(). Our custom function divides the credit data frame into training and test data, builds a decision tree using the C5.0() function on the training data, generates a set of predictions from the test data, and compares the predicted and actual values using the kappa2() function:
cv_results <- lapply(folds, function(x) {
credit_train <- credit[-x, ]
credit_test <- credit[x, ]
credit_model <- C5.0(default ~ ., data = credit_train)
credit_pred <- predict(credit_model, credit_test)
credit_actual <- credit_test$default
kappa <- kappa2(data.frame(credit_actual, credit_pred))$value
return(kappa)
})
The resulting kappa statistics are compiled into a list stored in the cv_results object, which we can examine using str():
str(cv_results)
## List of 10
## $ Fold01: num 0.343
## $ Fold02: num 0.255
## $ Fold03: num 0.109
## $ Fold04: num 0.107
## $ Fold05: num 0.338
## $ Fold06: num 0.474
## $ Fold07: num 0.245
## $ Fold08: num 0.0365
## $ Fold09: num 0.425
## $ Fold10: num 0.505
There’s just one more step remaining in the 10-fold CV process: we must calculate the average of these 10 values. Although you will be tempted to type mean(cvresults), because cvresults is not a numeric vector, the result would be an error. Instead, use the unlist() function, which eliminates the list structure, and reduces cv_results to a numeric vector. From here, we can calculate the mean kappa as expected:
mean(unlist(cv_results))
## [1] 0.283796
This chapter presented a number of the most common measures and techniques for evaluating the performance of machine learning classification models. Although accuracy provides a simple method to examine how often a model is correct, this can be misleading in the case of rare events because the real-life cost of such events may be inversely proportional to how frequently they appear.
A number of measures based on confusion matrices better capture the balance among the costs of various types of errors. Closely examining the tradeoffs between sensitivity and specificity, or precision and recall can be a useful tool for thinking about the implications of errors in the real world. Visualizations such as the ROC curve are also helpful to this end.