1.0 Overview:

For banking businesses, credit scoring is an essential instrument for determining credit risk. Banks have access to a wealth of data due to the constantly changing financial landscape, which risk analysts can use to assess a potential applicant’s creditworthiness. Credit scoring offers a score to a borrower based on their qualities using statistical methods, which helps categorize them as a successful or failing business. Credit scoring, according to (Flaman, 1997), is the process of assigning a score to a potential borrower in order to predict how well the borrower will repay the loan in the future. Thomas et al. (2002) claim that credit scoring is a collection of decision models and underlying procedures for approving consumer credit. To make educated decisions about granting or denying credit, the procedure converts qualitative and quantitative information into quantifiable numerical indicators. This technique is known as data mining and is described by R. Anderson (2007) as a collection of statistical models. This grading system has uses in fields other than the credit market, such insurance and epidemiology. Overall, credit scoring is a crucial part of managing financial risk since it enables banks to identify default risks and gauge the creditworthiness of both people and businesses.

The financial sector relies heavily on the credit scoring algorithms used by banks to decide whether or not to approve a loan. People may use loans to buy a home, a car, or to pay for home improvement tasks when they are on a tight budget. Banks are well aware of the possibility that borrowers could stop making payments on their loans, or default. As a result, one of the most important factors for financial institutions is the appraisal of the risks that the loan represents. Banks utilize credit scoring, a statistical technique, in addition to the fundamental assurances needed for loan acceptance, to judge a borrower’s dependability and viability.

Future borrowers are given a credit score based on a variety of qualitative factors, including their demographics, credit history, and financial situation. The bank can determine the level of risk they are taking on thanks to this score, which represents the strength of the borrower’s file. The bank analyzes the score to determine whether or not to offer a loan because it also reflects the possibility of non-repayment. Since each bank creates its own credit rating methodology, each institution has its own unique approach.

The goal of this project is to develop a credit scoring algorithm that predicts the probability of financial distress. The project will utilize statistical learning methods and a dataset of individuals’ credit information already provided. The data used in this project will consist of a set of observable characteristics of borrowers, such as age, Monthly income and other financial informations. The algorithm will use these characteristics to predict the probability of financial distress, once the algorithm is developed, a cross-validation exercise will be conducted to evaluate its accuracy and robustness.

Now we proceed to load the data and start our work which will be well detailed on the R code. Five Credit Risk models are built from scratch for credit default analysis.

2.0 Installing and Launching R Packages

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(stats)
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
library(lattice)
library(DataExplorer)
library(MASS)
library(jtools)
library(caTools)
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
library(ROCR)
library(ranger)
## 
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
## 
##     importance
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-6
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
library(brnn)
## Loading required package: Formula
## Loading required package: truncnorm
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)
library(knitr)
library(dplyr)
library(magrittr)
library(ggplot2)
library(reshape2)
library(lattice)
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ROSE)
## Loaded ROSE 0.0-4
library(cvAUC)
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(creditmodel)
## Package 'creditmodel' version 1.3.1
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.1.8
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine()     masks randomForest::combine()
## ✖ tidyr::expand()      masks Matrix::expand()
## ✖ tidyr::extract()     masks magrittr::extract()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ purrr::lift()        masks caret::lift()
## ✖ ggplot2::margin()    masks randomForest::margin()
## ✖ tidyr::pack()        masks Matrix::pack()
## ✖ dplyr::select()      masks MASS::select()
## ✖ purrr::set_names()   masks magrittr::set_names()
## ✖ xgboost::slice()     masks dplyr::slice()
## ✖ stringr::str_match() masks creditmodel::str_match()
## ✖ tidyr::unpack()      masks Matrix::unpack()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(reshape2)
library(Rcpp)
library(reshape)
## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## 
## The following objects are masked from 'package:reshape2':
## 
##     colsplit, melt, recast
## 
## The following object is masked from 'package:dplyr':
## 
##     rename
## 
## The following object is masked from 'package:Matrix':
## 
##     expand
library(stargazer)
## 
## Please cite as: 
## 
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(broom)
## Registered S3 methods overwritten by 'broom':
##   method            from  
##   tidy.glht         jtools
##   tidy.summary.glht jtools
library(ggcorrplot)

Before conducting the statistical analysis and making the interpretation we import the train and test data sets.

3.0 Importing data:

train <- read.csv("~/Desktop/Data Mining/train.csv", row.names=1)
test <- read.csv("~/Desktop/Data Mining/test.csv", row.names=1)
cs.test.response <- read.csv("~/Desktop/Data Mining/cs-test.response.csv", row.names=1)
head(train)
##   y       Rev age nT30.59D  DebtRatio MonIn nOpenCandL nT90DLate nRELOrL
## 1 1 0.7661266  45        2 0.80298213  9120         13         0       6
## 2 0 0.9571510  40        0 0.12187620  2600          4         0       0
## 3 0 0.6581801  38        1 0.08511338  3042          2         1       0
## 4 0 0.2338098  30        0 0.03604968  3300          5         0       0
## 5 0 0.9072394  49        1 0.02492570 63588          7         0       1
## 6 0 0.2131787  74        0 0.37560697  3500          3         0       1
##   nT60.89D nDependents
## 1        0           2
## 2        0           1
## 3        0           0
## 4        0           0
## 5        0           0
## 6        0           1
head(test)
##    y        Rev age nT30.59D  DebtRatio MonIn nOpenCandL nT90DLate nRELOrL
## 1 NA 0.88551908  43        0 0.17751272  5700          4         0       0
## 2 NA 0.46329527  57        0 0.52723693  9141         15         0       4
## 3 NA 0.04327504  59        0 0.68764752  5083         12         0       1
## 4 NA 0.28030823  38        1 0.92596064  3200          7         0       2
## 5 NA 0.99999990  27        0 0.01991723  3865          4         0       0
## 6 NA 0.50979145  63        0 0.34242936  4140          4         0       0
##   nT60.89D nDependents
## 1        0           0
## 2        0           2
## 3        0           2
## 4        0           0
## 5        0           1
## 6        0           1

4.0 Descriptive Statistics:

4.1 Train dataset:

This section explores the properties of the variables before the data manipulation process. The graph below displays that the dataset suffers from NAs which accounting for 2%.

plot_intro(train)

We could clearly see the NAs in the descriptive statistics outcome bellow. The columns that have missing data are the MoIn and nDependents.

stargazer(train, type = "text", title = "Descriptive Statistics", digits = 3, out = "table.txt")
## 
## Descriptive Statistics
## ==========================================================
## Statistic      N      Mean     St. Dev.   Min      Max    
## ----------------------------------------------------------
## y           150,000   0.067     0.250      0        1     
## Rev         150,000   6.048    249.755   0.000 50,708.000 
## age         150,000  52.295     14.772     0       109    
## nT30.59D    150,000   0.421     4.193      0       98     
## DebtRatio   150,000  353.005  2,037.819  0.000 329,664.000
## MonIn       120,269 6,670.221 14,384.670   0    3,008,750 
## nOpenCandL  150,000   8.453     5.146      0       58     
## nT90DLate   150,000   0.266     4.169      0       98     
## nRELOrL     150,000   1.018     1.130      0       54     
## nT60.89D    150,000   0.240     4.155      0       98     
## nDependents 146,076   0.757     1.115      0       20     
## ----------------------------------------------------------

The interdependence in between the variables is another issue that should be looked at when building a statistical model. Looking at the correlation matrix visualization, we could spot that three strong associations exist. This correlation is between 3 explanatory variables which are Number of Time 30.59 Days Past due not Worse, Number of Time 90 Days Later, Number of Time 60.89 Days Past Due Not Worse. Some people would omit the variable that exhibit a high correlation and prefer the use of a single variable. However, due ti the low number of our variables, we decided not to drop them.

corr_matrix <- cor(train[,-1], use='complete.obs')
ggcorrplot(corr_matrix, type='lower', hc.order=TRUE, lab=TRUE, method = c("circle"))

The visuals below and in the next set of codes show the box plots and the histogram for each variable. According to the data analysis, we discovered that the age variable spans a range of [0-110], which seems a bit wide given that banks always want to ensure the customer’s ability to repay the loan; in this case, they are not interested in the youngest and oldest borrowers.

To have a better view of the data, the outline in boxplot function is set to FALSE for some columns to have the outliers omitted.

boxplot(train$y, col = "lightblue", xlab="SeriousDlqin2yrs", outline = TRUE)

boxplot(train$Rev, col = "lightblue", xlab="RevolvingUtilizationOfUnsecuredLines", outline = FALSE)

boxplot(train$age, col = "lightblue", xlab="age", outline = FALSE)

boxplot(train$nT30.59D, col = "lightblue", xlab="NumberOfTime30-59DaysPastDueNotWorse", outline = TRUE)

boxplot(train$DebtRatio, col = "lightblue", xlab="DebtRatio", outline = FALSE)

boxplot(train$MonIn, col = "lightblue", xlab="MonthlyIncome", outline = FALSE)

boxplot(train$nOpenCandL, col = "lightblue", xlab="NumberOfOpenCreditLinesAndLoans", outline = FALSE)

boxplot(train$nT90DLate, col = "lightblue", xlab="NumberOfTimes90DaysLate", outline = TRUE)

boxplot(train$nRELOrL, col = "lightblue", xlab="NumberRealEstateLoansOrLines", outline = FALSE)

boxplot(train$nT60.89D, col = "lightblue", xlab="NumberOfTime60-89DaysPastDueNotWorse", outline = TRUE)

boxplot(train$nDependents, col = "lightblue", xlab="NumberOfDependents", outline = FALSE)

histogram(train$y)

histogram(train$Rev)

histogram(train$age)

histogram(train$nT30.59D)

histogram(train$DebtRatio)

histogram(train$MonIn)

histogram(train$nOpenCandL)

histogram(train$nT90DLate)

histogram(train$nRELOrL)

histogram(train$nT60.89D)

histogram(train$nDependents)

4.2 Test dataset:

We now repeat the same descriptive statistics process for the test data set. We also observe NAs in the same variables and the percentage is also the same, 2 percent. We observe that y has no values as it is the dependent variable whose value is to be found.

plot_intro(test)

stargazer(test, type = "text", title = "Descriptive Statistics", digits = 3, out = "table.txt")
## 
## Descriptive Statistics
## ==========================================================
## Statistic      N      Mean     St. Dev.   Min      Max    
## ----------------------------------------------------------
## Rev         101,503   5.310    196.156   0.000 21,821.000 
## age         101,503  52.405     14.780    21       104    
## nT30.59D    101,503   0.454     4.538      0       98     
## DebtRatio   101,503  344.475  1,632.595  0.000 268,326.000
## MonIn       81,400  6,855.036 36,508.600   0    7,727,000 
## nOpenCandL  101,503   8.454     5.144      0       85     
## nT90DLate   101,503   0.297     4.516      0       98     
## nRELOrL     101,503   1.013     1.110      0       37     
## nT60.89D    101,503   0.270     4.504      0       98     
## nDependents 98,877    0.769     1.137      0       43     
## ----------------------------------------------------------
corr_matrix_test <- cor(test[,-1], use='complete.obs')
ggcorrplot(corr_matrix_test, type='lower', hc.order=TRUE, lab=TRUE, method = c("circle"))

boxplot(test$Rev, col = "lightblue", xlab="RevolvingUtilizationOfUnsecuredLines", outline = FALSE)

boxplot(test$age, col = "lightblue", xlab="age", outline = FALSE)

boxplot(test$nT30.59D, col = "lightblue", xlab="NumberOfTime30-59DaysPastDueNotWorse", outline = TRUE)

boxplot(test$DebtRatio, col = "lightblue", xlab="DebtRatio", outline = FALSE)

boxplot(test$MonIn, col = "lightblue", xlab="MonthlyIncome", outline = FALSE)

boxplot(test$nOpenCandL, col = "lightblue", xlab="NumberOfOpenCreditLinesAndLoans", outline = FALSE)

boxplot(test$nT90DLate, col = "lightblue", xlab="NumberOfTimes90DaysLate", outline = TRUE)

boxplot(test$nRELOrL, col = "lightblue", xlab="NumberRealEstateLoansOrLines", outline = FALSE)

boxplot(test$nT60.89D, col = "lightblue", xlab="NumberOfTime60-89DaysPastDueNotWorse", outline = TRUE)

boxplot(test$nDependents, col = "lightblue", xlab="NumberOfDependents", outline = FALSE)

# histograms of the variables where we see almost only age values are close to have a normal distribution:
histogram(test$Rev)

histogram(test$age)

histogram(test$nT30.59D)

histogram(test$DebtRatio)

histogram(test$MonIn)

histogram(test$nOpenCandL)

histogram(test$nT90DLate)

histogram(test$nRELOrL)

histogram(test$nT60.89D)

histogram(test$nDependents)

4.3 Data manipulation:

After having done the statistical analysis of the data under study, we know clean the data and process the missing data to predict the models and have a robust outcome. This step is a must to continue to the machine leaning models analysis. Here we replace the missing values with the median of each column for the train and test datasets. In order to make the data ready for the algorithm, this phase involves cleaning the acquired data, which includes also reformulating and sorting specific fields, standardizing the values, and globally modifying the data.

# clearing the data:
df1 <- for(i in 1:ncol(train)) {
  train[ , i][is.na(train[ , i])] <- median(train[ , i], na.rm=TRUE)
}

test$MonIn <- ifelse(is.na(test$MonIn), round(median(test$MonIn, na.rm = TRUE), 2), test$MonIn)
test$nDependents <- ifelse(is.na(test$nDependents), round(median(test$nDependents, na.rm = TRUE), 2), test$nDependents)

The next step is to convert the numeric values in the dependent variable to factors where number 1 is represented as D (default) and 0 as N.

#converting train dataset porbs to factors of D and N where 1 indicates default:
train$y <- ifelse(train$y == 1, "D", "N")

5.0 Models Analysis:

5.1 Tunning the models:

When applying Machine Learning to Data Sets, it is critical to choose the best performing model to deploy for the prediction on the test data. We have run six different models namely Glm, Pls, RandomForest, and Nnet. The powerful package caret and its functions for tunning and buidling model specifications are used for running the models. With the trainControl function, the resampling strategy can be changed. The resampling type is controlled by the option method, which is set to “boot” by default. The “repeatedcv” method is another way to specify repeated K-fold cross-validation (and the argument repeats controls the number of repetitions). K is determined by the number parameter and is set to 10 in our case which is the default value.

ctrl <- trainControl(
  method = "repeatedcv", 
  repeats = 3,
  classProbs = TRUE, 
  summaryFunction = twoClassSummary
)

After having modified the resampling method, we integrate the ctrl in our models. Running the six models here and we will examine them in the following section to choose the best performing model to apply for the test data. The tuneLength argument governs how many possible sets of parameter values generated by the train function are examined. 3 tuneLength are used by the function to explore diffrent models for better outcomes. Since it is very much time consuming, it is set to 3. The method function inserts the type of model. Last but not least, the function will select the tuning parameters linked to the best outcomes. The criterion that should be optimized must also be stated because we are utilizing bespoke performance metrics. We may accomplish this by using metric = “ROC” in the call to train in our case.

We will run all the models and decide in the end on the necessary and the model with the highest ROC value will be selected for the next phase, the prediction phase.

5.2 Running the models:

Fitglm <- train(
  y ~ .,
  data = train,
  method = "glm",
  preProc = c("center", "scale"),
  tuneLength = 3,
  trControl = ctrl,
  metric = "ROC"
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Fitglm
## Generalized Linear Model 
## 
## 150000 samples
##     10 predictor
##      2 classes: 'D', 'N' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 134999, 135000, 134999, 135001, 135001, 135000, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.6983188  0.0432205  0.9976686
Fitpls <- train(
  y ~ .,
  data = train,
  method = "pls",
  preProc = c("center", "scale"),
  tuneLength = 3,
  trControl = ctrl,
  metric = "ROC"
)
## Warning in fitFunc(X, Y, ncomp, Y.add = Y.add, center = center, ...): No convergence in 100 iterations

## Warning in fitFunc(X, Y, ncomp, Y.add = Y.add, center = center, ...): No convergence in 100 iterations
Fitpls
## Partial Least Squares 
## 
## 150000 samples
##     10 predictor
##      2 classes: 'D', 'N' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 135000, 135000, 135001, 135001, 135000, 135000, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  ROC        Sens        Spec     
##   1      0.7121447  0.01466124  0.9991284
##   2      0.6761050  0.01466124  0.9991284
##   3      0.6854481  0.01466124  0.9991284
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 1.
rpartFit <- train(
  y ~ .,
  data = train,
  method = "rpart",
  preProc = c("center", "scale"),
  tuneLength = 3,
  trControl = ctrl,
  metric = "ROC"
)
rpartFit
## CART 
## 
## 150000 samples
##     10 predictor
##      2 classes: 'D', 'N' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 134999, 135000, 135000, 135001, 134999, 134999, ... 
## Resampling results across tuning parameters:
## 
##   cp           ROC        Sens        Spec     
##   0.005186515  0.6570666  0.13627630  0.9926081
##   0.005984440  0.6570361  0.14558590  0.9917675
##   0.015908638  0.5711474  0.07441813  0.9951515
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005186515.
randomForestFit <- train(
  y ~ .,
  data = train,
  method = "ranger",
  preProc = c("center", "scale"),
  tuneLength = 3,
  trControl = ctrl,
  metric = "ROC"
)
randomForestFit
## Random Forest 
## 
## 150000 samples
##     10 predictor
##      2 classes: 'D', 'N' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 135000, 135000, 134999, 134999, 135000, 135000, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   ROC        Sens       Spec     
##    2    gini        0.8615680  0.1674975  0.9919390
##    2    extratrees  0.8515919  0.1392387  0.9937107
##    6    gini        0.8445050  0.2015761  0.9872714
##    6    extratrees  0.8400120  0.1968206  0.9878906
##   10    gini        0.8407042  0.2051326  0.9864189
##   10    extratrees  0.8360981  0.2032374  0.9863737
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
##  and min.node.size = 1.
nnetFit <- train(
  y ~ .,
  data = train,
  method = "nnet",
  preProc = c("center", "scale"),
  tuneLength = 3,
  trControl = ctrl,
  metric = "ROC"
)
nnetFit
## Neural Network 
## 
## 150000 samples
##     10 predictor
##      2 classes: 'D', 'N' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 135000, 135000, 134999, 135001, 134999, 135001, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  ROC        Sens        Spec     
##   1     0e+00  0.7924204  0.08683666  0.9948847
##   1     1e-04  0.7998071  0.07431461  0.9959040
##   1     1e-01  0.8377845  0.19516139  0.9885526
##   3     0e+00  0.8321541  0.10493291  0.9942632
##   3     1e-04  0.8320548  0.11975225  0.9933012
##   3     1e-01  0.8327676  0.17444891  0.9908364
##   5     0e+00  0.8392777  0.15893142  0.9914175
##   5     1e-04  0.8206215  0.12398115  0.9932487
##   5     1e-01  0.8332195  0.16783173  0.9911460
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.

6.0 Models Comparison, ROC:

6.1 Summary of the models:

In this section we list the models for the purpose of comparing their performance. The twoclassSummary calculates the measures specific to the classes that we have which are two, for instance the ROC curve. Those results are shown by assigning the classProb as TRUE.

# Creating a list of the models
model_list <- list(pls = Fitpls, glm = Fitglm, rf = randomForestFit, nnet = nnetFit, rpart = rpartFit)
# Passing the model_list to the resamples function:
resamples <- resamples(model_list)
# Summarizing the results of the models performed:
summary(resamples)
## 
## Call:
## summary.resamples(object = resamples)
## 
## Models: pls, glm, rf, nnet, rpart 
## Number of resamples: 30 
## 
## ROC 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## pls   0.6895343 0.7047903 0.7146217 0.7121447 0.7183862 0.7257409    0
## glm   0.6840607 0.6932489 0.6972747 0.6983188 0.7068069 0.7240385    0
## rf    0.8473742 0.8589586 0.8625180 0.8615680 0.8662044 0.8693813    0
## nnet  0.7996646 0.8265280 0.8358267 0.8392777 0.8554732 0.8660762    0
## rpart 0.6355080 0.6519274 0.6567414 0.6570666 0.6610814 0.6718036    0
## 
## Sens 
##              Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## pls   0.005988024 0.01121635 0.01446360 0.01466124 0.01670388 0.02492522    0
## glm   0.027916251 0.03816395 0.04388979 0.04322050 0.04861613 0.05782652    0
## rf    0.150548355 0.15914159 0.16708209 0.16749751 0.17447657 0.18743769    0
## nnet  0.000000000 0.16155575 0.18145563 0.15893142 0.20059880 0.24426720    0
## rpart 0.100798403 0.11714855 0.13010967 0.13627630 0.15119760 0.18843470    0
## 
## Spec 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## pls   0.9986427 0.9989998 0.9991070 0.9991284 0.9993570 0.9995714    0
## glm   0.9961420 0.9974461 0.9977138 0.9976686 0.9979997 0.9984997    0
## rf    0.9891405 0.9914983 0.9920703 0.9919390 0.9926417 0.9930699    0
## nnet  0.9845681 0.9890518 0.9903551 0.9914175 0.9921239 1.0000000    0
## rpart 0.9883555 0.9917482 0.9934274 0.9926081 0.9939276 0.9952136    0

6.2 ROC plots:

The ROC plots below shows the outcomes of the four models under study. The median values of the ROC are, 66%, 85%, 86%, 70%, and 71% for rapart, nnet, rf, glm, and pls respectfully. The model with the highest value is the randomforest model.For this reason we are choosing this machine learning specification for predicting the y value in the test data set.

# bwplot
bwplot(resamples, metric = "ROC")

# dotplot
dotplot(resamples, metric = "ROC")

# densitytplot
densityplot(resamples, metric = "ROC", auto.key=TRUE)

############################################################
####### Applying randomForest model to the test data #######
############################################################

randomForestClasses <- predict(randomForestFit, newdata = test)
str(randomForestClasses)
##  Factor w/ 2 levels "D","N": 2 2 2 2 2 2 2 2 2 2 ...

6.3 Prediction based on the RandomForest model:

The creation of random forest models involves integrating a number of decision tree-based, straightforward models. A large number of separate decision trees are trained independently of one another, and then their combined forecasts are used to make a final prediction. We must therefore introduce some randomness into the decision tree development process to ensure that no two decision trees that we train have exactly the same appearance.

That is what mtry does. The fitted model selected in our case 2 mtry. It regulates how much randomness is incorporated into the process of building decision trees. The mtry parameter specifically regulates the number of input features that a decision tree can take into account at any one time. It will be (almost) impossible for all of your trees to appear exactly the same because various decision trees will have access to different sets of characteristics at different times.

ggplot(randomForestFit)

rfProbs <- predict(randomForestFit, newdata = test, type = "prob")
# probs as factors: prob > 0.5 = defult:
cs.test.response$Probability <- factor(ifelse(cs.test.response$Probability > 0.5, "D", "N"))

A confusion matrix is a very helpful tool for analyzing all the outcomes of your predictions and calibrating the output of a model (true positive, true negative, false positive, false negative).

We trimmed our projected probabilities at a 50% threshold before creating the confusion matrix in order to convert probabilities into a factor of class predictions. This threshold is subjective to the tolerance of the lender and can vary from one to the other. The ifelse() and factor() methods are used to accomplish the following:

# model accuracy and sensitivity analysis:
confusionMatrix(data = randomForestClasses, cs.test.response$Probability)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     D     N
##          D  1423   428
##          N   282 99370
##                                           
##                Accuracy : 0.993           
##                  95% CI : (0.9925, 0.9935)
##     No Information Rate : 0.9832          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7968          
##                                           
##  Mcnemar's Test P-Value : 5.276e-08       
##                                           
##             Sensitivity : 0.83460         
##             Specificity : 0.99571         
##          Pos Pred Value : 0.76877         
##          Neg Pred Value : 0.99717         
##              Prevalence : 0.01680         
##          Detection Rate : 0.01402         
##    Detection Prevalence : 0.01824         
##       Balanced Accuracy : 0.91516         
##                                           
##        'Positive' Class : D               
## 

The true positive rate is represented by the sensitivity which is 83%. A higher rate indicates a good model prediction and that is considered acceptable. The specificity however shows the true negative rate which is 99%. This implies that what is considered negative is actually predicted the same by the model. Accuracy is of the overall model performance which is pretty high at 99%.