options(warn =  -1) # Deactivating warnings

Context

The data comes from the “Give Me Some Credit” competition from Kaggle. The goal of this competition is to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

Loading libraries and datasets

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(corrplot)

## corrplot 0.92 loaded

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)

## Loading required package: lattice

library(ROSE)

## Loaded ROSE 0.0-4

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

train_set <- read.csv("cs-training.csv")
test_set <- read.csv("cs-test.csv")
test_labels <- read.csv("sampleEntry.csv")

Examining the structure of the dataset for training

str(train_set)

## 'data.frame':    150000 obs. of  12 variables:
##  $ X                                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907 ...
##  $ age                                 : int  45 40 38 30 49 74 57 39 27 57 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  2 0 1 0 1 0 0 0 0 0 ...
##  $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
##  $ MonthlyIncome                       : int  9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  13 4 2 5 7 3 8 8 2 9 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ NumberRealEstateLoansOrLines        : int  6 0 0 0 1 1 3 0 0 4 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfDependents                  : int  2 1 0 0 0 1 0 0 NA 2 ...

sum(duplicated(train_set$X)) # Looking for duplicated records in the dataset

## [1] 0

summary(train_set) # Calculating basic statistics for the dataset

##        X          SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines
##  Min.   :     1   Min.   :0.00000   Min.   :    0.00                    
##  1st Qu.: 37501   1st Qu.:0.00000   1st Qu.:    0.03                    
##  Median : 75001   Median :0.00000   Median :    0.15                    
##  Mean   : 75001   Mean   :0.06684   Mean   :    6.05                    
##  3rd Qu.:112500   3rd Qu.:0.00000   3rd Qu.:    0.56                    
##  Max.   :150000   Max.   :1.00000   Max.   :50708.00                    
##                                                                         
##       age        NumberOfTime30.59DaysPastDueNotWorse   DebtRatio       
##  Min.   :  0.0   Min.   : 0.000                       Min.   :     0.0  
##  1st Qu.: 41.0   1st Qu.: 0.000                       1st Qu.:     0.2  
##  Median : 52.0   Median : 0.000                       Median :     0.4  
##  Mean   : 52.3   Mean   : 0.421                       Mean   :   353.0  
##  3rd Qu.: 63.0   3rd Qu.: 0.000                       3rd Qu.:     0.9  
##  Max.   :109.0   Max.   :98.000                       Max.   :329664.0  
##                                                                         
##  MonthlyIncome     NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
##  Min.   :      0   Min.   : 0.000                  Min.   : 0.000         
##  1st Qu.:   3400   1st Qu.: 5.000                  1st Qu.: 0.000         
##  Median :   5400   Median : 8.000                  Median : 0.000         
##  Mean   :   6670   Mean   : 8.453                  Mean   : 0.266         
##  3rd Qu.:   8249   3rd Qu.:11.000                  3rd Qu.: 0.000         
##  Max.   :3008750   Max.   :58.000                  Max.   :98.000         
##  NA's   :29731                                                            
##  NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
##  Min.   : 0.000               Min.   : 0.0000                     
##  1st Qu.: 0.000               1st Qu.: 0.0000                     
##  Median : 1.000               Median : 0.0000                     
##  Mean   : 1.018               Mean   : 0.2404                     
##  3rd Qu.: 2.000               3rd Qu.: 0.0000                     
##  Max.   :54.000               Max.   :98.0000                     
##                                                                   
##  NumberOfDependents
##  Min.   : 0.000    
##  1st Qu.: 0.000    
##  Median : 0.000    
##  Mean   : 0.757    
##  3rd Qu.: 1.000    
##  Max.   :20.000    
##  NA's   :3924

The dataset has 12 variables and 150000 observations.The variables can be described as follows:

SeriousDlqin2yrs which indicates if a person experienced 90 days past due delinquency or worse.
RevolvingUtilizationOfUnsecuredLines which indicates the total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits.
Age which indicates the borrower’s age in years.
NumberOfTime30-59DaysPastDueNotWorse which indicates the number of times borrower has been 30-59 days past due but no worse in the last 2 years.
DebtRatio which indicates the monthly debt payments, alimony,living costs divided by monthy gross income.
MonthlyIncome which is self explanatory.
NumberOfOpenCreditLinesAndLoans which indicates the number of open loans (installment like car loan or mortgage) and lines of credit (e.g. credit cards).
NumberOfTimes90DaysLate which indicates the number of times the borrower has been 90 days or more past due.
NumberRealEstateLoansOrLines which indicates the number of mortgage and real estate loans including home equity lines of credit.
NumberOfTime60-89DaysPastDueNotWorse which indicates the number of times the borrower has been 60-89 days past due but no worse in the last 2 years.
NumberOfDependents which indicates the number of dependents in family excluding themselves (spouse, children etc.).

Some of the variables are not correctly identified as factors, there are missing values for Monthly Income and Number of Depedents, there are no duplicated records but there are outliers in the dataset.

Examining the structure of the data set for testing

str(test_set)

## 'data.frame':    101503 obs. of  12 variables:
##  $ X                                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : logi  NA NA NA NA NA NA ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.8855 0.4633 0.0433 0.2803 1 ...
##  $ age                                 : int  43 57 59 38 27 63 50 79 68 23 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  0 0 0 1 0 0 0 1 0 98 ...
##  $ DebtRatio                           : num  0.1775 0.5272 0.6876 0.926 0.0199 ...
##  $ MonthlyIncome                       : int  5700 9141 5083 3200 3865 4140 0 3301 NA 0 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  4 15 12 7 4 4 5 8 4 0 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 0 0 0 0 0 0 0 98 ...
##  $ NumberRealEstateLoansOrLines        : int  0 4 1 2 0 0 0 1 1 0 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 98 ...
##  $ NumberOfDependents                  : int  0 2 2 0 1 1 3 1 0 0 ...

summary(train_set) # Calculating basic statistics

##        X          SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines
##  Min.   :     1   Min.   :0.00000   Min.   :    0.00                    
##  1st Qu.: 37501   1st Qu.:0.00000   1st Qu.:    0.03                    
##  Median : 75001   Median :0.00000   Median :    0.15                    
##  Mean   : 75001   Mean   :0.06684   Mean   :    6.05                    
##  3rd Qu.:112500   3rd Qu.:0.00000   3rd Qu.:    0.56                    
##  Max.   :150000   Max.   :1.00000   Max.   :50708.00                    
##                                                                         
##       age        NumberOfTime30.59DaysPastDueNotWorse   DebtRatio       
##  Min.   :  0.0   Min.   : 0.000                       Min.   :     0.0  
##  1st Qu.: 41.0   1st Qu.: 0.000                       1st Qu.:     0.2  
##  Median : 52.0   Median : 0.000                       Median :     0.4  
##  Mean   : 52.3   Mean   : 0.421                       Mean   :   353.0  
##  3rd Qu.: 63.0   3rd Qu.: 0.000                       3rd Qu.:     0.9  
##  Max.   :109.0   Max.   :98.000                       Max.   :329664.0  
##                                                                         
##  MonthlyIncome     NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
##  Min.   :      0   Min.   : 0.000                  Min.   : 0.000         
##  1st Qu.:   3400   1st Qu.: 5.000                  1st Qu.: 0.000         
##  Median :   5400   Median : 8.000                  Median : 0.000         
##  Mean   :   6670   Mean   : 8.453                  Mean   : 0.266         
##  3rd Qu.:   8249   3rd Qu.:11.000                  3rd Qu.: 0.000         
##  Max.   :3008750   Max.   :58.000                  Max.   :98.000         
##  NA's   :29731                                                            
##  NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
##  Min.   : 0.000               Min.   : 0.0000                     
##  1st Qu.: 0.000               1st Qu.: 0.0000                     
##  Median : 1.000               Median : 0.0000                     
##  Mean   : 1.018               Mean   : 0.2404                     
##  3rd Qu.: 2.000               3rd Qu.: 0.0000                     
##  Max.   :54.000               Max.   :98.0000                     
##                                                                   
##  NumberOfDependents
##  Min.   : 0.000    
##  1st Qu.: 0.000    
##  Median : 0.000    
##  Mean   : 0.757    
##  3rd Qu.: 1.000    
##  Max.   :20.000    
##  NA's   :3924

sum(duplicated(test_set$X)) # Looking for duplicated records in the dataset

## [1] 0

In the dataset for training there are 12 variables and 101503 observations, once again the categorical variables are not correctly identified as factors, there are no duplicated records but there are outliers in the data for all variables.

Data Cleaning

train_set$X <- NULL # Dropping the X variable since is not relevant for the analysis in the data set for training.

Cleaning the age variable

summary(train_set$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    41.0    52.0    52.3    63.0   109.0

summary(test_set$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   41.00   52.00   52.41   63.00  104.00

# Distribution of the Borrower's age in the training and testing datasets

p1 <- ggplot(train_set, aes(age,)) + 
  geom_bar() +coord_cartesian(x=c(0,110), ylim = c(0,3800)) + 
  labs(title="Distribution of the borrower's age in the dataset for training", x="Age", y="Number of   borrowers")

p2 <- ggplot(test_set, aes(age,)) + 
  geom_bar()+ coord_cartesian(x=c(0,110), ylim = c(0,2800)) +
  labs(title = "Distribution of the borrower's age in the dataset for training", x="age", y="Number of borrowers")

grid.arrange(p1, p2, ncol=1, nrow=2)

# Outliers for age in the dataset for training and testing
par(mfrow=c(2,1))
boxplot(train_set$age,main="Boxplot: Outliers of age in the dataset for training",
        xlab="Age",  horizontal = TRUE)
boxplot(test_set$age,main="Boxplot: Outliers of age in the dataset for testing",
        xlab="Age",  horizontal = TRUE)

boxplot.stats(train_set$age)$stats

## [1] 21 41 52 63 96

boxplot.stats(test_set$age)$stats

## [1] 21 41 52 63 96

The descriptive statistics show that the average age is 52 years for both datasets and as we can see 75% of the borrowers have less than 63 years. With that in mind, the analysis is made for the population between 21-96 years old.

# Selecting the borrowers between 21 and 96 years old. 
train_set <- train_set[train_set$age<=96,] 
train_set <- train_set[train_set$age>=21,] # Since 21 years old is the minimum age to get a credit card in the US.  
test_set <- test_set[test_set$age<= 96,] # Dropping the outlier for age in the dataset for testing
test_set <- test_set[test_set$age>=21,] # Since 21 years old is the minimum age to get a credit card in the US.

# Double checking that outliers for age were removed

par(mfrow=c(2,1))
boxplot(train_set$age,main="Boxplo for age in the dataset for training",
        xlab="Age",  horizontal = TRUE)
boxplot(test_set$age,main="Boxplot for age in the dataset for testing",
        xlab="Age",  horizontal = TRUE)

Cleaning the Revolving Utilization Of Unsecured Lines Variable

# Outliers for Revolving Utilization in the dataset for training and testing
par(mfrow=c(2,1))
boxplot(train_set$RevolvingUtilizationOfUnsecuredLines,main="Boxplot: Outliers of Revolving Utilization in the dataset for training",
        xlab="Revolving Utilization of Unsecured Lines",  horizontal = TRUE)
boxplot(test_set$RevolvingUtilizationOfUnsecuredLines,main="Boxplot: Outliers of Revolving Utilization in the dataset for testing",
        xlab="Revolving Utilization of Unsecured Lines",  horizontal = TRUE)

boxplot.stats(train_set$RevolvingUtilizationOfUnsecuredLines)$stats # Retrieving the outliers of Revolving Utilization Of Unsecured Lines in the dataset for training

## [1] 0.00000000 0.02990146 0.15431089 0.55913626 1.35215947

boxplot.stats(test_set$RevolvingUtilizationOfUnsecuredLines)$stats # Retrieving the outliers of Revolving Utilization Of Unsecured Lines in the dataset for testing

## [1] 0.00000000 0.03014561 0.15261957 0.56425639 1.36502077

Since revolving utilization refers to how much of the credit limit is used, it makes sense to assume that a borrower can not exceed the 100% credit limit.

# Dropping outliers of Revolving Utilization  in the training and testing datasets.

train_set <- train_set[train_set$RevolvingUtilizationOfUnsecuredLines <= 1,] # Since a 100%  revolving utilization could be the maximum amount that a borrower could be in debt. 
test_set <- test_set[test_set$RevolvingUtilizationOfUnsecuredLines<= 1,] # Since a 100%  revolving utilization could be the maximum amount that a borrower could be in debt. 

# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$RevolvingUtilizationOfUnsecuredLines,main="Boxplot:Revolving Utilization of Unsecurd Lines in the dataset for training",
        xlab="Percentage of Revolving",  horizontal = TRUE)

boxplot(test_set$RevolvingUtilizationOfUnsecuredLines,main="Boxplot:Revolving Utilization of Unsecurd Lines in the dataset for testing",
        xlab="Percentage of Revolving Utilization",  horizontal = TRUE)

Cleaning the Number of times that a borrowers have default in a 30-59 days period in two years

summary(train_set$NumberOfTime30.59DaysPastDueNotWorse)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4077  0.0000 98.0000

summary(test_set$NumberOfTime30.59DaysPastDueNotWorse)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4421  0.0000 98.0000

As we can see the maximum number of defaults in a 30-59 days period is 98, so let’s investigate further.

train_set%>% filter(NumberOfTime30.59DaysPastDueNotWorse>=98)

test_set %>% filter(NumberOfTime30.59DaysPastDueNotWorse>=98)

We see a clear pattern, the number of credit lines or real estate loans for these observations is 0 , we see as well that there are missing values for the monthly income. With that in mind it is reasonable to assume that they are errors in the datasets.

# Dropping extreme values for the  number of defaults in a 30-59 day period for the training and testing datasets
train_set  <-train_set[train_set$NumberOfTime30.59DaysPastDueNotWorse  < 98,]
test_set   <- test_set[test_set$NumberOfTime30.59DaysPastDueNotWorse < 98, ]

# Looking for outliers in the number of defaults in a 30-59 day period
par(mfrow =c(1,2))
plot(train_set$NumberOfTime30.59DaysPastDueNotWorse , ylab="Number of defaults", main = "# of defaults in a 30-59 day period(training)")
plot(test_set$NumberOfTime30.59DaysPastDueNotWorse, ylab=" Number of defaults", main = "# of defaults in a 30-59 day period(testing)")

# Dropping extreme values for the  number of defaults in a 30-59 day period for the training and testing datasets
train_set  <-train_set[train_set$NumberOfTime30.59DaysPastDueNotWorse  < 20,]
test_set   <- test_set[test_set$NumberOfTime30.59DaysPastDueNotWorse < 20, ]

# Checking out that all the outliers were removed.
p1 <- ggplot(train_set, aes(NumberOfTime30.59DaysPastDueNotWorse,)) + geom_bar() +coord_cartesian(x=c(0,15), ylim = c(0,125000)) + labs(title="Distribution of defaulters in a 30-59 day period in the dataset for training", x="Number of defaults", y="Number of borrowers")

p2 <- ggplot(test_set, aes(NumberOfTime30.59DaysPastDueNotWorse,)) + geom_bar()+
  coord_cartesian(x=c(0,15), ylim = c(0,90000)) + labs(title = "Distribution of defaulters in a 30-59 day period in the dataset for testing", x="Number of defaults", y="Number of borrowers")
grid.arrange(p1, p2, ncol=1, nrow=2)

The plots show that most borrowers have between 0 and 3 defaults. With that in mind, borrowers with 4 or more defaults will be grouped together to get a more representative category.

Cleaning the Number of times that a borrowers had defaulted in a 60-89 day period

summary(train_set$NumberOfTime60.89DaysPastDueNotWorse)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05649  0.00000 11.00000

summary(test_set$NumberOfTime60.89DaysPastDueNotWorse)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05594 0.00000 9.00000

# Looking for outliers in the number of defaults in a 30-59 day period
par(mfrow =c(1,2))
plot(train_set$NumberOfTime60.89DaysPastDueNotWorse , ylab="Number of defaults", main = "# of defaults in a 60-89 day period(training)")
plot(test_set$NumberOfTime60.89DaysPastDueNotWorse, ylab=" Number of defaults", main = "# of defaults in a 60-89 day period(testing)")

# Distribution of defaulters in a 60-89 day period.
p3 <- ggplot(train_set, aes(NumberOfTime60.89DaysPastDueNotWorse,)) + geom_bar() +coord_cartesian(x=c(0,11), ylim = c(0,150000)) + labs(title="Distribution of defaulters in a 60-89 day period in the dataset for training", x="Number of defaults", y="Number of borrowers")

p4 <- ggplot(test_set, aes(NumberOfTime60.89DaysPastDueNotWorse,)) + geom_bar() +
  coord_cartesian(x=c(0,8), ylim = c(0,100000)) + labs(title = "Distribution of defaulters in a 60-89 day period in the dataset for testing", x="Number of defaults", y="Number of borrowers")

grid.arrange(p3, p4, ncol=1, nrow=2)

The plots show that most borrowers have between 0 and 2 defaults. With that in mind, borrowers with 3 or more defaults will be grouped together to get a more representative category.

Cleaning the Number of times that a borrower had defaulted in a 90 day period

summary(train_set$NumberOfTimes90DaysLate)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07806  0.00000 17.00000

summary(test_set$NumberOfTimes90DaysLate)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07856  0.00000 18.00000

# Looking for outliers in the number of defaults in a 30-59 day period
par(mfrow =c(1,2))
plot(train_set$NumberOfTimes90DaysLate , ylab="Number of defaults", main = "# of defaults in a 90 day period(training)")
plot(test_set$NumberOfTimes90DaysLate, ylab=" Number of defaults", main = "# of defaults in a 90 day period(testing)")

# Distribution of defaulters in a 90 day period.
p3 <- ggplot(train_set, aes(NumberOfTime60.89DaysPastDueNotWorse,)) + geom_bar() +coord_cartesian(x=c(0,8), ylim = c(0,150000)) + labs(title="Distribution of defaulters in a 90 day period in the dataset for training", x="Number of defaults", y="Number of borrowers")

p4 <- ggplot(test_set, aes(NumberOfTime60.89DaysPastDueNotWorse,)) + geom_bar()+
  coord_cartesian(x=c(0,8), ylim = c(0,100000)) + labs(title = "Distribution of defaulters in a 90 day period in the dataset for testing", x="Number of defaults", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)

The plots show that most borrowers have between 0 and 2 defaults. With that in mind, borrowers with more than 3 defaults will be grouped together to get a more representative category.

Cleaning the number of open credit lines and loans open variables

summary(train_set$NumberOfOpenCreditLinesAndLoans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   8.000   8.516  11.000  58.000

summary(test_set$NumberOfOpenCreditLinesAndLoans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   8.000   8.519  11.000  85.000

# Distribution of open credit lines and loans open
p3 <- ggplot(train_set, aes(NumberOfOpenCreditLinesAndLoans,)) + geom_bar() +coord_cartesian(x=c(0,58), ylim = c(0,14000)) + labs(title="Distribution of credit lines and loans open", x="Number of credit lines and loans", y="Number of borrowers")

p4 <- ggplot(test_set, aes(NumberOfOpenCreditLinesAndLoans,)) + geom_bar()+
  coord_cartesian(x=c(0,85), ylim = c(0,10000)) + labs(title = "Distribution of credit lines and loans open", x="Number of credit lines and loans", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)

The descriptive statistics shows that only 25% of the borrowers have more than 11 credit lines or loans open. In order to simplify the datasets, the maximum value for credit lines or loans is set to 40 and borrers with 20 or more credit lines are grouped together to get a more representative category.

# Dropping outliers of the  number of credit lines and loans open for the training and testing datasets
train_set  <-train_set[train_set$NumberOfOpenCreditLinesAndLoans <=40,]
test_set   <- test_set[test_set$NumberOfOpenCreditLinesAndLoans <= 40, ]

# Distribution of open credit lines and loans open
p3 <- ggplot(train_set, aes(NumberOfOpenCreditLinesAndLoans,)) + geom_bar() +coord_cartesian(x=c(0,30), ylim = c(0,15000)) + labs(title="Distribution of credit lines and loans open", x="Number of credit lines and loans open for the training dataset", y="Number of borrowers")

p4 <- ggplot(test_set, aes(NumberOfOpenCreditLinesAndLoans,)) + geom_bar()+
  coord_cartesian(x=c(0,30), ylim = c(0,10000)) + labs(title = "Distribution of credit lines and loans open for the testing", x="Number of credit lines and loans open", y="Number of borrowers")

grid.arrange(p3, p4, ncol=1, nrow=2)

Cleaning the number of mortgage and real estate loans including home equity lines of credit

summary(train_set$NumberRealEstateLoansOrLines)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.027   2.000  32.000

summary(test_set$NumberRealEstateLoansOrLines)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.023   2.000  29.000

# Distribution of mortgage and real estate loans
p3 <- ggplot(train_set, aes(NumberRealEstateLoansOrLines,)) + geom_bar() +coord_cartesian(x=c(0,18), ylim = c(0,60000)) + labs(title=" Distribution of mortgage and real estate loans for the training dataset", x="Number of mortgage and real estate loans", y="Number of borrowers")

p4 <- ggplot(test_set, aes(NumberRealEstateLoansOrLines,)) + geom_bar()+
  coord_cartesian(x=c(0,14), ylim = c(0,40000)) + labs(title = " Distribution of mortgage and real estate loans for testing dataset", x="Number of mortgage and real estate loans", y="Number of borrowers")

grid.arrange(p3, p4, ncol=1, nrow=2)

The descriptive statistics shows that only 25% of the borrowers have more than two mortgage or real state loans open. With that in mind borrowers with 3 or more mortage loan are grouped together to get a more representative category.

Cleaning the debt ratio variable

summary(train_set$DebtRatio)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.2      0.4    356.2      0.9 329664.0

summary(test_set$DebtRatio)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      0.17      0.36    346.79      0.85 268326.00

For this variables it’s important to keep in mind, that it measures the proportion of debt in comparison to applicant’s income. The median it’s 0.40% for the training set and 36% for the testing set. With that in mind the values 329664 and 268326.00 are clearly errors in the dataset.

train_set  <-train_set[train_set$DebtRatio != 329664,] # Dropping extreme value in the dataset for training
test_set <- test_set[test_set$DebtRatio != 268326.00, ] # Dropping extreme value in the dataset for testing

# Outliers of debt ratio for the training and testing datasets
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
        xlab="Debt Ratio",  horizontal = TRUE)
boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
        xlab="Debt ratio",  horizontal = TRUE)

boxplot.stats(train_set$DebtRatio)$stats  # Retrieving the outliers of the debt ratio in the dataset for training

## [1] 0.0000000 0.1759553 0.3669957 0.8718086 1.9140893

boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing

## [1] 0.0000000 0.1746405 0.3648610 0.8546931 1.8747084

# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set  <-train_set[train_set$DebtRatio <= 1.9140893,]  # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <= 1.8747084, ]    # Dropping outliers of debt ratio in the dataset for testing

par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
        xlab="Debt Ratio",  horizontal = TRUE)

boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
        xlab="Debt ratio",  horizontal = TRUE)

As we can see there are still outliers in the dataset so the process is going to be repeated.

# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats  # Retrieving the outliers of the debt ratio in the dataset for training

## [1] 0.0000000 0.1325928 0.2843993 0.4592201 0.9491500

boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing

## [1] 0.0000000 0.1319718 0.2841397 0.4579237 0.9468047

# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set  <-train_set[train_set$DebtRatio <= 9491500,]  # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <= 0.9468047, ]    # Dropping outliers of debt ratio in the dataset for testing

par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
        xlab="Debt Ratio",  horizontal = TRUE)

boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
        xlab="Debt ratio",  horizontal = TRUE)

There are still outliers in the dataset so the process is going to be reapeated.

# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats  # Retrieving the outliers of the debt ratio in the dataset for training

## [1] 0.0000000 0.1325928 0.2843993 0.4592201 0.9491500

boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing

## [1] 0.0000000 0.1254341 0.2726197 0.4322744 0.8925144

# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set  <-train_set[train_set$DebtRatio <= 0.9491500,]  # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <= 0.8925144, ]    # Dropping outliers of debt ratio in the dataset for testing

# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
        xlab="Debt Ratio",  horizontal = TRUE)

boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
        xlab="Debt ratio",  horizontal = TRUE)

There are still outliers in the dataset so the process is going to be repeated.

# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats  # Retrieving the outliers of the debt ratio in the dataset for training

## [1] 0.0000000 0.1253605 0.2727415 0.4328667 0.8941059

boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing

## [1] 0.0000000 0.1241627 0.2704673 0.4275277 0.8824974

# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set  <-train_set[train_set$DebtRatio <= 0.8941059,]  # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <=0.8824974, ]    # Dropping outliers of debt ratio in the dataset for testing

# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
        xlab="Debt Ratio",  horizontal = TRUE)

boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
        xlab="Debt ratio",  horizontal = TRUE)

There are still outliers in the dataset so the process is going to be reapeated.

# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats  # Retrieving the outliers of the debt ratio in the dataset for training

## [1] 0.0000000 0.1238715 0.2703160 0.4278875 0.8839050

boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing

## [1] 0.0000000 0.1239277 0.2699323 0.4266657 0.8806882

# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set  <-train_set[train_set$DebtRatio <= 0.8839050,]  # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <=0.8806882, ]    # Dropping outliers of debt ratio in the dataset for testing

# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
        xlab="Debt Ratio",  horizontal = TRUE)

boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
        xlab="Debt ratio",  horizontal = TRUE)

There are still outliers in the dataset so the process is going to be reapeated one more time.

# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats  # Retrieving the outliers of the debt ratio in the dataset for training

## [1] 0.0000000 0.1235546 0.2697925 0.4268852 0.8818719

boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing

## [1] 0.0000000 0.1238924 0.2698811 0.4265703 0.8805249

# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set  <-train_set[train_set$DebtRatio <= 0.8,]  # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <= 0.8, ]    # Dropping outliers of debt ratio in the dataset for testing

# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
        xlab="Debt Ratio",  horizontal = TRUE)

boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
        xlab="Debt ratio",  horizontal = TRUE)

Cleaning the monthly income variable

summary(train_set$MonthlyIncome) # Basic statistics of the monthly income in the dataset for training

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3750    5750    7078    8555 3008750    1520

summary(test_set$MonthlyIncome) # Basic statistics of the monthly income in the dataset for testing

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3750    5750    7276    8500 7727000    1035

The descriptive statistics show that the average monthly income is $7078 for the training dataset and $7276 for the testing dataset. Only 25% of the borrowers have a monthly income greater than $8555 for the training dataset and $8500 for the testing dataset, respectively. There are missing values for this variable and since we have a lot of observations it seems reasonable to drop these observations before applying any sophisticated imputation method.

train_set <- train_set %>% filter(!is.na(MonthlyIncome)) # Droping NA'S in the dataset for training
test_set <- test_set %>% filter(!is.na(MonthlyIncome)) # Dropping NA'S in the dataset for testing

summary(train_set$MonthlyIncome) # Basic statistics of the monthly income in the dataset for training

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3750    5750    7078    8555 3008750

summary(test_set$MonthlyIncome) # Basic statistics of the monthly income in the dataset for testing

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3750    5750    7276    8500 7727000

# Looking for outliers in the monthly income for the training and testing datasets
par(mfrow=c(2,1))
boxplot(train_set$MonthlyIncome,main="Boxplot:Monthly Income for the training dataset",
        xlab="Monthly Income",  horizontal = TRUE)

boxplot(test_set$MonthlyIncome,main="Boxplot:Monthly Income for the testing dataset",
        xlab="Monthly Income",  horizontal = TRUE)

boxplot.stats(train_set$MonthlyIncome)$stats

## [1]     0  3750  5750  8555 15761

boxplot.stats(test_set$MonthlyIncome)$stats

## [1]     0  3750  5750  8500 15623

# Dropping outliers of the monthly income variable for the training and testing datasets
train_set  <-train_set[train_set$MonthlyIncome<=  15761,]  # Dropping outliers of the monthly income in the dataset for training
test_set <- test_set[test_set$MonthlyIncome <= 15623, ]    # Dropping outliers of the monthly income in the dataset for testing

# Checking out that all the outliers were removed.

par(mfrow=c(2,1))
boxplot(train_set$MonthlyIncome,main="Boxplot:Monthly Income for the training dataset",
        xlab="Monthly Income",  horizontal = TRUE)

boxplot(test_set$MonthlyIncome,main="Boxplot:Monthly Income for the testing dataset",
        xlab="Monthly Income",  horizontal = TRUE)

There are missing values so the process is reapeated.

boxplot.stats(train_set$MonthlyIncome)$stats

## [1]     0  3640  5500  8083 14746

boxplot.stats(test_set$MonthlyIncome)$stats

## [1]     0  3650  5500  8024 14584

# Dropping outliers of the monthly income variable for the training and testing datasets
train_set  <-train_set[train_set$MonthlyIncome<= 14746,]  # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$MonthlyIncome <= 14584, ]    # Dropping outliers of debt ratio in the dataset for testing

# Checking out that all the outliers were removed.

par(mfrow=c(2,1))
boxplot(train_set$MonthlyIncome,main="Boxplot:Monthly Income for the training dataset",
        xlab="Monthly Income",  horizontal = TRUE)

boxplot(test_set$MonthlyIncome,main="Boxplot:Monthly Income for the testing dataset",
        xlab="Monthly Income",  horizontal = TRUE)

There are missing values so the process is reapeated.

boxplot.stats(train_set$MonthlyIncome)$stats

## [1]     0  3600  5500  8000 14600

boxplot.stats(test_set$MonthlyIncome)$stats

## [1]     0  3600  5500  8000 14584

# Dropping outliers of the monthly income variable for the training and testing datasets
train_set  <-train_set[train_set$MonthlyIncome<= 14600,]  # Dropping outliers of monthly income in the dataset for training
test_set <- test_set[test_set$MonthlyIncome <= 14584, ]    # Dropping outliers of monthly income in the dataset for testing

# Checking out that all the outliers were removed.

par(mfrow=c(2,1))
boxplot(train_set$MonthlyIncome,main="Boxplot:Monthly Income for the training dataset",
        xlab="Monthly Income",  horizontal = TRUE)

boxplot(test_set$MonthlyIncome,main="Boxplot:Monthly Income for the testing dataset",
        xlab="Monthly Income",  horizontal = TRUE)

Cleaning the number of dependents variable

summary(train_set$NumberOfDependents)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.8267  1.0000 20.0000

summary(test_set$NumberOfDependents)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.8463  2.0000 43.0000

# Distribution of number of dependents for the training and testing datasets

p3 <- ggplot(train_set, aes(NumberOfDependents,)) + geom_bar() +coord_cartesian(x=c(0,20), ylim = c(0,50000)) + labs(title=" Distribution of the borrower's dependents", x="Number of dependents", y="Number of borrowers")

p4 <- ggplot(test_set, aes(NumberOfDependents,)) + geom_bar() +
  coord_cartesian(x=c(0,20), ylim = c(0,40000)) + labs(title = " Distribution of the borrower's dependents", x="Number of dependets", y="Number of borrowers")

grid.arrange(p3, p4, ncol=1, nrow=2)

to simplify the analisis the number of depedents is fixed to 5 for both datasets.

# Dropping outliers of the number of dependents for the training and testing datasets
train_set  <-train_set[train_set$NumberOfDependents<= 5,]
test_set <- test_set[test_set$NumberOfDependents <= 5, ]

# Distribution of number of dependents for the training and testing datasets

p3 <- ggplot(train_set, aes(NumberOfDependents,)) + geom_bar() +coord_cartesian(x=c(0,5), ylim = c(0,60000)) + labs(title=" Distribution of the borrower's dependents", x="Number of dependents", y="Number of borrowers")

p4 <- ggplot(test_set, aes(NumberOfDependents,)) + geom_bar() +
  coord_cartesian(x=c(0,5), ylim = c(0,40000)) + labs(title = " Distribution of the borrower's dependents", x="Number of dependets", y="Number of borrowers")

grid.arrange(p3, p4, ncol=1, nrow=2)

Exploratory Data Analysis

# Proportion of defaults in the dataset for training 

ggplot(data = train_set, aes(x=factor(SeriousDlqin2yrs),y = prop.table(stat(count)), fill = factor(SeriousDlqin2yrs),label = scales::percent(prop.table(stat(count))))) + geom_bar(position = "dodge") + geom_text(stat="count",
position = position_dodge(0.9),vjust=-0.5,size = 3) + scale_x_discrete(labels=c("No Default", "Default")) +scale_y_continuous(labels = scales::percent) +ggtitle("Proportion of Credit Defaults") + labs(x="", y = "Proportion of Borrowers")

As we can the there a huge class imbalance in the dataset, taking that into account let’s see if the model can generalize correctly without applying any method for imbalance data.

# Distribution of borrowers by age

p5<- ggplot(data=train_set[train_set$SeriousDlqin2yrs==0,], aes(x=age)) + 
  geom_bar(aes(fill='No Default'), show.legend = FALSE) +
  labs(title='Distribution of borrowers by age for no defaulters', y = 'Number of borrowers') + theme_classic()

p6<-ggplot(data=train_set[train_set$SeriousDlqin2yrs==1,], aes(x=age)) + 
  geom_bar(aes(fill='Default'), show.legend=FALSE) +
  labs(title='Distribution of borrowers by age for defaulters', y = 'Number of borrowers') +
  theme_classic()

grid.arrange(p5, p6, ncol=1, nrow=2)

The plots show that defaulters are concentrated in the left side of the distribution, we could expect that age could be a predictor of whether a borrower will default or not.

ggplot(train_set, aes(x=factor(SeriousDlqin2yrs), y=RevolvingUtilizationOfUnsecuredLines)) + geom_boxplot() + coord_cartesian(ylim = c(0,2)) + labs(title = "Box Plot: Revoling Utilization by Type of Borrower")

It seems that median of the Revolving Utilization is different between defaulters and no defaulters. Of course a median test could be more appropriate to taste if there’s any difference statistically significant.

# Distribution of borrowers by number of defaults in a 30-59 day period

p5<- ggplot(data=train_set[train_set$SeriousDlqin2yrs==0,], aes(x=NumberOfTime30.59DaysPastDueNotWorse)) + 
  geom_bar(aes(fill='No Default'), show.legend = FALSE) +
  labs(title='Number of defaults in a 30-59 day period for defaulters', y = 'Number of borrowers', x="Number of defaults") + theme_classic()

p6<-ggplot(data=train_set[train_set$SeriousDlqin2yrs==1,], aes(x=NumberOfTime30.59DaysPastDueNotWorse)) + 
  geom_bar(aes(fill='Default'), show.legend=FALSE) +
  labs(title='Number of defaults in a 30-59 day period for no defaulters', y = 'Number of borrowers', x= "Number of defaults") +
  theme_classic()

grid.arrange(p5, p6, ncol=1, nrow=2)

# Distribution of borrowers by number of defaults in a 60-89 day period

p5<- ggplot(data=train_set[train_set$SeriousDlqin2yrs==0,], aes(x=NumberOfTime60.89DaysPastDueNotWorse)) + 
  geom_bar(aes(fill='No Default'), show.legend = FALSE) +
  labs(title='Number of defaults in a 60-89 day period for defaulters', y = 'Number of borrowers', x="Number of defaults") + theme_classic()

p6<-ggplot(data=train_set[train_set$SeriousDlqin2yrs==1,], aes(x=NumberOfTime60.89DaysPastDueNotWorse)) + 
  geom_bar(aes(fill='Default'), show.legend=FALSE) +
  labs(title='Number of defaults in a 60-89 day period for defaulters', y = 'Number of borrowers', x= "Number of defaults") +
  theme_classic()

grid.arrange(p5, p6, ncol=1, nrow=2)

# Distribution of borrowers by number of defaults in a 30-59 day period

p5<- ggplot(data=train_set[train_set$SeriousDlqin2yrs==0,], aes(x=NumberOfTimes90DaysLate)) + 
  geom_bar(aes(fill='No Default'), show.legend = FALSE) +
  labs(title='Number of defaults in a 90 day period for defaulters', y = 'Number of borrowers', x="Number of defaults") + theme_classic()

p6<-ggplot(data=train_set[train_set$SeriousDlqin2yrs==1,], aes(x=NumberOfTimes90DaysLate)) + 
  geom_bar(aes(fill='Default'), show.legend=FALSE) +
  labs(title='Number of defaults in a 90 day period for no defaulters', y = 'Number of borrowers', x= "Number of defaults") +
  theme_classic()

grid.arrange(p5, p6, ncol=1, nrow=2)

# Visualizing the correlations between variables
correlations <- cor(train_set, method="pearson")
corrplot(correlations, number.cex = .9, method = "color", type = "full", tl.cex=0.8,tl.col = "black")

Data Preparation

# Transforming the target variable to a factor
train_set$SeriousDlqin2yrs <- factor(train_set$SeriousDlqin2yrs, levels = c(0,1), labels = c("No Default", "Default"))

# Double Cheking that the category of reference is Default
contrasts(train_set$SeriousDlqin2yrs)

##            Default
## No Default       0
## Default          1

# Transforming the age to a categorical variable in the training dataset
train_set$age_category <- with(train_set,
  ifelse(age >= 21 & age < 30, "(20+ to 29+)",
  ifelse(age >= 30 & age < 40, "(30+ to 39+)",
  ifelse(age >= 40 & age < 50, "(40+ to 49+)",
  ifelse(age >= 50 & age < 60, "(50+ to 59+)",
  "60+ years")))))

train_set$age_category <- as.factor(train_set$age_category)

# Transforming the age to a categorical variable in the testing dataset
test_set$age_category <- with(test_set,
  ifelse(age >= 21 & age < 30, "(20+ to 29+)",
  ifelse(age >= 30 & age < 40, "(30+ to 39+)",
  ifelse(age >= 40 & age < 50, "(40+ to 49+)",
  ifelse(age >= 50 & age < 60, "(50+ to 59+)",
  "60+ years")))))
test_set$age_category <- as.factor(test_set$age_category)

# Setting the category of reference for age
test_set$age_category <- relevel(test_set$age_category, ref="60+ years")
train_set$age_category <- relevel(train_set$age_category, ref="60+ years")

# Simplifying the datasets
train_set$age <- NULL
test_set$age <- NULL

# Transforming the NumberOfTime30.59DaysPastDueNotWorse to a categorical variable in the training dataset
train_set$defaults_30_59_days <- with(train_set,
  ifelse(NumberOfTime30.59DaysPastDueNotWorse == 0 , "(Zero)",
  ifelse(NumberOfTime30.59DaysPastDueNotWorse > 0 & NumberOfTime30.59DaysPastDueNotWorse <= 1, "(One)",
  ifelse(NumberOfTime30.59DaysPastDueNotWorse > 1 & NumberOfTime30.59DaysPastDueNotWorse <= 2, "(Two)",
  "(Tree or more)"))))

train_set$defaults_30_59_days <- as.factor(train_set$defaults_30_59_days) 




# Transforming the NumberOfTime30.59DaysPastDueNotWorse to a categorical variable in the training dataset
test_set$defaults_30_59_days <- with(test_set,
  ifelse(NumberOfTime30.59DaysPastDueNotWorse == 0 , "(Zero)",
  ifelse(NumberOfTime30.59DaysPastDueNotWorse > 0 & NumberOfTime30.59DaysPastDueNotWorse <= 1, "(One)",
  ifelse(NumberOfTime30.59DaysPastDueNotWorse > 1 & NumberOfTime30.59DaysPastDueNotWorse <= 2, "(Two)",
  "(Tree or more)"))))

test_set$defaults_30_59_days <- as.factor(test_set$defaults_30_59_days) 


train_set$defaults_30_59_days <- relevel(train_set$defaults_30_59_days, ref="(Zero)")

test_set$defaults_30_59_days <- relevel(test_set$defaults_30_59_days, ref="(Zero)")

# Simplifying the dataset
train_set$NumberOfTime30.59DaysPastDueNotWorse <- NULL
test_set$NumberOfTime30.59DaysPastDueNotWorse <- NULL

# Transforming the NumberOfTime60.89DaysPastDueNotWorse to a categorical variable in the training dataset

train_set$defaults_60_89_days <- with(train_set,
  ifelse(NumberOfTime60.89DaysPastDueNotWorse == 0 , "(Zero)",
  ifelse(NumberOfTime60.89DaysPastDueNotWorse > 0 & NumberOfTime60.89DaysPastDueNotWorse <= 1, "(One)",
  ifelse(NumberOfTime60.89DaysPastDueNotWorse > 1 & NumberOfTime60.89DaysPastDueNotWorse <= 2, "(Two)",
  "(Tree or More)"))))

train_set$defaults_60_89_days <- as.factor(train_set$defaults_60_89_days)         

# Transforming the NumberOfTime60.89DaysPastDueNotWorse to a categorical variable in the testing dataset


test_set$defaults_60_89_days <- with(test_set,
  ifelse(NumberOfTime60.89DaysPastDueNotWorse == 0 , "(Zero)",
  ifelse(NumberOfTime60.89DaysPastDueNotWorse > 0 & NumberOfTime60.89DaysPastDueNotWorse <= 1, "(One)",
  ifelse(NumberOfTime60.89DaysPastDueNotWorse > 1 & NumberOfTime60.89DaysPastDueNotWorse <= 2, "(Two)",
  "(Tree or More)"))))

test_set$defaults_60_89_days <- as.factor(test_set$defaults_60_89_days) 



# Setting the category of reference for the number of defaults in a 60-89 day period

train_set$defaults_60_89_days <- relevel(train_set$defaults_60_89_days, ref="(Zero)")

test_set$defaults_60_89_days <- relevel(test_set$defaults_60_89_days, ref="(Zero)")


# Simplifying the dataset
train_set$NumberOfTime60.89DaysPastDueNotWorse  <- NULL
test_set$ NumberOfTime60.89DaysPastDueNotWorse  <- NULL

# Transforming the NumberOfTimes90DaysLate to a categorical variable in the training dataset
train_set$defaults_90_days <- with(train_set,
  ifelse(NumberOfTimes90DaysLate == 0 , "(Zero)",
  ifelse(NumberOfTimes90DaysLate > 0 & NumberOfTimes90DaysLate <= 1, "(One)",
  ifelse(NumberOfTimes90DaysLate > 1 & NumberOfTimes90DaysLate <= 2, "(Two)",
  "(Tree or more)"))))

train_set$defaults_90_days  <- as.factor(train_set$defaults_90_days )         

# Transforming the NumberOfTimes90DaysLate to a categorical variable in the testing dataset

test_set$defaults_90_days <- with(test_set,
  ifelse(NumberOfTimes90DaysLate == 0 , "(Zero)",
  ifelse(NumberOfTimes90DaysLate > 0 & NumberOfTimes90DaysLate <= 1, "(One)",
  ifelse(NumberOfTimes90DaysLate > 1 & NumberOfTimes90DaysLate <= 2, "(Two)",
  "(Tree or more)"))))

test_set$defaults_90_days  <- as.factor(test_set$defaults_90_days )         


# Setting the category of reference for the number of defaults in a 90 day period

train_set$defaults_90_days <- relevel(train_set$defaults_90_days, ref="(Zero)")

test_set$defaults_90_days <- relevel(test_set$defaults_90_days, ref="(Zero)")


# Simplifying the dataset
train_set$NumberOfTimes90DaysLate  <- NULL
test_set$NumberOfTimes90DaysLate <- NULL

# Transforming the NumberOfDependents to a categorical variable in the training dataset

train_set$Dependents <- with(train_set,
  ifelse(NumberOfDependents == 0 , "(Zero)",
  ifelse(NumberOfDependents > 0 & NumberOfDependents <= 1, "(One)",
  ifelse(NumberOfDependents > 1 & NumberOfDependents <= 2, "(Two)",
  "(Tree or more)"))))

train_set$Dependents <- as.factor(train_set$Dependents)

# Transforming the NumberOfDependents to a categorical variable in the testing dataset

test_set$Dependents <- with(test_set,
  ifelse(NumberOfDependents == 0 , "(Zero)",
  ifelse(NumberOfDependents > 0 & NumberOfDependents <= 1, "(One)",
  ifelse(NumberOfDependents > 1 & NumberOfDependents <= 2, "(Two)",
  "(Tree or more)"))))

test_set$Dependents <- as.factor(test_set$Dependents)


# Setting the category of reference for the number of dependents

train_set$Dependents <- relevel(train_set$Dependents, ref="(Zero)")

test_set$Dependents <- relevel(test_set$Dependents, ref="(Zero)")


# Simplifying the dataset
train_set$NumberOfDependents  <- NULL
test_set$NumberOfDependents <- NULL

# Transforming the NumberRealEstateLoansOrLines to a categorical variable in the training dataset

train_set$Real_Estate_Loans <- with(train_set,
  ifelse(NumberRealEstateLoansOrLines == 0 , "(Zero)",
  ifelse(NumberRealEstateLoansOrLines > 0 & NumberRealEstateLoansOrLines <= 3, "(One to Tree)",
  "(Four or more)")))



train_set$Real_Estate_Loans <- as.factor(train_set$Real_Estate_Loans)

# Transforming the NumberRealEstateLoansOrLines to a categorical variable in the testing dataset

test_set$Real_Estate_Loans <- with(test_set,
  ifelse(NumberRealEstateLoansOrLines == 0 , "(Zero)",
  ifelse(NumberRealEstateLoansOrLines > 0 & NumberRealEstateLoansOrLines <= 3, "(One to Tree)",
  "(Four or more)")))



test_set$Real_Estate_Loans <- as.factor(test_set$Real_Estate_Loans)


# Setting the category of reference for the number of real estate loans

train_set$Real_Estate_Loans <- relevel(train_set$Real_Estate_Loans, ref="(Zero)")

test_set$Real_Estate_Loans <- relevel(test_set$Real_Estate_Loans, ref="(Zero)")


# Simplifying the dataset
train_set$NumberRealEstateLoansOrLines  <- NULL
test_set$NumberRealEstateLoansOrLines <- NULL

# Transforming the NumberOfOpenCreditLinesAndLoans to a categorical variable in the training dataset

train_set$Credit_or_Loans <- with(train_set,
  ifelse(NumberOfOpenCreditLinesAndLoans >= 0 & NumberOfOpenCreditLinesAndLoans <= 11, "(0-11)",
  ifelse(NumberOfOpenCreditLinesAndLoans >11 & NumberOfOpenCreditLinesAndLoans<= 21, "(11-21)",
         "(22 or more)")))

train_set$Credit_or_Loans <- as.factor(train_set$Credit_or_Loans)



# Transforming the NumberOfOpenCreditLinesAndLoans to a categorical variable in the testing dataset


test_set$Credit_or_Loans <- with(test_set,
  ifelse(NumberOfOpenCreditLinesAndLoans >= 0 & NumberOfOpenCreditLinesAndLoans <= 11, "(0-11)",
  ifelse(NumberOfOpenCreditLinesAndLoans >11 & NumberOfOpenCreditLinesAndLoans<= 21, "(11-21)",
         "(22 or more)")))

test_set$Credit_or_Loans <- as.factor(test_set$Credit_or_Loans)


# Setting the category of reference for the number of credit lines
train_set$Credit_or_Loans <- relevel(train_set$Credit_or_Loans, ref="(0-11)")

test_set$Credit_or_Loans <- relevel(test_set$Credit_or_Loans, ref="(0-11)")

# Simplifying the dataset
train_set$NumberOfOpenCreditLinesAndLoans  <- NULL
test_set$NumberOfOpenCreditLinesAndLoans <- NULL

# Double Checking that the data is complete
colSums(is.na(train_set))

##                     SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines 
##                                    0                                    0 
##                            DebtRatio                        MonthlyIncome 
##                                    0                                    0 
##                         age_category                  defaults_30_59_days 
##                                    0                                    0 
##                  defaults_60_89_days                     defaults_90_days 
##                                    0                                    0 
##                           Dependents                    Real_Estate_Loans 
##                                    0                                    0 
##                      Credit_or_Loans 
##                                    0

# Retrieving the same of observations for test label 

names(test_labels) <- c("X", "Probability" )
test_labels_clean <- dplyr::left_join(test_set, test_labels, by = "X")
test_labels_clean <- test_labels_clean %>% dplyr::select(X, Probability )
str(test_labels_clean)

## 'data.frame':    68314 obs. of  2 variables:
##  $ X          : int  1 2 3 5 6 8 11 13 14 15 ...
##  $ Probability: num  0.0808 0.0407 0.012 0.1083 0.0275 ...

Data Modeling

Model I: All variables

model1_complete  <- glm(SeriousDlqin2yrs~., data = train_set, family = "binomial" )
summary(model1_complete)

## 
## Call:
## glm(formula = SeriousDlqin2yrs ~ ., family = "binomial", data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2745  -0.3080  -0.2115  -0.1662   3.1660  
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -4.421e+00  5.885e-02 -75.131  < 2e-16 ***
## RevolvingUtilizationOfUnsecuredLines  1.609e+00  4.482e-02  35.897  < 2e-16 ***
## DebtRatio                             8.408e-01  9.915e-02   8.480  < 2e-16 ***
## MonthlyIncome                        -3.530e-05  6.321e-06  -5.585 2.34e-08 ***
## age_category(20+ to 29+)              6.238e-01  6.267e-02   9.953  < 2e-16 ***
## age_category(30+ to 39+)              4.404e-01  5.078e-02   8.673  < 2e-16 ***
## age_category(40+ to 49+)              4.106e-01  4.908e-02   8.366  < 2e-16 ***
## age_category(50+ to 59+)              2.936e-01  4.913e-02   5.977 2.28e-09 ***
## defaults_30_59_days(One)              7.529e-01  3.767e-02  19.984  < 2e-16 ***
## defaults_30_59_days(Tree or more)     1.417e+00  6.287e-02  22.544  < 2e-16 ***
## defaults_30_59_days(Two)              1.133e+00  5.454e-02  20.767  < 2e-16 ***
## defaults_60_89_days(One)              9.281e-01  4.737e-02  19.593  < 2e-16 ***
## defaults_60_89_days(Tree or More)     1.205e+00  1.532e-01   7.867 3.63e-15 ***
## defaults_60_89_days(Two)              1.201e+00  9.687e-02  12.399  < 2e-16 ***
## defaults_90_days(One)                 1.358e+00  4.641e-02  29.262  < 2e-16 ***
## defaults_90_days(Tree or more)        2.127e+00  8.693e-02  24.466  < 2e-16 ***
## defaults_90_days(Two)                 1.715e+00  7.662e-02  22.388  < 2e-16 ***
## Dependents(One)                       6.072e-02  3.925e-02   1.547  0.12186    
## Dependents(Tree or more)              1.948e-01  4.977e-02   3.914 9.07e-05 ***
## Dependents(Two)                       7.975e-02  4.418e-02   1.805  0.07106 .  
## Real_Estate_Loans(Four or more)       3.375e-01  1.197e-01   2.820  0.00480 ** 
## Real_Estate_Loans(One to Tree)       -2.505e-01  4.445e-02  -5.636 1.74e-08 ***
## Credit_or_Loans(11-21)                3.187e-01  4.045e-02   7.879 3.31e-15 ***
## Credit_or_Loans(22 or more)           3.810e-01  1.173e-01   3.248  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45032  on 100609  degrees of freedom
## Residual deviance: 34861  on 100586  degrees of freedom
## AIC: 34909
## 
## Number of Fisher Scoring iterations: 6

# Confusion Matrix
threshold <- 0.5
pred_glm <- predict(model1_complete, newdata = test_set, type = 'response')

pred <- ifelse(pred_glm > threshold, 1, 0)
test_class <- ifelse(test_labels_clean$Probability > threshold, 1, 0)
pred[is.na(pred)] <- 1

observ_glm <- as.factor(ifelse(test = test_class == 1, yes = 'Default', no = 'No_Default'))
predic_glm <- as.factor(ifelse(test = pred == 1, yes = 'Default', no = 'No_Default'))
matriz_glm <- confusionMatrix(predic_glm, observ_glm, positive='Default')
matriz_glm

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Default No_Default
##   Default        621        442
##   No_Default      75      67176
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9918, 0.9931)
##     No Information Rate : 0.9898          
##     P-Value [Acc > NIR] : 5.476e-13       
##                                           
##                   Kappa : 0.7024          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.89224         
##             Specificity : 0.99346         
##          Pos Pred Value : 0.58420         
##          Neg Pred Value : 0.99888         
##              Prevalence : 0.01019         
##          Detection Rate : 0.00909         
##    Detection Prevalence : 0.01556         
##       Balanced Accuracy : 0.94285         
##                                           
##        'Positive' Class : Default         
##

Model II: Without number of dependents

model2  <- glm(SeriousDlqin2yrs~MonthlyIncome+defaults_60_89_days+Real_Estate_Loans+RevolvingUtilizationOfUnsecuredLines+age_category+defaults_90_days+Credit_or_Loans+DebtRatio+defaults_30_59_days, data = train_set, family = "binomial" )
summary(model2)

## 
## Call:
## glm(formula = SeriousDlqin2yrs ~ MonthlyIncome + defaults_60_89_days + 
##     Real_Estate_Loans + RevolvingUtilizationOfUnsecuredLines + 
##     age_category + defaults_90_days + Credit_or_Loans + DebtRatio + 
##     defaults_30_59_days, family = "binomial", data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2635  -0.3083  -0.2115  -0.1665   3.1519  
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -4.428e+00  5.853e-02 -75.643  < 2e-16 ***
## MonthlyIncome                        -3.217e-05  6.260e-06  -5.140 2.75e-07 ***
## defaults_60_89_days(One)              9.287e-01  4.736e-02  19.610  < 2e-16 ***
## defaults_60_89_days(Tree or More)     1.209e+00  1.531e-01   7.894 2.93e-15 ***
## defaults_60_89_days(Two)              1.205e+00  9.689e-02  12.433  < 2e-16 ***
## Real_Estate_Loans(Four or more)       3.325e-01  1.197e-01   2.777  0.00548 ** 
## Real_Estate_Loans(One to Tree)       -2.471e-01  4.440e-02  -5.564 2.63e-08 ***
## RevolvingUtilizationOfUnsecuredLines  1.615e+00  4.475e-02  36.076  < 2e-16 ***
## age_category(20+ to 29+)              6.336e-01  6.257e-02  10.126  < 2e-16 ***
## age_category(30+ to 39+)              4.774e-01  4.958e-02   9.629  < 2e-16 ***
## age_category(40+ to 49+)              4.617e-01  4.684e-02   9.856  < 2e-16 ***
## age_category(50+ to 59+)              3.179e-01  4.857e-02   6.545 5.94e-11 ***
## defaults_90_days(One)                 1.362e+00  4.638e-02  29.364  < 2e-16 ***
## defaults_90_days(Tree or more)        2.132e+00  8.687e-02  24.538  < 2e-16 ***
## defaults_90_days(Two)                 1.722e+00  7.657e-02  22.490  < 2e-16 ***
## Credit_or_Loans(11-21)                3.161e-01  4.044e-02   7.818 5.36e-15 ***
## Credit_or_Loans(22 or more)           3.720e-01  1.173e-01   3.172  0.00151 ** 
## DebtRatio                             8.524e-01  9.897e-02   8.613  < 2e-16 ***
## defaults_30_59_days(One)              7.558e-01  3.766e-02  20.071  < 2e-16 ***
## defaults_30_59_days(Tree or more)     1.419e+00  6.288e-02  22.564  < 2e-16 ***
## defaults_30_59_days(Two)              1.136e+00  5.452e-02  20.832  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45032  on 100609  degrees of freedom
## Residual deviance: 34877  on 100589  degrees of freedom
## AIC: 34919
## 
## Number of Fisher Scoring iterations: 6

# Confusion Matrix
threshold <- 0.5
pred_glm <- predict(model2, newdata = test_set, type = 'response')

pred <- ifelse(pred_glm > threshold, 1, 0)
test_class <- ifelse(test_labels_clean$Probability > threshold, 1, 0)
pred[is.na(pred)] <- 1

observ_glm <- as.factor(ifelse(test = test_class == 1, yes = 'Default', no = 'No_Default'))
predic_glm <- as.factor(ifelse(test = pred == 1, yes = 'Default', no = 'No_Default'))
matriz_glm <- confusionMatrix(predic_glm, observ_glm, positive='Default')
matriz_glm

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Default No_Default
##   Default        620        444
##   No_Default      76      67174
##                                          
##                Accuracy : 0.9924         
##                  95% CI : (0.9917, 0.993)
##     No Information Rate : 0.9898         
##     P-Value [Acc > NIR] : 1.353e-12      
##                                          
##                   Kappa : 0.7009         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.890805       
##             Specificity : 0.993434       
##          Pos Pred Value : 0.582707       
##          Neg Pred Value : 0.998870       
##              Prevalence : 0.010188       
##          Detection Rate : 0.009076       
##    Detection Prevalence : 0.015575       
##       Balanced Accuracy : 0.942119       
##                                          
##        'Positive' Class : Default        
##

Model II: Without number of dependents and real estate loan

model3  <- glm(SeriousDlqin2yrs~MonthlyIncome+defaults_60_89_days+RevolvingUtilizationOfUnsecuredLines+age_category+defaults_90_days+Credit_or_Loans+DebtRatio+defaults_30_59_days, data = train_set, family = "binomial" )
summary(model3)

## 
## Call:
## glm(formula = SeriousDlqin2yrs ~ MonthlyIncome + defaults_60_89_days + 
##     RevolvingUtilizationOfUnsecuredLines + age_category + defaults_90_days + 
##     Credit_or_Loans + DebtRatio + defaults_30_59_days, family = "binomial", 
##     data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2335  -0.3076  -0.2129  -0.1686   3.1412  
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -4.451e+00  5.835e-02 -76.284  < 2e-16 ***
## MonthlyIncome                        -4.119e-05  5.662e-06  -7.274 3.48e-13 ***
## defaults_60_89_days(One)              9.293e-01  4.732e-02  19.636  < 2e-16 ***
## defaults_60_89_days(Tree or More)     1.208e+00  1.530e-01   7.896 2.87e-15 ***
## defaults_60_89_days(Two)              1.200e+00  9.685e-02  12.388  < 2e-16 ***
## RevolvingUtilizationOfUnsecuredLines  1.656e+00  4.397e-02  37.662  < 2e-16 ***
## age_category(20+ to 29+)              6.865e-01  6.222e-02  11.033  < 2e-16 ***
## age_category(30+ to 39+)              4.994e-01  4.941e-02  10.109  < 2e-16 ***
## age_category(40+ to 49+)              4.641e-01  4.682e-02   9.912  < 2e-16 ***
## age_category(50+ to 59+)              3.107e-01  4.854e-02   6.400 1.55e-10 ***
## defaults_90_days(One)                 1.367e+00  4.635e-02  29.490  < 2e-16 ***
## defaults_90_days(Tree or more)        2.145e+00  8.690e-02  24.685  < 2e-16 ***
## defaults_90_days(Two)                 1.733e+00  7.653e-02  22.641  < 2e-16 ***
## Credit_or_Loans(11-21)                3.327e-01  4.022e-02   8.273  < 2e-16 ***
## Credit_or_Loans(22 or more)           4.216e-01  1.168e-01   3.611 0.000305 ***
## DebtRatio                             5.685e-01  7.776e-02   7.311 2.66e-13 ***
## defaults_30_59_days(One)              7.590e-01  3.764e-02  20.167  < 2e-16 ***
## defaults_30_59_days(Tree or more)     1.414e+00  6.275e-02  22.539  < 2e-16 ***
## defaults_30_59_days(Two)              1.136e+00  5.450e-02  20.848  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45032  on 100609  degrees of freedom
## Residual deviance: 34937  on 100591  degrees of freedom
## AIC: 34975
## 
## Number of Fisher Scoring iterations: 6

# Confusion Matrix
threshold <- 0.5
pred_glm <- predict(model3, newdata = test_set, type = 'response')

pred <- ifelse(pred_glm > threshold, 1, 0)
test_class <- ifelse(test_labels_clean$Probability > threshold, 1, 0)
pred[is.na(pred)] <- 1

observ_glm <- as.factor(ifelse(test = test_class == 1, yes = 'Default', no = 'No_Default'))
predic_glm <- as.factor(ifelse(test = pred == 1, yes = 'Default', no = 'No_Default'))
matriz_glm <- confusionMatrix(predic_glm, observ_glm, positive='Default')
matriz_glm

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Default No_Default
##   Default        626        443
##   No_Default      70      67175
##                                           
##                Accuracy : 0.9925          
##                  95% CI : (0.9918, 0.9931)
##     No Information Rate : 0.9898          
##     P-Value [Acc > NIR] : 1.596e-13       
##                                           
##                   Kappa : 0.7057          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.899425        
##             Specificity : 0.993448        
##          Pos Pred Value : 0.585594        
##          Neg Pred Value : 0.998959        
##              Prevalence : 0.010188        
##          Detection Rate : 0.009164        
##    Detection Prevalence : 0.015648        
##       Balanced Accuracy : 0.946437        
##                                           
##        'Positive' Class : Default         
##

Conclusions

Logistic Regression is appropriate to calculate the probabilities of default, however, it is necessary to have a clean dataset because this method is sensible to outliers. Cleaning a large dataset can be time consuming and for that reason a method like decision trees could be more appropriate.
The best model seems to be model4. This model is very good at predicting defaults and no defaults. The specificity of this model is about 99% with a sensitivity of 89%. The kappa coefficient indicates that the model has a very good agreement between classes considering that the dataset is unbalanced.
The age it’s important to predict whether a borrower will default or not, we see that the probability of default decreases as the borrower becomes older.
We see that if a borrower has defaulted in the past, the probability of default increases.
On the other hand, as the monthly income increases the probability of default decreases.
Finally, as the borrowers gets into debt, the probability of default increases as well.

Credit Default Analysis with Logistic Regression