options(warn = -1) # Deactivating warnings
The data comes from the “Give Me Some Credit” competition from Kaggle. The goal of this competition is to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(corrplot)
## corrplot 0.92 loaded
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
library(ROSE)
## Loaded ROSE 0.0-4
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
train_set <- read.csv("cs-training.csv")
test_set <- read.csv("cs-test.csv")
test_labels <- read.csv("sampleEntry.csv")
str(train_set)
## 'data.frame': 150000 obs. of 12 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SeriousDlqin2yrs : int 1 0 0 0 0 0 0 0 0 0 ...
## $ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
## $ age : int 45 40 38 30 49 74 57 39 27 57 ...
## $ NumberOfTime30.59DaysPastDueNotWorse: int 2 0 1 0 1 0 0 0 0 0 ...
## $ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
## $ MonthlyIncome : int 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
## $ NumberOfOpenCreditLinesAndLoans : int 13 4 2 5 7 3 8 8 2 9 ...
## $ NumberOfTimes90DaysLate : int 0 0 1 0 0 0 0 0 0 0 ...
## $ NumberRealEstateLoansOrLines : int 6 0 0 0 1 1 3 0 0 4 ...
## $ NumberOfTime60.89DaysPastDueNotWorse: int 0 0 0 0 0 0 0 0 0 0 ...
## $ NumberOfDependents : int 2 1 0 0 0 1 0 0 NA 2 ...
sum(duplicated(train_set$X)) # Looking for duplicated records in the dataset
## [1] 0
summary(train_set) # Calculating basic statistics for the dataset
## X SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines
## Min. : 1 Min. :0.00000 Min. : 0.00
## 1st Qu.: 37501 1st Qu.:0.00000 1st Qu.: 0.03
## Median : 75001 Median :0.00000 Median : 0.15
## Mean : 75001 Mean :0.06684 Mean : 6.05
## 3rd Qu.:112500 3rd Qu.:0.00000 3rd Qu.: 0.56
## Max. :150000 Max. :1.00000 Max. :50708.00
##
## age NumberOfTime30.59DaysPastDueNotWorse DebtRatio
## Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 41.0 1st Qu.: 0.000 1st Qu.: 0.2
## Median : 52.0 Median : 0.000 Median : 0.4
## Mean : 52.3 Mean : 0.421 Mean : 353.0
## 3rd Qu.: 63.0 3rd Qu.: 0.000 3rd Qu.: 0.9
## Max. :109.0 Max. :98.000 Max. :329664.0
##
## MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3400 1st Qu.: 5.000 1st Qu.: 0.000
## Median : 5400 Median : 8.000 Median : 0.000
## Mean : 6670 Mean : 8.453 Mean : 0.266
## 3rd Qu.: 8249 3rd Qu.:11.000 3rd Qu.: 0.000
## Max. :3008750 Max. :58.000 Max. :98.000
## NA's :29731
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 1.000 Median : 0.0000
## Mean : 1.018 Mean : 0.2404
## 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :54.000 Max. :98.0000
##
## NumberOfDependents
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 0.757
## 3rd Qu.: 1.000
## Max. :20.000
## NA's :3924
The dataset has 12 variables and 150000 observations.The variables can be described as follows:
SeriousDlqin2yrs which indicates if a person experienced 90 days past due delinquency or worse.
RevolvingUtilizationOfUnsecuredLines which indicates the total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits.
Age which indicates the borrower’s age in years.
NumberOfTime30-59DaysPastDueNotWorse which indicates the number of times borrower has been 30-59 days past due but no worse in the last 2 years.
DebtRatio which indicates the monthly debt payments, alimony,living costs divided by monthy gross income.
MonthlyIncome which is self explanatory.
NumberOfOpenCreditLinesAndLoans which indicates the number of open loans (installment like car loan or mortgage) and lines of credit (e.g. credit cards).
NumberOfTimes90DaysLate which indicates the number of times the borrower has been 90 days or more past due.
NumberRealEstateLoansOrLines which indicates the number of mortgage and real estate loans including home equity lines of credit.
NumberOfTime60-89DaysPastDueNotWorse which indicates the number of times the borrower has been 60-89 days past due but no worse in the last 2 years.
NumberOfDependents which indicates the number of dependents in family excluding themselves (spouse, children etc.).
Some of the variables are not correctly identified as factors, there are missing values for Monthly Income and Number of Depedents, there are no duplicated records but there are outliers in the dataset.
str(test_set)
## 'data.frame': 101503 obs. of 12 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SeriousDlqin2yrs : logi NA NA NA NA NA NA ...
## $ RevolvingUtilizationOfUnsecuredLines: num 0.8855 0.4633 0.0433 0.2803 1 ...
## $ age : int 43 57 59 38 27 63 50 79 68 23 ...
## $ NumberOfTime30.59DaysPastDueNotWorse: int 0 0 0 1 0 0 0 1 0 98 ...
## $ DebtRatio : num 0.1775 0.5272 0.6876 0.926 0.0199 ...
## $ MonthlyIncome : int 5700 9141 5083 3200 3865 4140 0 3301 NA 0 ...
## $ NumberOfOpenCreditLinesAndLoans : int 4 15 12 7 4 4 5 8 4 0 ...
## $ NumberOfTimes90DaysLate : int 0 0 0 0 0 0 0 0 0 98 ...
## $ NumberRealEstateLoansOrLines : int 0 4 1 2 0 0 0 1 1 0 ...
## $ NumberOfTime60.89DaysPastDueNotWorse: int 0 0 0 0 0 0 0 0 0 98 ...
## $ NumberOfDependents : int 0 2 2 0 1 1 3 1 0 0 ...
summary(train_set) # Calculating basic statistics
## X SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines
## Min. : 1 Min. :0.00000 Min. : 0.00
## 1st Qu.: 37501 1st Qu.:0.00000 1st Qu.: 0.03
## Median : 75001 Median :0.00000 Median : 0.15
## Mean : 75001 Mean :0.06684 Mean : 6.05
## 3rd Qu.:112500 3rd Qu.:0.00000 3rd Qu.: 0.56
## Max. :150000 Max. :1.00000 Max. :50708.00
##
## age NumberOfTime30.59DaysPastDueNotWorse DebtRatio
## Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 41.0 1st Qu.: 0.000 1st Qu.: 0.2
## Median : 52.0 Median : 0.000 Median : 0.4
## Mean : 52.3 Mean : 0.421 Mean : 353.0
## 3rd Qu.: 63.0 3rd Qu.: 0.000 3rd Qu.: 0.9
## Max. :109.0 Max. :98.000 Max. :329664.0
##
## MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3400 1st Qu.: 5.000 1st Qu.: 0.000
## Median : 5400 Median : 8.000 Median : 0.000
## Mean : 6670 Mean : 8.453 Mean : 0.266
## 3rd Qu.: 8249 3rd Qu.:11.000 3rd Qu.: 0.000
## Max. :3008750 Max. :58.000 Max. :98.000
## NA's :29731
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 1.000 Median : 0.0000
## Mean : 1.018 Mean : 0.2404
## 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :54.000 Max. :98.0000
##
## NumberOfDependents
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 0.757
## 3rd Qu.: 1.000
## Max. :20.000
## NA's :3924
sum(duplicated(test_set$X)) # Looking for duplicated records in the dataset
## [1] 0
In the dataset for training there are 12 variables and 101503 observations, once again the categorical variables are not correctly identified as factors, there are no duplicated records but there are outliers in the data for all variables.
train_set$X <- NULL # Dropping the X variable since is not relevant for the analysis in the data set for training.
summary(train_set$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 41.0 52.0 52.3 63.0 109.0
summary(test_set$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 41.00 52.00 52.41 63.00 104.00
# Distribution of the Borrower's age in the training and testing datasets
p1 <- ggplot(train_set, aes(age,)) +
geom_bar() +coord_cartesian(x=c(0,110), ylim = c(0,3800)) +
labs(title="Distribution of the borrower's age in the dataset for training", x="Age", y="Number of borrowers")
p2 <- ggplot(test_set, aes(age,)) +
geom_bar()+ coord_cartesian(x=c(0,110), ylim = c(0,2800)) +
labs(title = "Distribution of the borrower's age in the dataset for training", x="age", y="Number of borrowers")
grid.arrange(p1, p2, ncol=1, nrow=2)
# Outliers for age in the dataset for training and testing
par(mfrow=c(2,1))
boxplot(train_set$age,main="Boxplot: Outliers of age in the dataset for training",
xlab="Age", horizontal = TRUE)
boxplot(test_set$age,main="Boxplot: Outliers of age in the dataset for testing",
xlab="Age", horizontal = TRUE)
boxplot.stats(train_set$age)$stats
## [1] 21 41 52 63 96
boxplot.stats(test_set$age)$stats
## [1] 21 41 52 63 96
The descriptive statistics show that the average age is 52 years for both datasets and as we can see 75% of the borrowers have less than 63 years. With that in mind, the analysis is made for the population between 21-96 years old.
# Selecting the borrowers between 21 and 96 years old.
train_set <- train_set[train_set$age<=96,]
train_set <- train_set[train_set$age>=21,] # Since 21 years old is the minimum age to get a credit card in the US.
test_set <- test_set[test_set$age<= 96,] # Dropping the outlier for age in the dataset for testing
test_set <- test_set[test_set$age>=21,] # Since 21 years old is the minimum age to get a credit card in the US.
# Double checking that outliers for age were removed
par(mfrow=c(2,1))
boxplot(train_set$age,main="Boxplo for age in the dataset for training",
xlab="Age", horizontal = TRUE)
boxplot(test_set$age,main="Boxplot for age in the dataset for testing",
xlab="Age", horizontal = TRUE)
# Outliers for Revolving Utilization in the dataset for training and testing
par(mfrow=c(2,1))
boxplot(train_set$RevolvingUtilizationOfUnsecuredLines,main="Boxplot: Outliers of Revolving Utilization in the dataset for training",
xlab="Revolving Utilization of Unsecured Lines", horizontal = TRUE)
boxplot(test_set$RevolvingUtilizationOfUnsecuredLines,main="Boxplot: Outliers of Revolving Utilization in the dataset for testing",
xlab="Revolving Utilization of Unsecured Lines", horizontal = TRUE)
boxplot.stats(train_set$RevolvingUtilizationOfUnsecuredLines)$stats # Retrieving the outliers of Revolving Utilization Of Unsecured Lines in the dataset for training
## [1] 0.00000000 0.02990146 0.15431089 0.55913626 1.35215947
boxplot.stats(test_set$RevolvingUtilizationOfUnsecuredLines)$stats # Retrieving the outliers of Revolving Utilization Of Unsecured Lines in the dataset for testing
## [1] 0.00000000 0.03014561 0.15261957 0.56425639 1.36502077
Since revolving utilization refers to how much of the credit limit is used, it makes sense to assume that a borrower can not exceed the 100% credit limit.
# Dropping outliers of Revolving Utilization in the training and testing datasets.
train_set <- train_set[train_set$RevolvingUtilizationOfUnsecuredLines <= 1,] # Since a 100% revolving utilization could be the maximum amount that a borrower could be in debt.
test_set <- test_set[test_set$RevolvingUtilizationOfUnsecuredLines<= 1,] # Since a 100% revolving utilization could be the maximum amount that a borrower could be in debt.
# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$RevolvingUtilizationOfUnsecuredLines,main="Boxplot:Revolving Utilization of Unsecurd Lines in the dataset for training",
xlab="Percentage of Revolving", horizontal = TRUE)
boxplot(test_set$RevolvingUtilizationOfUnsecuredLines,main="Boxplot:Revolving Utilization of Unsecurd Lines in the dataset for testing",
xlab="Percentage of Revolving Utilization", horizontal = TRUE)
summary(train_set$NumberOfTime30.59DaysPastDueNotWorse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4077 0.0000 98.0000
summary(test_set$NumberOfTime30.59DaysPastDueNotWorse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4421 0.0000 98.0000
As we can see the maximum number of defaults in a 30-59 days period is 98, so let’s investigate further.
train_set%>% filter(NumberOfTime30.59DaysPastDueNotWorse>=98)
test_set %>% filter(NumberOfTime30.59DaysPastDueNotWorse>=98)
We see a clear pattern, the number of credit lines or real estate loans for these observations is 0 , we see as well that there are missing values for the monthly income. With that in mind it is reasonable to assume that they are errors in the datasets.
# Dropping extreme values for the number of defaults in a 30-59 day period for the training and testing datasets
train_set <-train_set[train_set$NumberOfTime30.59DaysPastDueNotWorse < 98,]
test_set <- test_set[test_set$NumberOfTime30.59DaysPastDueNotWorse < 98, ]
# Looking for outliers in the number of defaults in a 30-59 day period
par(mfrow =c(1,2))
plot(train_set$NumberOfTime30.59DaysPastDueNotWorse , ylab="Number of defaults", main = "# of defaults in a 30-59 day period(training)")
plot(test_set$NumberOfTime30.59DaysPastDueNotWorse, ylab=" Number of defaults", main = "# of defaults in a 30-59 day period(testing)")
# Dropping extreme values for the number of defaults in a 30-59 day period for the training and testing datasets
train_set <-train_set[train_set$NumberOfTime30.59DaysPastDueNotWorse < 20,]
test_set <- test_set[test_set$NumberOfTime30.59DaysPastDueNotWorse < 20, ]
# Checking out that all the outliers were removed.
p1 <- ggplot(train_set, aes(NumberOfTime30.59DaysPastDueNotWorse,)) + geom_bar() +coord_cartesian(x=c(0,15), ylim = c(0,125000)) + labs(title="Distribution of defaulters in a 30-59 day period in the dataset for training", x="Number of defaults", y="Number of borrowers")
p2 <- ggplot(test_set, aes(NumberOfTime30.59DaysPastDueNotWorse,)) + geom_bar()+
coord_cartesian(x=c(0,15), ylim = c(0,90000)) + labs(title = "Distribution of defaulters in a 30-59 day period in the dataset for testing", x="Number of defaults", y="Number of borrowers")
grid.arrange(p1, p2, ncol=1, nrow=2)
The plots show that most borrowers have between 0 and 3 defaults. With
that in mind, borrowers with 4 or more defaults will be grouped together
to get a more representative category.
summary(train_set$NumberOfTime60.89DaysPastDueNotWorse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05649 0.00000 11.00000
summary(test_set$NumberOfTime60.89DaysPastDueNotWorse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05594 0.00000 9.00000
# Looking for outliers in the number of defaults in a 30-59 day period
par(mfrow =c(1,2))
plot(train_set$NumberOfTime60.89DaysPastDueNotWorse , ylab="Number of defaults", main = "# of defaults in a 60-89 day period(training)")
plot(test_set$NumberOfTime60.89DaysPastDueNotWorse, ylab=" Number of defaults", main = "# of defaults in a 60-89 day period(testing)")
# Distribution of defaulters in a 60-89 day period.
p3 <- ggplot(train_set, aes(NumberOfTime60.89DaysPastDueNotWorse,)) + geom_bar() +coord_cartesian(x=c(0,11), ylim = c(0,150000)) + labs(title="Distribution of defaulters in a 60-89 day period in the dataset for training", x="Number of defaults", y="Number of borrowers")
p4 <- ggplot(test_set, aes(NumberOfTime60.89DaysPastDueNotWorse,)) + geom_bar() +
coord_cartesian(x=c(0,8), ylim = c(0,100000)) + labs(title = "Distribution of defaulters in a 60-89 day period in the dataset for testing", x="Number of defaults", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)
The plots show that most borrowers have between 0 and 2 defaults. With
that in mind, borrowers with 3 or more defaults will be grouped together
to get a more representative category.
summary(train_set$NumberOfTimes90DaysLate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07806 0.00000 17.00000
summary(test_set$NumberOfTimes90DaysLate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07856 0.00000 18.00000
# Looking for outliers in the number of defaults in a 30-59 day period
par(mfrow =c(1,2))
plot(train_set$NumberOfTimes90DaysLate , ylab="Number of defaults", main = "# of defaults in a 90 day period(training)")
plot(test_set$NumberOfTimes90DaysLate, ylab=" Number of defaults", main = "# of defaults in a 90 day period(testing)")
# Distribution of defaulters in a 90 day period.
p3 <- ggplot(train_set, aes(NumberOfTime60.89DaysPastDueNotWorse,)) + geom_bar() +coord_cartesian(x=c(0,8), ylim = c(0,150000)) + labs(title="Distribution of defaulters in a 90 day period in the dataset for training", x="Number of defaults", y="Number of borrowers")
p4 <- ggplot(test_set, aes(NumberOfTime60.89DaysPastDueNotWorse,)) + geom_bar()+
coord_cartesian(x=c(0,8), ylim = c(0,100000)) + labs(title = "Distribution of defaulters in a 90 day period in the dataset for testing", x="Number of defaults", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)
The plots show that most borrowers have between 0 and 2 defaults. With
that in mind, borrowers with more than 3 defaults will be grouped
together to get a more representative category.
summary(train_set$NumberOfOpenCreditLinesAndLoans)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 8.000 8.516 11.000 58.000
summary(test_set$NumberOfOpenCreditLinesAndLoans)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 8.000 8.519 11.000 85.000
# Distribution of open credit lines and loans open
p3 <- ggplot(train_set, aes(NumberOfOpenCreditLinesAndLoans,)) + geom_bar() +coord_cartesian(x=c(0,58), ylim = c(0,14000)) + labs(title="Distribution of credit lines and loans open", x="Number of credit lines and loans", y="Number of borrowers")
p4 <- ggplot(test_set, aes(NumberOfOpenCreditLinesAndLoans,)) + geom_bar()+
coord_cartesian(x=c(0,85), ylim = c(0,10000)) + labs(title = "Distribution of credit lines and loans open", x="Number of credit lines and loans", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)
The descriptive statistics shows that only 25% of the borrowers have
more than 11 credit lines or loans open. In order to simplify the
datasets, the maximum value for credit lines or loans is set to 40 and
borrers with 20 or more credit lines are grouped together to get a more
representative category.
# Dropping outliers of the number of credit lines and loans open for the training and testing datasets
train_set <-train_set[train_set$NumberOfOpenCreditLinesAndLoans <=40,]
test_set <- test_set[test_set$NumberOfOpenCreditLinesAndLoans <= 40, ]
# Distribution of open credit lines and loans open
p3 <- ggplot(train_set, aes(NumberOfOpenCreditLinesAndLoans,)) + geom_bar() +coord_cartesian(x=c(0,30), ylim = c(0,15000)) + labs(title="Distribution of credit lines and loans open", x="Number of credit lines and loans open for the training dataset", y="Number of borrowers")
p4 <- ggplot(test_set, aes(NumberOfOpenCreditLinesAndLoans,)) + geom_bar()+
coord_cartesian(x=c(0,30), ylim = c(0,10000)) + labs(title = "Distribution of credit lines and loans open for the testing", x="Number of credit lines and loans open", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)
summary(train_set$NumberRealEstateLoansOrLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.027 2.000 32.000
summary(test_set$NumberRealEstateLoansOrLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.023 2.000 29.000
# Distribution of mortgage and real estate loans
p3 <- ggplot(train_set, aes(NumberRealEstateLoansOrLines,)) + geom_bar() +coord_cartesian(x=c(0,18), ylim = c(0,60000)) + labs(title=" Distribution of mortgage and real estate loans for the training dataset", x="Number of mortgage and real estate loans", y="Number of borrowers")
p4 <- ggplot(test_set, aes(NumberRealEstateLoansOrLines,)) + geom_bar()+
coord_cartesian(x=c(0,14), ylim = c(0,40000)) + labs(title = " Distribution of mortgage and real estate loans for testing dataset", x="Number of mortgage and real estate loans", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)
The descriptive statistics shows that only 25% of the borrowers have more than two mortgage or real state loans open. With that in mind borrowers with 3 or more mortage loan are grouped together to get a more representative category.
summary(train_set$DebtRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.2 0.4 356.2 0.9 329664.0
summary(test_set$DebtRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.17 0.36 346.79 0.85 268326.00
For this variables it’s important to keep in mind, that it measures the proportion of debt in comparison to applicant’s income. The median it’s 0.40% for the training set and 36% for the testing set. With that in mind the values 329664 and 268326.00 are clearly errors in the dataset.
train_set <-train_set[train_set$DebtRatio != 329664,] # Dropping extreme value in the dataset for training
test_set <- test_set[test_set$DebtRatio != 268326.00, ] # Dropping extreme value in the dataset for testing
# Outliers of debt ratio for the training and testing datasets
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
xlab="Debt Ratio", horizontal = TRUE)
boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
xlab="Debt ratio", horizontal = TRUE)
boxplot.stats(train_set$DebtRatio)$stats # Retrieving the outliers of the debt ratio in the dataset for training
## [1] 0.0000000 0.1759553 0.3669957 0.8718086 1.9140893
boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing
## [1] 0.0000000 0.1746405 0.3648610 0.8546931 1.8747084
# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set <-train_set[train_set$DebtRatio <= 1.9140893,] # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <= 1.8747084, ] # Dropping outliers of debt ratio in the dataset for testing
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
xlab="Debt Ratio", horizontal = TRUE)
boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
xlab="Debt ratio", horizontal = TRUE)
As we can see there are still outliers in the dataset so the process is going to be repeated.
# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats # Retrieving the outliers of the debt ratio in the dataset for training
## [1] 0.0000000 0.1325928 0.2843993 0.4592201 0.9491500
boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing
## [1] 0.0000000 0.1319718 0.2841397 0.4579237 0.9468047
# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set <-train_set[train_set$DebtRatio <= 9491500,] # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <= 0.9468047, ] # Dropping outliers of debt ratio in the dataset for testing
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
xlab="Debt Ratio", horizontal = TRUE)
boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
xlab="Debt ratio", horizontal = TRUE)
There are still outliers in the dataset so the process is going to be reapeated.
# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats # Retrieving the outliers of the debt ratio in the dataset for training
## [1] 0.0000000 0.1325928 0.2843993 0.4592201 0.9491500
boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing
## [1] 0.0000000 0.1254341 0.2726197 0.4322744 0.8925144
# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set <-train_set[train_set$DebtRatio <= 0.9491500,] # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <= 0.8925144, ] # Dropping outliers of debt ratio in the dataset for testing
# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
xlab="Debt Ratio", horizontal = TRUE)
boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
xlab="Debt ratio", horizontal = TRUE)
There are still outliers in the dataset so the process is going to be repeated.
# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats # Retrieving the outliers of the debt ratio in the dataset for training
## [1] 0.0000000 0.1253605 0.2727415 0.4328667 0.8941059
boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing
## [1] 0.0000000 0.1241627 0.2704673 0.4275277 0.8824974
# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set <-train_set[train_set$DebtRatio <= 0.8941059,] # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <=0.8824974, ] # Dropping outliers of debt ratio in the dataset for testing
# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
xlab="Debt Ratio", horizontal = TRUE)
boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
xlab="Debt ratio", horizontal = TRUE)
There are still outliers in the dataset so the process is going to be
reapeated.
# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats # Retrieving the outliers of the debt ratio in the dataset for training
## [1] 0.0000000 0.1238715 0.2703160 0.4278875 0.8839050
boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing
## [1] 0.0000000 0.1239277 0.2699323 0.4266657 0.8806882
# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set <-train_set[train_set$DebtRatio <= 0.8839050,] # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <=0.8806882, ] # Dropping outliers of debt ratio in the dataset for testing
# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
xlab="Debt Ratio", horizontal = TRUE)
boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
xlab="Debt ratio", horizontal = TRUE)
There are still outliers in the dataset so the process is going to be reapeated one more time.
# Retrieving outliers for the training and testing datasets
boxplot.stats(train_set$DebtRatio)$stats # Retrieving the outliers of the debt ratio in the dataset for training
## [1] 0.0000000 0.1235546 0.2697925 0.4268852 0.8818719
boxplot.stats(test_set$DebtRatio)$stats # Retrieving the outlier of the debt ratio in the dataset for testing
## [1] 0.0000000 0.1238924 0.2698811 0.4265703 0.8805249
# Dropping outliers of the debt ratio variable for the training and testing datasets
train_set <-train_set[train_set$DebtRatio <= 0.8,] # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$DebtRatio <= 0.8, ] # Dropping outliers of debt ratio in the dataset for testing
# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$DebtRatio,main="Boxplot:Debt Ratio for the training dataset",
xlab="Debt Ratio", horizontal = TRUE)
boxplot(test_set$DebtRatio,main="Boxplot:Debt Ratio for the testing dataset",
xlab="Debt ratio", horizontal = TRUE)
summary(train_set$MonthlyIncome) # Basic statistics of the monthly income in the dataset for training
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3750 5750 7078 8555 3008750 1520
summary(test_set$MonthlyIncome) # Basic statistics of the monthly income in the dataset for testing
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3750 5750 7276 8500 7727000 1035
The descriptive statistics show that the average monthly income is $7078 for the training dataset and $7276 for the testing dataset. Only 25% of the borrowers have a monthly income greater than $8555 for the training dataset and $8500 for the testing dataset, respectively. There are missing values for this variable and since we have a lot of observations it seems reasonable to drop these observations before applying any sophisticated imputation method.
train_set <- train_set %>% filter(!is.na(MonthlyIncome)) # Droping NA'S in the dataset for training
test_set <- test_set %>% filter(!is.na(MonthlyIncome)) # Dropping NA'S in the dataset for testing
summary(train_set$MonthlyIncome) # Basic statistics of the monthly income in the dataset for training
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3750 5750 7078 8555 3008750
summary(test_set$MonthlyIncome) # Basic statistics of the monthly income in the dataset for testing
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3750 5750 7276 8500 7727000
# Looking for outliers in the monthly income for the training and testing datasets
par(mfrow=c(2,1))
boxplot(train_set$MonthlyIncome,main="Boxplot:Monthly Income for the training dataset",
xlab="Monthly Income", horizontal = TRUE)
boxplot(test_set$MonthlyIncome,main="Boxplot:Monthly Income for the testing dataset",
xlab="Monthly Income", horizontal = TRUE)
boxplot.stats(train_set$MonthlyIncome)$stats
## [1] 0 3750 5750 8555 15761
boxplot.stats(test_set$MonthlyIncome)$stats
## [1] 0 3750 5750 8500 15623
# Dropping outliers of the monthly income variable for the training and testing datasets
train_set <-train_set[train_set$MonthlyIncome<= 15761,] # Dropping outliers of the monthly income in the dataset for training
test_set <- test_set[test_set$MonthlyIncome <= 15623, ] # Dropping outliers of the monthly income in the dataset for testing
# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$MonthlyIncome,main="Boxplot:Monthly Income for the training dataset",
xlab="Monthly Income", horizontal = TRUE)
boxplot(test_set$MonthlyIncome,main="Boxplot:Monthly Income for the testing dataset",
xlab="Monthly Income", horizontal = TRUE)
There are missing values so the process is reapeated.
boxplot.stats(train_set$MonthlyIncome)$stats
## [1] 0 3640 5500 8083 14746
boxplot.stats(test_set$MonthlyIncome)$stats
## [1] 0 3650 5500 8024 14584
# Dropping outliers of the monthly income variable for the training and testing datasets
train_set <-train_set[train_set$MonthlyIncome<= 14746,] # Dropping outliers of debt ratio in the dataset for training
test_set <- test_set[test_set$MonthlyIncome <= 14584, ] # Dropping outliers of debt ratio in the dataset for testing
# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$MonthlyIncome,main="Boxplot:Monthly Income for the training dataset",
xlab="Monthly Income", horizontal = TRUE)
boxplot(test_set$MonthlyIncome,main="Boxplot:Monthly Income for the testing dataset",
xlab="Monthly Income", horizontal = TRUE)
There are missing values so the process is reapeated.
boxplot.stats(train_set$MonthlyIncome)$stats
## [1] 0 3600 5500 8000 14600
boxplot.stats(test_set$MonthlyIncome)$stats
## [1] 0 3600 5500 8000 14584
# Dropping outliers of the monthly income variable for the training and testing datasets
train_set <-train_set[train_set$MonthlyIncome<= 14600,] # Dropping outliers of monthly income in the dataset for training
test_set <- test_set[test_set$MonthlyIncome <= 14584, ] # Dropping outliers of monthly income in the dataset for testing
# Checking out that all the outliers were removed.
par(mfrow=c(2,1))
boxplot(train_set$MonthlyIncome,main="Boxplot:Monthly Income for the training dataset",
xlab="Monthly Income", horizontal = TRUE)
boxplot(test_set$MonthlyIncome,main="Boxplot:Monthly Income for the testing dataset",
xlab="Monthly Income", horizontal = TRUE)
summary(train_set$NumberOfDependents)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.8267 1.0000 20.0000
summary(test_set$NumberOfDependents)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.8463 2.0000 43.0000
# Distribution of number of dependents for the training and testing datasets
p3 <- ggplot(train_set, aes(NumberOfDependents,)) + geom_bar() +coord_cartesian(x=c(0,20), ylim = c(0,50000)) + labs(title=" Distribution of the borrower's dependents", x="Number of dependents", y="Number of borrowers")
p4 <- ggplot(test_set, aes(NumberOfDependents,)) + geom_bar() +
coord_cartesian(x=c(0,20), ylim = c(0,40000)) + labs(title = " Distribution of the borrower's dependents", x="Number of dependets", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)
to simplify the analisis the number of depedents is fixed to 5 for both
datasets.
# Dropping outliers of the number of dependents for the training and testing datasets
train_set <-train_set[train_set$NumberOfDependents<= 5,]
test_set <- test_set[test_set$NumberOfDependents <= 5, ]
# Distribution of number of dependents for the training and testing datasets
p3 <- ggplot(train_set, aes(NumberOfDependents,)) + geom_bar() +coord_cartesian(x=c(0,5), ylim = c(0,60000)) + labs(title=" Distribution of the borrower's dependents", x="Number of dependents", y="Number of borrowers")
p4 <- ggplot(test_set, aes(NumberOfDependents,)) + geom_bar() +
coord_cartesian(x=c(0,5), ylim = c(0,40000)) + labs(title = " Distribution of the borrower's dependents", x="Number of dependets", y="Number of borrowers")
grid.arrange(p3, p4, ncol=1, nrow=2)
# Proportion of defaults in the dataset for training
ggplot(data = train_set, aes(x=factor(SeriousDlqin2yrs),y = prop.table(stat(count)), fill = factor(SeriousDlqin2yrs),label = scales::percent(prop.table(stat(count))))) + geom_bar(position = "dodge") + geom_text(stat="count",
position = position_dodge(0.9),vjust=-0.5,size = 3) + scale_x_discrete(labels=c("No Default", "Default")) +scale_y_continuous(labels = scales::percent) +ggtitle("Proportion of Credit Defaults") + labs(x="", y = "Proportion of Borrowers")
As we can the there a huge class imbalance in the dataset, taking that into account let’s see if the model can generalize correctly without applying any method for imbalance data.
# Distribution of borrowers by age
p5<- ggplot(data=train_set[train_set$SeriousDlqin2yrs==0,], aes(x=age)) +
geom_bar(aes(fill='No Default'), show.legend = FALSE) +
labs(title='Distribution of borrowers by age for no defaulters', y = 'Number of borrowers') + theme_classic()
p6<-ggplot(data=train_set[train_set$SeriousDlqin2yrs==1,], aes(x=age)) +
geom_bar(aes(fill='Default'), show.legend=FALSE) +
labs(title='Distribution of borrowers by age for defaulters', y = 'Number of borrowers') +
theme_classic()
grid.arrange(p5, p6, ncol=1, nrow=2)
The plots show that defaulters are concentrated in the left side of the distribution, we could expect that age could be a predictor of whether a borrower will default or not.
ggplot(train_set, aes(x=factor(SeriousDlqin2yrs), y=RevolvingUtilizationOfUnsecuredLines)) + geom_boxplot() + coord_cartesian(ylim = c(0,2)) + labs(title = "Box Plot: Revoling Utilization by Type of Borrower")
It seems that median of the Revolving Utilization is different between defaulters and no defaulters. Of course a median test could be more appropriate to taste if there’s any difference statistically significant.
# Distribution of borrowers by number of defaults in a 30-59 day period
p5<- ggplot(data=train_set[train_set$SeriousDlqin2yrs==0,], aes(x=NumberOfTime30.59DaysPastDueNotWorse)) +
geom_bar(aes(fill='No Default'), show.legend = FALSE) +
labs(title='Number of defaults in a 30-59 day period for defaulters', y = 'Number of borrowers', x="Number of defaults") + theme_classic()
p6<-ggplot(data=train_set[train_set$SeriousDlqin2yrs==1,], aes(x=NumberOfTime30.59DaysPastDueNotWorse)) +
geom_bar(aes(fill='Default'), show.legend=FALSE) +
labs(title='Number of defaults in a 30-59 day period for no defaulters', y = 'Number of borrowers', x= "Number of defaults") +
theme_classic()
grid.arrange(p5, p6, ncol=1, nrow=2)
# Distribution of borrowers by number of defaults in a 60-89 day period
p5<- ggplot(data=train_set[train_set$SeriousDlqin2yrs==0,], aes(x=NumberOfTime60.89DaysPastDueNotWorse)) +
geom_bar(aes(fill='No Default'), show.legend = FALSE) +
labs(title='Number of defaults in a 60-89 day period for defaulters', y = 'Number of borrowers', x="Number of defaults") + theme_classic()
p6<-ggplot(data=train_set[train_set$SeriousDlqin2yrs==1,], aes(x=NumberOfTime60.89DaysPastDueNotWorse)) +
geom_bar(aes(fill='Default'), show.legend=FALSE) +
labs(title='Number of defaults in a 60-89 day period for defaulters', y = 'Number of borrowers', x= "Number of defaults") +
theme_classic()
grid.arrange(p5, p6, ncol=1, nrow=2)
# Distribution of borrowers by number of defaults in a 30-59 day period
p5<- ggplot(data=train_set[train_set$SeriousDlqin2yrs==0,], aes(x=NumberOfTimes90DaysLate)) +
geom_bar(aes(fill='No Default'), show.legend = FALSE) +
labs(title='Number of defaults in a 90 day period for defaulters', y = 'Number of borrowers', x="Number of defaults") + theme_classic()
p6<-ggplot(data=train_set[train_set$SeriousDlqin2yrs==1,], aes(x=NumberOfTimes90DaysLate)) +
geom_bar(aes(fill='Default'), show.legend=FALSE) +
labs(title='Number of defaults in a 90 day period for no defaulters', y = 'Number of borrowers', x= "Number of defaults") +
theme_classic()
grid.arrange(p5, p6, ncol=1, nrow=2)
# Visualizing the correlations between variables
correlations <- cor(train_set, method="pearson")
corrplot(correlations, number.cex = .9, method = "color", type = "full", tl.cex=0.8,tl.col = "black")
# Transforming the target variable to a factor
train_set$SeriousDlqin2yrs <- factor(train_set$SeriousDlqin2yrs, levels = c(0,1), labels = c("No Default", "Default"))
# Double Cheking that the category of reference is Default
contrasts(train_set$SeriousDlqin2yrs)
## Default
## No Default 0
## Default 1
# Transforming the age to a categorical variable in the training dataset
train_set$age_category <- with(train_set,
ifelse(age >= 21 & age < 30, "(20+ to 29+)",
ifelse(age >= 30 & age < 40, "(30+ to 39+)",
ifelse(age >= 40 & age < 50, "(40+ to 49+)",
ifelse(age >= 50 & age < 60, "(50+ to 59+)",
"60+ years")))))
train_set$age_category <- as.factor(train_set$age_category)
# Transforming the age to a categorical variable in the testing dataset
test_set$age_category <- with(test_set,
ifelse(age >= 21 & age < 30, "(20+ to 29+)",
ifelse(age >= 30 & age < 40, "(30+ to 39+)",
ifelse(age >= 40 & age < 50, "(40+ to 49+)",
ifelse(age >= 50 & age < 60, "(50+ to 59+)",
"60+ years")))))
test_set$age_category <- as.factor(test_set$age_category)
# Setting the category of reference for age
test_set$age_category <- relevel(test_set$age_category, ref="60+ years")
train_set$age_category <- relevel(train_set$age_category, ref="60+ years")
# Simplifying the datasets
train_set$age <- NULL
test_set$age <- NULL
# Transforming the NumberOfTime30.59DaysPastDueNotWorse to a categorical variable in the training dataset
train_set$defaults_30_59_days <- with(train_set,
ifelse(NumberOfTime30.59DaysPastDueNotWorse == 0 , "(Zero)",
ifelse(NumberOfTime30.59DaysPastDueNotWorse > 0 & NumberOfTime30.59DaysPastDueNotWorse <= 1, "(One)",
ifelse(NumberOfTime30.59DaysPastDueNotWorse > 1 & NumberOfTime30.59DaysPastDueNotWorse <= 2, "(Two)",
"(Tree or more)"))))
train_set$defaults_30_59_days <- as.factor(train_set$defaults_30_59_days)
# Transforming the NumberOfTime30.59DaysPastDueNotWorse to a categorical variable in the training dataset
test_set$defaults_30_59_days <- with(test_set,
ifelse(NumberOfTime30.59DaysPastDueNotWorse == 0 , "(Zero)",
ifelse(NumberOfTime30.59DaysPastDueNotWorse > 0 & NumberOfTime30.59DaysPastDueNotWorse <= 1, "(One)",
ifelse(NumberOfTime30.59DaysPastDueNotWorse > 1 & NumberOfTime30.59DaysPastDueNotWorse <= 2, "(Two)",
"(Tree or more)"))))
test_set$defaults_30_59_days <- as.factor(test_set$defaults_30_59_days)
train_set$defaults_30_59_days <- relevel(train_set$defaults_30_59_days, ref="(Zero)")
test_set$defaults_30_59_days <- relevel(test_set$defaults_30_59_days, ref="(Zero)")
# Simplifying the dataset
train_set$NumberOfTime30.59DaysPastDueNotWorse <- NULL
test_set$NumberOfTime30.59DaysPastDueNotWorse <- NULL
# Transforming the NumberOfTime60.89DaysPastDueNotWorse to a categorical variable in the training dataset
train_set$defaults_60_89_days <- with(train_set,
ifelse(NumberOfTime60.89DaysPastDueNotWorse == 0 , "(Zero)",
ifelse(NumberOfTime60.89DaysPastDueNotWorse > 0 & NumberOfTime60.89DaysPastDueNotWorse <= 1, "(One)",
ifelse(NumberOfTime60.89DaysPastDueNotWorse > 1 & NumberOfTime60.89DaysPastDueNotWorse <= 2, "(Two)",
"(Tree or More)"))))
train_set$defaults_60_89_days <- as.factor(train_set$defaults_60_89_days)
# Transforming the NumberOfTime60.89DaysPastDueNotWorse to a categorical variable in the testing dataset
test_set$defaults_60_89_days <- with(test_set,
ifelse(NumberOfTime60.89DaysPastDueNotWorse == 0 , "(Zero)",
ifelse(NumberOfTime60.89DaysPastDueNotWorse > 0 & NumberOfTime60.89DaysPastDueNotWorse <= 1, "(One)",
ifelse(NumberOfTime60.89DaysPastDueNotWorse > 1 & NumberOfTime60.89DaysPastDueNotWorse <= 2, "(Two)",
"(Tree or More)"))))
test_set$defaults_60_89_days <- as.factor(test_set$defaults_60_89_days)
# Setting the category of reference for the number of defaults in a 60-89 day period
train_set$defaults_60_89_days <- relevel(train_set$defaults_60_89_days, ref="(Zero)")
test_set$defaults_60_89_days <- relevel(test_set$defaults_60_89_days, ref="(Zero)")
# Simplifying the dataset
train_set$NumberOfTime60.89DaysPastDueNotWorse <- NULL
test_set$ NumberOfTime60.89DaysPastDueNotWorse <- NULL
# Transforming the NumberOfTimes90DaysLate to a categorical variable in the training dataset
train_set$defaults_90_days <- with(train_set,
ifelse(NumberOfTimes90DaysLate == 0 , "(Zero)",
ifelse(NumberOfTimes90DaysLate > 0 & NumberOfTimes90DaysLate <= 1, "(One)",
ifelse(NumberOfTimes90DaysLate > 1 & NumberOfTimes90DaysLate <= 2, "(Two)",
"(Tree or more)"))))
train_set$defaults_90_days <- as.factor(train_set$defaults_90_days )
# Transforming the NumberOfTimes90DaysLate to a categorical variable in the testing dataset
test_set$defaults_90_days <- with(test_set,
ifelse(NumberOfTimes90DaysLate == 0 , "(Zero)",
ifelse(NumberOfTimes90DaysLate > 0 & NumberOfTimes90DaysLate <= 1, "(One)",
ifelse(NumberOfTimes90DaysLate > 1 & NumberOfTimes90DaysLate <= 2, "(Two)",
"(Tree or more)"))))
test_set$defaults_90_days <- as.factor(test_set$defaults_90_days )
# Setting the category of reference for the number of defaults in a 90 day period
train_set$defaults_90_days <- relevel(train_set$defaults_90_days, ref="(Zero)")
test_set$defaults_90_days <- relevel(test_set$defaults_90_days, ref="(Zero)")
# Simplifying the dataset
train_set$NumberOfTimes90DaysLate <- NULL
test_set$NumberOfTimes90DaysLate <- NULL
# Transforming the NumberOfDependents to a categorical variable in the training dataset
train_set$Dependents <- with(train_set,
ifelse(NumberOfDependents == 0 , "(Zero)",
ifelse(NumberOfDependents > 0 & NumberOfDependents <= 1, "(One)",
ifelse(NumberOfDependents > 1 & NumberOfDependents <= 2, "(Two)",
"(Tree or more)"))))
train_set$Dependents <- as.factor(train_set$Dependents)
# Transforming the NumberOfDependents to a categorical variable in the testing dataset
test_set$Dependents <- with(test_set,
ifelse(NumberOfDependents == 0 , "(Zero)",
ifelse(NumberOfDependents > 0 & NumberOfDependents <= 1, "(One)",
ifelse(NumberOfDependents > 1 & NumberOfDependents <= 2, "(Two)",
"(Tree or more)"))))
test_set$Dependents <- as.factor(test_set$Dependents)
# Setting the category of reference for the number of dependents
train_set$Dependents <- relevel(train_set$Dependents, ref="(Zero)")
test_set$Dependents <- relevel(test_set$Dependents, ref="(Zero)")
# Simplifying the dataset
train_set$NumberOfDependents <- NULL
test_set$NumberOfDependents <- NULL
# Transforming the NumberRealEstateLoansOrLines to a categorical variable in the training dataset
train_set$Real_Estate_Loans <- with(train_set,
ifelse(NumberRealEstateLoansOrLines == 0 , "(Zero)",
ifelse(NumberRealEstateLoansOrLines > 0 & NumberRealEstateLoansOrLines <= 3, "(One to Tree)",
"(Four or more)")))
train_set$Real_Estate_Loans <- as.factor(train_set$Real_Estate_Loans)
# Transforming the NumberRealEstateLoansOrLines to a categorical variable in the testing dataset
test_set$Real_Estate_Loans <- with(test_set,
ifelse(NumberRealEstateLoansOrLines == 0 , "(Zero)",
ifelse(NumberRealEstateLoansOrLines > 0 & NumberRealEstateLoansOrLines <= 3, "(One to Tree)",
"(Four or more)")))
test_set$Real_Estate_Loans <- as.factor(test_set$Real_Estate_Loans)
# Setting the category of reference for the number of real estate loans
train_set$Real_Estate_Loans <- relevel(train_set$Real_Estate_Loans, ref="(Zero)")
test_set$Real_Estate_Loans <- relevel(test_set$Real_Estate_Loans, ref="(Zero)")
# Simplifying the dataset
train_set$NumberRealEstateLoansOrLines <- NULL
test_set$NumberRealEstateLoansOrLines <- NULL
# Transforming the NumberOfOpenCreditLinesAndLoans to a categorical variable in the training dataset
train_set$Credit_or_Loans <- with(train_set,
ifelse(NumberOfOpenCreditLinesAndLoans >= 0 & NumberOfOpenCreditLinesAndLoans <= 11, "(0-11)",
ifelse(NumberOfOpenCreditLinesAndLoans >11 & NumberOfOpenCreditLinesAndLoans<= 21, "(11-21)",
"(22 or more)")))
train_set$Credit_or_Loans <- as.factor(train_set$Credit_or_Loans)
# Transforming the NumberOfOpenCreditLinesAndLoans to a categorical variable in the testing dataset
test_set$Credit_or_Loans <- with(test_set,
ifelse(NumberOfOpenCreditLinesAndLoans >= 0 & NumberOfOpenCreditLinesAndLoans <= 11, "(0-11)",
ifelse(NumberOfOpenCreditLinesAndLoans >11 & NumberOfOpenCreditLinesAndLoans<= 21, "(11-21)",
"(22 or more)")))
test_set$Credit_or_Loans <- as.factor(test_set$Credit_or_Loans)
# Setting the category of reference for the number of credit lines
train_set$Credit_or_Loans <- relevel(train_set$Credit_or_Loans, ref="(0-11)")
test_set$Credit_or_Loans <- relevel(test_set$Credit_or_Loans, ref="(0-11)")
# Simplifying the dataset
train_set$NumberOfOpenCreditLinesAndLoans <- NULL
test_set$NumberOfOpenCreditLinesAndLoans <- NULL
# Double Checking that the data is complete
colSums(is.na(train_set))
## SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines
## 0 0
## DebtRatio MonthlyIncome
## 0 0
## age_category defaults_30_59_days
## 0 0
## defaults_60_89_days defaults_90_days
## 0 0
## Dependents Real_Estate_Loans
## 0 0
## Credit_or_Loans
## 0
# Retrieving the same of observations for test label
names(test_labels) <- c("X", "Probability" )
test_labels_clean <- dplyr::left_join(test_set, test_labels, by = "X")
test_labels_clean <- test_labels_clean %>% dplyr::select(X, Probability )
str(test_labels_clean)
## 'data.frame': 68314 obs. of 2 variables:
## $ X : int 1 2 3 5 6 8 11 13 14 15 ...
## $ Probability: num 0.0808 0.0407 0.012 0.1083 0.0275 ...
model1_complete <- glm(SeriousDlqin2yrs~., data = train_set, family = "binomial" )
summary(model1_complete)
##
## Call:
## glm(formula = SeriousDlqin2yrs ~ ., family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2745 -0.3080 -0.2115 -0.1662 3.1660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.421e+00 5.885e-02 -75.131 < 2e-16 ***
## RevolvingUtilizationOfUnsecuredLines 1.609e+00 4.482e-02 35.897 < 2e-16 ***
## DebtRatio 8.408e-01 9.915e-02 8.480 < 2e-16 ***
## MonthlyIncome -3.530e-05 6.321e-06 -5.585 2.34e-08 ***
## age_category(20+ to 29+) 6.238e-01 6.267e-02 9.953 < 2e-16 ***
## age_category(30+ to 39+) 4.404e-01 5.078e-02 8.673 < 2e-16 ***
## age_category(40+ to 49+) 4.106e-01 4.908e-02 8.366 < 2e-16 ***
## age_category(50+ to 59+) 2.936e-01 4.913e-02 5.977 2.28e-09 ***
## defaults_30_59_days(One) 7.529e-01 3.767e-02 19.984 < 2e-16 ***
## defaults_30_59_days(Tree or more) 1.417e+00 6.287e-02 22.544 < 2e-16 ***
## defaults_30_59_days(Two) 1.133e+00 5.454e-02 20.767 < 2e-16 ***
## defaults_60_89_days(One) 9.281e-01 4.737e-02 19.593 < 2e-16 ***
## defaults_60_89_days(Tree or More) 1.205e+00 1.532e-01 7.867 3.63e-15 ***
## defaults_60_89_days(Two) 1.201e+00 9.687e-02 12.399 < 2e-16 ***
## defaults_90_days(One) 1.358e+00 4.641e-02 29.262 < 2e-16 ***
## defaults_90_days(Tree or more) 2.127e+00 8.693e-02 24.466 < 2e-16 ***
## defaults_90_days(Two) 1.715e+00 7.662e-02 22.388 < 2e-16 ***
## Dependents(One) 6.072e-02 3.925e-02 1.547 0.12186
## Dependents(Tree or more) 1.948e-01 4.977e-02 3.914 9.07e-05 ***
## Dependents(Two) 7.975e-02 4.418e-02 1.805 0.07106 .
## Real_Estate_Loans(Four or more) 3.375e-01 1.197e-01 2.820 0.00480 **
## Real_Estate_Loans(One to Tree) -2.505e-01 4.445e-02 -5.636 1.74e-08 ***
## Credit_or_Loans(11-21) 3.187e-01 4.045e-02 7.879 3.31e-15 ***
## Credit_or_Loans(22 or more) 3.810e-01 1.173e-01 3.248 0.00116 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45032 on 100609 degrees of freedom
## Residual deviance: 34861 on 100586 degrees of freedom
## AIC: 34909
##
## Number of Fisher Scoring iterations: 6
# Confusion Matrix
threshold <- 0.5
pred_glm <- predict(model1_complete, newdata = test_set, type = 'response')
pred <- ifelse(pred_glm > threshold, 1, 0)
test_class <- ifelse(test_labels_clean$Probability > threshold, 1, 0)
pred[is.na(pred)] <- 1
observ_glm <- as.factor(ifelse(test = test_class == 1, yes = 'Default', no = 'No_Default'))
predic_glm <- as.factor(ifelse(test = pred == 1, yes = 'Default', no = 'No_Default'))
matriz_glm <- confusionMatrix(predic_glm, observ_glm, positive='Default')
matriz_glm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Default No_Default
## Default 621 442
## No_Default 75 67176
##
## Accuracy : 0.9924
## 95% CI : (0.9918, 0.9931)
## No Information Rate : 0.9898
## P-Value [Acc > NIR] : 5.476e-13
##
## Kappa : 0.7024
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.89224
## Specificity : 0.99346
## Pos Pred Value : 0.58420
## Neg Pred Value : 0.99888
## Prevalence : 0.01019
## Detection Rate : 0.00909
## Detection Prevalence : 0.01556
## Balanced Accuracy : 0.94285
##
## 'Positive' Class : Default
##
model2 <- glm(SeriousDlqin2yrs~MonthlyIncome+defaults_60_89_days+Real_Estate_Loans+RevolvingUtilizationOfUnsecuredLines+age_category+defaults_90_days+Credit_or_Loans+DebtRatio+defaults_30_59_days, data = train_set, family = "binomial" )
summary(model2)
##
## Call:
## glm(formula = SeriousDlqin2yrs ~ MonthlyIncome + defaults_60_89_days +
## Real_Estate_Loans + RevolvingUtilizationOfUnsecuredLines +
## age_category + defaults_90_days + Credit_or_Loans + DebtRatio +
## defaults_30_59_days, family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2635 -0.3083 -0.2115 -0.1665 3.1519
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.428e+00 5.853e-02 -75.643 < 2e-16 ***
## MonthlyIncome -3.217e-05 6.260e-06 -5.140 2.75e-07 ***
## defaults_60_89_days(One) 9.287e-01 4.736e-02 19.610 < 2e-16 ***
## defaults_60_89_days(Tree or More) 1.209e+00 1.531e-01 7.894 2.93e-15 ***
## defaults_60_89_days(Two) 1.205e+00 9.689e-02 12.433 < 2e-16 ***
## Real_Estate_Loans(Four or more) 3.325e-01 1.197e-01 2.777 0.00548 **
## Real_Estate_Loans(One to Tree) -2.471e-01 4.440e-02 -5.564 2.63e-08 ***
## RevolvingUtilizationOfUnsecuredLines 1.615e+00 4.475e-02 36.076 < 2e-16 ***
## age_category(20+ to 29+) 6.336e-01 6.257e-02 10.126 < 2e-16 ***
## age_category(30+ to 39+) 4.774e-01 4.958e-02 9.629 < 2e-16 ***
## age_category(40+ to 49+) 4.617e-01 4.684e-02 9.856 < 2e-16 ***
## age_category(50+ to 59+) 3.179e-01 4.857e-02 6.545 5.94e-11 ***
## defaults_90_days(One) 1.362e+00 4.638e-02 29.364 < 2e-16 ***
## defaults_90_days(Tree or more) 2.132e+00 8.687e-02 24.538 < 2e-16 ***
## defaults_90_days(Two) 1.722e+00 7.657e-02 22.490 < 2e-16 ***
## Credit_or_Loans(11-21) 3.161e-01 4.044e-02 7.818 5.36e-15 ***
## Credit_or_Loans(22 or more) 3.720e-01 1.173e-01 3.172 0.00151 **
## DebtRatio 8.524e-01 9.897e-02 8.613 < 2e-16 ***
## defaults_30_59_days(One) 7.558e-01 3.766e-02 20.071 < 2e-16 ***
## defaults_30_59_days(Tree or more) 1.419e+00 6.288e-02 22.564 < 2e-16 ***
## defaults_30_59_days(Two) 1.136e+00 5.452e-02 20.832 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45032 on 100609 degrees of freedom
## Residual deviance: 34877 on 100589 degrees of freedom
## AIC: 34919
##
## Number of Fisher Scoring iterations: 6
# Confusion Matrix
threshold <- 0.5
pred_glm <- predict(model2, newdata = test_set, type = 'response')
pred <- ifelse(pred_glm > threshold, 1, 0)
test_class <- ifelse(test_labels_clean$Probability > threshold, 1, 0)
pred[is.na(pred)] <- 1
observ_glm <- as.factor(ifelse(test = test_class == 1, yes = 'Default', no = 'No_Default'))
predic_glm <- as.factor(ifelse(test = pred == 1, yes = 'Default', no = 'No_Default'))
matriz_glm <- confusionMatrix(predic_glm, observ_glm, positive='Default')
matriz_glm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Default No_Default
## Default 620 444
## No_Default 76 67174
##
## Accuracy : 0.9924
## 95% CI : (0.9917, 0.993)
## No Information Rate : 0.9898
## P-Value [Acc > NIR] : 1.353e-12
##
## Kappa : 0.7009
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.890805
## Specificity : 0.993434
## Pos Pred Value : 0.582707
## Neg Pred Value : 0.998870
## Prevalence : 0.010188
## Detection Rate : 0.009076
## Detection Prevalence : 0.015575
## Balanced Accuracy : 0.942119
##
## 'Positive' Class : Default
##
model3 <- glm(SeriousDlqin2yrs~MonthlyIncome+defaults_60_89_days+RevolvingUtilizationOfUnsecuredLines+age_category+defaults_90_days+Credit_or_Loans+DebtRatio+defaults_30_59_days, data = train_set, family = "binomial" )
summary(model3)
##
## Call:
## glm(formula = SeriousDlqin2yrs ~ MonthlyIncome + defaults_60_89_days +
## RevolvingUtilizationOfUnsecuredLines + age_category + defaults_90_days +
## Credit_or_Loans + DebtRatio + defaults_30_59_days, family = "binomial",
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2335 -0.3076 -0.2129 -0.1686 3.1412
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.451e+00 5.835e-02 -76.284 < 2e-16 ***
## MonthlyIncome -4.119e-05 5.662e-06 -7.274 3.48e-13 ***
## defaults_60_89_days(One) 9.293e-01 4.732e-02 19.636 < 2e-16 ***
## defaults_60_89_days(Tree or More) 1.208e+00 1.530e-01 7.896 2.87e-15 ***
## defaults_60_89_days(Two) 1.200e+00 9.685e-02 12.388 < 2e-16 ***
## RevolvingUtilizationOfUnsecuredLines 1.656e+00 4.397e-02 37.662 < 2e-16 ***
## age_category(20+ to 29+) 6.865e-01 6.222e-02 11.033 < 2e-16 ***
## age_category(30+ to 39+) 4.994e-01 4.941e-02 10.109 < 2e-16 ***
## age_category(40+ to 49+) 4.641e-01 4.682e-02 9.912 < 2e-16 ***
## age_category(50+ to 59+) 3.107e-01 4.854e-02 6.400 1.55e-10 ***
## defaults_90_days(One) 1.367e+00 4.635e-02 29.490 < 2e-16 ***
## defaults_90_days(Tree or more) 2.145e+00 8.690e-02 24.685 < 2e-16 ***
## defaults_90_days(Two) 1.733e+00 7.653e-02 22.641 < 2e-16 ***
## Credit_or_Loans(11-21) 3.327e-01 4.022e-02 8.273 < 2e-16 ***
## Credit_or_Loans(22 or more) 4.216e-01 1.168e-01 3.611 0.000305 ***
## DebtRatio 5.685e-01 7.776e-02 7.311 2.66e-13 ***
## defaults_30_59_days(One) 7.590e-01 3.764e-02 20.167 < 2e-16 ***
## defaults_30_59_days(Tree or more) 1.414e+00 6.275e-02 22.539 < 2e-16 ***
## defaults_30_59_days(Two) 1.136e+00 5.450e-02 20.848 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45032 on 100609 degrees of freedom
## Residual deviance: 34937 on 100591 degrees of freedom
## AIC: 34975
##
## Number of Fisher Scoring iterations: 6
# Confusion Matrix
threshold <- 0.5
pred_glm <- predict(model3, newdata = test_set, type = 'response')
pred <- ifelse(pred_glm > threshold, 1, 0)
test_class <- ifelse(test_labels_clean$Probability > threshold, 1, 0)
pred[is.na(pred)] <- 1
observ_glm <- as.factor(ifelse(test = test_class == 1, yes = 'Default', no = 'No_Default'))
predic_glm <- as.factor(ifelse(test = pred == 1, yes = 'Default', no = 'No_Default'))
matriz_glm <- confusionMatrix(predic_glm, observ_glm, positive='Default')
matriz_glm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Default No_Default
## Default 626 443
## No_Default 70 67175
##
## Accuracy : 0.9925
## 95% CI : (0.9918, 0.9931)
## No Information Rate : 0.9898
## P-Value [Acc > NIR] : 1.596e-13
##
## Kappa : 0.7057
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.899425
## Specificity : 0.993448
## Pos Pred Value : 0.585594
## Neg Pred Value : 0.998959
## Prevalence : 0.010188
## Detection Rate : 0.009164
## Detection Prevalence : 0.015648
## Balanced Accuracy : 0.946437
##
## 'Positive' Class : Default
##
Logistic Regression is appropriate to calculate the probabilities of default, however, it is necessary to have a clean dataset because this method is sensible to outliers. Cleaning a large dataset can be time consuming and for that reason a method like decision trees could be more appropriate.
The best model seems to be model4. This model is very good at predicting defaults and no defaults. The specificity of this model is about 99% with a sensitivity of 89%. The kappa coefficient indicates that the model has a very good agreement between classes considering that the dataset is unbalanced.
The age it’s important to predict whether a borrower will default or not, we see that the probability of default decreases as the borrower becomes older.
We see that if a borrower has defaulted in the past, the probability of default increases.
On the other hand, as the monthly income increases the probability of default decreases.
Finally, as the borrowers gets into debt, the probability of default increases as well.