the following is a step by step procedure for predicting credit score in using Random Forest Algorithm in R. We first talk about the random forest algorithm and how it works. We then start with analyzing the data and tune it to better suit our solution.
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Using Random Forest For Credit Scoring: “Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.”
We first set the working directory and upload the test and train files into the local R database.
Random forest is a type of decision tree algorithm. A decision tree algorithm is a supervised learning algorithm which has a predefined target variable that is mostly used in classification problems. It works for both categorical and continuous input and output variables and thus is the panacea of the data science problem. R has a package called RandomForest which makes life pretty easy to solve such a problem but as it is with Data Science Problems, the true zest lies in tuning and woking with the data to come to a point where randomForest can efficiently solve it. Here in this problem the code is quite readable but having said that, I have done my best to go step byu step to explain each procedure.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
setwd("C:/Users/Atul/Desktop/risk analysis")
train1 <- read.csv("cs-training.csv")
test1 <- read.csv("cs-test.csv")
head(test1)
head(train1)
We find explore both the databases and try to understand the underlying problem. We check data structures of each of the data features and check which of the following features are NA’s.
combi <- rbind(train1,test1)
str(combi)
'data.frame': 251503 obs. of 12 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ SeriousDlqin2yrs : int 1 0 0 0 0 0 0 0 0 0 ...
$ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
$ age : int 45 40 38 30 49 74 57 39 27 57 ...
$ NumberOfTime30.59DaysPastDueNotWorse: int 2 0 1 0 1 0 0 0 0 0 ...
$ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
$ MonthlyIncome : int 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
$ NumberOfOpenCreditLinesAndLoans : int 13 4 2 5 7 3 8 8 2 9 ...
$ NumberOfTimes90DaysLate : int 0 0 1 0 0 0 0 0 0 0 ...
$ NumberRealEstateLoansOrLines : int 6 0 0 0 1 1 3 0 0 4 ...
$ NumberOfTime60.89DaysPastDueNotWorse: int 0 0 0 0 0 0 0 0 0 0 ...
$ NumberOfDependents : int 2 1 0 0 0 1 0 0 NA 2 ...
summary(combi)
X SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
Min. : 1 Min. :0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 31439 1st Qu.:0.00 1st Qu.: 0.03 1st Qu.: 41.00
Median : 62876 Median :0.00 Median : 0.15 Median : 52.00
Mean : 65214 Mean :0.07 Mean : 5.75 Mean : 52.34
3rd Qu.: 94314 3rd Qu.:0.00 3rd Qu.: 0.56 3rd Qu.: 63.00
Max. :150000 Max. :1.00 Max. :50708.00 Max. :109.00
NA's :101503
NumberOfTime30.59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans
Min. : 0.0000 Min. : 0.0 Min. : 0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 0.2 1st Qu.: 3400 1st Qu.: 5.000
Median : 0.0000 Median : 0.4 Median : 5400 Median : 8.000
Mean : 0.4343 Mean : 349.6 Mean : 6745 Mean : 8.453
3rd Qu.: 0.0000 3rd Qu.: 0.9 3rd Qu.: 8212 3rd Qu.:11.000
Max. :98.0000 Max. :329664.0 Max. :7727000 Max. :85.000
NA's :49834
NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
Min. : 0.0000 Min. : 0.000 Min. : 0.0000
1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
Median : 0.0000 Median : 1.000 Median : 0.0000
Mean : 0.2784 Mean : 1.016 Mean : 0.2525
3rd Qu.: 0.0000 3rd Qu.: 2.000 3rd Qu.: 0.0000
Max. :98.0000 Max. :54.000 Max. :98.0000
NumberOfDependents
Min. : 0.000
1st Qu.: 0.000
Median : 0.000
Mean : 0.762
3rd Qu.: 1.000
Max. :43.000
NA's :6550
What we did is we combined both test and train datasets, and got the strusture and summary of the combi datatset to get some good insights about the data. The features are self explanatory and can tell a good thing or two about the data itself.
library(rpart)
combi$AgeClass <- '64+'
combi$AgeClass[combi$age > 0 & combi$age <= 40] <- '0-40'
combi$AgeClass[combi$age > 40 & combi$age <=52] <- '41-52'
combi$AgeClass[combi$age > 52 & combi$age <=64] <- '43-54'
combi$AgeClass <- as.factor(combi$AgeClass)
#combi$AgeClass[is.na(combi$age)] <- NA
we now divide age into different cattegories to make our classification tree more efficient. Here, one can see that we havw now divided the feature age in 4 classes.
library(gmodels)
package <U+393C><U+3E31>gmodels<U+393C><U+3E32> was built under R version 3.3.3
CrossTable(combi$SeriousDlqin2yrs)
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 150000
| 0 | 1 |
|-----------|-----------|
| 139974 | 10026 |
| 0.933 | 0.067 |
|-----------|-----------|
The following table gives a good insight about the number of 0s and 1s in the datasets. We know that around 93.3% have had serious deliquencies in two years.
DependentsFit <- rpart(NumberOfDependents ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfOpenCreditLinesAndLoans
+ MonthlyIncome + NumberRealEstateLoansOrLines + DebtRatio,
data=combi[!is.na(combi$NumberOfDependents),],
method="anova")
combi$NumberOfDependents[is.na(combi$NumberOfDependents)] <- predict(DependentsFit, combi[is.na(combi$NumberOfDependents),])
sum(is.na(combi$NumberOfDependents))
[1] 0
In the above chunk of code we have successfully replaced all the Na’s with predicted values acquired from the anova method. It we could also exclude the rows with Na’s but this seems like a better choice as we have a lot of data to learn from and edting a little would do no harm.
Further, Lets now look at another predictor variable, lets look at the number of times people were lateb by 30-59 dats, lets change the datatype of the variable from integer to factor and change the levels.
combi$NumberOfTime30.59DaysPastDueNotWorse <- as.factor(combi$NumberOfTime30.59DaysPastDueNotWorse)
Lets apply the same the numberoftimepast 60-89 days column. And for deliquencies more than 90 days.
combi$NumberOfTime60.89DaysPastDueNotWorse <- as.factor(combi$NumberOfTime60.89DaysPastDueNotWorse)
combi$NumberOfTimes90DaysLate <- as.factor(combi$NumberOfTimes90DaysLate)
Let’s now fill Na’s in the monthy income column by regression using anova technique.
IncomeFit <- rpart(MonthlyIncome ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfOpenCreditLinesAndLoans + NumberOfDependents + NumberRealEstateLoansOrLines + DebtRatio,
data=combi[!is.na(combi$MonthlyIncome),],
method="anova")
combi$MonthlyIncome[is.na(combi$MonthlyIncome)] <- predict(IncomeFit, combi[is.na(combi$MonthlyIncome),])
sum(is.na(combi$MonthlyIncome))
[1] 0
Let’s now go ahead to the salary variable and create differnet classes for the salary variable data.
combi$IncomeClass[combi$MonthlyIncome >= 0 & combi$MonthlyIncome <= 1000] <- '0-1000'
combi$IncomeClass[combi$MonthlyIncome > 1000 & combi$MonthlyIncome <= 2000] <- '1001-2000'
combi$IncomeClass[combi$MonthlyIncome > 2000 & combi$MonthlyIncome <= 3000] <- '2001-3000'
combi$IncomeClass[combi$MonthlyIncome > 3000 & combi$MonthlyIncome <= 4000] <- '3001-4000'
combi$IncomeClass[combi$MonthlyIncome > 4000 & combi$MonthlyIncome <= 6000] <- '4001-6000'
combi$IncomeClass[combi$MonthlyIncome > 6001 & combi$MonthlyIncome <= 8000] <- '6001-8000'
invalid factor level, NA generated
combi$IncomeClass[combi$MonthlyIncome > 6000 & combi$MonthlyIncome <= 10000] <- '8001-10000'
combi$IncomeClass[combi$MonthlyIncome > 10000 & combi$MonthlyIncome <= 20000] <- '10001-20000'
combi$IncomeClass[combi$MonthlyIncome > 20000] <- '20000+'
combi$IncomeClass <- as.factor(combi$IncomeClass)
One of the most insightful variable in my opinion is the debt ratrio, this has to make a lo0t of sense once we model the Cross Table of Debt Ratio.
combi$DebtRatioClass <- '100+'
combi$DebtRatioClass[combi$DebtRatio >= 0 & combi$DebtRatio <= 0.5] <- '0-0.5'
combi$DebtRatioClass[combi$DebtRatio > 0.5 & combi$DebtRatio <= 1] <- '0.5-1'
combi$DebtRatioClass[combi$DebtRatio > 1 & combi$DebtRatio <= 2] <- '1-2'
combi$DebtRatioClass[combi$DebtRatio > 2 & combi$DebtRatio <= 10] <- '2-10'
combi$DebtRatioClass[combi$DebtRatio > 10 & combi$DebtRatio <= 100] <- '10-100'
combi$DebtRatioClass <- as.factor(combi$DebtRatioClass)
summary(combi$DebtRatioClass)
0-0.5 0.5-1 1-2 10-100 100+ 2-10
157240 35548 6855 7521 40753 3586
CrossTable(combi$DebtRatioClass)
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 251503
| 0-0.5 | 0.5-1 | 1-2 | 10-100 | 100+ |
|-----------|-----------|-----------|-----------|-----------|
| 157240 | 35548 | 6855 | 7521 | 40753 |
| 0.625 | 0.141 | 0.027 | 0.030 | 0.162 |
|-----------|-----------|-----------|-----------|-----------|
| 2-10 |
|-----------|
| 3586 |
| 0.014 |
|-----------|
CrossTable(combi$SeriousDlqin2yrs,combi$DebtRatioClass, prop.r = TRUE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 150000
| combi$DebtRatioClass
combi$SeriousDlqin2yrs | 0-0.5 | 0.5-1 | 1-2 | 10-100 | 100+ | 2-10 | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 88053 | 19075 | 3553 | 4298 | 22970 | 2025 | 139974 |
| 0.629 | 0.136 | 0.025 | 0.031 | 0.164 | 0.014 | 0.933 |
-----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 5655 | 2080 | 539 | 199 | 1410 | 143 | 10026 |
| 0.564 | 0.207 | 0.054 | 0.020 | 0.141 | 0.014 | 0.067 |
-----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 93708 | 21155 | 4092 | 4497 | 24380 | 2168 | 150000 |
-----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
As can be seen using the comparison, this makes perfect sense..
The next feature of the dataset to be considered, is the RevolvingUtilizationOfUnsecuredLine, this should be on the same lines of that of debt ratio.
summary(combi$NumberOfOpenCreditLinesAndLoans)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 5.000 8.000 8.453 11.000 85.000
plot(density(combi$NumberOfOpenCreditLinesAndLoans))
#Let us model categorically for this variable
combi$OpenCreditClass <- '20+'
combi$OpenCreditClass[combi$NumberOfOpenCreditLinesAndLoans >= 0 & combi$NumberOfOpenCreditLinesAndLoans<=5] <- '0-5'
combi$OpenCreditClass[combi$NumberOfOpenCreditLinesAndLoans > 5 & combi$NumberOfOpenCreditLinesAndLoans<=10] <- '5-10'
combi$OpenCreditClass[combi$NumberOfOpenCreditLinesAndLoans > 10 & combi$NumberOfOpenCreditLinesAndLoans<=15] <- '10-15'
combi$OpenCreditClass[combi$NumberOfOpenCreditLinesAndLoans > 15 & combi$NumberOfOpenCreditLinesAndLoans<= 20] <- '15-20'
combi$OpenCreditClass <- as.factor(combi$OpenCreditClass)
CrossTable(combi$OpenCreditClass)
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 251503
| 0-5 | 10-15 | 15-20 | 20+ | 5-10 |
|-----------|-----------|-----------|-----------|-----------|
| 78018 | 48912 | 16442 | 6679 | 101452 |
| 0.310 | 0.194 | 0.065 | 0.027 | 0.403 |
|-----------|-----------|-----------|-----------|-----------|
CrossTable(combi$SeriousDlqin2yrs,combi$OpenCreditClass, prop.r = TRUE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 150000
| combi$OpenCreditClass
combi$SeriousDlqin2yrs | 0-5 | 10-15 | 15-20 | 20+ | 5-10 | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 42668 | 27380 | 9170 | 3701 | 57055 | 139974 |
| 0.305 | 0.196 | 0.066 | 0.026 | 0.408 | 0.933 |
-----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 3922 | 1804 | 676 | 279 | 3345 | 10026 |
| 0.391 | 0.180 | 0.067 | 0.028 | 0.334 | 0.067 |
-----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 46590 | 29184 | 9846 | 3980 | 60400 | 150000 |
-----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
#independently, this variable does not seem as insightful as other, but i am certain collectively it will be very important.
summary(combi$NumberRealEstateLoansOrLines)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 1.016 2.000 54.000
combi$RealtyLinesClass <- '3+'
combi$RealtyLinesClass[combi$NumberRealEstateLoansOrLines >=0 & combi$NumberRealEstateLoansOrLines <= 1] <- '0-1'
combi$RealtyLinesClass[combi$NumberRealEstateLoansOrLines >1 & combi$NumberRealEstateLoansOrLines <= 2] <- '1-2'
combi$RealtyLinesClass[combi$NumberRealEstateLoansOrLines >2 & combi$NumberRealEstateLoansOrLines <= 3] <- '2-3'
combi$RealtyLinesClass <- as.factor(combi$RealtyLinesClass)
CrossTable(combi$RealtyLinesClass)
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 251503
| 0-1 | 1-2 | 2-3 | 3+ |
|-----------|-----------|-----------|-----------|
| 182262 | 52477 | 10723 | 6041 |
| 0.725 | 0.209 | 0.043 | 0.024 |
|-----------|-----------|-----------|-----------|
CrossTable(combi$SeriousDlqin2yrs,combi$RealtyLinesClass, prop.r = TRUE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 150000
| combi$RealtyLinesClass
combi$SeriousDlqin2yrs | 0-1 | 1-2 | 2-3 | 3+ | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
0 | 101106 | 29757 | 5878 | 3233 | 139974 |
| 0.722 | 0.213 | 0.042 | 0.023 | 0.933 |
-----------------------|-----------|-----------|-----------|-----------|-----------|
1 | 7420 | 1765 | 422 | 419 | 10026 |
| 0.740 | 0.176 | 0.042 | 0.042 | 0.067 |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 108526 | 31522 | 6300 | 3652 | 150000 |
-----------------------|-----------|-----------|-----------|-----------|-----------|
str(combi)
'data.frame': 251503 obs. of 17 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ SeriousDlqin2yrs : int 1 0 0 0 0 0 0 0 0 0 ...
$ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
$ age : int 45 40 38 30 49 74 57 39 27 57 ...
$ NumberOfTime30.59DaysPastDueNotWorse: Factor w/ 17 levels "0","1","2","3",..: 3 1 2 1 2 1 1 1 1 1 ...
$ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
$ MonthlyIncome : num 9120 2600 3042 3300 63588 ...
$ NumberOfOpenCreditLinesAndLoans : int 13 4 2 5 7 3 8 8 2 9 ...
$ NumberOfTimes90DaysLate : Factor w/ 21 levels "0","1","2","3",..: 1 1 2 1 1 1 1 1 1 1 ...
$ NumberRealEstateLoansOrLines : int 6 0 0 0 1 1 3 0 0 4 ...
$ NumberOfTime60.89DaysPastDueNotWorse: Factor w/ 13 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ NumberOfDependents : num 2 1 0 0 0 ...
$ AgeClass : Factor w/ 4 levels "0-40","41-52",..: 2 1 1 1 2 4 3 1 1 3 ...
$ IncomeClass : Factor w/ 8 levels "0-1000","10001-20000",..: 8 5 6 6 4 6 8 6 6 4 ...
$ DebtRatioClass : Factor w/ 6 levels "0-0.5","0.5-1",..: 2 1 1 1 1 1 5 1 4 2 ...
$ OpenCreditClass : Factor w/ 5 levels "0-5","10-15",..: 2 1 1 1 5 1 5 5 1 5 ...
$ RealtyLinesClass : Factor w/ 4 levels "0-1","1-2","2-3",..: 4 1 1 1 1 1 3 1 1 4 ...
After turing all the parameters that we think are important, we now go ahead to solve the predictive model and using Random Forest Algorithm.
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
set.seed(888)
nrow(combi)
[1] 251503
train <- combi[1:150000,]
test <- combi[150001:251503,]
To do this we have again divided the combi dataset into train and test datasets.
fit <- randomForest(as.factor(SeriousDlqin2yrs) ~ NumberOfTimes90DaysLate + NumberOfTime60.89DaysPastDueNotWorse
+ NumberOfTime30.59DaysPastDueNotWorse + NumberOfDependents + AgeClass + DebtRatioClass +
RevolvingUtilizationOfUnsecuredLines+ OpenCreditClass + RealtyLinesClass,
data=train,
importance=TRUE,
ntree=25, keep.forest = TRUE)
sum(is.na(train$SeriousDlqin2yrs))
[1] 0
Prediction <- predict(fit, test, type = "prob")
submit <- data.frame(Id = test$X, Probability = Prediction)
entry <- data.frame(Id = submit$Id, Probability = submit$Probability.1)
write.csv(entry, file = "Entry.csv", row.names = FALSE)
nrow(test)
[1] 101503
Let’s plot ctree..
library("partykit")
x <- ctree(as.factor(SeriousDlqin2yrs) ~ NumberOfTimes90DaysLate + NumberOfTime60.89DaysPastDueNotWorse
+ NumberOfTime30.59DaysPastDueNotWorse + NumberOfDependents + AgeClass + DebtRatioClass +
RevolvingUtilizationOfUnsecuredLines + OpenCreditClass + RealtyLinesClass,
data=train)
plot(x, gp = gpar(fontsize = 5), # font size changed to 6
inner_panel=node_inner,
ip_args=list(
abbreviate = FALSE,
id = FALSE)
)
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).