Description

The purpose of this database is to provide information about a bank’s customers so that machine learning models can be developed that can predict whether a particular customer will repay the loan or not. We investigated 4 algorithms: Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine (SVM).

The dataset used in this report is Credit Risk data in Kaggle. The dataset can be downloaded here. Link to dataset: https://www.kaggle.com/upadorprofzs/credit-risk

Variable Name Description Type

clientid = Client id Person
income = Income of the client
age = Age of client
loan = Load value
default = target 0 good client, 1 not.

Report Outline

Data Extraction
Exploratory Data Analysis
Data Preparation
Modeling
Evaluation
Recommendation

1. Data Extraction

The dataset is downloaded from Kaggle and saved in the data folder. We use read.csv() function to read the dataset and put in CustomerCredit_df data frame.

CustomerCredit_df<-read.csv("C:/Users/dedis/Documents/Data Science Course/Portofolio/data/credit_risk.csv")

See the data dimension. The dataset has 2000 rows and 5 columns.

dim(CustomerCredit_df)

## [1] 2000    5

2. Exploratory Data Analysis

To find out the column names and types, we used str() function.

str(CustomerCredit_df)

## 'data.frame':    2000 obs. of  5 variables:
##  $ clientid: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ income  : num  66156 34415 57317 42710 66953 ...
##  $ age     : num  59 48.1 63.1 45.8 18.6 ...
##  $ loan    : num  8107 6565 8021 6104 8770 ...
##  $ default : int  0 0 0 0 1 0 0 1 0 0 ...

From the result above, we know the following:
1. The first column is id. It is unique and unnecessary for prediction. So, it should be removed.
2. The fifth column is default Currently the type is int and it should be converted to factor and change variable name to be payment_status

# remove unnecessary columns
CustomerCredit_df$clientid <- NULL

# change to factor for target variable 
colnames(CustomerCredit_df)[4] <- "payment_status"  
CustomerCredit_df$payment_status <- factor(CustomerCredit_df$payment_status,
                                           levels = c(0,1),
                                           labels = c("Paid", "Unpaid"))

2.1. Univariate Data Analysis

Analysis of a single variable.
Number of paid (P) and unpaid (UP) in payment_status column.

library(ggplot2)
ggplot(data=CustomerCredit_df, aes(x=payment_status)) + geom_bar(color="white",fill = "#49B3E8")+ 
    geom_text(aes(y = ..count.. -50,label = paste0(round(prop.table(..count..),4) * 100, '%')), 
              stat = 'count', position = position_stack(vjust=0.5), size = 5,color= "white") +
    labs(title="Payment Status Of Client", x="Payment Status")

From the result above, we know that the number of customer Paid is more than Unpaid.

Distribution of loan variable in Bar

ggplot(data=CustomerCredit_df, aes(x=loan))+
    geom_histogram(color="white",fill = "#49B3E8")+
    scale_x_continuous(breaks = c(0,2000,4000,6000,8000,10000,12000,14000),
                       labels = c("0","2000","4000","6000","8000","10000","12000",
                                  "14000"),
                       limits = c(1000,14000))+
    labs(title = "Loan Of Client",
         x="Loan", y="Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the result above, we know that data distribution of client loan is uneven, it’s leaning more to the left

Distribution of age variable in Bar

ggplot(data=CustomerCredit_df, aes(x=age))+
    geom_histogram(color="white",fill = "#49B3E8")+
    labs(title = "Age Of Client",
         x="Age", y="Count")

Distribution of income variable in Bar

ggplot(data=CustomerCredit_df, aes(x=income))+
    geom_histogram(color="white",fill = "#49B3E8")+
    scale_x_continuous(breaks = c(20000,30000,40000,50000
                                  ,60000,70000),
                       labels = c("20000","30000",
                                  "40000","50000","60000"
                                  ,"70000"),
                       limits = c(20000,70000))+
    labs(title = "Income Of Client",
         x="Income", y="Count")

From the result above, we know that data distribution of Income Clientis normal

2.2. Bivariate Data Analysis

Analysis of two variables.
Distribution of loan variable based on payment_status.

ggplot(data=CustomerCredit_df, aes(x=payment_status, 
                                        y=loan,
                                        color = payment_status)) + 
    scale_y_continuous(breaks = c(0,2000,4000,6000,8000,10000,12000,14000),
                       labels = c("0","2000","4000","6000","8000","10000","12000",
                                  "14000"),
                       limits = c(1000,14000))+
    geom_boxplot(alpha=.3) + 

    geom_jitter(alpha = 0.3,
                color = "blue", 
                width = 0.2) + 
    labs(title="Payment Status based on Loan", 
         x="Payment Status", y="Loan")

From the result above, we know the following:
1. Based on Loan , the number of paid is above 2000 and below 7000
2. Based on Loan , the number of unpaid is is above 5000 and below 10000

Distribution of income variable based on payment_status.

ggplot(data=CustomerCredit_df, aes(x=payment_status, 
                                        y=income,
                                        color = payment_status)) + 
    scale_y_continuous(breaks = c(20000,30000,40000,50000
                                  ,60000,70000),
                       labels = c("20000","30000",
                                  "40000","50000","60000"
                                  ,"70000"),
                       limits = c(20000,70000))+
    geom_boxplot(alpha=.3) + 
    geom_jitter(alpha = 0.3,
                color = "blue", 
                width = 0.2) + 
    labs(title="Payment Status based on Income", 
         x="Payment Status", y="Income")

From the result above, we know the following:
Based on Income of Customer, we see that customer who have a high salary or low salary has a potency to pay or unpaid their loan

Distribution of age variable based on payment_status.

ggplot(data=CustomerCredit_df, aes(x=age, 
                                        fill=payment_status)) +
    geom_density(alpha=.3)+
    labs(title="Payment Status based on Age", 
         x="Age")

## Warning: Removed 3 rows containing non-finite values (stat_density).

2.3. Multivariate Data Analysis

Compute and visualize correlation coefficient of each measurement.

Visualize Pearson’s Correlation Coefficient variables.

library(corrgram)
corrgram(CustomerCredit_df[,c("income", "loan", "age")],
         main="Pearson’s Correlation Coefficient variables")

From the result above, we know the following:
variable income has a strong corelation with variable loan than variable age

3. Data Preparation

3.1 Feature Selection

# remove unnecessary columns
# CustomerCredit_df$clientid <- NULL

3.2 Remove Outlier

3.2.1 Get Outliers Values

In this section, it is a part of getting the outliers value.

Get Outliers Values

out_age <- boxplot.stats(CustomerCredit_df$age)$out
out_age

## [1] -28.21836 -52.42328 -36.49698

Get Outliers Index

out_idx <- which(CustomerCredit_df$age %in% c(out_age))
out_idx

## [1] 16 22 27

Data Without Outliers

CustomerCredit_clean <-CustomerCredit_df[-out_idx,]
dim(CustomerCredit_clean)

## [1] 1997    4

3.2.2 Handling Missing Value

Detect Missing Value on Data

Distribution of Missing Value in Pattern

library(mice)
md.pattern(CustomerCredit_clean)

##      income loan payment_status age  
## 1994      1    1              1   1 0
## 3         1    1              1   0 1
##           0    0              0   3 3

Imputation Missing Value With Mean

CustomerCredit_df.imput = CustomerCredit_clean
for(i in which(sapply(CustomerCredit_df.imput, is.numeric))){
    CustomerCredit_df.imput[is.na(CustomerCredit_df.imput[, i]), i] <- mean(CustomerCredit_df.imput[, i], na.rm = TRUE)
}

3.3 Training and Test Division

set.seed(2021)
m = nrow(CustomerCredit_df.imput)
train_ind <- sample(m, 0.8 * m)
TrainingSet <- CustomerCredit_df.imput[train_ind, ]
TestingSet <- CustomerCredit_df.imput[-train_ind, ]

4. Modelling

4.1 Logistic Regression

library(caret)
Model_LG <- train(payment_status ~ ., data = TrainingSet,
               method = "glm",
               na.action = na.omit,
               preProcess=c("scale","center"),
               trControl= trainControl(method="none")
)
summary(Model_LG)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.26416  -0.08227  -0.00899  -0.00036   2.71212  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.6447     0.5937  -12.88   <2e-16 ***
## income       -3.6388     0.3460  -10.52   <2e-16 ***
## age          -4.7082     0.3880  -12.13   <2e-16 ***
## loan          5.3884     0.4544   11.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1316.55  on 1596  degrees of freedom
## Residual deviance:  360.52  on 1593  degrees of freedom
## AIC: 368.52
## 
## Number of Fisher Scoring iterations: 9

# Feature importance
print(varImp(Model_LG))

## glm variable importance
## 
##        Overall
## age     100.00
## loan     82.88
## income    0.00

Importance <- varImp(Model_LG)
plot(Importance, col = "red")

4.2 Classification Tree

library(party)
Model_CT <- train(payment_status ~ ., data = TrainingSet,
               method = "LMT",
               na.action = na.omit,
               preProcess=c("scale","center"),
               trControl= trainControl(method="none")
)
summary(Model_CT)

## 
## === Summary ===
## 
## Correctly Classified Instances        1594               99.8121 %
## Incorrectly Classified Instances         3                0.1879 %
## Kappa statistic                          0.9924
## Mean absolute error                      0.0481
## Root mean squared error                  0.0655
## Relative absolute error                 19.4711 %
## Root relative squared error             18.6587 %
## Total Number of Instances             1597     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  1365    2 |    a = Paid
##     1  229 |    b = Unpaid

# Feature importance
print(varImp(Model_CT))

## ROC curve variable importance
## 
##        Importance
## age        100.00
## loan        85.41
## income       0.00

Importance <- varImp(Model_CT)
plot(Importance, col = "red")

4.3 Random Forest

library(randomForest)
set.seed(2021)
fit.forest <- randomForest(formula = payment_status ~ .,
                           data = TrainingSet,
                           na.action = na.roughfix)
fit.forest

## 
## Call:
##  randomForest(formula = payment_status ~ ., data = TrainingSet,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 1.13%
## Confusion matrix:
##        Paid Unpaid class.error
## Paid   1363      4 0.002926116
## Unpaid   14    216 0.060869565

summary(fit.forest)

##                 Length Class  Mode     
## call               4   -none- call     
## type               1   -none- character
## predicted       1597   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           3194   matrix numeric  
## oob.times       1597   -none- numeric  
## classes            2   -none- character
## importance         3   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               1597   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

# Feature importance
varImpPlot(fit.forest)

4.4 Support Vector Machine

library(e1071)
library(caret )


Model_SVM <-   train(payment_status ~ ., data = TrainingSet,
               method = "svmPoly",
               na.action = na.omit,
               preProcess=c("scale","center"),
               trControl= trainControl(method="none"),
               tuneGrid = data.frame(degree=1,scale=1,C=1)
)

summary(Model_SVM)

## Length  Class   Mode 
##      1   ksvm     S4

# Feature importance
print(varImp(Model_SVM))

## ROC curve variable importance
## 
##        Importance
## age        100.00
## loan        85.41
## income       0.00

Importance <- varImp(Model_SVM)
plot(Importance, col = "red")

5. Evaluate The Model

5.1 Logistic Regression

Model.Testing.LG <- predict(Model_LG, TestingSet)
Model.testing.confusion.LG <-confusionMatrix(Model.Testing.LG,                                         TestingSet$payment_status,
                                             dnn = c("Predict","Actual"))
Model.testing.confusion.LG

## Confusion Matrix and Statistics
## 
##         Actual
## Predict  Paid Unpaid
##   Paid    336      8
##   Unpaid   11     45
##                                           
##                Accuracy : 0.9525          
##                  95% CI : (0.9268, 0.9712)
##     No Information Rate : 0.8675          
##     P-Value [Acc > NIR] : 1.362e-08       
##                                           
##                   Kappa : 0.7982          
##                                           
##  Mcnemar's Test P-Value : 0.6464          
##                                           
##             Sensitivity : 0.9683          
##             Specificity : 0.8491          
##          Pos Pred Value : 0.9767          
##          Neg Pred Value : 0.8036          
##              Prevalence : 0.8675          
##          Detection Rate : 0.8400          
##    Detection Prevalence : 0.8600          
##       Balanced Accuracy : 0.9087          
##                                           
##        'Positive' Class : Paid            
##

ggplot(data=TestingSet, aes(x=payment_status, 
                                   y=Model.Testing.LG)) + 
    
    geom_boxplot(alpha=.3) + 
    
    geom_jitter(alpha = 0.3,
                color = "blue", 
                width = 0.2) + 
    labs(title="Actual VS Predicted", 
         x="Actual", y="Predicted")

#### 5.2 Support Vector Machine

Model.Testing.SV <- predict(Model_SVM, TestingSet)
Model.testing.confusion <-confusionMatrix(Model.Testing.SV,
                                   TestingSet$payment_status)
Model.testing.confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Paid Unpaid
##     Paid    335      8
##     Unpaid   12     45
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.9238, 0.9692)
##     No Information Rate : 0.8675          
##     P-Value [Acc > NIR] : 4.066e-08       
##                                           
##                   Kappa : 0.7892          
##                                           
##  Mcnemar's Test P-Value : 0.5023          
##                                           
##             Sensitivity : 0.9654          
##             Specificity : 0.8491          
##          Pos Pred Value : 0.9767          
##          Neg Pred Value : 0.7895          
##              Prevalence : 0.8675          
##          Detection Rate : 0.8375          
##    Detection Prevalence : 0.8575          
##       Balanced Accuracy : 0.9072          
##                                           
##        'Positive' Class : Paid            
##

ggplot(data=TestingSet, aes(x=payment_status, 
                            y=Model.Testing.SV)) + 
    
    geom_boxplot(alpha=.3) + 
    
    geom_jitter(alpha = 0.3,
                color = "blue", 
                width = 0.2) + 
    labs(title="Actual VS Predicted", 
         x="Actual", y="Predicted")

5.3 Random Forest

Model.Testing.RF <- predict(fit.forest, TestingSet)
Model.testing.confusion <-confusionMatrix(Model.Testing.RF,
                                          TestingSet$payment_status)
Model.testing.confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Paid Unpaid
##     Paid    347      3
##     Unpaid    0     50
##                                           
##                Accuracy : 0.9925          
##                  95% CI : (0.9782, 0.9985)
##     No Information Rate : 0.8675          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9666          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9434          
##          Pos Pred Value : 0.9914          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.8675          
##          Detection Rate : 0.8675          
##    Detection Prevalence : 0.8750          
##       Balanced Accuracy : 0.9717          
##                                           
##        'Positive' Class : Paid            
##

ggplot(data=TestingSet, aes(x=payment_status, 
                            y=Model.Testing.RF)) + 
    
    geom_boxplot(alpha=.3) + 
    
    geom_jitter(alpha = 0.3,
                color = "blue", 
                width = 0.2) + 
    labs(title="Actual VS Predicted", 
         x="Actual", y="Predicted")

5.4 Classification Tree

Model.Testing.CT <- predict(Model_CT, TestingSet)
Model.testing.confusion <-confusionMatrix(Model.Testing.CT,
                                          TestingSet$payment_status)
Model.testing.confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Paid Unpaid
##     Paid    346      1
##     Unpaid    1     52
##                                           
##                Accuracy : 0.995           
##                  95% CI : (0.9821, 0.9994)
##     No Information Rate : 0.8675          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9783          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9971          
##             Specificity : 0.9811          
##          Pos Pred Value : 0.9971          
##          Neg Pred Value : 0.9811          
##              Prevalence : 0.8675          
##          Detection Rate : 0.8650          
##    Detection Prevalence : 0.8675          
##       Balanced Accuracy : 0.9891          
##                                           
##        'Positive' Class : Paid            
##

ggplot(data=TestingSet, aes(x=payment_status, 
                            y=Model.Testing.CT)) + 
    
    geom_boxplot(alpha=.3) + 
    
    geom_jitter(alpha = 0.3,
                color = "blue", 
                width = 0.2) + 
    labs(title="Actual VS Predicted", 
         x="Actual", y="Predicted")

6. Reccomendation

From the result above, we know the following:
1. Classification Tree is the best among all the tested algortihms
2. Based Classification Tree Model , the most important variables are age, loan and income
3. For age around 25 to 30 the value of the loan may be lowered The results can be improved by better data preparation

Credit Risk Classification

Dedi Irwanto Samosir

05/01/2021