Group 11:

  • Charroogesinee a/p radhakrishnan(17060405)
  • Hii Yew Han (S2037987)
  • Lim Chin Woon (S2124307)
  • Liang Chen Chen(S2113246)
  • Tan Shi Ling (S2115562)

Machine Learning in Banking Loan using R Programming

Part 1: Introduction

1.1 Introduction of Bank Loan Default

Taking up loans are common these days as the needs of people are increasing (Anant, S. et all, 2022) Often, there are borrower unable to service the loan and went default. In this case, bank will make a loss. Also, for the defaulter, their credit score will be drastically reduced and impact the ability to receive loan in the future. Hence, in this project, we will use machine learning to predict if a loan application will turn into defaulters. This will help bank in minimising their defaulter risk. In addition to that, we will also provide the loan applicants the suggested loan amount to increase their chance of getting approval and reduce the risk of becoming defaulters in the future.

1.2 Objectives

  1. To predict if a person has the ability to repay the loan
  2. To predict the loan amount that a person can take without having issue to repay

Part 2: Data understanding

The dataset we chose can be found from the link below. It consists 307,511 rows x 122 columns. https://www.kaggle.com/code/shrutidandagi/eda-bank-loan-default/notebook

2.1 Overview of Data Attributes

There are around 122 variables in the dataset. Only 57 variables (which later found to be meaningful) are selected to display in this section.

Attribute Information:
1. SK_ID_CURR:ID of loan in our sample
2. TARGET: Target variable (1 - client with payment difficulties 0- client with no payment difficulties)
3. NAME_CONTRACT_TYPE: Identification if loan is cash or revolving
4. CODE_GENDER: Gender of the client
5. FLAG_OWN_CAR: Flag if the client owns a car
6. FLAG_OWN_REALTY: Flag if client owns a house or flat
7. CNT_CHILDREN: Number of children the client has
8. AMT_INCOME_TOTAL: Income of the client
9. AMT_CREDIT: Credit amount of the loan
10. AMT_ANNUITY: Loan annuity
11. AMT_GOODS_PRICE: For consumer loans it is the price of the goods for which the loan is given
12. NAME_TYPE_SUITE: Who was accompanying client when he was applying for the loan
13. NAME_INCOME_TYPE: Clients income type (businessman, working, maternity leave,…)
14. NAME_EDUCATION_TYPE: Level of highest education the client achieved
15. NAME_FAMILY_STATUS:Family status of the client
16. NAME_HOUSING_TYPE: What is the housing situation of the client (renting, living with parents, …)
17. REGION_POPULATION_RELATIVE: Normalized population of region where client lives (higher number means the client lives in more populated region)
18. DAYS_BIRTH: Client’s age in days at the time of application
19. DAYS_EMPLOYED: How many days before the application the person started current employment
20. DAYS_REGISTRATION: How many days before the application did client change his registration
21. DAYS_ID_PUBLISH: How many days before the application did client change the identity document
22. FLAG_MOBIL: Did client provide mobile phone (1=YES, 0=NO)
23. FLAG_EMP_PHONE: Did client provide work phone (1=YES, 0=NO)
24. FLAG_WORK_PHONE: Did client provide home phone (1=YES, 0=NO)
25. FLAG_CONT_MOBILE: Was mobile phone reachable (1=YES, 0=NO)
26. FLAG_PHONE: Did client provide home phone (1=YES, 0=NO)
27. FLAG_EMAIL: Did client provide email (1=YES, 0=NO)
28. OCCUPATION_TYPE: What kind of occupation does the client have
29. CNT_FAM_MEMBERS: How many family members does client have
30. REGION_RATING_CLIENT: Our rating of the region where client lives (1,2,3)
31. REGION_RATING_CLIENT_W_CITY: REGION_RATING_CLIENT_W_CITY
32. WEEKDAY_APPR_PROCESS_START: On which day of the week did the client apply for the loan
33. HOUR_APPR_PROCESS_START: Approximately at what hour did the client apply for the loan
34. REG_REGION_NOT_LIVE_REGION: Flag if client’s permanent address does not match contact address (1=different, 0=same, at region level)
35. REG_REGION_NOT_WORK_REGION: Flag if client’s contact address does not match work address (1=different, 0=same, at region level)
36. LIVE_REGION_NOT_WORK_REGION: Flag if client’s contact address does not match work address (1=different, 0=same, at region level)
37. REG_CITY_NOT_LIVE_CITY: Flag if client’s permanent address does not match contact address (1=different, 0=same, at city level)
38. REG_CITY_NOT_WORK_CITY: Flag if client’s permanent address does not match work address (1=different, 0=same, at city level)
39. LIVE_CITY_NOT_WORK_CITY: Flag if client’s contact address does not match work address (1=different, 0=same, at city level)
40. ORGANIZATION_TYPE: Type of organization where client works
41. EXT_SOURCE_2: Normalized score from external data source
42. EXT_SOURCE_3: Normalized score from external data source
43. FONDKAPREMONT_MODE: Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
44. HOUSETYPE_MODE: Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
45. WALLSMATERIAL_MODE: Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
46. EMERGENCYSTATE_MODE: Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
47. OBS_30_CNT_SOCIAL_CIRCLE : How many observation of client’s social surroundings with observable 30 DPD (days past due) default
48. DEF_30_CNT_SOCIAL_CIRCLE: How many observation of client’s social surroundings defaulted on 30 DPD (days past due)
49. OBS_60_CNT_SOCIAL_CIRCLE: How many observation of client’s social surroundings with observable 60 DPD (days past due) default
50. DEF_60_CNT_SOCIAL_CIRCLE: How many observation of client’s social surroundings defaulted on 60 (days past due) DPD
51. DAYS_LAST_PHONE_CHANGE: How many days before application did client change phone
52. AMT_REQ_CREDIT_BUREAU_HOUR: Number of enquiries to Credit Bureau about the client one hour before application
53. AMT_REQ_CREDIT_BUREAU_DAY: Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application)
54. AMT_REQ_CREDIT_BUREAU_WEEK: Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application)
55. AMT_REQ_CREDIT_BUREAU_MON: Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application)
56. AMT_REQ_CREDIT_BUREAU_QRT: Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application)
57. AMT_REQ_CREDIT_BUREAU_YEAR: Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application)

2.2- Import the necessary libraries

library('ggpubr')
library(corrplot)
library(naniar)
library(forcats)
library('readxl')
library('dplyr')
library('ggplot2')
library(tidyverse)
library(knitr)      # web widget
library(tidyverse)  # data manipulation
library(data.table) # fast file reading
library(caret)      # rocr analysis
library(ROCR)       # rocr analysis
library(kableExtra) # nice table html formating 
library(gridExtra)  # arranging ggplot in grid
library(rpart)      # decision tree
library(rpart.plot) # decision tree plotting
library(caTools)    # split 
library(hrbrthemes)
library(viridis)
library(e1071)
library(rpart)
library(rpart.plot)
library(ROSE)
library(pROC)
library(Metrics)
library(glmnet)
library(plyr)
library(readr)
library(repr)

2.3 - Reading the original dataset

OriginalDF = read.csv("C:\\Users\\Joshua HII\\Desktop\\7004 assignment\\application_data.csv")

2.4- Basic understading of the dataset

dim(OriginalDF) #Shows the dimensions of the dataset
## [1] 307511    122
str(OriginalDF) #Shows the structure of the dataset
## 'data.frame':    307511 obs. of  122 variables:
##  $ SK_ID_CURR                  : int  100002 100003 100004 100006 100007 100008 100009 100010 100011 100012 ...
##  $ TARGET                      : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ NAME_CONTRACT_TYPE          : chr  "Cash loans" "Cash loans" "Revolving loans" "Cash loans" ...
##  $ CODE_GENDER                 : chr  "M" "F" "M" "F" ...
##  $ FLAG_OWN_CAR                : chr  "N" "N" "Y" "N" ...
##  $ FLAG_OWN_REALTY             : chr  "Y" "N" "Y" "Y" ...
##  $ CNT_CHILDREN                : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ AMT_INCOME_TOTAL            : num  202500 270000 67500 135000 121500 ...
##  $ AMT_CREDIT                  : num  406598 1293503 135000 312683 513000 ...
##  $ AMT_ANNUITY                 : num  24701 35699 6750 29687 21866 ...
##  $ AMT_GOODS_PRICE             : num  351000 1129500 135000 297000 513000 ...
##  $ NAME_TYPE_SUITE             : chr  "Unaccompanied" "Family" "Unaccompanied" "Unaccompanied" ...
##  $ NAME_INCOME_TYPE            : chr  "Working" "State servant" "Working" "Working" ...
##  $ NAME_EDUCATION_TYPE         : chr  "Secondary / secondary special" "Higher education" "Secondary / secondary special" "Secondary / secondary special" ...
##  $ NAME_FAMILY_STATUS          : chr  "Single / not married" "Married" "Single / not married" "Civil marriage" ...
##  $ NAME_HOUSING_TYPE           : chr  "House / apartment" "House / apartment" "House / apartment" "House / apartment" ...
##  $ REGION_POPULATION_RELATIVE  : num  0.0188 0.00354 0.01003 0.00802 0.02866 ...
##  $ DAYS_BIRTH                  : int  -9461 -16765 -19046 -19005 -19932 -16941 -13778 -18850 -20099 -14469 ...
##  $ DAYS_EMPLOYED               : int  -637 -1188 -225 -3039 -3038 -1588 -3130 -449 365243 -2019 ...
##  $ DAYS_REGISTRATION           : num  -3648 -1186 -4260 -9833 -4311 ...
##  $ DAYS_ID_PUBLISH             : int  -2120 -291 -2531 -2437 -3458 -477 -619 -2379 -3514 -3992 ...
##  $ OWN_CAR_AGE                 : int  NA NA 26 NA NA NA 17 8 NA NA ...
##  $ FLAG_MOBIL                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_EMP_PHONE              : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ FLAG_WORK_PHONE             : int  0 0 1 0 0 1 0 1 0 0 ...
##  $ FLAG_CONT_MOBILE            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_PHONE                  : int  1 1 1 0 0 1 1 0 0 0 ...
##  $ FLAG_EMAIL                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OCCUPATION_TYPE             : chr  "Laborers" "Core staff" "Laborers" "Laborers" ...
##  $ CNT_FAM_MEMBERS             : int  1 2 1 2 1 2 3 2 2 1 ...
##  $ REGION_RATING_CLIENT        : int  2 1 2 2 2 2 2 3 2 2 ...
##  $ REGION_RATING_CLIENT_W_CITY : int  2 1 2 2 2 2 2 3 2 2 ...
##  $ WEEKDAY_APPR_PROCESS_START  : chr  "WEDNESDAY" "MONDAY" "MONDAY" "WEDNESDAY" ...
##  $ HOUR_APPR_PROCESS_START     : int  10 11 9 17 11 16 16 16 14 8 ...
##  $ REG_REGION_NOT_LIVE_REGION  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_REGION_NOT_WORK_REGION  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LIVE_REGION_NOT_WORK_REGION : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_CITY_NOT_LIVE_CITY      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_CITY_NOT_WORK_CITY      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ LIVE_CITY_NOT_WORK_CITY     : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ORGANIZATION_TYPE           : chr  "Business Entity Type 3" "School" "Government" "Business Entity Type 3" ...
##  $ EXT_SOURCE_1                : num  0.083 0.311 NA NA NA ...
##  $ EXT_SOURCE_2                : num  0.263 0.622 0.556 0.65 0.323 ...
##  $ EXT_SOURCE_3                : num  0.139 NA 0.73 NA NA ...
##  $ APARTMENTS_AVG              : num  0.0247 0.0959 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_AVG            : num  0.0369 0.0529 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_AVG : num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_AVG             : num  0.619 0.796 NA NA NA ...
##  $ COMMONAREA_AVG              : num  0.0143 0.0605 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_AVG               : num  0 0.08 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_AVG               : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_AVG               : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_AVG               : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_AVG                : num  0.0369 0.013 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_AVG        : num  0.0202 0.0773 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_AVG              : num  0.019 0.0549 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_AVG     : num  0 0.0039 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_AVG           : num  0 0.0098 NA NA NA NA NA NA NA NA ...
##  $ APARTMENTS_MODE             : num  0.0252 0.0924 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_MODE           : num  0.0383 0.0538 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_MODE: num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_MODE            : num  0.634 0.804 NA NA NA ...
##  $ COMMONAREA_MODE             : num  0.0144 0.0497 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_MODE              : num  0 0.0806 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_MODE              : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_MODE              : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_MODE              : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_MODE               : num  0.0377 0.0128 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_MODE       : num  0.022 0.079 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_MODE             : num  0.0198 0.0554 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_MODE    : num  0 0 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_MODE          : num  0 0 NA NA NA NA NA NA NA NA ...
##  $ APARTMENTS_MEDI             : num  0.025 0.0968 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_MEDI           : num  0.0369 0.0529 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_MEDI: num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_MEDI            : num  0.624 0.799 NA NA NA ...
##  $ COMMONAREA_MEDI             : num  0.0144 0.0608 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_MEDI              : num  0 0.08 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_MEDI              : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_MEDI              : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_MEDI              : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_MEDI               : num  0.0375 0.0132 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_MEDI       : num  0.0205 0.0787 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_MEDI             : num  0.0193 0.0558 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_MEDI    : num  0 0.0039 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_MEDI          : num  0 0.01 NA NA NA NA NA NA NA NA ...
##  $ FONDKAPREMONT_MODE          : chr  "reg oper account" "reg oper account" "" "" ...
##  $ HOUSETYPE_MODE              : chr  "block of flats" "block of flats" "" "" ...
##  $ TOTALAREA_MODE              : num  0.0149 0.0714 NA NA NA NA NA NA NA NA ...
##  $ WALLSMATERIAL_MODE          : chr  "Stone, brick" "Block" "" "" ...
##  $ EMERGENCYSTATE_MODE         : chr  "No" "No" "" "" ...
##  $ OBS_30_CNT_SOCIAL_CIRCLE    : int  2 1 0 2 0 0 1 2 1 2 ...
##  $ DEF_30_CNT_SOCIAL_CIRCLE    : int  2 0 0 0 0 0 0 0 0 0 ...
##  $ OBS_60_CNT_SOCIAL_CIRCLE    : int  2 1 0 2 0 0 1 2 1 2 ...
##  $ DEF_60_CNT_SOCIAL_CIRCLE    : int  2 0 0 0 0 0 0 0 0 0 ...
##  $ DAYS_LAST_PHONE_CHANGE      : int  -1134 -828 -815 -617 -1106 -2536 -1562 -1070 0 -1673 ...
##  $ FLAG_DOCUMENT_2             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FLAG_DOCUMENT_3             : int  1 1 0 1 0 1 0 1 1 0 ...
##  $ FLAG_DOCUMENT_4             : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
summary(OriginalDF) #Shows the summary of the dataset
##    SK_ID_CURR         TARGET        NAME_CONTRACT_TYPE CODE_GENDER       
##  Min.   :100002   Min.   :0.00000   Length:307511      Length:307511     
##  1st Qu.:189146   1st Qu.:0.00000   Class :character   Class :character  
##  Median :278202   Median :0.00000   Mode  :character   Mode  :character  
##  Mean   :278181   Mean   :0.08073                                        
##  3rd Qu.:367143   3rd Qu.:0.00000                                        
##  Max.   :456255   Max.   :1.00000                                        
##                                                                          
##  FLAG_OWN_CAR       FLAG_OWN_REALTY     CNT_CHILDREN     AMT_INCOME_TOTAL   
##  Length:307511      Length:307511      Min.   : 0.0000   Min.   :    25650  
##  Class :character   Class :character   1st Qu.: 0.0000   1st Qu.:   112500  
##  Mode  :character   Mode  :character   Median : 0.0000   Median :   147150  
##                                        Mean   : 0.4171   Mean   :   168798  
##                                        3rd Qu.: 1.0000   3rd Qu.:   202500  
##                                        Max.   :19.0000   Max.   :117000000  
##                                                                             
##    AMT_CREDIT       AMT_ANNUITY     AMT_GOODS_PRICE   NAME_TYPE_SUITE   
##  Min.   :  45000   Min.   :  1616   Min.   :  40500   Length:307511     
##  1st Qu.: 270000   1st Qu.: 16524   1st Qu.: 238500   Class :character  
##  Median : 513531   Median : 24903   Median : 450000   Mode  :character  
##  Mean   : 599026   Mean   : 27109   Mean   : 538396                     
##  3rd Qu.: 808650   3rd Qu.: 34596   3rd Qu.: 679500                     
##  Max.   :4050000   Max.   :258026   Max.   :4050000                     
##                    NA's   :12       NA's   :278                         
##  NAME_INCOME_TYPE   NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE 
##  Length:307511      Length:307511       Length:307511      Length:307511     
##  Class :character   Class :character    Class :character   Class :character  
##  Mode  :character   Mode  :character    Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  REGION_POPULATION_RELATIVE   DAYS_BIRTH     DAYS_EMPLOYED    DAYS_REGISTRATION
##  Min.   :0.00029            Min.   :-25229   Min.   :-17912   Min.   :-24672   
##  1st Qu.:0.01001            1st Qu.:-19682   1st Qu.: -2760   1st Qu.: -7480   
##  Median :0.01885            Median :-15750   Median : -1213   Median : -4504   
##  Mean   :0.02087            Mean   :-16037   Mean   : 63815   Mean   : -4986   
##  3rd Qu.:0.02866            3rd Qu.:-12413   3rd Qu.:  -289   3rd Qu.: -2010   
##  Max.   :0.07251            Max.   : -7489   Max.   :365243   Max.   :     0   
##                                                                                
##  DAYS_ID_PUBLISH  OWN_CAR_AGE       FLAG_MOBIL FLAG_EMP_PHONE  
##  Min.   :-7197   Min.   : 0.00    Min.   :0    Min.   :0.0000  
##  1st Qu.:-4299   1st Qu.: 5.00    1st Qu.:1    1st Qu.:1.0000  
##  Median :-3254   Median : 9.00    Median :1    Median :1.0000  
##  Mean   :-2994   Mean   :12.06    Mean   :1    Mean   :0.8199  
##  3rd Qu.:-1720   3rd Qu.:15.00    3rd Qu.:1    3rd Qu.:1.0000  
##  Max.   :    0   Max.   :91.00    Max.   :1    Max.   :1.0000  
##                  NA's   :202929                                
##  FLAG_WORK_PHONE  FLAG_CONT_MOBILE   FLAG_PHONE       FLAG_EMAIL     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.1994   Mean   :0.9981   Mean   :0.2811   Mean   :0.05672  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##                                                                      
##  OCCUPATION_TYPE    CNT_FAM_MEMBERS  REGION_RATING_CLIENT
##  Length:307511      Min.   : 1.000   Min.   :1.000       
##  Class :character   1st Qu.: 2.000   1st Qu.:2.000       
##  Mode  :character   Median : 2.000   Median :2.000       
##                     Mean   : 2.153   Mean   :2.052       
##                     3rd Qu.: 3.000   3rd Qu.:2.000       
##                     Max.   :20.000   Max.   :3.000       
##                     NA's   :2                            
##  REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START
##  Min.   :1.000               Length:307511              Min.   : 0.00          
##  1st Qu.:2.000               Class :character           1st Qu.:10.00          
##  Median :2.000               Mode  :character           Median :12.00          
##  Mean   :2.032                                          Mean   :12.06          
##  3rd Qu.:2.000                                          3rd Qu.:14.00          
##  Max.   :3.000                                          Max.   :23.00          
##                                                                                
##  REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
##  Min.   :0.00000            Min.   :0.00000           
##  1st Qu.:0.00000            1st Qu.:0.00000           
##  Median :0.00000            Median :0.00000           
##  Mean   :0.01514            Mean   :0.05077           
##  3rd Qu.:0.00000            3rd Qu.:0.00000           
##  Max.   :1.00000            Max.   :1.00000           
##                                                       
##  LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY
##  Min.   :0.00000             Min.   :0.00000        Min.   :0.0000        
##  1st Qu.:0.00000             1st Qu.:0.00000        1st Qu.:0.0000        
##  Median :0.00000             Median :0.00000        Median :0.0000        
##  Mean   :0.04066             Mean   :0.07817        Mean   :0.2305        
##  3rd Qu.:0.00000             3rd Qu.:0.00000        3rd Qu.:0.0000        
##  Max.   :1.00000             Max.   :1.00000        Max.   :1.0000        
##                                                                           
##  LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE   EXT_SOURCE_1     EXT_SOURCE_2   
##  Min.   :0.0000          Length:307511      Min.   :0.01     Min.   :0.0000  
##  1st Qu.:0.0000          Class :character   1st Qu.:0.33     1st Qu.:0.3925  
##  Median :0.0000          Mode  :character   Median :0.51     Median :0.5660  
##  Mean   :0.1796                             Mean   :0.50     Mean   :0.5144  
##  3rd Qu.:0.0000                             3rd Qu.:0.68     3rd Qu.:0.6636  
##  Max.   :1.0000                             Max.   :0.96     Max.   :0.8550  
##                                             NA's   :173378   NA's   :660     
##   EXT_SOURCE_3   APARTMENTS_AVG   BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG
##  Min.   :0.00    Min.   :0.00     Min.   :0.00     Min.   :0.00               
##  1st Qu.:0.37    1st Qu.:0.06     1st Qu.:0.04     1st Qu.:0.98               
##  Median :0.54    Median :0.09     Median :0.08     Median :0.98               
##  Mean   :0.51    Mean   :0.12     Mean   :0.09     Mean   :0.98               
##  3rd Qu.:0.67    3rd Qu.:0.15     3rd Qu.:0.11     3rd Qu.:0.99               
##  Max.   :0.90    Max.   :1.00     Max.   :1.00     Max.   :1.00               
##  NA's   :60965   NA's   :156061   NA's   :179943   NA's   :150007             
##  YEARS_BUILD_AVG  COMMONAREA_AVG   ELEVATORS_AVG    ENTRANCES_AVG   
##  Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.69     1st Qu.:0.01     1st Qu.:0.00     1st Qu.:0.07    
##  Median :0.76     Median :0.02     Median :0.00     Median :0.14    
##  Mean   :0.75     Mean   :0.04     Mean   :0.08     Mean   :0.15    
##  3rd Qu.:0.82     3rd Qu.:0.05     3rd Qu.:0.12     3rd Qu.:0.21    
##  Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
##  NA's   :204488   NA's   :214865   NA's   :163891   NA's   :154828  
##  FLOORSMAX_AVG    FLOORSMIN_AVG     LANDAREA_AVG    LIVINGAPARTMENTS_AVG
##  Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00        
##  1st Qu.:0.17     1st Qu.:0.08     1st Qu.:0.02     1st Qu.:0.05        
##  Median :0.17     Median :0.21     Median :0.05     Median :0.08        
##  Mean   :0.23     Mean   :0.23     Mean   :0.07     Mean   :0.10        
##  3rd Qu.:0.33     3rd Qu.:0.38     3rd Qu.:0.09     3rd Qu.:0.12        
##  Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00        
##  NA's   :153020   NA's   :208642   NA's   :182590   NA's   :210199      
##  LIVINGAREA_AVG   NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE 
##  Min.   :0.00     Min.   :0.00            Min.   :0.00      Min.   :0.00    
##  1st Qu.:0.05     1st Qu.:0.00            1st Qu.:0.00      1st Qu.:0.05    
##  Median :0.07     Median :0.00            Median :0.00      Median :0.08    
##  Mean   :0.11     Mean   :0.01            Mean   :0.03      Mean   :0.11    
##  3rd Qu.:0.13     3rd Qu.:0.00            3rd Qu.:0.03      3rd Qu.:0.14    
##  Max.   :1.00     Max.   :1.00            Max.   :1.00      Max.   :1.00    
##  NA's   :154350   NA's   :213514          NA's   :169682    NA's   :156061  
##  BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE
##  Min.   :0.00      Min.   :0.00                 Min.   :0.00    
##  1st Qu.:0.04      1st Qu.:0.98                 1st Qu.:0.70    
##  Median :0.07      Median :0.98                 Median :0.76    
##  Mean   :0.09      Mean   :0.98                 Mean   :0.76    
##  3rd Qu.:0.11      3rd Qu.:0.99                 3rd Qu.:0.82    
##  Max.   :1.00      Max.   :1.00                 Max.   :1.00    
##  NA's   :179943    NA's   :150007               NA's   :204488  
##  COMMONAREA_MODE  ELEVATORS_MODE   ENTRANCES_MODE   FLOORSMAX_MODE  
##  Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.01     1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17    
##  Median :0.02     Median :0.00     Median :0.14     Median :0.17    
##  Mean   :0.04     Mean   :0.07     Mean   :0.15     Mean   :0.22    
##  3rd Qu.:0.05     3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33    
##  Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
##  NA's   :214865   NA's   :163891   NA's   :154828   NA's   :153020  
##  FLOORSMIN_MODE   LANDAREA_MODE    LIVINGAPARTMENTS_MODE LIVINGAREA_MODE 
##  Min.   :0.00     Min.   :0.00     Min.   :0.00          Min.   :0.00    
##  1st Qu.:0.08     1st Qu.:0.02     1st Qu.:0.05          1st Qu.:0.04    
##  Median :0.21     Median :0.05     Median :0.08          Median :0.07    
##  Mean   :0.23     Mean   :0.06     Mean   :0.11          Mean   :0.11    
##  3rd Qu.:0.38     3rd Qu.:0.08     3rd Qu.:0.13          3rd Qu.:0.13    
##  Max.   :1.00     Max.   :1.00     Max.   :1.00          Max.   :1.00    
##  NA's   :208642   NA's   :182590   NA's   :210199        NA's   :154350  
##  NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI  BASEMENTAREA_MEDI
##  Min.   :0.00             Min.   :0.00       Min.   :0.00     Min.   :0.00     
##  1st Qu.:0.00             1st Qu.:0.00       1st Qu.:0.06     1st Qu.:0.04     
##  Median :0.00             Median :0.00       Median :0.09     Median :0.08     
##  Mean   :0.01             Mean   :0.03       Mean   :0.12     Mean   :0.09     
##  3rd Qu.:0.00             3rd Qu.:0.02       3rd Qu.:0.15     3rd Qu.:0.11     
##  Max.   :1.00             Max.   :1.00       Max.   :1.00     Max.   :1.00     
##  NA's   :213514           NA's   :169682     NA's   :156061   NA's   :179943   
##  YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI 
##  Min.   :0.00                 Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.98                 1st Qu.:0.69     1st Qu.:0.01    
##  Median :0.98                 Median :0.76     Median :0.02    
##  Mean   :0.98                 Mean   :0.76     Mean   :0.04    
##  3rd Qu.:0.99                 3rd Qu.:0.83     3rd Qu.:0.05    
##  Max.   :1.00                 Max.   :1.00     Max.   :1.00    
##  NA's   :150007               NA's   :204488   NA's   :214865  
##  ELEVATORS_MEDI   ENTRANCES_MEDI   FLOORSMAX_MEDI   FLOORSMIN_MEDI  
##  Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17     1st Qu.:0.08    
##  Median :0.00     Median :0.14     Median :0.17     Median :0.21    
##  Mean   :0.08     Mean   :0.15     Mean   :0.23     Mean   :0.23    
##  3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33     3rd Qu.:0.38    
##  Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
##  NA's   :163891   NA's   :154828   NA's   :153020   NA's   :208642  
##  LANDAREA_MEDI    LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI 
##  Min.   :0.00     Min.   :0.00          Min.   :0.00    
##  1st Qu.:0.02     1st Qu.:0.05          1st Qu.:0.05    
##  Median :0.05     Median :0.08          Median :0.07    
##  Mean   :0.07     Mean   :0.10          Mean   :0.11    
##  3rd Qu.:0.09     3rd Qu.:0.12          3rd Qu.:0.13    
##  Max.   :1.00     Max.   :1.00          Max.   :1.00    
##  NA's   :182590   NA's   :210199        NA's   :154350  
##  NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE
##  Min.   :0.00             Min.   :0.00       Length:307511     
##  1st Qu.:0.00             1st Qu.:0.00       Class :character  
##  Median :0.00             Median :0.00       Mode  :character  
##  Mean   :0.01             Mean   :0.03                         
##  3rd Qu.:0.00             3rd Qu.:0.03                         
##  Max.   :1.00             Max.   :1.00                         
##  NA's   :213514           NA's   :169682                       
##  HOUSETYPE_MODE     TOTALAREA_MODE   WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
##  Length:307511      Min.   :0.00     Length:307511      Length:307511      
##  Class :character   1st Qu.:0.04     Class :character   Class :character   
##  Mode  :character   Median :0.07     Mode  :character   Mode  :character   
##                     Mean   :0.10                                           
##                     3rd Qu.:0.13                                           
##                     Max.   :1.00                                           
##                     NA's   :148431                                         
##  OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
##  Min.   :  0.000          Min.   : 0.0000          Min.   :  0.000         
##  1st Qu.:  0.000          1st Qu.: 0.0000          1st Qu.:  0.000         
##  Median :  0.000          Median : 0.0000          Median :  0.000         
##  Mean   :  1.422          Mean   : 0.1434          Mean   :  1.405         
##  3rd Qu.:  2.000          3rd Qu.: 0.0000          3rd Qu.:  2.000         
##  Max.   :348.000          Max.   :34.0000          Max.   :344.000         
##  NA's   :1021             NA's   :1021             NA's   :1021            
##  DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2   
##  Min.   : 0.0             Min.   :-4292.0        Min.   :0.00e+00  
##  1st Qu.: 0.0             1st Qu.:-1570.0        1st Qu.:0.00e+00  
##  Median : 0.0             Median : -757.0        Median :0.00e+00  
##  Mean   : 0.1             Mean   : -962.9        Mean   :4.23e-05  
##  3rd Qu.: 0.0             3rd Qu.: -274.0        3rd Qu.:0.00e+00  
##  Max.   :24.0             Max.   :    0.0        Max.   :1.00e+00  
##  NA's   :1021             NA's   :1                                
##  FLAG_DOCUMENT_3 FLAG_DOCUMENT_4    FLAG_DOCUMENT_5   FLAG_DOCUMENT_6  
##  Min.   :0.00    Min.   :0.00e+00   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00    1st Qu.:0.00e+00   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.00    Median :0.00e+00   Median :0.00000   Median :0.00000  
##  Mean   :0.71    Mean   :8.13e-05   Mean   :0.01511   Mean   :0.08806  
##  3rd Qu.:1.00    3rd Qu.:0.00e+00   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00    Max.   :1.00e+00   Max.   :1.00000   Max.   :1.00000  
##                                                                        
##  FLAG_DOCUMENT_7     FLAG_DOCUMENT_8   FLAG_DOCUMENT_9    FLAG_DOCUMENT_10  
##  Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00e+00  
##  1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00e+00  
##  Median :0.0000000   Median :0.00000   Median :0.000000   Median :0.00e+00  
##  Mean   :0.0001919   Mean   :0.08138   Mean   :0.003896   Mean   :2.28e-05  
##  3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00e+00  
##  Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00e+00  
##                                                                             
##  FLAG_DOCUMENT_11   FLAG_DOCUMENT_12  FLAG_DOCUMENT_13   FLAG_DOCUMENT_14  
##  Min.   :0.000000   Min.   :0.0e+00   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.0e+00   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.0e+00   Median :0.000000   Median :0.000000  
##  Mean   :0.003912   Mean   :6.5e-06   Mean   :0.003525   Mean   :0.002936  
##  3rd Qu.:0.000000   3rd Qu.:0.0e+00   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.0e+00   Max.   :1.000000   Max.   :1.000000  
##                                                                            
##  FLAG_DOCUMENT_15  FLAG_DOCUMENT_16   FLAG_DOCUMENT_17    FLAG_DOCUMENT_18 
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.0000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.0000000   Median :0.00000  
##  Mean   :0.00121   Mean   :0.009928   Mean   :0.0002667   Mean   :0.00813  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.0000000   Max.   :1.00000  
##                                                                            
##  FLAG_DOCUMENT_19    FLAG_DOCUMENT_20    FLAG_DOCUMENT_21   
##  Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
##  1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000  
##  Median :0.0000000   Median :0.0000000   Median :0.0000000  
##  Mean   :0.0005951   Mean   :0.0005073   Mean   :0.0003349  
##  3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000  
##  Max.   :1.0000000   Max.   :1.0000000   Max.   :1.0000000  
##                                                             
##  AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
##  Min.   :0.00               Min.   :0.00             
##  1st Qu.:0.00               1st Qu.:0.00             
##  Median :0.00               Median :0.00             
##  Mean   :0.01               Mean   :0.01             
##  3rd Qu.:0.00               3rd Qu.:0.00             
##  Max.   :4.00               Max.   :9.00             
##  NA's   :41519              NA's   :41519            
##  AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
##  Min.   :0.00               Min.   : 0.00             Min.   :  0.00           
##  1st Qu.:0.00               1st Qu.: 0.00             1st Qu.:  0.00           
##  Median :0.00               Median : 0.00             Median :  0.00           
##  Mean   :0.03               Mean   : 0.27             Mean   :  0.27           
##  3rd Qu.:0.00               3rd Qu.: 0.00             3rd Qu.:  0.00           
##  Max.   :8.00               Max.   :27.00             Max.   :261.00           
##  NA's   :41519              NA's   :41519             NA's   :41519            
##  AMT_REQ_CREDIT_BUREAU_YEAR
##  Min.   : 0.0              
##  1st Qu.: 0.0              
##  Median : 1.0              
##  Mean   : 1.9              
##  3rd Qu.: 3.0              
##  Max.   :25.0              
##  NA's   :41519

Part 3: Data Pre-processing

3.1 Checking for missing values

# Total number of missing values in the dataset
cat("The total number of missing values in the dataset is" , sum(is.na(OriginalDF)))
## The total number of missing values in the dataset is 8388094

Total number of missing values in the dataset per column name

colSums(is.na(OriginalDF))
##                   SK_ID_CURR                       TARGET 
##                            0                            0 
##           NAME_CONTRACT_TYPE                  CODE_GENDER 
##                            0                            0 
##                 FLAG_OWN_CAR              FLAG_OWN_REALTY 
##                            0                            0 
##                 CNT_CHILDREN             AMT_INCOME_TOTAL 
##                            0                            0 
##                   AMT_CREDIT                  AMT_ANNUITY 
##                            0                           12 
##              AMT_GOODS_PRICE              NAME_TYPE_SUITE 
##                          278                            0 
##             NAME_INCOME_TYPE          NAME_EDUCATION_TYPE 
##                            0                            0 
##           NAME_FAMILY_STATUS            NAME_HOUSING_TYPE 
##                            0                            0 
##   REGION_POPULATION_RELATIVE                   DAYS_BIRTH 
##                            0                            0 
##                DAYS_EMPLOYED            DAYS_REGISTRATION 
##                            0                            0 
##              DAYS_ID_PUBLISH                  OWN_CAR_AGE 
##                            0                       202929 
##                   FLAG_MOBIL               FLAG_EMP_PHONE 
##                            0                            0 
##              FLAG_WORK_PHONE             FLAG_CONT_MOBILE 
##                            0                            0 
##                   FLAG_PHONE                   FLAG_EMAIL 
##                            0                            0 
##              OCCUPATION_TYPE              CNT_FAM_MEMBERS 
##                            0                            2 
##         REGION_RATING_CLIENT  REGION_RATING_CLIENT_W_CITY 
##                            0                            0 
##   WEEKDAY_APPR_PROCESS_START      HOUR_APPR_PROCESS_START 
##                            0                            0 
##   REG_REGION_NOT_LIVE_REGION   REG_REGION_NOT_WORK_REGION 
##                            0                            0 
##  LIVE_REGION_NOT_WORK_REGION       REG_CITY_NOT_LIVE_CITY 
##                            0                            0 
##       REG_CITY_NOT_WORK_CITY      LIVE_CITY_NOT_WORK_CITY 
##                            0                            0 
##            ORGANIZATION_TYPE                 EXT_SOURCE_1 
##                            0                       173378 
##                 EXT_SOURCE_2                 EXT_SOURCE_3 
##                          660                        60965 
##               APARTMENTS_AVG             BASEMENTAREA_AVG 
##                       156061                       179943 
##  YEARS_BEGINEXPLUATATION_AVG              YEARS_BUILD_AVG 
##                       150007                       204488 
##               COMMONAREA_AVG                ELEVATORS_AVG 
##                       214865                       163891 
##                ENTRANCES_AVG                FLOORSMAX_AVG 
##                       154828                       153020 
##                FLOORSMIN_AVG                 LANDAREA_AVG 
##                       208642                       182590 
##         LIVINGAPARTMENTS_AVG               LIVINGAREA_AVG 
##                       210199                       154350 
##      NONLIVINGAPARTMENTS_AVG            NONLIVINGAREA_AVG 
##                       213514                       169682 
##              APARTMENTS_MODE            BASEMENTAREA_MODE 
##                       156061                       179943 
## YEARS_BEGINEXPLUATATION_MODE             YEARS_BUILD_MODE 
##                       150007                       204488 
##              COMMONAREA_MODE               ELEVATORS_MODE 
##                       214865                       163891 
##               ENTRANCES_MODE               FLOORSMAX_MODE 
##                       154828                       153020 
##               FLOORSMIN_MODE                LANDAREA_MODE 
##                       208642                       182590 
##        LIVINGAPARTMENTS_MODE              LIVINGAREA_MODE 
##                       210199                       154350 
##     NONLIVINGAPARTMENTS_MODE           NONLIVINGAREA_MODE 
##                       213514                       169682 
##              APARTMENTS_MEDI            BASEMENTAREA_MEDI 
##                       156061                       179943 
## YEARS_BEGINEXPLUATATION_MEDI             YEARS_BUILD_MEDI 
##                       150007                       204488 
##              COMMONAREA_MEDI               ELEVATORS_MEDI 
##                       214865                       163891 
##               ENTRANCES_MEDI               FLOORSMAX_MEDI 
##                       154828                       153020 
##               FLOORSMIN_MEDI                LANDAREA_MEDI 
##                       208642                       182590 
##        LIVINGAPARTMENTS_MEDI              LIVINGAREA_MEDI 
##                       210199                       154350 
##     NONLIVINGAPARTMENTS_MEDI           NONLIVINGAREA_MEDI 
##                       213514                       169682 
##           FONDKAPREMONT_MODE               HOUSETYPE_MODE 
##                            0                            0 
##               TOTALAREA_MODE           WALLSMATERIAL_MODE 
##                       148431                            0 
##          EMERGENCYSTATE_MODE     OBS_30_CNT_SOCIAL_CIRCLE 
##                            0                         1021 
##     DEF_30_CNT_SOCIAL_CIRCLE     OBS_60_CNT_SOCIAL_CIRCLE 
##                         1021                         1021 
##     DEF_60_CNT_SOCIAL_CIRCLE       DAYS_LAST_PHONE_CHANGE 
##                         1021                            1 
##              FLAG_DOCUMENT_2              FLAG_DOCUMENT_3 
##                            0                            0 
##              FLAG_DOCUMENT_4              FLAG_DOCUMENT_5 
##                            0                            0 
##              FLAG_DOCUMENT_6              FLAG_DOCUMENT_7 
##                            0                            0 
##              FLAG_DOCUMENT_8              FLAG_DOCUMENT_9 
##                            0                            0 
##             FLAG_DOCUMENT_10             FLAG_DOCUMENT_11 
##                            0                            0 
##             FLAG_DOCUMENT_12             FLAG_DOCUMENT_13 
##                            0                            0 
##             FLAG_DOCUMENT_14             FLAG_DOCUMENT_15 
##                            0                            0 
##             FLAG_DOCUMENT_16             FLAG_DOCUMENT_17 
##                            0                            0 
##             FLAG_DOCUMENT_18             FLAG_DOCUMENT_19 
##                            0                            0 
##             FLAG_DOCUMENT_20             FLAG_DOCUMENT_21 
##                            0                            0 
##   AMT_REQ_CREDIT_BUREAU_HOUR    AMT_REQ_CREDIT_BUREAU_DAY 
##                        41519                        41519 
##   AMT_REQ_CREDIT_BUREAU_WEEK    AMT_REQ_CREDIT_BUREAU_MON 
##                        41519                        41519 
##    AMT_REQ_CREDIT_BUREAU_QRT   AMT_REQ_CREDIT_BUREAU_YEAR 
##                        41519                        41519

We have about 300,000 instances and we will remove columns which consists of more than 140,000 missing values. We will also remove rows that consist more than 50% missing value

miss <- c()
for(i in 1:ncol(OriginalDF)) {
  if(length(which(is.na(OriginalDF[,i]))) > 140000) miss <- append(miss,i) 
}
OriginalDF<- OriginalDF[,-miss]
miss2 <- c()
for(i in 1:nrow(OriginalDF)) {
  if(length(which(is.na(OriginalDF[i,]))) > 0.5*ncol(OriginalDF)) miss2 <- append(miss2,i) 
}
OriginalDF<- OriginalDF[-miss,]

Column flag-documents (index 52 to 71) are not useful to our analysis, therefore, we also drop these columns and check the number of rows & columns left in the dataset

OriginalDF<- OriginalDF[,-c(52:71)]
#check how many columns left
dim(OriginalDF)
## [1] 307466     57

3.2 Missing Value Treatment

Missing value in categorical columns - Replace missing with ‘Unknown’

OriginalDF<-OriginalDF %>% replace_with_na(replace = list(EMERGENCYSTATE_MODE =""))
OriginalDF$EMERGENCYSTATE_MODE= fct_explicit_na(OriginalDF$EMERGENCYSTATE_MODE, "Unknown")


OriginalDF<-OriginalDF %>% replace_with_na(replace = list(OCCUPATION_TYPE =""))
OriginalDF$OCCUPATION_TYPE= fct_explicit_na(OriginalDF$OCCUPATION_TYPE, "Unknown")


OriginalDF<-OriginalDF %>% replace_with_na(replace = list(FONDKAPREMONT_MODE =""))
OriginalDF$FONDKAPREMONT_MODE= fct_explicit_na(OriginalDF$FONDKAPREMONT_MODE, "Unknown")


OriginalDF<-OriginalDF %>% replace_with_na(replace = list(HOUSETYPE_MODE =""))
OriginalDF$HOUSETYPE_MODE= fct_explicit_na(OriginalDF$HOUSETYPE_MODE, "Unknown")


OriginalDF<-OriginalDF %>% replace_with_na(replace = list(WALLSMATERIAL_MODE =""))
OriginalDF$WALLSMATERIAL_MODE= fct_explicit_na(OriginalDF$WALLSMATERIAL_MODE, "Unknown")

Missing value in numeric columns - Replace missing with ‘mean’

OriginalDF<- OriginalDF                                             
for(i in 1:ncol(OriginalDF)) {                                   
  OriginalDF[ , i][is.na(OriginalDF[ , i])] <- mean(OriginalDF[ , i], na.rm = TRUE)
}

3.3 Data Correction

The variables for DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH, DAYS_ID_PUBLISH, DAYS_LAST_PHONE_CHANGE are in negative values. Therefore, we will transform the negative values to positive values.

OriginalDF$DAYS_BIRTH<-OriginalDF$DAYS_BIRTH*-1
OriginalDF$DAYS_EMPLOYED<-OriginalDF$DAYS_EMPLOYED*-1
OriginalDF$DAYS_REGISTRATION<-OriginalDF$DAYS_REGISTRATION*-1
OriginalDF$DAYS_ID_PUBLISH<-OriginalDF$DAYS_ID_PUBLISH*-1
OriginalDF$DAYS_LAST_PHONE_CHANGE<-OriginalDF$DAYS_LAST_PHONE_CHANGE*-1

3.4 Binning continuous variables to categorical variables

We bin AMT_CREDIT variable to AMT_CREDIT_RANGE, AMT_INCOME_TOTAL variable to AMT_INCOME_RANGE and DAYS_BIRTH variable to AGE_RANGE.

bins = c(0,350000,700000,1000000000)
slots = c('Low','Medium','High')
###Creating bins for Credit amount
OriginalDF['AMT_CREDIT_RANGE']=cut(OriginalDF$AMT_CREDIT,breaks=c(0,350000,700000,1000000000),labels=slots)


bins = c(0,200000,400000,10000000000)
slots = c('Low','Medium','High')
# Creating bins for income amount
OriginalDF['AMT_INCOME_RANGE']=cut(OriginalDF$AMT_INCOME_TOTAL,breaks=c(0,200000,400000,10000000000),labels=slots)


bins = c(0,7300,10950,14600,18250,21900,25500)
slots = c('0-20','20-30','30-40','40-50','50-60','60-70')
OriginalDF['AGE_RANGE']=cut(OriginalDF$DAYS_BIRTH,breaks=c(0,7300,10950,14600,18250,21900,25500),labels=slots)

3.5 Removal Duplicated Rows

After imputation, certain rows became identical hence need to be deduplicated.

OriginalDF = OriginalDF %>% distinct

3.6 Verify Cleanliness

Verify Deduplication

sum(duplicated(OriginalDF))
## [1] 0

Verify Missing value

sum(is.na(OriginalDF))
## [1] 0

Cleaned dataset

dim(OriginalDF)
## [1] 307466     60

Check if data is balance

prop.table(table(OriginalDF$TARGET))
## 
##          0          1 
## 0.91926262 0.08073738

The TARGET column has 8.07% of 1’s which means 8% customers have payment difficulties and 91.92% are having no payment difficulties. The dataset is inbalanced and we will need to take note of this during the modelling.

Part 4: Exploratory Data Analysis (EDA)

4.1 Univariate Analysis

We will select these variables(CODE_GENDER, NAME_INCOME_TYPE, NAME_EDUCATION_TYPE, NAME_HOUSING_TYPE, OCCUPATION_TYPE, ORGANIZATION_TYPE, AMT_CREDITRANGE, AMT_INCOME_RANGE and AGE_RANGE) to perform our univariate analysis. ### (1) GENDER, AMOUNT OF CREDIT/LOAN, AMOUNT OF INCOME,& AGE

genderplot<-ggplot(data=OriginalDF,aes(x=CODE_GENDER,fill=CODE_GENDER)) +
  geom_bar(stat="count") + ggtitle('Gender')+
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

AMT_CREDITRANGEplot<-ggplot(data=OriginalDF,aes(x=factor(AMT_CREDIT_RANGE, level=c('Low','Medium','High')),fill=AMT_CREDIT_RANGE))+
  geom_bar(stat="count") + ggtitle('Credit Amount (Range)') + 
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

AMT_INCOME_RANGEplot<-ggplot(data=OriginalDF,aes(x=factor(AMT_INCOME_RANGE, level=c('Low','Medium','High')),fill=AMT_INCOME_RANGE))+
  geom_bar(stat="count") + ggtitle('Income Amount (Range)') + 
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

AGE_RANGEplot<-ggplot(data=OriginalDF,aes(x=AGE_RANGE,fill=AGE_RANGE))+ 
  #scale_fill_manual(values = c("#F8766D","#A3A500","#00BF7D","#00B0F6","#E76BF3"))+
  geom_bar(stat="count") + ggtitle('Age Range') +
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

p1 <- ggarrange(genderplot, AMT_INCOME_RANGEplot,
                ncol = 1, nrow = 2)
p2 <- ggarrange(AGE_RANGEplot, AMT_CREDITRANGEplot, 
                ncol = 1, nrow = 2)
ggarrange(p1, p2, ncol = 2, nrow = 1)


From the above barplots, it is shown that the highest number of customers in making loan payemnts are female customers, ranging from age 30-40 years old, having low income andlow credit amount while applying loan.

(2) INCOME TYPE,EDUCATION TYPE, FAMILY STATUS, HOUSING TYPE, OCCUPATION TYPE

incometypeplot<-ggplot(data=OriginalDF,aes(x=NAME_INCOME_TYPE,fill=NAME_INCOME_TYPE)) +
  geom_bar(stat="count") + ggtitle('Types of Income')+
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

edutypeplot<-ggplot(data=OriginalDF,aes(x=NAME_EDUCATION_TYPE,fill=NAME_EDUCATION_TYPE)) +
  geom_bar(stat="count") + ggtitle('Types of Education')+
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

marriagetypeplot<-ggplot(data=OriginalDF,aes(x=NAME_FAMILY_STATUS,fill=NAME_FAMILY_STATUS)) +
  geom_bar(stat="count") + ggtitle('Marital Status')+
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

housetypeplot<-ggplot(data=OriginalDF,aes(x=NAME_HOUSING_TYPE,fill=NAME_HOUSING_TYPE)) +
  geom_bar(stat="count") + ggtitle('Types of Housing')+
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

worktypeplot<-ggplot(data=OriginalDF,aes(x=OCCUPATION_TYPE,fill=OCCUPATION_TYPE)) +
  geom_bar(stat="count") + ggtitle('Tpyes of Occupation')+
  theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')

p3 <- ggarrange(incometypeplot,edutypeplot,
                ncol = 1, nrow = 2)
p4 <- ggarrange(marriagetypeplot,housetypeplot, 
                ncol = 1, nrow = 2)
ggarrange(p3, p4, ncol = 2, nrow = 1)

worktypeplot+coord_flip()


From the barplots, it is shown that the highest number of customers in making the loan payments are married employees, owning a house or an apartment with a secondary education. However, the occupation for these customers are labeled as unknown in our analysis.

4.2 Bivariate analysis

(1) AGE & Gender vs TARGET

par(mfrow = c(2, 1)) 
OriginalDF1 <- with(OriginalDF, table(TARGET, AGE_RANGE))
barplot(OriginalDF1, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
                                                                    inset=c(-0.05,0)),
        col =c("Red","Green"),
        xlab="AGE_RANGE",
        ylab="Values",
        main ="Target=0 : No Payment issue, Target =1 : Payment issue",
        )

OriginalDF2 <- with(OriginalDF, table(TARGET,CODE_GENDER))
barplot(OriginalDF2, beside = TRUE, legend = TRUE,col =c("Red","Green"),args.legend=list(x="topright",
                                                                    inset=c(-0.05,0)),
        xlab="CODE_GENDER",
        ylab="values",
        main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)


We can observe that most of the female customers have no issues in paying the loan and customers belonging to age group 30-40 are able to make payment on time.

(2) Amount of credit(loan), Income type, and Amount of Income vs TARGET

par(mfrow = c(3, 1)) 
OriginalDF3 <- with(OriginalDF, table(TARGET, AMT_CREDIT_RANGE))
barplot(OriginalDF3, beside = TRUE, legend = TRUE, args.legend=list(x="topright",
                                                                    inset=c(0,0)),
        col =c("Red","Green"),
        xlab="Amount of Credit(Loan)",
        ylab="Values",
        main ="Target=0 : No Payment issue,Target =1 : Payment issue",
)



OriginalDF4 <- with(OriginalDF, table(TARGET,NAME_INCOME_TYPE))
barplot(OriginalDF4, beside = TRUE, legend = TRUE, args.legend=list(x="topright",
                                                                    inset=c(0,0)),
        col =c("Red","Green"),
        xlab="NAME_INCOME_TYPE",
        ylab="values",
        #horiz=T,
        main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)


OriginalDF5 <- with(OriginalDF, table(TARGET,AMT_INCOME_RANGE))
barplot(OriginalDF5, beside = TRUE, legend = TRUE, args.legend=list(x="topright",
                                                                    inset=c(0,0)),
        col =c("Red","Green"),
        xlab="AMT_INCOME_RANGE",
        ylab="values",
        #horiz=T,
        main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)


Customers (Employees) with less credit and low income are most likely to make payment. Customers having medium and high credit can also be considered while lending the loan.

(3) Types of Education, Marital Status & Types of Housing vs TARGET

par(mfrow = c(3, 1)) 
OriginalDF6 <- with(OriginalDF, table(TARGET,NAME_EDUCATION_TYPE))
barplot(OriginalDF6, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
                                                                    inset=c(0,0)),
        col =c("Red","Green"),
        xlab="Types of Education",
        ylab="values",
        #horiz=T,
        main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)
        OriginalDF7 <- with(OriginalDF, table(TARGET,NAME_FAMILY_STATUS))
barplot(OriginalDF7, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
                                                                    inset=c(0,0)),
        col =c("Red","Green"),
        xlab="Marital Status",
        ylab="values",
        #horiz=T,
        main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)


OriginalDF8 <- with(OriginalDF, table(TARGET,NAME_HOUSING_TYPE))
barplot(OriginalDF8, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
                                                                    inset=c(0,0)),
        col =c("Red","Green"),
        xlab="Types of Housing",
        ylab="values",
        #horiz=T,
        main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)


Married Customers with secondary education and staying in a house or a apartment are most likely to make payments when compared to customers with academic degree.

(4) AMT_CREDIT, AMT_ANNUITY TARGET, AMT_GOODS_PRICE vs TARGET

par(mfrow = c(3, 1)) 
p1 <- ggplot(data=OriginalDF, aes(x=AMT_CREDIT, group=TARGET, fill=TARGET)) +
  geom_density(adjust=1.5) +
  theme_ipsum() 

p2 <- ggplot(data=OriginalDF, aes(x=AMT_ANNUITY, group=TARGET, fill=TARGET)) +
  geom_density(adjust=1.5) +
  theme_ipsum()

p3 <- ggplot(data=OriginalDF, aes(x=AMT_GOODS_PRICE, group=TARGET, fill=TARGET)) +
  geom_density(adjust=1.5) 
  

grid.arrange(p1,p2,p3,ncol=1)


Most customers have paid their amount of credits and have good price on time. However, they are unable to pay the annuity fees on time.

4.3 Correlation matrix

We select the numerical data and convert it to matrix. After that, we use Spearman coefficent to calculate the correlation coefficient. Next ,we visualise the correlation matrix to check on the highly correlated variables.

data=select_if(OriginalDF,is.numeric)
data.cor=cor(data)
data.cor = cor(data, method = c("spearman"))
corrplot(data.cor)

Based on the correlation matrix, we have selected few variables AMT_INCOME_TOTAL,AMT_CREDIT, AMT_GOODS_PRICE,AMT_INCOME_RANGE,CODE_GENDER, NAME_EDUCATION_TYPE FOR Exploratory data analysis.

(1) AMT_INCOME_TOTAL vs AMT_CREDIT , AMT_GOODS_PRICE vs AMT_CREDIT

attach(data)
par(mfrow = c(1, 2)) 
plot(AMT_INCOME_TOTAL, AMT_CREDIT, main="Amount of Income  vs  Amount of Credit(Range)",
     xlab="AMT_INCOME_TOTAL ", ylab="AMT_CREDIT ", pch=19)
attach(data)
plot(AMT_GOODS_PRICE, AMT_CREDIT, main="Good Price vs  Amount of Credit(Range)",
     xlab="AMT_GOODS_PRICE ", ylab="AMT_CREDIT ", pch=19)


Those who have paid the loan amount on/within time are more likely to get higher credits than those who didnt pay/did late payments.People who have higher goods price and have made payments on time have higher credits than those with higher goods price but didnt pay loan.

(2) AMT_INCOME_RANGE vs CODE_GENDER

OriginalDF9 <- with(OriginalDF, table(AMT_INCOME_RANGE,CODE_GENDER))
barplot(OriginalDF9, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
                                                                    inset=c(0,0)),
        col =c("Red","Green","Blue"),
        xlab="CODE_GENDER",
        ylab="values",
        #horiz=T,
        main ="The relationship between income range and gender",
)


We can see that Females with low income don’t have any payment issues.

(3) AMT_CREDIT vs NAME_EDUCATION_TYPE

ggplot(OriginalDF, aes(x=NAME_EDUCATION_TYPE, y=AMT_CREDIT, fill=NAME_FAMILY_STATUS, 
                       main ="The relationship between income range and gender")) + 
  geom_boxplot()


From the boxplot above:
(1) Some of the highly educated, married person are having credits higher than those who have done lower secondary education.
(2) Those with higher education have higher credits and are more likely to make payments on time.
(3) More number of outliers are seen in higher education.
(4) The people with secondary and secondary special education are less likely to make payments on time.

After checking with the correlation matrix, we decide to drop insignificant variables such as REGION_RATING_CLIENT_W_CITY, SK_ID_CURR from the dataset and assign it to df.

df= select(OriginalDF,-1,-31,)

Part 5 Machine Learning Preparation (Classification)

5.1 Training and test set

We will split into training set and test set in a ratio of 70:30

set.seed(123)
split = sample.split(df$TARGET,SplitRatio = 0.70)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)

5.2 Scaling the continuous variables

Next, we scale the continuous variables for training set and test set.

continuous_column = c('AMT_INCOME_TOTAL',  'AMT_CREDIT',  'AMT_ANNUITY',  'AMT_GOODS_PRICE',  'REGION_POPULATION_RELATIVE',  'DAYS_REGISTRATION',  'CNT_FAM_MEMBERS',  'EXT_SOURCE_2', 'EXT_SOURCE_3',  'OBS_30_CNT_SOCIAL_CIRCLE',  'DEF_30_CNT_SOCIAL_CIRCLE',  'OBS_60_CNT_SOCIAL_CIRCLE',  'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE','AMT_REQ_CREDIT_BUREAU_HOUR',  'AMT_REQ_CREDIT_BUREAU_DAY',  'AMT_REQ_CREDIT_BUREAU_WEEK',  'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',  'AMT_REQ_CREDIT_BUREAU_YEAR')
training_set[continuous_column] = scale(training_set[continuous_column])
test_set[continuous_column] = scale(test_set[continuous_column])

5.3 Data Balancing

Based on the results that we have from EDA, the data is unbalanced where the majority of the observation were 0: no payment issue. We will balance the data using oversampling method.

# data are imbalance
table(training_set$TARGET)
## 
##      0      1 
## 197849  17377
# balance the data
bal_training_set=ovun.sample(TARGET~., data = training_set, method = "over", N = 197849*2)$data
table(bal_training_set$TARGET)
## 
##      0      1 
## 197849 197849

Now, we have the perfect balanced data for training and ready for modelling.

5.4 Performance Evaluation Function for Classification

Next, we create a performance evaluation function to check on the accuracy, ROC area, Kappa, precision, recall and F1-measure. As the dataset is unbalanced, and it is equally important to predict if one has the ability to repay the loan, we will use macro-average for precision, recall and f1-measure.

get_cm <- function(test_set, y_pred){
  cm <- as.matrix(table(Actual = test_set$TARGET, Predicted = y_pred))
  }

simple_eval <- function (test_set, y_pred) {
  CM = get_cm(test_set, y_pred)
  n = sum(CM)# number of instance
  nc = nrow(CM)# number of class, should be 2 in our case
  rowsums = apply(CM, 1, sum) # number of instances per class
  colsums = apply(CM, 2, sum) # number of predictions per class
  p = rowsums / n # distribution of instances over the actual classes
  q = colsums / n # distribution of instances over the predicted classes
  diag = diag(CM) # get TP and TN using identity matrix

  accuracy = sum(diag) / n # overall classification accuracy
  precision = diag / colsums #  fraction of correct predictions for a certain class
  recall = diag / rowsums # fraction of instances of a class that were correctly predicted
  f1 = 2 * precision * recall / (precision + recall) # harmonic mean (or a weighted average) of precision and recall
  perf_df = data.frame(accuracy,precision, recall, f1) 
  AUC = auc(test_set$TARGET, factor(y_pred, ordered = TRUE)) # Area under ROC curve
  expAccuracy = sum(p*q)
  kappa = (accuracy - expAccuracy) / (1 - expAccuracy)
  macroPrecision = mean(precision)
  macroRecall = mean(recall)
  macroF1 = mean(f1)
  
  list(Accuracy=accuracy, ROC_area=AUC, Kappa =kappa, Precision=macroPrecision, Recall=macroRecall, Fscore=macroF1, performance_df=perf_df, confusion_matrix=CM)
}

Part 6 Build Machine Learning Model (Classifcation)

We will develop a few model to predict if a person has the ability to service the loan. Since the output is binary (yes/no), classification algorithm will be used, namely Decision Tree, Naive Bayes and Logistic Regression.

6.1 Decision Tree (DT) Classification

# fit the decision tree classification
model = rpart(TARGET ~.,
               data = bal_training_set, method = "class")
# plot decision tree
rpart.plot(model, extra = 106)


Decision tree model suggests uses the normalized score from external data source (EXT_SOURCE_2 & EXT_SOURCE_3) to determine if a person has the ability to service the loan.

Evaluate Decision Tree (DT) Model

# make prediction
DT.y_pred = predict (model, test_set, type = 'class')

# Confusion Matrix
DT.eval <- simple_eval(test_set, DT.y_pred)
DT.eval$confusion_matrix
##       Predicted
## Actual     0     1
##      0 67441 17352
##      1  4225  3222
# Performance by output class
print(round(DT.eval$performance_df,2))
##   accuracy precision recall   f1
## 0     0.77      0.94   0.80 0.86
## 1     0.77      0.16   0.43 0.23
# Model Evaluation
cat('Accuracy: ', round(DT.eval$Accuracy,2),
    '\nMacro-precision: ', round(DT.eval$Precision,2),
    '\nMacro-recall: ', round(DT.eval$Recall,2),
    '\nMacro-F1: ', round(DT.eval$Fscore,2),
    '\nKappa: ', round(DT.eval$Kappa,2),
    '\nROC area: ', round(DT.eval$ROC_area,2)
    )
## Accuracy:  0.77 
## Macro-precision:  0.55 
## Macro-recall:  0.61 
## Macro-F1:  0.55 
## Kappa:  0.13 
## ROC area:  0.61

6.2 Naive Bayes (NB)

For our second model, we will use Naive Bayes by fitting into the training dataset

# Fitting Naive Bayes Model to training dataset
model_NB <- naiveBayes(TARGET ~ ., data = bal_training_set)
# Predicting on test data'
NB.y_pred <- predict(model_NB, newdata = test_set)

# Confusion Matrix
NB.eval <- simple_eval(test_set, NB.y_pred)
NB.eval$confusion_matrix
##       Predicted
## Actual     0     1
##      0 54056 30737
##      1  3113  4334
# Performance by output class
print(round(NB.eval$performance_df,2))
##   accuracy precision recall   f1
## 0     0.63      0.95   0.64 0.76
## 1     0.63      0.12   0.58 0.20

Evaluate Naive Bayes (NB) Model

# Model Evaluation
cat('Accuracy: ', round(NB.eval$Accuracy,2),
    '\nMacro-precision: ', round(NB.eval$Precision,2),
    '\nMacro-recall: ', round(NB.eval$Recall,2),
    '\nMacro-F1: ', round(NB.eval$Fscore,2),
    '\nKappa: ', round(NB.eval$Kappa,2),
    '\nROC area: ', round(NB.eval$ROC_area,2)
    )
## Accuracy:  0.63 
## Macro-precision:  0.53 
## Macro-recall:  0.61 
## Macro-F1:  0.48 
## Kappa:  0.08 
## ROC area:  0.61

6.3 Logistic Regression (LR)

For our third model, we will use Logistic Regression by fitting into the training dataset.

# Training model
model_glm <- glm(TARGET ~ ., 
                 data = bal_training_set, 
                 family = "binomial")
# Predict test data based on model
LR.y_pred <- predict(model_glm, 
                  newdata = test_set, type = "response")

# Changing probabilities
LR.y_pred <- ifelse(LR.y_pred >0.5, 1, 0)

# Confusion matrix
LR.eval <- simple_eval(test_set, LR.y_pred)
LR.eval$confusion_matrix
##       Predicted
## Actual     0     1
##      0 58338 26455
##      1  2434  5013
# Performance by output class
print(round(LR.eval$performance_df,2))
##   accuracy precision recall   f1
## 0     0.69      0.96   0.69 0.80
## 1     0.69      0.16   0.67 0.26

Evaluate Logistic Regression (LR) Model

# Model Evaluation
cat('Accuracy: ', round(LR.eval$Accuracy,2),
    '\nMacro-precision: ', round(LR.eval$Precision,2),
    '\nMacro-recall: ', round(LR.eval$Recall,2),
    '\nMacro-F1: ', round(LR.eval$Fscore,2),
    '\nKappa: ', round(LR.eval$Kappa,2),
    '\nROC area: ', round(LR.eval$ROC_area,2)
    )
## Accuracy:  0.69 
## Macro-precision:  0.56 
## Macro-recall:  0.68 
## Macro-F1:  0.53 
## Kappa:  0.15 
## ROC area:  0.68

6.4 Comparison between DT, NB and LR models

compare <- data.frame(Method=c('Decision Tree', 'Naive Bayes', 'Logistic Regression'), Accuracy = NA, Precision = NA, Recall = NA, FScore = NA, 'ROC' = NA)
compare$Accuracy <- c(round(DT.eval$Accuracy,2),round(NB.eval$Accuracy,2),round(LR.eval$Accuracy,2))
compare$Precision <- c(round(DT.eval$Precision,2),round(NB.eval$Precision,2),round(LR.eval$Precision,2))
compare$Recall <- c(round(DT.eval$Recall,2),round(NB.eval$Recall,2),round(LR.eval$Recall,2))
compare$FScore <- c(round(DT.eval$Fscore,2),round(NB.eval$Fscore,2),round(LR.eval$Fscore,2))
compare$ROC <- c(round(DT.eval$ROC_area,2),round(NB.eval$ROC_area,2),round(LR.eval$ROC_area,2))
kable_styling(kable(compare),c("striped","bordered"), full_width = F)
Method Accuracy Precision Recall FScore ROC
Decision Tree 0.77 0.55 0.61 0.55 0.61
Naive Bayes 0.63 0.53 0.61 0.48 0.61
Logistic Regression 0.69 0.56 0.68 0.53 0.68

Decision Tree (DT) has the highest accuracy; however, it is worth take note that this model works well in predicting non-defaulters, but not defaulters. When we look at the ROC, logistic regression performs better than DT as more features is considered in predicting defaulters, rather than just using normalised external rating score.

get_roc_curve <- function(test_set, DT.y_pred, NB.y_pred, LR.y_pred){
  par(pty = "s")
  roc(test_set$TARGET, factor(DT.y_pred, ordered = TRUE),plot = TRUE, print.auc = TRUE, col="#377eb8", lwd = 2, main="ROC Curve Comparison")
  plot.roc(test_set$TARGET, factor(NB.y_pred, ordered = TRUE), print.auc = TRUE, add=TRUE, print.auc.y = 0.45, col="#4daf4a", lwd = 2)
  plot.roc(test_set$TARGET, factor(LR.y_pred, ordered = TRUE), print.auc = TRUE, add=TRUE, print.auc.y = 0.40, col="#FF0000", lwd = 2)
  legend("bottomright", legend = c("Decision Tree", "Naive Bayes", "Logistic Regression"), col=c("#377eb8", "#4daf4a", "#FF0000"), lwd = 2)
}

get_roc_curve(test_set, DT.y_pred, NB.y_pred, LR.y_pred)

By considering ROC and other metrics, we find logistic regression is the best model in predicting if a person has the ability to repay the loan.

Part 7 Machine Learning Preparation (Regression)

7.1 Select obeservations that able to repay the loan

# select rows where target = 0 (able to replay the loan)
df1 = subset(df, TARGET == 0)

# Drop unnecessary or duplicated columns
df1 = select (df1, -c('TARGET', 'AMT_CREDIT_RANGE'))

7.2 Training and test set

We will split into training set and test set in a ratio of 70:30.

set.seed(123)
split = sample.split(df1$AMT_CREDIT,SplitRatio = 0.70)
training_set = subset(df1, split == TRUE)
test_set = subset(df1, split == FALSE)

7.3 Scaling the continuous variables

Next, we scale the continuous variables for training set and test set.

continuous_column = c('AMT_INCOME_TOTAL', 'AMT_ANNUITY',  'AMT_GOODS_PRICE',  'REGION_POPULATION_RELATIVE',  'DAYS_REGISTRATION',  'CNT_FAM_MEMBERS',  'EXT_SOURCE_2', 'EXT_SOURCE_3',  'OBS_30_CNT_SOCIAL_CIRCLE',  'DEF_30_CNT_SOCIAL_CIRCLE',  'OBS_60_CNT_SOCIAL_CIRCLE',  'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE','AMT_REQ_CREDIT_BUREAU_HOUR',  'AMT_REQ_CREDIT_BUREAU_DAY',  'AMT_REQ_CREDIT_BUREAU_WEEK',  'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',  'AMT_REQ_CREDIT_BUREAU_YEAR')
training_set[continuous_column] = scale(training_set[continuous_column])
test_set[continuous_column] = scale(test_set[continuous_column])

x_train = data.matrix(select (training_set, -c('AMT_CREDIT'))) #select (training_set, -c('AMT_CREDIT'))
y_train = training_set$AMT_CREDIT

x_test = data.matrix(select (test_set, -c('AMT_CREDIT'))) #select (test_set, -c('AMT_CREDIT'))
y_test = test_set$AMT_CREDIT

7.4 Performance Evaluation for Regression

For regression, we will use Mean Square Error (MSE), Root Mean Square Error(RMSE), Mean Average Error(MAE) and R square to evaluate the performance of regression models.

regr_evaluation <- function(testDataset, predictionResult){
MSE <- mse(testDataset$AMT_CREDIT, predictionResult)
RMSE <- rmse(testDataset$AMT_CREDIT, predictionResult)
MAE <- mae(testDataset$AMT_CREDIT, predictionResult)

error <- testDataset$AMT_CREDIT - predictionResult
R2 <- 1-sum(error^2)/sum((testDataset$AMT_CREDIT- mean(testDataset$AMT_CREDIT))^2)
AD_R2 <- 1-(MSE/var(testDataset$AMT_CREDIT))
list(MSE = MSE, RMSE = RMSE, MAE = MAE, R_square = R2, Adjusted_R_Square =AD_R2)
}


show_graph <- function(testDataset, predictionResult){
plot(x=predictionResult, y= testDataset$AMT_CREDIT,
xlab='Predicted Values',
ylab='Actual Values',
main='Predicted vs. Actual Values')
abline(a=0, b=1)
}

Part 8 Build Machine Learning Model (Regression)

We will develop a few models to predict the loan amount that one can take and able to service. Regression algorithm will be used since the output is continuous variables, namely Linear Regression, Ridge Regression and Decision Tree Regression.

8.1 Linear Regression (LR)

# fit the linear regression
lr <- lm(AMT_CREDIT ~ ., data = training_set)

Evaluate Linear Regresion (LR)

# make prediction
LR.y_pred = predict(lr, test_set)

# Model Evaluation
LR.eval <- regr_evaluation(test_set,LR.y_pred)
cat('MSE: ', round(LR.eval$MSE,2),
'\nRMSE: ', round(LR.eval$RMSE,2),
'\nMAE: ', round(LR.eval$MAE,2),
'\nR_square: ', round(LR.eval$R_square,2),
'\nAdjusted_R_Square: ', round(LR.eval$Adjusted_R_Square,2)
)
## MSE:  3851730025 
## RMSE:  62062.31 
## MAE:  45469.49 
## R_square:  0.98 
## Adjusted_R_Square:  0.98
show_graph(test_set,LR.y_pred)

8.2 Ridge Regression (RR)

# fit the rdige regression
lr <- lm(AMT_CREDIT ~ ., data = training_set)
lambdas <- 10^seq(2, -3, by = -.1)
ridge_reg = glmnet(x_train, y_train, alpha = 0, lambda  = lambdas)

cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
optimal_lambda 
## [1] 0.001
best_fit <- cv_ridge$glmnet.fit

best_ridge <- glmnet(x_train, y_train, alpha = 0, lambda = optimal_lambda)

Evaluate Ridge Regresion (RR)

# make prediction
RR.y_pred = predict(best_ridge, s = optimal_lambda, newx = x_test)

# Model Evaluation
RR.eval <- regr_evaluation(test_set,RR.y_pred)
cat('MSE: ', round(RR.eval$MSE,2),
'\nRMSE: ', round(RR.eval$RMSE,2),
'\nMAE: ', round(RR.eval$MAE,2),
'\nR_square: ', round(RR.eval$R_square,2),
'\nAdjusted_R_Square: ', round(RR.eval$Adjusted_R_Square,2)
)
## MSE:  3886222698 
## RMSE:  62339.58 
## MAE:  45662.85 
## R_square:  0.98 
## Adjusted_R_Square:  0.98
show_graph(test_set,RR.y_pred)

8.3 Decision Tree Regression (DTR)

# fit the decision tree
DT_regression <- rpart(AMT_CREDIT ~ ., 
             method = "anova", data = training_set )

rpart.plot(DT_regression, uniform = TRUE,
     main = "Amount Credit Decision Tree using Regression")

Evaluate Decision Tree Regresion (DTR)

# make prediction
DTR.y_pred = predict(DT_regression, test_set, method = "anova")

# Model Evaluation
DTR.eval <- regr_evaluation(test_set,DTR.y_pred)
cat('MSE: ', round(DTR.eval$MSE,2),
'\nRMSE: ', round(DTR.eval$RMSE,2),
'\nMAE: ', round(DTR.eval$MAE,2),
'\nR_square: ', round(DTR.eval$R_square,2),
'\nAdjusted_R_Square: ', round(DTR.eval$Adjusted_R_Square,2)
)
## MSE:  10256317365 
## RMSE:  101273.5 
## MAE:  75873.9 
## R_square:  0.94 
## Adjusted_R_Square:  0.94
show_graph(test_set,DTR.y_pred)

8.4 Comparison between LR, RR and DTR models

compare2 <- data.frame(Method=c('Linear Regression', 'Ridge Regression', 'Decision Tree Regression'), MSE=NA, RMSE = NA, MAE=NA, R_Square = NA, Adjusted_R_Square=NA)
compare2$MSE <- c(round(LR.eval$MSE,2),round(RR.eval$MSE,2),round(DTR.eval$MSE,2))
compare2$RMSE <- c(round(LR.eval$RMSE,2),round(RR.eval$RMSE,2),round(DTR.eval$RMSE,2))
compare2$MAE <- c(round(LR.eval$MAE,2),round(RR.eval$MAE,2),round(DTR.eval$MAE,2))
compare2$R_Square <- c(round(LR.eval$R_square,2),round(RR.eval$R_square,2),round(DTR.eval$R_square,2))
compare2$Adjusted_R_Square <- c(round(LR.eval$Adjusted_R_Square,2),round(RR.eval$Adjusted_R_Square,2),round(DTR.eval$Adjusted_R_Square,2))

kable_styling(kable(compare2),c("striped","bordered"), full_width = F)
Method MSE RMSE MAE R_Square Adjusted_R_Square
Linear Regression 3851730025 62062.31 45469.49 0.98 0.98
Ridge Regression 3886222698 62339.58 45662.85 0.98 0.98
Decision Tree Regression 10256317365 101273.48 75873.90 0.94 0.94

The target variable loan amount (AMT_CREDIT) range from 45000 to 4.05 million. The loan amount is big and hence explain why the RMSE is big but we still achieve high R square. Based on the graph, the predicted value is quite close to the actual value for both linear regression and ridge regression. Linear regression which has the highest R square is the best model in predicting the loan amount that one can take.

Conclusion

  1. Based on the correlation matrix and feature selection 58 variables (including 3 binning variables) out of 122 variables are are deemed as significant features in predictive modelling construction and performance.
  2. Logistic regression works best in predicting the loan defaulters.However, the dataset needs to be enhanced for our classification purpose.
  3. Linear regression works best in predicting the loan amount that one can take without having issue to repay in the future.

In the future studies, the researchers can explore more machine learning algorithms, especially the ensemble learners and deep learning models.