Taking up loans are common these days as the needs of people are increasing (Anant, S. et all, 2022) Often, there are borrower unable to service the loan and went default. In this case, bank will make a loss. Also, for the defaulter, their credit score will be drastically reduced and impact the ability to receive loan in the future. Hence, in this project, we will use machine learning to predict if a loan application will turn into defaulters. This will help bank in minimising their defaulter risk. In addition to that, we will also provide the loan applicants the suggested loan amount to increase their chance of getting approval and reduce the risk of becoming defaulters in the future.
The dataset we chose can be found from the link below. It consists 307,511 rows x 122 columns. https://www.kaggle.com/code/shrutidandagi/eda-bank-loan-default/notebook
There are around 122 variables in the dataset. Only 57 variables (which later found to be meaningful) are selected to display in this section.
Attribute Information:
1. SK_ID_CURR:ID of loan in our sample
2. TARGET: Target variable (1 - client with payment difficulties 0- client with no payment difficulties)
3. NAME_CONTRACT_TYPE: Identification if loan is cash or revolving
4. CODE_GENDER: Gender of the client
5. FLAG_OWN_CAR: Flag if the client owns a car
6. FLAG_OWN_REALTY: Flag if client owns a house or flat
7. CNT_CHILDREN: Number of children the client has
8. AMT_INCOME_TOTAL: Income of the client
9. AMT_CREDIT: Credit amount of the loan
10. AMT_ANNUITY: Loan annuity
11. AMT_GOODS_PRICE: For consumer loans it is the price of the goods for which the loan is given
12. NAME_TYPE_SUITE: Who was accompanying client when he was applying for the loan
13. NAME_INCOME_TYPE: Clients income type (businessman, working, maternity leave,…)
14. NAME_EDUCATION_TYPE: Level of highest education the client achieved
15. NAME_FAMILY_STATUS:Family status of the client
16. NAME_HOUSING_TYPE: What is the housing situation of the client (renting, living with parents, …)
17. REGION_POPULATION_RELATIVE: Normalized population of region where client lives (higher number means the client lives in more populated region)
18. DAYS_BIRTH: Client’s age in days at the time of application
19. DAYS_EMPLOYED: How many days before the application the person started current employment
20. DAYS_REGISTRATION: How many days before the application did client change his registration
21. DAYS_ID_PUBLISH: How many days before the application did client change the identity document
22. FLAG_MOBIL: Did client provide mobile phone (1=YES, 0=NO)
23. FLAG_EMP_PHONE: Did client provide work phone (1=YES, 0=NO)
24. FLAG_WORK_PHONE: Did client provide home phone (1=YES, 0=NO)
25. FLAG_CONT_MOBILE: Was mobile phone reachable (1=YES, 0=NO)
26. FLAG_PHONE: Did client provide home phone (1=YES, 0=NO)
27. FLAG_EMAIL: Did client provide email (1=YES, 0=NO)
28. OCCUPATION_TYPE: What kind of occupation does the client have
29. CNT_FAM_MEMBERS: How many family members does client have
30. REGION_RATING_CLIENT: Our rating of the region where client lives (1,2,3)
31. REGION_RATING_CLIENT_W_CITY: REGION_RATING_CLIENT_W_CITY
32. WEEKDAY_APPR_PROCESS_START: On which day of the week did the client apply for the loan
33. HOUR_APPR_PROCESS_START: Approximately at what hour did the client apply for the loan
34. REG_REGION_NOT_LIVE_REGION: Flag if client’s permanent address does not match contact address (1=different, 0=same, at region level)
35. REG_REGION_NOT_WORK_REGION: Flag if client’s contact address does not match work address (1=different, 0=same, at region level)
36. LIVE_REGION_NOT_WORK_REGION: Flag if client’s contact address does not match work address (1=different, 0=same, at region level)
37. REG_CITY_NOT_LIVE_CITY: Flag if client’s permanent address does not match contact address (1=different, 0=same, at city level)
38. REG_CITY_NOT_WORK_CITY: Flag if client’s permanent address does not match work address (1=different, 0=same, at city level)
39. LIVE_CITY_NOT_WORK_CITY: Flag if client’s contact address does not match work address (1=different, 0=same, at city level)
40. ORGANIZATION_TYPE: Type of organization where client works
41. EXT_SOURCE_2: Normalized score from external data source
42. EXT_SOURCE_3: Normalized score from external data source
43. FONDKAPREMONT_MODE: Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
44. HOUSETYPE_MODE: Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
45. WALLSMATERIAL_MODE: Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
46. EMERGENCYSTATE_MODE: Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
47. OBS_30_CNT_SOCIAL_CIRCLE : How many observation of client’s social surroundings with observable 30 DPD (days past due) default
48. DEF_30_CNT_SOCIAL_CIRCLE: How many observation of client’s social surroundings defaulted on 30 DPD (days past due)
49. OBS_60_CNT_SOCIAL_CIRCLE: How many observation of client’s social surroundings with observable 60 DPD (days past due) default
50. DEF_60_CNT_SOCIAL_CIRCLE: How many observation of client’s social surroundings defaulted on 60 (days past due) DPD
51. DAYS_LAST_PHONE_CHANGE: How many days before application did client change phone
52. AMT_REQ_CREDIT_BUREAU_HOUR: Number of enquiries to Credit Bureau about the client one hour before application
53. AMT_REQ_CREDIT_BUREAU_DAY: Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application)
54. AMT_REQ_CREDIT_BUREAU_WEEK: Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application)
55. AMT_REQ_CREDIT_BUREAU_MON: Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application)
56. AMT_REQ_CREDIT_BUREAU_QRT: Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application)
57. AMT_REQ_CREDIT_BUREAU_YEAR: Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application)
library('ggpubr')
library(corrplot)
library(naniar)
library(forcats)
library('readxl')
library('dplyr')
library('ggplot2')
library(tidyverse)
library(knitr) # web widget
library(tidyverse) # data manipulation
library(data.table) # fast file reading
library(caret) # rocr analysis
library(ROCR) # rocr analysis
library(kableExtra) # nice table html formating
library(gridExtra) # arranging ggplot in grid
library(rpart) # decision tree
library(rpart.plot) # decision tree plotting
library(caTools) # split
library(hrbrthemes)
library(viridis)
library(e1071)
library(rpart)
library(rpart.plot)
library(ROSE)
library(pROC)
library(Metrics)
library(glmnet)
library(plyr)
library(readr)
library(repr)## [1] 307511 122
## 'data.frame': 307511 obs. of 122 variables:
## $ SK_ID_CURR : int 100002 100003 100004 100006 100007 100008 100009 100010 100011 100012 ...
## $ TARGET : int 1 0 0 0 0 0 0 0 0 0 ...
## $ NAME_CONTRACT_TYPE : chr "Cash loans" "Cash loans" "Revolving loans" "Cash loans" ...
## $ CODE_GENDER : chr "M" "F" "M" "F" ...
## $ FLAG_OWN_CAR : chr "N" "N" "Y" "N" ...
## $ FLAG_OWN_REALTY : chr "Y" "N" "Y" "Y" ...
## $ CNT_CHILDREN : int 0 0 0 0 0 0 1 0 0 0 ...
## $ AMT_INCOME_TOTAL : num 202500 270000 67500 135000 121500 ...
## $ AMT_CREDIT : num 406598 1293503 135000 312683 513000 ...
## $ AMT_ANNUITY : num 24701 35699 6750 29687 21866 ...
## $ AMT_GOODS_PRICE : num 351000 1129500 135000 297000 513000 ...
## $ NAME_TYPE_SUITE : chr "Unaccompanied" "Family" "Unaccompanied" "Unaccompanied" ...
## $ NAME_INCOME_TYPE : chr "Working" "State servant" "Working" "Working" ...
## $ NAME_EDUCATION_TYPE : chr "Secondary / secondary special" "Higher education" "Secondary / secondary special" "Secondary / secondary special" ...
## $ NAME_FAMILY_STATUS : chr "Single / not married" "Married" "Single / not married" "Civil marriage" ...
## $ NAME_HOUSING_TYPE : chr "House / apartment" "House / apartment" "House / apartment" "House / apartment" ...
## $ REGION_POPULATION_RELATIVE : num 0.0188 0.00354 0.01003 0.00802 0.02866 ...
## $ DAYS_BIRTH : int -9461 -16765 -19046 -19005 -19932 -16941 -13778 -18850 -20099 -14469 ...
## $ DAYS_EMPLOYED : int -637 -1188 -225 -3039 -3038 -1588 -3130 -449 365243 -2019 ...
## $ DAYS_REGISTRATION : num -3648 -1186 -4260 -9833 -4311 ...
## $ DAYS_ID_PUBLISH : int -2120 -291 -2531 -2437 -3458 -477 -619 -2379 -3514 -3992 ...
## $ OWN_CAR_AGE : int NA NA 26 NA NA NA 17 8 NA NA ...
## $ FLAG_MOBIL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_EMP_PHONE : int 1 1 1 1 1 1 1 1 0 1 ...
## $ FLAG_WORK_PHONE : int 0 0 1 0 0 1 0 1 0 0 ...
## $ FLAG_CONT_MOBILE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_PHONE : int 1 1 1 0 0 1 1 0 0 0 ...
## $ FLAG_EMAIL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ OCCUPATION_TYPE : chr "Laborers" "Core staff" "Laborers" "Laborers" ...
## $ CNT_FAM_MEMBERS : int 1 2 1 2 1 2 3 2 2 1 ...
## $ REGION_RATING_CLIENT : int 2 1 2 2 2 2 2 3 2 2 ...
## $ REGION_RATING_CLIENT_W_CITY : int 2 1 2 2 2 2 2 3 2 2 ...
## $ WEEKDAY_APPR_PROCESS_START : chr "WEDNESDAY" "MONDAY" "MONDAY" "WEDNESDAY" ...
## $ HOUR_APPR_PROCESS_START : int 10 11 9 17 11 16 16 16 14 8 ...
## $ REG_REGION_NOT_LIVE_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_REGION_NOT_WORK_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LIVE_REGION_NOT_WORK_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_CITY_NOT_LIVE_CITY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_CITY_NOT_WORK_CITY : int 0 0 0 0 1 0 0 1 0 0 ...
## $ LIVE_CITY_NOT_WORK_CITY : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ORGANIZATION_TYPE : chr "Business Entity Type 3" "School" "Government" "Business Entity Type 3" ...
## $ EXT_SOURCE_1 : num 0.083 0.311 NA NA NA ...
## $ EXT_SOURCE_2 : num 0.263 0.622 0.556 0.65 0.323 ...
## $ EXT_SOURCE_3 : num 0.139 NA 0.73 NA NA ...
## $ APARTMENTS_AVG : num 0.0247 0.0959 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_AVG : num 0.0369 0.0529 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_AVG : num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_AVG : num 0.619 0.796 NA NA NA ...
## $ COMMONAREA_AVG : num 0.0143 0.0605 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_AVG : num 0 0.08 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_AVG : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_AVG : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_AVG : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_AVG : num 0.0369 0.013 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_AVG : num 0.0202 0.0773 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_AVG : num 0.019 0.0549 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_AVG : num 0 0.0039 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_AVG : num 0 0.0098 NA NA NA NA NA NA NA NA ...
## $ APARTMENTS_MODE : num 0.0252 0.0924 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_MODE : num 0.0383 0.0538 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_MODE: num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_MODE : num 0.634 0.804 NA NA NA ...
## $ COMMONAREA_MODE : num 0.0144 0.0497 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_MODE : num 0 0.0806 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_MODE : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_MODE : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_MODE : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_MODE : num 0.0377 0.0128 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_MODE : num 0.022 0.079 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_MODE : num 0.0198 0.0554 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_MODE : num 0 0 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_MODE : num 0 0 NA NA NA NA NA NA NA NA ...
## $ APARTMENTS_MEDI : num 0.025 0.0968 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_MEDI : num 0.0369 0.0529 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_MEDI: num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_MEDI : num 0.624 0.799 NA NA NA ...
## $ COMMONAREA_MEDI : num 0.0144 0.0608 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_MEDI : num 0 0.08 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_MEDI : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_MEDI : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_MEDI : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_MEDI : num 0.0375 0.0132 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_MEDI : num 0.0205 0.0787 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_MEDI : num 0.0193 0.0558 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_MEDI : num 0 0.0039 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_MEDI : num 0 0.01 NA NA NA NA NA NA NA NA ...
## $ FONDKAPREMONT_MODE : chr "reg oper account" "reg oper account" "" "" ...
## $ HOUSETYPE_MODE : chr "block of flats" "block of flats" "" "" ...
## $ TOTALAREA_MODE : num 0.0149 0.0714 NA NA NA NA NA NA NA NA ...
## $ WALLSMATERIAL_MODE : chr "Stone, brick" "Block" "" "" ...
## $ EMERGENCYSTATE_MODE : chr "No" "No" "" "" ...
## $ OBS_30_CNT_SOCIAL_CIRCLE : int 2 1 0 2 0 0 1 2 1 2 ...
## $ DEF_30_CNT_SOCIAL_CIRCLE : int 2 0 0 0 0 0 0 0 0 0 ...
## $ OBS_60_CNT_SOCIAL_CIRCLE : int 2 1 0 2 0 0 1 2 1 2 ...
## $ DEF_60_CNT_SOCIAL_CIRCLE : int 2 0 0 0 0 0 0 0 0 0 ...
## $ DAYS_LAST_PHONE_CHANGE : int -1134 -828 -815 -617 -1106 -2536 -1562 -1070 0 -1673 ...
## $ FLAG_DOCUMENT_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FLAG_DOCUMENT_3 : int 1 1 0 1 0 1 0 1 1 0 ...
## $ FLAG_DOCUMENT_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
## SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER
## Min. :100002 Min. :0.00000 Length:307511 Length:307511
## 1st Qu.:189146 1st Qu.:0.00000 Class :character Class :character
## Median :278202 Median :0.00000 Mode :character Mode :character
## Mean :278181 Mean :0.08073
## 3rd Qu.:367143 3rd Qu.:0.00000
## Max. :456255 Max. :1.00000
##
## FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL
## Length:307511 Length:307511 Min. : 0.0000 Min. : 25650
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 112500
## Mode :character Mode :character Median : 0.0000 Median : 147150
## Mean : 0.4171 Mean : 168798
## 3rd Qu.: 1.0000 3rd Qu.: 202500
## Max. :19.0000 Max. :117000000
##
## AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE
## Min. : 45000 Min. : 1616 Min. : 40500 Length:307511
## 1st Qu.: 270000 1st Qu.: 16524 1st Qu.: 238500 Class :character
## Median : 513531 Median : 24903 Median : 450000 Mode :character
## Mean : 599026 Mean : 27109 Mean : 538396
## 3rd Qu.: 808650 3rd Qu.: 34596 3rd Qu.: 679500
## Max. :4050000 Max. :258026 Max. :4050000
## NA's :12 NA's :278
## NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE
## Length:307511 Length:307511 Length:307511 Length:307511
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION
## Min. :0.00029 Min. :-25229 Min. :-17912 Min. :-24672
## 1st Qu.:0.01001 1st Qu.:-19682 1st Qu.: -2760 1st Qu.: -7480
## Median :0.01885 Median :-15750 Median : -1213 Median : -4504
## Mean :0.02087 Mean :-16037 Mean : 63815 Mean : -4986
## 3rd Qu.:0.02866 3rd Qu.:-12413 3rd Qu.: -289 3rd Qu.: -2010
## Max. :0.07251 Max. : -7489 Max. :365243 Max. : 0
##
## DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE
## Min. :-7197 Min. : 0.00 Min. :0 Min. :0.0000
## 1st Qu.:-4299 1st Qu.: 5.00 1st Qu.:1 1st Qu.:1.0000
## Median :-3254 Median : 9.00 Median :1 Median :1.0000
## Mean :-2994 Mean :12.06 Mean :1 Mean :0.8199
## 3rd Qu.:-1720 3rd Qu.:15.00 3rd Qu.:1 3rd Qu.:1.0000
## Max. : 0 Max. :91.00 Max. :1 Max. :1.0000
## NA's :202929
## FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :0.00000
## Mean :0.1994 Mean :0.9981 Mean :0.2811 Mean :0.05672
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT
## Length:307511 Min. : 1.000 Min. :1.000
## Class :character 1st Qu.: 2.000 1st Qu.:2.000
## Mode :character Median : 2.000 Median :2.000
## Mean : 2.153 Mean :2.052
## 3rd Qu.: 3.000 3rd Qu.:2.000
## Max. :20.000 Max. :3.000
## NA's :2
## REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START
## Min. :1.000 Length:307511 Min. : 0.00
## 1st Qu.:2.000 Class :character 1st Qu.:10.00
## Median :2.000 Mode :character Median :12.00
## Mean :2.032 Mean :12.06
## 3rd Qu.:2.000 3rd Qu.:14.00
## Max. :3.000 Max. :23.00
##
## REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.01514 Mean :0.05077
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
##
## LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.04066 Mean :0.07817 Mean :0.2305
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2
## Min. :0.0000 Length:307511 Min. :0.01 Min. :0.0000
## 1st Qu.:0.0000 Class :character 1st Qu.:0.33 1st Qu.:0.3925
## Median :0.0000 Mode :character Median :0.51 Median :0.5660
## Mean :0.1796 Mean :0.50 Mean :0.5144
## 3rd Qu.:0.0000 3rd Qu.:0.68 3rd Qu.:0.6636
## Max. :1.0000 Max. :0.96 Max. :0.8550
## NA's :173378 NA's :660
## EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.37 1st Qu.:0.06 1st Qu.:0.04 1st Qu.:0.98
## Median :0.54 Median :0.09 Median :0.08 Median :0.98
## Mean :0.51 Mean :0.12 Mean :0.09 Mean :0.98
## 3rd Qu.:0.67 3rd Qu.:0.15 3rd Qu.:0.11 3rd Qu.:0.99
## Max. :0.90 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :60965 NA's :156061 NA's :179943 NA's :150007
## YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.69 1st Qu.:0.01 1st Qu.:0.00 1st Qu.:0.07
## Median :0.76 Median :0.02 Median :0.00 Median :0.14
## Mean :0.75 Mean :0.04 Mean :0.08 Mean :0.15
## 3rd Qu.:0.82 3rd Qu.:0.05 3rd Qu.:0.12 3rd Qu.:0.21
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :204488 NA's :214865 NA's :163891 NA's :154828
## FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.17 1st Qu.:0.08 1st Qu.:0.02 1st Qu.:0.05
## Median :0.17 Median :0.21 Median :0.05 Median :0.08
## Mean :0.23 Mean :0.23 Mean :0.07 Mean :0.10
## 3rd Qu.:0.33 3rd Qu.:0.38 3rd Qu.:0.09 3rd Qu.:0.12
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :153020 NA's :208642 NA's :182590 NA's :210199
## LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.05 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.05
## Median :0.07 Median :0.00 Median :0.00 Median :0.08
## Mean :0.11 Mean :0.01 Mean :0.03 Mean :0.11
## 3rd Qu.:0.13 3rd Qu.:0.00 3rd Qu.:0.03 3rd Qu.:0.14
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :154350 NA's :213514 NA's :169682 NA's :156061
## BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.04 1st Qu.:0.98 1st Qu.:0.70
## Median :0.07 Median :0.98 Median :0.76
## Mean :0.09 Mean :0.98 Mean :0.76
## 3rd Qu.:0.11 3rd Qu.:0.99 3rd Qu.:0.82
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :179943 NA's :150007 NA's :204488
## COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.01 1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17
## Median :0.02 Median :0.00 Median :0.14 Median :0.17
## Mean :0.04 Mean :0.07 Mean :0.15 Mean :0.22
## 3rd Qu.:0.05 3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :214865 NA's :163891 NA's :154828 NA's :153020
## FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.08 1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.04
## Median :0.21 Median :0.05 Median :0.08 Median :0.07
## Mean :0.23 Mean :0.06 Mean :0.11 Mean :0.11
## 3rd Qu.:0.38 3rd Qu.:0.08 3rd Qu.:0.13 3rd Qu.:0.13
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :208642 NA's :182590 NA's :210199 NA's :154350
## NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.06 1st Qu.:0.04
## Median :0.00 Median :0.00 Median :0.09 Median :0.08
## Mean :0.01 Mean :0.03 Mean :0.12 Mean :0.09
## 3rd Qu.:0.00 3rd Qu.:0.02 3rd Qu.:0.15 3rd Qu.:0.11
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :213514 NA's :169682 NA's :156061 NA's :179943
## YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.98 1st Qu.:0.69 1st Qu.:0.01
## Median :0.98 Median :0.76 Median :0.02
## Mean :0.98 Mean :0.76 Mean :0.04
## 3rd Qu.:0.99 3rd Qu.:0.83 3rd Qu.:0.05
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :150007 NA's :204488 NA's :214865
## ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17 1st Qu.:0.08
## Median :0.00 Median :0.14 Median :0.17 Median :0.21
## Mean :0.08 Mean :0.15 Mean :0.23 Mean :0.23
## 3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33 3rd Qu.:0.38
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :163891 NA's :154828 NA's :153020 NA's :208642
## LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.05
## Median :0.05 Median :0.08 Median :0.07
## Mean :0.07 Mean :0.10 Mean :0.11
## 3rd Qu.:0.09 3rd Qu.:0.12 3rd Qu.:0.13
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :182590 NA's :210199 NA's :154350
## NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE
## Min. :0.00 Min. :0.00 Length:307511
## 1st Qu.:0.00 1st Qu.:0.00 Class :character
## Median :0.00 Median :0.00 Mode :character
## Mean :0.01 Mean :0.03
## 3rd Qu.:0.00 3rd Qu.:0.03
## Max. :1.00 Max. :1.00
## NA's :213514 NA's :169682
## HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
## Length:307511 Min. :0.00 Length:307511 Length:307511
## Class :character 1st Qu.:0.04 Class :character Class :character
## Mode :character Median :0.07 Mode :character Mode :character
## Mean :0.10
## 3rd Qu.:0.13
## Max. :1.00
## NA's :148431
## OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
## Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.000 Median : 0.0000 Median : 0.000
## Mean : 1.422 Mean : 0.1434 Mean : 1.405
## 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
## Max. :348.000 Max. :34.0000 Max. :344.000
## NA's :1021 NA's :1021 NA's :1021
## DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2
## Min. : 0.0 Min. :-4292.0 Min. :0.00e+00
## 1st Qu.: 0.0 1st Qu.:-1570.0 1st Qu.:0.00e+00
## Median : 0.0 Median : -757.0 Median :0.00e+00
## Mean : 0.1 Mean : -962.9 Mean :4.23e-05
## 3rd Qu.: 0.0 3rd Qu.: -274.0 3rd Qu.:0.00e+00
## Max. :24.0 Max. : 0.0 Max. :1.00e+00
## NA's :1021 NA's :1
## FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6
## Min. :0.00 Min. :0.00e+00 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00 1st Qu.:0.00e+00 1st Qu.:0.00000 1st Qu.:0.00000
## Median :1.00 Median :0.00e+00 Median :0.00000 Median :0.00000
## Mean :0.71 Mean :8.13e-05 Mean :0.01511 Mean :0.08806
## 3rd Qu.:1.00 3rd Qu.:0.00e+00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00 Max. :1.00e+00 Max. :1.00000 Max. :1.00000
##
## FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10
## Min. :0.0000000 Min. :0.00000 Min. :0.000000 Min. :0.00e+00
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00e+00
## Median :0.0000000 Median :0.00000 Median :0.000000 Median :0.00e+00
## Mean :0.0001919 Mean :0.08138 Mean :0.003896 Mean :2.28e-05
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00e+00
## Max. :1.0000000 Max. :1.00000 Max. :1.000000 Max. :1.00e+00
##
## FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14
## Min. :0.000000 Min. :0.0e+00 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.0e+00 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.0e+00 Median :0.000000 Median :0.000000
## Mean :0.003912 Mean :6.5e-06 Mean :0.003525 Mean :0.002936
## 3rd Qu.:0.000000 3rd Qu.:0.0e+00 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.0e+00 Max. :1.000000 Max. :1.000000
##
## FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18
## Min. :0.00000 Min. :0.000000 Min. :0.0000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.0000000 Median :0.00000
## Mean :0.00121 Mean :0.009928 Mean :0.0002667 Mean :0.00813
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.000000 Max. :1.0000000 Max. :1.00000
##
## FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21
## Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
## 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000
## Median :0.0000000 Median :0.0000000 Median :0.0000000
## Mean :0.0005951 Mean :0.0005073 Mean :0.0003349
## 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
## Max. :1.0000000 Max. :1.0000000 Max. :1.0000000
##
## AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
## Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :0.00
## Mean :0.01 Mean :0.01
## 3rd Qu.:0.00 3rd Qu.:0.00
## Max. :4.00 Max. :9.00
## NA's :41519 NA's :41519
## AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
## Min. :0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median :0.00 Median : 0.00 Median : 0.00
## Mean :0.03 Mean : 0.27 Mean : 0.27
## 3rd Qu.:0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :8.00 Max. :27.00 Max. :261.00
## NA's :41519 NA's :41519 NA's :41519
## AMT_REQ_CREDIT_BUREAU_YEAR
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 1.0
## Mean : 1.9
## 3rd Qu.: 3.0
## Max. :25.0
## NA's :41519
# Total number of missing values in the dataset
cat("The total number of missing values in the dataset is" , sum(is.na(OriginalDF)))## The total number of missing values in the dataset is 8388094
Total number of missing values in the dataset per column name
## SK_ID_CURR TARGET
## 0 0
## NAME_CONTRACT_TYPE CODE_GENDER
## 0 0
## FLAG_OWN_CAR FLAG_OWN_REALTY
## 0 0
## CNT_CHILDREN AMT_INCOME_TOTAL
## 0 0
## AMT_CREDIT AMT_ANNUITY
## 0 12
## AMT_GOODS_PRICE NAME_TYPE_SUITE
## 278 0
## NAME_INCOME_TYPE NAME_EDUCATION_TYPE
## 0 0
## NAME_FAMILY_STATUS NAME_HOUSING_TYPE
## 0 0
## REGION_POPULATION_RELATIVE DAYS_BIRTH
## 0 0
## DAYS_EMPLOYED DAYS_REGISTRATION
## 0 0
## DAYS_ID_PUBLISH OWN_CAR_AGE
## 0 202929
## FLAG_MOBIL FLAG_EMP_PHONE
## 0 0
## FLAG_WORK_PHONE FLAG_CONT_MOBILE
## 0 0
## FLAG_PHONE FLAG_EMAIL
## 0 0
## OCCUPATION_TYPE CNT_FAM_MEMBERS
## 0 2
## REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY
## 0 0
## WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START
## 0 0
## REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
## 0 0
## LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY
## 0 0
## REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY
## 0 0
## ORGANIZATION_TYPE EXT_SOURCE_1
## 0 173378
## EXT_SOURCE_2 EXT_SOURCE_3
## 660 60965
## APARTMENTS_AVG BASEMENTAREA_AVG
## 156061 179943
## YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG
## 150007 204488
## COMMONAREA_AVG ELEVATORS_AVG
## 214865 163891
## ENTRANCES_AVG FLOORSMAX_AVG
## 154828 153020
## FLOORSMIN_AVG LANDAREA_AVG
## 208642 182590
## LIVINGAPARTMENTS_AVG LIVINGAREA_AVG
## 210199 154350
## NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG
## 213514 169682
## APARTMENTS_MODE BASEMENTAREA_MODE
## 156061 179943
## YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE
## 150007 204488
## COMMONAREA_MODE ELEVATORS_MODE
## 214865 163891
## ENTRANCES_MODE FLOORSMAX_MODE
## 154828 153020
## FLOORSMIN_MODE LANDAREA_MODE
## 208642 182590
## LIVINGAPARTMENTS_MODE LIVINGAREA_MODE
## 210199 154350
## NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE
## 213514 169682
## APARTMENTS_MEDI BASEMENTAREA_MEDI
## 156061 179943
## YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI
## 150007 204488
## COMMONAREA_MEDI ELEVATORS_MEDI
## 214865 163891
## ENTRANCES_MEDI FLOORSMAX_MEDI
## 154828 153020
## FLOORSMIN_MEDI LANDAREA_MEDI
## 208642 182590
## LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI
## 210199 154350
## NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI
## 213514 169682
## FONDKAPREMONT_MODE HOUSETYPE_MODE
## 0 0
## TOTALAREA_MODE WALLSMATERIAL_MODE
## 148431 0
## EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE
## 0 1021
## DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
## 1021 1021
## DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE
## 1021 1
## FLAG_DOCUMENT_2 FLAG_DOCUMENT_3
## 0 0
## FLAG_DOCUMENT_4 FLAG_DOCUMENT_5
## 0 0
## FLAG_DOCUMENT_6 FLAG_DOCUMENT_7
## 0 0
## FLAG_DOCUMENT_8 FLAG_DOCUMENT_9
## 0 0
## FLAG_DOCUMENT_10 FLAG_DOCUMENT_11
## 0 0
## FLAG_DOCUMENT_12 FLAG_DOCUMENT_13
## 0 0
## FLAG_DOCUMENT_14 FLAG_DOCUMENT_15
## 0 0
## FLAG_DOCUMENT_16 FLAG_DOCUMENT_17
## 0 0
## FLAG_DOCUMENT_18 FLAG_DOCUMENT_19
## 0 0
## FLAG_DOCUMENT_20 FLAG_DOCUMENT_21
## 0 0
## AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
## 41519 41519
## AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON
## 41519 41519
## AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
## 41519 41519
We have about 300,000 instances and we will remove columns which consists of more than 140,000 missing values. We will also remove rows that consist more than 50% missing value
miss <- c()
for(i in 1:ncol(OriginalDF)) {
if(length(which(is.na(OriginalDF[,i]))) > 140000) miss <- append(miss,i)
}
OriginalDF<- OriginalDF[,-miss]miss2 <- c()
for(i in 1:nrow(OriginalDF)) {
if(length(which(is.na(OriginalDF[i,]))) > 0.5*ncol(OriginalDF)) miss2 <- append(miss2,i)
}
OriginalDF<- OriginalDF[-miss,]Column flag-documents (index 52 to 71) are not useful to our analysis, therefore, we also drop these columns and check the number of rows & columns left in the dataset
## [1] 307466 57
Missing value in categorical columns - Replace missing with ‘Unknown’
OriginalDF<-OriginalDF %>% replace_with_na(replace = list(EMERGENCYSTATE_MODE =""))
OriginalDF$EMERGENCYSTATE_MODE= fct_explicit_na(OriginalDF$EMERGENCYSTATE_MODE, "Unknown")
OriginalDF<-OriginalDF %>% replace_with_na(replace = list(OCCUPATION_TYPE =""))
OriginalDF$OCCUPATION_TYPE= fct_explicit_na(OriginalDF$OCCUPATION_TYPE, "Unknown")
OriginalDF<-OriginalDF %>% replace_with_na(replace = list(FONDKAPREMONT_MODE =""))
OriginalDF$FONDKAPREMONT_MODE= fct_explicit_na(OriginalDF$FONDKAPREMONT_MODE, "Unknown")
OriginalDF<-OriginalDF %>% replace_with_na(replace = list(HOUSETYPE_MODE =""))
OriginalDF$HOUSETYPE_MODE= fct_explicit_na(OriginalDF$HOUSETYPE_MODE, "Unknown")
OriginalDF<-OriginalDF %>% replace_with_na(replace = list(WALLSMATERIAL_MODE =""))
OriginalDF$WALLSMATERIAL_MODE= fct_explicit_na(OriginalDF$WALLSMATERIAL_MODE, "Unknown")Missing value in numeric columns - Replace missing with ‘mean’
The variables for DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH, DAYS_ID_PUBLISH, DAYS_LAST_PHONE_CHANGE are in negative values. Therefore, we will transform the negative values to positive values.
We bin AMT_CREDIT variable to AMT_CREDIT_RANGE, AMT_INCOME_TOTAL variable to AMT_INCOME_RANGE and DAYS_BIRTH variable to AGE_RANGE.
bins = c(0,350000,700000,1000000000)
slots = c('Low','Medium','High')
###Creating bins for Credit amount
OriginalDF['AMT_CREDIT_RANGE']=cut(OriginalDF$AMT_CREDIT,breaks=c(0,350000,700000,1000000000),labels=slots)
bins = c(0,200000,400000,10000000000)
slots = c('Low','Medium','High')
# Creating bins for income amount
OriginalDF['AMT_INCOME_RANGE']=cut(OriginalDF$AMT_INCOME_TOTAL,breaks=c(0,200000,400000,10000000000),labels=slots)
bins = c(0,7300,10950,14600,18250,21900,25500)
slots = c('0-20','20-30','30-40','40-50','50-60','60-70')
OriginalDF['AGE_RANGE']=cut(OriginalDF$DAYS_BIRTH,breaks=c(0,7300,10950,14600,18250,21900,25500),labels=slots)After imputation, certain rows became identical hence need to be deduplicated.
##
## 0 1
## 0.91926262 0.08073738
The TARGET column has 8.07% of 1’s which means 8% customers have payment difficulties and 91.92% are having no payment difficulties. The dataset is inbalanced and we will need to take note of this during the modelling.
We will select these variables(CODE_GENDER, NAME_INCOME_TYPE, NAME_EDUCATION_TYPE, NAME_HOUSING_TYPE, OCCUPATION_TYPE, ORGANIZATION_TYPE, AMT_CREDITRANGE, AMT_INCOME_RANGE and AGE_RANGE) to perform our univariate analysis. ### (1) GENDER, AMOUNT OF CREDIT/LOAN, AMOUNT OF INCOME,& AGE
genderplot<-ggplot(data=OriginalDF,aes(x=CODE_GENDER,fill=CODE_GENDER)) +
geom_bar(stat="count") + ggtitle('Gender')+
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
AMT_CREDITRANGEplot<-ggplot(data=OriginalDF,aes(x=factor(AMT_CREDIT_RANGE, level=c('Low','Medium','High')),fill=AMT_CREDIT_RANGE))+
geom_bar(stat="count") + ggtitle('Credit Amount (Range)') +
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
AMT_INCOME_RANGEplot<-ggplot(data=OriginalDF,aes(x=factor(AMT_INCOME_RANGE, level=c('Low','Medium','High')),fill=AMT_INCOME_RANGE))+
geom_bar(stat="count") + ggtitle('Income Amount (Range)') +
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
AGE_RANGEplot<-ggplot(data=OriginalDF,aes(x=AGE_RANGE,fill=AGE_RANGE))+
#scale_fill_manual(values = c("#F8766D","#A3A500","#00BF7D","#00B0F6","#E76BF3"))+
geom_bar(stat="count") + ggtitle('Age Range') +
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
p1 <- ggarrange(genderplot, AMT_INCOME_RANGEplot,
ncol = 1, nrow = 2)
p2 <- ggarrange(AGE_RANGEplot, AMT_CREDITRANGEplot,
ncol = 1, nrow = 2)
ggarrange(p1, p2, ncol = 2, nrow = 1)
From the above barplots, it is shown that the highest number of customers in making loan payemnts are female customers, ranging from age 30-40 years old, having low income andlow credit amount while applying loan.
incometypeplot<-ggplot(data=OriginalDF,aes(x=NAME_INCOME_TYPE,fill=NAME_INCOME_TYPE)) +
geom_bar(stat="count") + ggtitle('Types of Income')+
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
edutypeplot<-ggplot(data=OriginalDF,aes(x=NAME_EDUCATION_TYPE,fill=NAME_EDUCATION_TYPE)) +
geom_bar(stat="count") + ggtitle('Types of Education')+
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
marriagetypeplot<-ggplot(data=OriginalDF,aes(x=NAME_FAMILY_STATUS,fill=NAME_FAMILY_STATUS)) +
geom_bar(stat="count") + ggtitle('Marital Status')+
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
housetypeplot<-ggplot(data=OriginalDF,aes(x=NAME_HOUSING_TYPE,fill=NAME_HOUSING_TYPE)) +
geom_bar(stat="count") + ggtitle('Types of Housing')+
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
worktypeplot<-ggplot(data=OriginalDF,aes(x=OCCUPATION_TYPE,fill=OCCUPATION_TYPE)) +
geom_bar(stat="count") + ggtitle('Types of Occupation')+
theme(plot.title = element_text (hjust = 0.5)) + theme(legend.position="none") + xlab('') + ylab ('')
p3 <- ggarrange(incometypeplot,edutypeplot,
ncol = 1, nrow = 2)
p4 <- ggarrange(marriagetypeplot,housetypeplot,
ncol = 1, nrow = 2)
ggarrange(p3, p4, ncol = 2, nrow = 1)
From the barplots, it is shown that the highest number of customers in making the loan payments are married employees, owning a house or an apartment with a secondary education. However, the occupation for these customers are labeled as unknown in our analysis.
par(mfrow = c(2, 1))
OriginalDF1 <- with(OriginalDF, table(TARGET, AGE_RANGE))
barplot(OriginalDF1, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
inset=c(-0.05,0)),
col =c("Red","Green"),
xlab="AGE_RANGE",
ylab="Values",
main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)
OriginalDF2 <- with(OriginalDF, table(TARGET,CODE_GENDER))
barplot(OriginalDF2, beside = TRUE, legend = TRUE,col =c("Red","Green"),args.legend=list(x="topright",
inset=c(-0.05,0)),
xlab="CODE_GENDER",
ylab="values",
main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)
We can observe that most of the female customers have no issues in paying the loan and customers belonging to age group 30-40 are able to make payment on time.
par(mfrow = c(3, 1))
OriginalDF3 <- with(OriginalDF, table(TARGET, AMT_CREDIT_RANGE))
barplot(OriginalDF3, beside = TRUE, legend = TRUE, args.legend=list(x="topright",
inset=c(0,0)),
col =c("Red","Green"),
xlab="Amount of Credit(Loan)",
ylab="Values",
main ="Target=0 : No Payment issue,Target =1 : Payment issue",
)
OriginalDF4 <- with(OriginalDF, table(TARGET,NAME_INCOME_TYPE))
barplot(OriginalDF4, beside = TRUE, legend = TRUE, args.legend=list(x="topright",
inset=c(0,0)),
col =c("Red","Green"),
xlab="NAME_INCOME_TYPE",
ylab="values",
#horiz=T,
main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)
OriginalDF5 <- with(OriginalDF, table(TARGET,AMT_INCOME_RANGE))
barplot(OriginalDF5, beside = TRUE, legend = TRUE, args.legend=list(x="topright",
inset=c(0,0)),
col =c("Red","Green"),
xlab="AMT_INCOME_RANGE",
ylab="values",
#horiz=T,
main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)
Customers (Employees) with less credit and low income are most likely to make payment. Customers having medium and high credit can also be considered while lending the loan.
par(mfrow = c(3, 1))
OriginalDF6 <- with(OriginalDF, table(TARGET,NAME_EDUCATION_TYPE))
barplot(OriginalDF6, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
inset=c(0,0)),
col =c("Red","Green"),
xlab="Types of Education",
ylab="values",
#horiz=T,
main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)
OriginalDF7 <- with(OriginalDF, table(TARGET,NAME_FAMILY_STATUS))
barplot(OriginalDF7, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
inset=c(0,0)),
col =c("Red","Green"),
xlab="Marital Status",
ylab="values",
#horiz=T,
main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)
OriginalDF8 <- with(OriginalDF, table(TARGET,NAME_HOUSING_TYPE))
barplot(OriginalDF8, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
inset=c(0,0)),
col =c("Red","Green"),
xlab="Types of Housing",
ylab="values",
#horiz=T,
main ="Target=0 : No Payment issue, Target =1 : Payment issue",
)
Married Customers with secondary education and staying in a house or a apartment are most likely to make payments when compared to customers with academic degree.
par(mfrow = c(3, 1))
p1 <- ggplot(data=OriginalDF, aes(x=AMT_CREDIT, group=TARGET, fill=TARGET)) +
geom_density(adjust=1.5) +
theme_ipsum()
p2 <- ggplot(data=OriginalDF, aes(x=AMT_ANNUITY, group=TARGET, fill=TARGET)) +
geom_density(adjust=1.5) +
theme_ipsum()
p3 <- ggplot(data=OriginalDF, aes(x=AMT_GOODS_PRICE, group=TARGET, fill=TARGET))+
geom_density(adjust=1.5)
grid.arrange(p1,p2,p3,ncol=1)
Most customers have paid their amount of credits and have good price on time. However, they are unable to pay the annuity fees on time.
We select the numerical data and convert it to matrix. After that, we use Spearman coefficent to calculate the correlation coefficient. Next ,we visualise the correlation matrix to check on the highly correlated variables.
data=select_if(OriginalDF,is.numeric)
data.cor=cor(data)
data.cor = cor(data, method = c("spearman"))
corrplot(data.cor)Based on the correlation matrix, we have selected few variables AMT_INCOME_TOTAL,AMT_CREDIT, AMT_GOODS_PRICE,AMT_INCOME_RANGE,CODE_GENDER, NAME_EDUCATION_TYPE FOR Exploratory data analysis.
attach(data)
par(mfrow = c(1, 2))
plot(AMT_INCOME_TOTAL, AMT_CREDIT, main="Amount of Income vs Amount of Credit(Range)",
xlab="AMT_INCOME_TOTAL ", ylab="AMT_CREDIT ", pch=19)
attach(data)
plot(AMT_GOODS_PRICE, AMT_CREDIT, main="Good Price vs Amount of Credit(Range)",
xlab="AMT_GOODS_PRICE ", ylab="AMT_CREDIT ", pch=19)
Those who have paid the loan amount on/within time are more likely to get higher credits than those who didnt pay/did late payments.People who have higher goods price and have made payments on time have higher credits than those with higher goods price but didnt pay loan.
OriginalDF9 <- with(OriginalDF, table(AMT_INCOME_RANGE,CODE_GENDER))
barplot(OriginalDF9, beside = TRUE, legend = TRUE,args.legend=list(x="topright",
inset=c(0,0)),
col =c("Red","Green","Blue"),
xlab="CODE_GENDER",
ylab="values",
#horiz=T,
main ="The relationship between income range and gender",
)
We can see that most of the customers are females with low income.
ggplot(OriginalDF, aes(x=NAME_EDUCATION_TYPE, y=AMT_CREDIT, fill=NAME_FAMILY_STATUS,
main ="The relationship between income range and gender")) +
geom_boxplot()
From the boxplot above:
(1) Some of highly educated, married person are having credits higher than those who have done lower secondary education.
(2) More number of outliers are seen in higher education.
After checking with the correlation matrix, we decide to drop insignificant variables such as REGION_RATING_CLIENT_W_CITY, SK_ID_CURR from the dataset and assign it to df.
We will split into training set and test set in a ratio of 70:30
Next, we scale the continuous variables for training set and test set.
continuous_column = c('AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE','AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR')
training_set[continuous_column] = scale(training_set[continuous_column])
test_set[continuous_column] = scale(test_set[continuous_column])Based on the results that we have from EDA, the data is unbalanced where the majority of the observation were 0: no payment issue. We will balance the data using oversampling method.
##
## 0 1
## 197849 17377
# balance the data
bal_training_set=ovun.sample(TARGET~., data = training_set, method = "over", N = 197849*2)$data
table(bal_training_set$TARGET)##
## 0 1
## 197849 197849
Now, we have the perfect balanced data for training and ready for modelling.
Next, we create a performance evaluation function to check on the accuracy, ROC area, Kappa, precision, recall and F1-measure. As the dataset is unbalanced, and it is equally important to predict if one has the ability to repay the loan, we will use macro-average for precision, recall and f1-measure.
get_cm <- function(test_set, y_pred){
cm <- as.matrix(table(Actual = test_set$TARGET, Predicted = y_pred))
}
simple_eval <- function (test_set, y_pred) {
CM = get_cm(test_set, y_pred)
n = sum(CM)# number of instance
nc = nrow(CM)# number of class, should be 2 in our case
rowsums = apply(CM, 1, sum) # number of instances per class
colsums = apply(CM, 2, sum) # number of predictions per class
p = rowsums / n # distribution of instances over the actual classes
q = colsums / n # distribution of instances over the predicted classes
diag = diag(CM) # get TP and TN using identity matrix
accuracy = sum(diag) / n # overall classification accuracy
precision = diag / colsums # fraction of correct predictions for a certain class
recall = diag / rowsums # fraction of instances of a class that were correctly predicted
f1 = 2 * precision * recall / (precision + recall) # harmonic mean (or a weighted average) of precision and recall
perf_df = data.frame(precision, recall, f1)
AUC = auc(test_set$TARGET, factor(y_pred, ordered = TRUE)) # Area under ROC curve
expAccuracy = sum(p*q)
kappa = (accuracy - expAccuracy) / (1 - expAccuracy)
macroPrecision = mean(precision)
macroRecall = mean(recall)
macroF1 = mean(f1)
list(Accuracy=accuracy, ROC_area=AUC, Kappa =kappa, Precision=macroPrecision, Recall=macroRecall, Fscore=macroF1, performance_df=perf_df, confusion_matrix=CM)
}We will develop a few model to predict if a person has the ability to service the loan. Since the output is binary (yes/no), classification algorithm will be used, namely Decision Tree, Naive Bayes and Logistic Regression.
# fit the decision tree classification
model = rpart(TARGET ~.,
data = bal_training_set, method = "class")
# plot decision tree
rpart.plot(model, extra = 106)
Decision tree model suggests uses the normalized score from external data source (EXT_SOURCE_2 & EXT_SOURCE_3) to determine if a person has the ability to service the loan.
# make prediction
DT.y_pred = predict (model, test_set, type = 'class')
# Confusion Matrix
DT.eval <- simple_eval(test_set, DT.y_pred)
DT.eval$confusion_matrix## Predicted
## Actual 0 1
## 0 67441 17352
## 1 4225 3222
## precision recall f1
## 0 0.94 0.80 0.86
## 1 0.16 0.43 0.23
# Model Evaluation
cat('Accuracy: ', round(DT.eval$Accuracy,2),
'\nMacro-precision: ', round(DT.eval$Precision,2),
'\nMacro-recall: ', round(DT.eval$Recall,2),
'\nMacro-F1: ', round(DT.eval$Fscore,2),
'\nKappa: ', round(DT.eval$Kappa,2),
'\nROC area: ', round(DT.eval$ROC_area,2)
)## Accuracy: 0.77
## Macro-precision: 0.55
## Macro-recall: 0.61
## Macro-F1: 0.55
## Kappa: 0.13
## ROC area: 0.61
For our second model, we will use Naive Bayes by fitting into the training dataset
# Fitting Naive Bayes Model to training dataset
model_NB <- naiveBayes(TARGET ~ ., data = bal_training_set)# Predicting on test data'
NB.y_pred <- predict(model_NB, newdata = test_set)
# Confusion Matrix
NB.eval <- simple_eval(test_set, NB.y_pred)
NB.eval$confusion_matrix## Predicted
## Actual 0 1
## 0 54056 30737
## 1 3113 4334
## precision recall f1
## 0 0.95 0.64 0.76
## 1 0.12 0.58 0.20
# Model Evaluation
cat('Accuracy: ', round(NB.eval$Accuracy,2),
'\nMacro-precision: ', round(NB.eval$Precision,2),
'\nMacro-recall: ', round(NB.eval$Recall,2),
'\nMacro-F1: ', round(NB.eval$Fscore,2),
'\nKappa: ', round(NB.eval$Kappa,2),
'\nROC area: ', round(NB.eval$ROC_area,2)
)## Accuracy: 0.63
## Macro-precision: 0.53
## Macro-recall: 0.61
## Macro-F1: 0.48
## Kappa: 0.08
## ROC area: 0.61
For our third model, we will use Logistic Regression by fitting into the training dataset.
# Predict test data based on model
LR.y_pred <- predict(model_glm,
newdata = test_set, type = "response")
# Changing probabilities
LR.y_pred <- ifelse(LR.y_pred >0.5, 1, 0)
# Confusion matrix
LR.eval <- simple_eval(test_set, LR.y_pred)
LR.eval$confusion_matrix## Predicted
## Actual 0 1
## 0 58338 26455
## 1 2434 5013
## precision recall f1
## 0 0.96 0.69 0.80
## 1 0.16 0.67 0.26
# Model Evaluation
cat('Accuracy: ', round(LR.eval$Accuracy,2),
'\nMacro-precision: ', round(LR.eval$Precision,2),
'\nMacro-recall: ', round(LR.eval$Recall,2),
'\nMacro-F1: ', round(LR.eval$Fscore,2),
'\nKappa: ', round(LR.eval$Kappa,2),
'\nROC area: ', round(LR.eval$ROC_area,2)
)## Accuracy: 0.69
## Macro-precision: 0.56
## Macro-recall: 0.68
## Macro-F1: 0.53
## Kappa: 0.15
## ROC area: 0.68
compare <- data.frame(Method=c('Decision Tree', 'Naive Bayes', 'Logistic Regression'), Accuracy = NA, Precision = NA, Recall = NA, FScore = NA, 'ROC' = NA)
compare$Accuracy <- c(round(DT.eval$Accuracy,2),round(NB.eval$Accuracy,2),round(LR.eval$Accuracy,2))
compare$Precision <- c(round(DT.eval$Precision,2),round(NB.eval$Precision,2),round(LR.eval$Precision,2))
compare$Recall <- c(round(DT.eval$Recall,2),round(NB.eval$Recall,2),round(LR.eval$Recall,2))
compare$FScore <- c(round(DT.eval$Fscore,2),round(NB.eval$Fscore,2),round(LR.eval$Fscore,2))
compare$ROC <- c(round(DT.eval$ROC_area,2),round(NB.eval$ROC_area,2),round(LR.eval$ROC_area,2))
kable_styling(kable(compare),c("striped","bordered"), full_width = F)| Method | Accuracy | Precision | Recall | FScore | ROC |
|---|---|---|---|---|---|
| Decision Tree | 0.77 | 0.55 | 0.61 | 0.55 | 0.61 |
| Naive Bayes | 0.63 | 0.53 | 0.61 | 0.48 | 0.61 |
| Logistic Regression | 0.69 | 0.56 | 0.68 | 0.53 | 0.68 |
Decision Tree (DT) has the highest accuracy; however, it is worth take note that this model works well in predicting non-defaulters, but not defaulters. When we look at the ROC, logistic regression performs better than DT as more features is considered in predicting defaulters, rather than just using normalised external rating score.
get_roc_curve <- function(test_set, DT.y_pred, NB.y_pred, LR.y_pred){
par(pty = "s")
roc(test_set$TARGET, factor(DT.y_pred, ordered = TRUE),plot = TRUE, print.auc = TRUE, col="#377eb8", lwd = 2, main="ROC Curve Comparison")
plot.roc(test_set$TARGET, factor(NB.y_pred, ordered = TRUE), print.auc = TRUE, add=TRUE, print.auc.y = 0.45, col="#4daf4a", lwd = 2)
plot.roc(test_set$TARGET, factor(LR.y_pred, ordered = TRUE), print.auc = TRUE, add=TRUE, print.auc.y = 0.40, col="#FF0000", lwd = 2)
legend("bottomright", legend = c("Decision Tree", "Naive Bayes", "Logistic Regression"), col=c("#377eb8", "#4daf4a", "#FF0000"), lwd = 2)
}
get_roc_curve(test_set, DT.y_pred, NB.y_pred, LR.y_pred)By considering ROC and other metrics, we find logistic regression is the best model in predicting if a person has the ability to repay the loan.
We will split into training set and test set in a ratio of 70:30.
Next, we scale the continuous variables for training set and test set.
continuous_column = c('AMT_INCOME_TOTAL', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE','AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR')
training_set[continuous_column] = scale(training_set[continuous_column])
test_set[continuous_column] = scale(test_set[continuous_column])
x_train = data.matrix(select (training_set, -c('AMT_CREDIT'))) #select (training_set, -c('AMT_CREDIT'))
y_train = training_set$AMT_CREDIT
x_test = data.matrix(select (test_set, -c('AMT_CREDIT'))) #select (test_set, -c('AMT_CREDIT'))
y_test = test_set$AMT_CREDITFor regression, we will use Mean Square Error (MSE), Root Mean Square Error(RMSE), Mean Average Error(MAE) and R square to evaluate the performance of regression models.
regr_evaluation <- function(testDataset, predictionResult){
MSE <- mse(testDataset$AMT_CREDIT, predictionResult)
RMSE <- rmse(testDataset$AMT_CREDIT, predictionResult)
MAE <- mae(testDataset$AMT_CREDIT, predictionResult)
error <- testDataset$AMT_CREDIT - predictionResult
R2 <- 1-sum(error^2)/sum((testDataset$AMT_CREDIT- mean(testDataset$AMT_CREDIT))^2)
AD_R2 <- 1-(MSE/var(testDataset$AMT_CREDIT))
list(MSE = MSE, RMSE = RMSE, MAE = MAE, R_square = R2, Adjusted_R_Square =AD_R2)
}
show_graph <- function(testDataset, predictionResult){
plot(x=predictionResult, y= testDataset$AMT_CREDIT,
xlab='Predicted Values',
ylab='Actual Values',
main='Predicted vs. Actual Values')
abline(a=0, b=1)
}We will develop a few models to predict the loan amount that one can take and able to service. Regression algorithm will be used since the output is continuous variables, namely Linear Regression, Ridge Regression and Decision Tree Regression.
# make prediction
LR.y_pred = predict(lr, test_set)
# Model Evaluation
LR.eval <- regr_evaluation(test_set,LR.y_pred)
cat('MSE: ', round(LR.eval$MSE,2),
'\nRMSE: ', round(LR.eval$RMSE,2),
'\nMAE: ', round(LR.eval$MAE,2),
'\nR_square: ', round(LR.eval$R_square,2),
'\nAdjusted_R_Square: ', round(LR.eval$Adjusted_R_Square,2)
)## MSE: 3851730025
## RMSE: 62062.31
## MAE: 45469.49
## R_square: 0.98
## Adjusted_R_Square: 0.98
# fit the rdige regression
lr <- lm(AMT_CREDIT ~ ., data = training_set)
lambdas <- 10^seq(2, -3, by = -.1)
ridge_reg = glmnet(x_train, y_train, alpha = 0, lambda = lambdas)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
optimal_lambda ## [1] 0.001
best_fit <- cv_ridge$glmnet.fit
best_ridge <- glmnet(x_train, y_train, alpha = 0, lambda = optimal_lambda)# make prediction
RR.y_pred = predict(best_ridge, s = optimal_lambda, newx = x_test)
# Model Evaluation
RR.eval <- regr_evaluation(test_set,RR.y_pred)
cat('MSE: ', round(RR.eval$MSE,2),
'\nRMSE: ', round(RR.eval$RMSE,2),
'\nMAE: ', round(RR.eval$MAE,2),
'\nR_square: ', round(RR.eval$R_square,2),
'\nAdjusted_R_Square: ', round(RR.eval$Adjusted_R_Square,2)
)## MSE: 3886222698
## RMSE: 62339.58
## MAE: 45662.85
## R_square: 0.98
## Adjusted_R_Square: 0.98
# fit the decision tree
DT_regression <- rpart(AMT_CREDIT ~ .,
method = "anova", data = training_set )
rpart.plot(DT_regression, uniform = TRUE,
main = "Amount Credit Decision Tree using Regression")# make prediction
DTR.y_pred = predict(DT_regression, test_set, method = "anova")
# Model Evaluation
DTR.eval <- regr_evaluation(test_set,DTR.y_pred)
cat('MSE: ', round(DTR.eval$MSE,2),
'\nRMSE: ', round(DTR.eval$RMSE,2),
'\nMAE: ', round(DTR.eval$MAE,2),
'\nR_square: ', round(DTR.eval$R_square,2),
'\nAdjusted_R_Square: ', round(DTR.eval$Adjusted_R_Square,2)
)## MSE: 10256317365
## RMSE: 101273.5
## MAE: 75873.9
## R_square: 0.94
## Adjusted_R_Square: 0.94
compare2 <- data.frame(Method=c('Linear Regression', 'Ridge Regression', 'Decision Tree Regression'), MSE=NA, RMSE = NA, MAE=NA, R_Square = NA, Adjusted_R_Square=NA)
compare2$MSE <- c(round(LR.eval$MSE,2),round(RR.eval$MSE,2),round(DTR.eval$MSE,2))
compare2$RMSE <- c(round(LR.eval$RMSE,2),round(RR.eval$RMSE,2),round(DTR.eval$RMSE,2))
compare2$MAE <- c(round(LR.eval$MAE,2),round(RR.eval$MAE,2),round(DTR.eval$MAE,2))
compare2$R_Square <- c(round(LR.eval$R_square,2),round(RR.eval$R_square,2),round(DTR.eval$R_square,2))
compare2$Adjusted_R_Square <- c(round(LR.eval$Adjusted_R_Square,2),round(RR.eval$Adjusted_R_Square,2),round(DTR.eval$Adjusted_R_Square,2))
kable_styling(kable(compare2),c("striped","bordered"), full_width = F)| Method | MSE | RMSE | MAE | R_Square | Adjusted_R_Square |
|---|---|---|---|---|---|
| Linear Regression | 3851730025 | 62062.31 | 45469.49 | 0.98 | 0.98 |
| Ridge Regression | 3886222698 | 62339.58 | 45662.85 | 0.98 | 0.98 |
| Decision Tree Regression | 10256317365 | 101273.48 | 75873.90 | 0.94 | 0.94 |
The target variable loan amount (AMT_CREDIT) range from 45000 to 4.05 million. The loan amount is big and hence explain why the RMSE is big but we still achieve high R square. Based on the graph, the predicted value is quite close to the actual value for both linear regression and ridge regression. We can conclude that both Linear regression and Ridge Regression are suitable in predicting the loan amount.
In the future studies, the researchers can explore more machine learning algorithms, especially the ensemble learners and deep learning models.