| TOPIC: Credit Risk Analysis |
| NAME : Zeba Khan |
| EMAIL: zeba.khan.stats@gmail.com |
| COLLEGE / COMPANY: Lady Shri Ram College for Women, University of Delhi |
INTRODUCTION:
Credit Risk assessment is a crucial issue faced by Banks nowadays which helps them to evaluate if a loan applicant can be a defaulter at a later stage so that they can go ahead and grant the loan or not. This helps the banks to minimize the possible losses and can increase the volume of credits.
When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision -
OBJECTIVE OF ANALYSIS: Minimization of risk and maximization of profit on behalf of the bank.
To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.
A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
DATA DESCRIPTION: The dataset contains 1000 entries where each entry represents a person who takes credit by a bank. Each person is classified as a good or a bad credit risk according to the set of attributes(Here, 10 variables which are listed below).
The categorical variables in the dataset are as follows: 1. Sex (male, female) 2. Job (0-unskilled and non-resident, 1-unskilled and resident, 2-skilled, 3-highly skilled) 3. Housing (own, rent or free) 4. Savings accounts (little, moderate, quite rich and rich) 5. Purpose (car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others) 6. Risk (value target-Good or Bad Risk)
Numeric variables in the dataset: 1. Age 2. Checking Account (numeric, in DM-Deutsch Mark) 3. Credit Amount (Numeric, in DM) 4. Duration (numeric, in month)
MODEL ANALYSIS:
#Reading the dataset into R
riskdat <- read.csv(paste("german_credit_data (2).csv", sep=""))
View(riskdat)
#Loading the libraries
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.3
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#Checking the Data
head(riskdat)
## X Age Sex Job Housing Saving.accounts Checking.account Credit.amount
## 1 0 67 male 2 own <NA> little 1169
## 2 1 22 female 2 own little moderate 5951
## 3 2 49 male 1 own little <NA> 2096
## 4 3 45 male 2 free little little 7882
## 5 4 53 male 2 free little little 4870
## 6 5 35 male 1 free <NA> <NA> 9055
## Duration Purpose Risk
## 1 6 radio/TV good
## 2 48 radio/TV bad
## 3 12 education good
## 4 42 furniture/equipment good
## 5 24 car bad
## 6 36 education good
tail(riskdat)
## X Age Sex Job Housing Saving.accounts Checking.account
## 995 994 50 male 2 own <NA> <NA>
## 996 995 31 female 1 own little <NA>
## 997 996 40 male 3 own little little
## 998 997 38 male 2 own little <NA>
## 999 998 23 male 2 free little little
## 1000 999 27 male 2 own moderate moderate
## Credit.amount Duration Purpose Risk
## 995 2390 12 car good
## 996 1736 12 furniture/equipment good
## 997 3857 30 car good
## 998 804 12 radio/TV good
## 999 1845 45 radio/TV bad
## 1000 4576 45 car good
str(riskdat)
## 'data.frame': 1000 obs. of 11 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
## $ Job : int 2 2 1 2 2 1 2 3 1 3 ...
## $ Housing : Factor w/ 3 levels "free","own","rent": 2 2 2 1 1 1 2 3 2 2 ...
## $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: NA 1 1 1 1 NA 3 1 4 1 ...
## $ Checking.account: Factor w/ 3 levels "little","moderate",..: 1 2 NA 1 1 NA NA 2 NA 2 ...
## $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Purpose : Factor w/ 8 levels "business","car",..: 6 6 4 5 2 4 5 2 6 2 ...
## $ Risk : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...
summary(riskdat)
## X Age Sex Job Housing
## Min. : 0.0 Min. :19.00 female:310 Min. :0.000 free:108
## 1st Qu.:249.8 1st Qu.:27.00 male :690 1st Qu.:2.000 own :713
## Median :499.5 Median :33.00 Median :2.000 rent:179
## Mean :499.5 Mean :35.55 Mean :1.904
## 3rd Qu.:749.2 3rd Qu.:42.00 3rd Qu.:2.000
## Max. :999.0 Max. :75.00 Max. :3.000
##
## Saving.accounts Checking.account Credit.amount Duration
## little :603 little :274 Min. : 250 Min. : 4.0
## moderate :103 moderate:269 1st Qu.: 1366 1st Qu.:12.0
## quite rich: 63 rich : 63 Median : 2320 Median :18.0
## rich : 48 NA's :394 Mean : 3271 Mean :20.9
## NA's :183 3rd Qu.: 3972 3rd Qu.:24.0
## Max. :18424 Max. :72.0
##
## Purpose Risk
## car :337 bad :300
## radio/TV :280 good:700
## furniture/equipment:181
## business : 97
## education : 59
## repairs : 22
## (Other) : 24
describe(riskdat)
## riskdat
##
## 11 Variables 1000 Observations
## ---------------------------------------------------------------------------
## X
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 1000 1 499.5 333.7 49.95 99.90
## .25 .50 .75 .90 .95
## 249.75 499.50 749.25 899.10 949.05
##
## lowest : 0 1 2 3 4, highest: 995 996 997 998 999
## ---------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 53 0.999 35.55 12.41 22 23
## .25 .50 .75 .90 .95
## 27 33 42 52 60
##
## lowest : 19 20 21 22 23, highest: 67 68 70 74 75
## ---------------------------------------------------------------------------
## Sex
## n missing distinct
## 1000 0 2
##
## Value female male
## Frequency 310 690
## Proportion 0.31 0.69
## ---------------------------------------------------------------------------
## Job
## n missing distinct Info Mean Gmd
## 1000 0 4 0.739 1.904 0.6413
##
## Value 0 1 2 3
## Frequency 22 200 630 148
## Proportion 0.022 0.200 0.630 0.148
## ---------------------------------------------------------------------------
## Housing
## n missing distinct
## 1000 0 3
##
## Value free own rent
## Frequency 108 713 179
## Proportion 0.108 0.713 0.179
## ---------------------------------------------------------------------------
## Saving.accounts
## n missing distinct
## 817 183 4
##
## Value little moderate quite rich rich
## Frequency 603 103 63 48
## Proportion 0.738 0.126 0.077 0.059
## ---------------------------------------------------------------------------
## Checking.account
## n missing distinct
## 606 394 3
##
## Value little moderate rich
## Frequency 274 269 63
## Proportion 0.452 0.444 0.104
## ---------------------------------------------------------------------------
## Credit.amount
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 921 1 3271 2773 709 932
## .25 .50 .75 .90 .95
## 1366 2320 3972 7179 9163
##
## lowest : 250 276 338 339 343, highest: 15653 15672 15857 15945 18424
## ---------------------------------------------------------------------------
## Duration
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 33 0.985 20.9 12.98 6 9
## .25 .50 .75 .90 .95
## 12 18 24 36 48
##
## lowest : 4 5 6 7 8, highest: 47 48 54 60 72
## ---------------------------------------------------------------------------
## Purpose
## n missing distinct
## 1000 0 8
##
## business (97, 0.097), car (337, 0.337), domestic appliances (12, 0.012),
## education (59, 0.059), furniture/equipment (181, 0.181), radio/TV (280,
## 0.280), repairs (22, 0.022), vacation/others (12, 0.012)
## ---------------------------------------------------------------------------
## Risk
## n missing distinct
## 1000 0 2
##
## Value bad good
## Frequency 300 700
## Proportion 0.3 0.7
## ---------------------------------------------------------------------------
#Making some changes to data
colnames(riskdat)[1] <- "index"
#Some Explorations
hist(riskdat$Credit.amount, main = "Histogram of Credit Amount", xlab = "Credit Amount", ylab = "Frequency", col = "green", border = "black" )
ggplot(riskdat, aes(Sex) ) + geom_bar(aes(fill = as.factor(riskdat$Sex))) +
scale_fill_discrete(name="Sex",
labels=c( "Female","Male")) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(x= "Sex",y= "Frequency" , title = "Plot of Sex")
ggplot(riskdat, aes(Job) ) + geom_bar(aes(fill = as.factor(riskdat$Job))) +
scale_fill_discrete(name="Job Type",
labels=c( "Unskilled and Non-Resident","Unskilled and Resident", "Skilled", "Highly Skilled")) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(x= "Level of Job",y= "Frequency" , title = "Plot of Job")
ggplot(riskdat, aes(Housing) ) + geom_bar(aes(fill = as.factor(riskdat$Housing))) +
scale_fill_discrete(name="Housing",
labels=c( "Free","Own", "Rent")) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(x= "Housing",y= "Frequency" , title = "Plot of Housing")
ggplot(riskdat, aes(Saving.accounts) ) + geom_bar(aes(fill = as.factor(riskdat$Saving.accounts))) +
scale_fill_discrete(name="Saving Accounts",
labels=c( "Little","Moderate", "Quite Rich", "Rich", "NA")) +
labs(x= "Saving Accounts",y= "Frequency" , title = "Plot of Saving Accounts")
ggplot(riskdat, aes(Checking.account) ) + geom_bar(aes(fill = as.factor(riskdat$Checking.account))) +
scale_fill_discrete(name="Checking Account",
labels=c( "Little","Moderate", "Rich")) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(x= "Checking Account",y= "Frequency" , title = "Plot of Checking Account")
ggplot(riskdat, aes(Duration)) + geom_histogram(binwidth=4, colour="black", fill="green") +
labs(x= "Duration in Months",y= "Frequency" , title = "Plot of Duration")
ggplot(riskdat, aes(Purpose) ) + geom_bar(aes(fill = as.factor(riskdat$Purpose))) +
scale_fill_discrete(name="Purpose of Loan",labels=c( "Business","Car", "Domestic Appliances","Education","Furniture/Equipment","Radio/TV","Repairs","Vacation/Others")) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(x= "Purpose of Loan",y= "Frequency" , title = "Plot of Loan Purpose")
#Some important inferences
riskdat$age_gp<-c(0)
riskdat$age_gp<-findInterval(riskdat$Age,c(18,25,35,60,120))
boxplot(riskdat$Credit.amount~riskdat$age_gp,horizontal=TRUE,ylab="Age Group",xlab="credit amount",las=1,main="Credit amount v/s Age group",col=c("red","green","pink","yellow"))
histogram(Credit.amount~Housing|Risk,data=riskdat)
#It can be seen that there is high correlation between good risk and people who own houses.
boxplot(Credit.amount~Housing+Risk,data=riskdat,horizontal=TRUE,xlab="Distribution of Housing by credit amount")
#it can be seen that the highest values come from category "free".
boxplot(Credit.amount~Sex+Risk,data=riskdat,horizontal=TRUE,xlab="Distribution of credit amount by sex")
histogram(Credit.amount~Sex|Risk,data=riskdat)
ggplot(riskdat, aes(Risk) ) + geom_bar(aes(fill = as.factor(riskdat$Risk))) +
scale_fill_discrete(name="Risk",
labels=c( "good", "bad")) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) + labs(x= "Risk",y= "Frequency" , title = "Plot of Risk")
histogram(Saving.accounts~Job|Risk,data=riskdat)
mytab1<-xtabs(~Risk+Job,data=riskdat)
mytab1
## Job
## Risk 0 1 2 3
## bad 7 56 186 51
## good 15 144 444 97
histogram(Credit.amount~Purpose|Risk,data=riskdat)
boxplot(Credit.amount~Purpose+Risk,data=riskdat,horizontal=TRUE,xlab="Distribution of Purpose by credit amount")
mytab<-xtabs(~Risk+Purpose,data=riskdat)
mytab
## Purpose
## Risk business car domestic appliances education furniture/equipment
## bad 34 106 4 23 58
## good 63 231 8 36 123
## Purpose
## Risk radio/TV repairs vacation/others
## bad 62 8 5
## good 218 14 7
histogram(Risk~Purpose,data=riskdat)
boxplot(Credit.amount~Purpose+Risk,data=riskdat,horizontal=TRUE,xlab="Distribution of Housing by credit amount",col=c("red","yellow"))
boxplot(Duration~Risk+Credit.amount,data=riskdat,horizontal=TRUE,xlab="Distribution of Housing by credit amount")
#It can be clearly seen that the highest duration have the highest amount The highest density is between [12~18~24] months.
#Some important contingency tables
mytable_ch<-with(riskdat,table(Checking.account))
mytable_ch
## Checking.account
## little moderate rich
## 274 269 63
mytable_js<-xtabs(~Job+Sex,data=riskdat)
mytable_js
## Sex
## Job female male
## 0 12 10
## 1 64 136
## 2 197 433
## 3 37 111
mytable_cs<-xtabs(~Checking.account+Sex,data=riskdat)
mytable_cs
## Sex
## Checking.account female male
## little 88 186
## moderate 86 183
## rich 20 43
mytable_sp<-xtabs(~Sex+Purpose,data=riskdat)
mytable_sp
## Purpose
## Sex business car domestic appliances education furniture/equipment
## female 19 94 6 24 74
## male 78 243 6 35 107
## Purpose
## Sex radio/TV repairs vacation/others
## female 85 5 3
## male 195 17 9
#Some important changes to be made in the data
riskdat$Sex_up<-c(0)
riskdat$Sex_up[which(riskdat$Sex=="male")]<-c(1)
riskdat$Sex_up[which(riskdat$Sex=="female")]<-c(2)
riskdat$housing_up<-c(0)
riskdat$housing_up[which(riskdat$Housing=="own")]<-c(1)
riskdat$housing_up[which(riskdat$Housing=="free")]<-c(2)
riskdat$housing_up[which(riskdat$Housing=="rent")]<-c(3)
riskdat$Sav_up<-c(0)
riskdat$Sav_up[which(riskdat$Saving.accounts=="little")]<-c(1)
riskdat$Sav_up[which(riskdat$Saving.accounts=="moderate")]<-c(2)
riskdat$Sav_up[which(riskdat$Saving.accounts=="quite rich")]<-c(3)
riskdat$Sav_up[which(riskdat$Saving.accounts=="rich")]<-c(4)
riskdat$rs_up<-c(0)
riskdat$rs_up[which(riskdat$Risk=="good")]<-c(1)
riskdat$rs_up[which(riskdat$Risk=="bad")]<-c(0)
riskdat$ch_ac<-c(0)
riskdat$ch_ac[which(riskdat$Checking.account=="little")]<-c(1)
riskdat$ch_ac[which(riskdat$Checking.account=="moderate")]<-c(2)
riskdat$ch_ac[which(riskdat$Checking.account=="rich")]<-c(3)
riskdat$pu_up<-c(0)
riskdat$pu_up[which(riskdat$Purpose=="radio/TV")]<-c(1)
riskdat$pu_up[which(riskdat$Purpose=="education")]<-c(2)
riskdat$pu_up[which(riskdat$Purpose=="furniture/equipment")]<-c(3)
riskdat$pu_up[which(riskdat$Purpose=="car")]<-c(4)
riskdat$pu_up[which(riskdat$Purpose=="business")]<-c(5)
riskdat$pu_up[which(riskdat$Purpose=="domestic appliances")]<-c(6)
riskdat$pu_up[which(riskdat$Purpose=="repairs")]<-c(7)
riskdat$pu_up[which(riskdat$Purpose=="vacation/others")]<-c(8)
#Correlation in the data
cor(riskdat[,c(1,4,8,9,12:18)])
## index Job Credit.amount Duration
## index 1.000000000 -0.02734538 0.01348793 0.03078762
## Job -0.027345376 1.00000000 0.28538533 0.21090973
## Credit.amount 0.013487929 0.28538533 1.00000000 0.62498420
## Duration 0.030787617 0.21090973 0.62498420 1.00000000
## age_gp -0.005784457 0.04244108 0.03875637 -0.03041478
## Sex_up 0.001692754 -0.07029834 -0.09348244 -0.08143219
## housing_up 0.020210925 0.01520106 0.05611874 0.01195019
## Sav_up -0.041050835 -0.04080295 -0.10753805 -0.06452558
## rs_up -0.034606444 -0.03273500 -0.15473864 -0.21492667
## ch_ac -0.048268278 -0.05425460 0.02456123 0.03504999
## pu_up 0.009952104 0.02462906 0.21451307 0.10552447
## age_gp Sex_up housing_up Sav_up
## index -0.005784457 0.001692754 0.020210925 -0.041050835
## Job 0.042441082 -0.070298338 0.015201060 -0.040802954
## Credit.amount 0.038756369 -0.093482437 0.056118745 -0.107538048
## Duration -0.030414781 -0.081432194 0.011950187 -0.064525576
## age_gp 1.000000000 -0.231082397 -0.176412073 -0.031660984
## Sex_up -0.231082397 1.000000000 0.179136893 0.029309993
## housing_up -0.176412073 0.179136893 1.000000000 0.003267732
## Sav_up -0.031660984 0.029309993 0.003267732 1.000000000
## rs_up 0.127149930 -0.075492697 -0.123815236 0.033871266
## ch_ac -0.067709213 0.021903226 0.032924525 -0.005614445
## pu_up 0.078191468 -0.051435596 0.041964365 0.024616085
## rs_up ch_ac pu_up
## index -0.03460644 -0.048268278 0.009952104
## Job -0.03273500 -0.054254600 0.024629056
## Credit.amount -0.15473864 0.024561230 0.214513073
## Duration -0.21492667 0.035049995 0.105524472
## age_gp 0.12714993 -0.067709213 0.078191468
## Sex_up -0.07549270 0.021903226 -0.051435596
## housing_up -0.12381524 0.032924525 0.041964365
## Sav_up 0.03387127 -0.005614445 0.024616085
## rs_up 1.00000000 -0.197787636 -0.090000522
## ch_ac -0.19778764 1.000000000 0.032660814
## pu_up -0.09000052 0.032660814 1.000000000
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
corrplot(corr=cor(riskdat[,c(1,4,8,9,12:18)]),method="ellipse")
#Extracting missing values
miss= apply(X= riskdat, MARGIN = 2, FUN = function(k) which(is.na(k) | is.nan(k) | is.infinite(k)))
#removing missing values
misspos = sort(unique(unlist(miss, use.names=FALSE)))
riskdat1 = riskdat[-misspos,]
nrow(riskdat1)
## [1] 522
Since,the response variable(Risk-good or Bad) is binary, it is preferred to use Logistic regression instead of linear.
#Model
mylogit=Risk~Sex+Saving.accounts+Checking.account+Duration+Purpose+Job+Housing+Age
fit=glm(mylogit,data=riskdat,family = binomial(link= "logit"))
summary(fit)
##
## Call:
## glm(formula = mylogit, family = binomial(link = "logit"), data = riskdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1430 -1.0862 0.6152 0.9840 2.1206
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.484205 0.736256 0.658 0.51076
## Sexmale 0.372072 0.211820 1.757 0.07899 .
## Saving.accountsmoderate 0.124210 0.304999 0.407 0.68383
## Saving.accountsquite rich 0.678331 0.524194 1.294 0.19565
## Saving.accountsrich 1.329302 0.583156 2.279 0.02264 *
## Checking.accountmoderate 0.190906 0.212711 0.897 0.36946
## Checking.accountrich 1.023122 0.372824 2.744 0.00607 **
## Duration -0.052239 0.008832 -5.914 3.33e-09 ***
## Purposecar -0.361756 0.362628 -0.998 0.31848
## Purposedomestic appliances -0.731609 0.940269 -0.778 0.43652
## Purposeeducation -1.096391 0.545624 -2.009 0.04449 *
## Purposefurniture/equipment -0.079054 0.388066 -0.204 0.83858
## Purposeradio/TV -0.092045 0.371760 -0.248 0.80445
## Purposerepairs -0.394206 0.661345 -0.596 0.55113
## Purposevacation/others 0.361401 0.804903 0.449 0.65343
## Job 0.038334 0.150073 0.255 0.79839
## Housingown 0.266776 0.328882 0.811 0.41727
## Housingrent -0.067311 0.377737 -0.178 0.85857
## Age 0.009913 0.008993 1.102 0.27032
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 716.73 on 521 degrees of freedom
## Residual deviance: 635.89 on 503 degrees of freedom
## (478 observations deleted due to missingness)
## AIC: 673.89
##
## Number of Fisher Scoring iterations: 4
#Confidence interval for the coefficients Beta_hat
confint(fit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -0.956949473 1.93557959
## Sexmale -0.042649511 0.78879500
## Saving.accountsmoderate -0.471210200 0.72832311
## Saving.accountsquite rich -0.301818358 1.78376692
## Saving.accountsrich 0.276915869 2.61822035
## Checking.accountmoderate -0.225891929 0.60899225
## Checking.accountrich 0.317566967 1.78925680
## Duration -0.070008462 -0.03532318
## Purposecar -1.081512595 0.34481712
## Purposedomestic appliances -2.637183615 1.17398278
## Purposeeducation -2.194859171 -0.04503009
## Purposefurniture/equipment -0.845728261 0.67993005
## Purposeradio/TV -0.828805165 0.63312887
## Purposerepairs -1.689540214 0.93469817
## Purposevacation/others -1.213209932 2.00378701
## Job -0.256416505 0.33307641
## Housingown -0.378407847 0.91488487
## Housingrent -0.808577282 0.67573711
## Age -0.007595798 0.02773828
#Performing Wald test to test the significance of the coefficients
library(aod)
## Warning: package 'aod' was built under R version 3.4.3
##
## Attaching package: 'aod'
## The following object is masked from 'package:survival':
##
## rats
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 1)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 0.43, df = 1, P(> X2) = 0.51
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 3.1, df = 1, P(> X2) = 0.079
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 0.17, df = 1, P(> X2) = 0.68
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 4)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 1.7, df = 1, P(> X2) = 0.2
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 5)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 5.2, df = 1, P(> X2) = 0.023
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 6)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 0.81, df = 1, P(> X2) = 0.37
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 7)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 7.5, df = 1, P(> X2) = 0.0061
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 8)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 35.0, df = 1, P(> X2) = 3.3e-09
#Pseudo R-Squared(MacFadden Test)
# McFadden test =(1 log likelihood of fitted model/log likelihood of null model)
library(pscl)
## Warning: package 'pscl' was built under R version 3.4.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(fit)
## llh llhNull G2 McFadden r2ML
## -317.9466369 -610.8643021 585.8353304 0.4795135 0.6744665
## r2CU
## 0.7463239
Since the value of McFadden is 0.4795 (>0.3), the model is a good fit to the data.
#Confusion Matrix
pred=predict(fit,type="response")
summary(pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06354 0.42872 0.58344 0.55747 0.69331 0.95555
table(riskdat1$Risk,round(pred))
##
## 0 1
## bad 125 106
## good 61 230
The table above shows that the model predicted 125 bad and 230 good Risk correctly. and 106 bad and 61 good Risk incorrectly.
#Calculating the accuracy of the model
y=ifelse(riskdat1$Risk=="bad",0,1)
fitted.results <- ifelse(pred > 0.5,1,0)
fitted.results
## 2 4 5 8 10 11 12 13 14 15 16 19 22 23 24
## 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1
## 26 28 29 30 31 32 33 35 36 38 39 40 42 43 44
## 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0
## 45 48 52 55 59 60 61 63 64 68 73 74 76 77 78
## 0 1 0 0 1 0 1 0 0 1 1 0 1 0 1
## 80 84 85 87 88 89 90 92 95 96 98 99 102 104 106
## 0 1 1 1 0 1 1 1 1 0 1 0 0 1 1
## 108 110 111 112 113 115 119 120 121 124 126 127 128 129 130
## 1 1 1 0 0 1 0 1 0 1 1 1 1 1 1
## 132 138 140 141 142 143 144 146 147 149 153 154 155 156 157
## 0 1 1 1 0 1 1 0 1 0 1 1 1 0 1
## 158 159 164 167 168 170 171 173 174 175 177 178 180 182 185
## 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1
## 187 188 189 190 192 193 195 196 198 200 202 204 206 208 209
## 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1
## 213 214 217 218 219 221 227 228 230 231 234 236 238 240 243
## 1 1 1 1 0 1 1 1 0 1 1 1 1 0 0
## 249 251 252 253 258 261 262 263 266 269 274 275 285 286 287
## 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0
## 288 289 290 292 293 294 296 300 302 304 308 309 310 313 314
## 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1
## 316 317 320 321 322 323 324 326 329 330 331 333 335 336 337
## 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1
## 339 340 341 342 343 344 345 347 348 350 352 354 356 360 363
## 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1
## 365 368 369 370 375 376 379 382 384 388 389 392 393 394 396
## 1 0 0 1 0 0 0 0 1 1 1 1 0 1 0
## 397 398 399 406 408 410 411 417 423 426 430 432 433 435 439
## 1 0 1 1 1 1 0 1 1 0 0 1 1 1 0
## 440 442 443 445 447 448 450 455 457 458 459 461 462 463 466
## 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1
## 467 471 472 473 475 476 478 479 480 481 482 483 486 492 495
## 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1
## 497 499 500 501 502 503 504 505 507 508 511 513 514 516 517
## 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1
## 519 522 523 525 526 529 530 531 532 536 538 539 540 541 544
## 1 1 0 1 0 0 1 0 0 1 1 0 1 1 1
## 546 549 553 554 555 556 557 559 560 562 563 566 567 570 571
## 0 1 0 1 1 1 1 1 1 0 1 0 1 0 0
## 574 575 577 579 581 582 584 586 587 588 589 590 591 594 596
## 1 1 1 0 1 1 0 0 1 1 0 1 1 0 1
## 597 598 601 602 603 605 606 608 611 612 613 614 618 619 621
## 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1
## 624 625 627 628 631 632 635 640 641 642 645 647 649 650 651
## 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0
## 652 653 654 656 657 659 660 661 664 665 667 669 670 678 679
## 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1
## 685 688 690 691 692 693 697 700 702 703 704 705 707 708 709
## 0 0 1 1 1 1 1 1 0 1 1 0 0 1 1
## 710 712 714 715 720 721 722 723 724 728 729 730 731 732 733
## 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1
## 737 738 740 741 742 744 746 747 748 751 752 753 757 760 762
## 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0
## 763 766 767 769 772 775 778 780 781 783 784 786 789 790 791
## 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1
## 794 802 803 806 807 809 810 811 812 813 814 815 816 819 820
## 1 1 0 0 1 0 0 1 1 0 0 0 0 1 1
## 822 823 824 826 827 832 833 835 836 839 841 849 850 851 854
## 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0
## 859 863 867 870 872 873 875 876 877 879 885 886 888 891 893
## 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1
## 894 897 900 901 906 912 915 916 918 919 920 923 924 925 926
## 0 0 1 1 1 0 0 0 1 1 0 1 1 1 1
## 927 928 930 931 932 935 936 937 938 939 945 946 947 951 952
## 0 0 1 1 1 1 1 1 1 0 1 0 1 1 0
## 953 955 956 958 959 960 962 965 967 970 971 973 974 976 977
## 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1
## 980 981 983 984 986 987 989 990 994 997 999 1000
## 1 1 1 0 0 1 0 1 0 0 0 0
misClasificError <- mean(fitted.results != y)
1-misClasificError
## [1] 0.6800766
The fitted model is 68.00% accurate.
#Roc Curve
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.4.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.4.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred=predict(fit,type="response")
pr <- prediction(pred, riskdat1$Risk)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
#Area Under the Curve (AUC)
library(ROCR)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7202883
#Tjur Statistic
tjur= mean(pred[which(y==1)])-mean(pred[which(y==0)])
tjur
## [1] 0.1461676
The value of TJUR statistic reveals that the model is a moderate fit to the data.
DISCUSSION:
The logistic regression fits well to the credit risk data since the response variable is binary. The prdictor varaibles are: 1. Age 2. Sex 3. Saving.accounts 4. Checking.account 5. Duration 6. Purpose 7. Job 8. Housing
The mentioned variables have been used to predict the response ( whether giving away a credit card a good risk or a bad one).
INFERENCES:
Following conclusions have been noted:
Some of the entries aren’t complete. There are a lot of missing values in the data. Therefore a new dataset “riskdat1” has been created to study the prediction.
Most of the data is categorical which is one of the main reason for using Logistic regression instead of Linear Regression.
The distribution of various variables have been shown above to have a glipmse of how the distribution look like.
It is observed that the people who own houses are highly correlated with the good risk.
Distribution of “Housing by Credit amount” reveals that the highest values come from Housing category “free”.
A lot of explorations of the data have been done using one-way, two-way contingency tables and a few histograms.
Highest Credit amounts have been taken for longer duration. The highest density is between [121824] months which makes sense.
Correlation matrix and corrplot have been constructed to study the correlation in the data where the darkest colour shows high correlation betweem the variables involved.
The final fitted model is a moderate fit to the data. This can be infered from various tests like Wald test, Accuracy,McFadden test and TJUR statistic. The model is found to be 68.00% accurate.
REFERENCES: