TOPIC: Credit Risk Analysis
NAME : Zeba Khan
EMAIL: zeba.khan.stats@gmail.com
COLLEGE / COMPANY: Lady Shri Ram College for Women, University of Delhi

INTRODUCTION:

Credit Risk assessment is a crucial issue faced by Banks nowadays which helps them to evaluate if a loan applicant can be a defaulter at a later stage so that they can go ahead and grant the loan or not. This helps the banks to minimize the possible losses and can increase the volume of credits.

When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision -

  1. If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
  2. If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank.

OBJECTIVE OF ANALYSIS: Minimization of risk and maximization of profit on behalf of the bank.

To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.

A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.

DATA DESCRIPTION: The dataset contains 1000 entries where each entry represents a person who takes credit by a bank. Each person is classified as a good or a bad credit risk according to the set of attributes(Here, 10 variables which are listed below).

The categorical variables in the dataset are as follows: 1. Sex (male, female) 2. Job (0-unskilled and non-resident, 1-unskilled and resident, 2-skilled, 3-highly skilled) 3. Housing (own, rent or free) 4. Savings accounts (little, moderate, quite rich and rich) 5. Purpose (car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others) 6. Risk (value target-Good or Bad Risk)

Numeric variables in the dataset: 1. Age 2. Checking Account (numeric, in DM-Deutsch Mark) 3. Credit Amount (Numeric, in DM) 4. Duration (numeric, in month)

MODEL ANALYSIS:

#Reading the dataset into R
riskdat <- read.csv(paste("german_credit_data (2).csv", sep=""))
View(riskdat)

#Loading the libraries
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.3
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
#Checking the Data
head(riskdat)
##   X Age    Sex Job Housing Saving.accounts Checking.account Credit.amount
## 1 0  67   male   2     own            <NA>           little          1169
## 2 1  22 female   2     own          little         moderate          5951
## 3 2  49   male   1     own          little             <NA>          2096
## 4 3  45   male   2    free          little           little          7882
## 5 4  53   male   2    free          little           little          4870
## 6 5  35   male   1    free            <NA>             <NA>          9055
##   Duration             Purpose Risk
## 1        6            radio/TV good
## 2       48            radio/TV  bad
## 3       12           education good
## 4       42 furniture/equipment good
## 5       24                 car  bad
## 6       36           education good
tail(riskdat)
##        X Age    Sex Job Housing Saving.accounts Checking.account
## 995  994  50   male   2     own            <NA>             <NA>
## 996  995  31 female   1     own          little             <NA>
## 997  996  40   male   3     own          little           little
## 998  997  38   male   2     own          little             <NA>
## 999  998  23   male   2    free          little           little
## 1000 999  27   male   2     own        moderate         moderate
##      Credit.amount Duration             Purpose Risk
## 995           2390       12                 car good
## 996           1736       12 furniture/equipment good
## 997           3857       30                 car good
## 998            804       12            radio/TV good
## 999           1845       45            radio/TV  bad
## 1000          4576       45                 car good
str(riskdat)
## 'data.frame':    1000 obs. of  11 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ Sex             : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
##  $ Job             : int  2 2 1 2 2 1 2 3 1 3 ...
##  $ Housing         : Factor w/ 3 levels "free","own","rent": 2 2 2 1 1 1 2 3 2 2 ...
##  $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: NA 1 1 1 1 NA 3 1 4 1 ...
##  $ Checking.account: Factor w/ 3 levels "little","moderate",..: 1 2 NA 1 1 NA NA 2 NA 2 ...
##  $ Credit.amount   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Purpose         : Factor w/ 8 levels "business","car",..: 6 6 4 5 2 4 5 2 6 2 ...
##  $ Risk            : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...
summary(riskdat)
##        X              Age            Sex           Job        Housing   
##  Min.   :  0.0   Min.   :19.00   female:310   Min.   :0.000   free:108  
##  1st Qu.:249.8   1st Qu.:27.00   male  :690   1st Qu.:2.000   own :713  
##  Median :499.5   Median :33.00                Median :2.000   rent:179  
##  Mean   :499.5   Mean   :35.55                Mean   :1.904             
##  3rd Qu.:749.2   3rd Qu.:42.00                3rd Qu.:2.000             
##  Max.   :999.0   Max.   :75.00                Max.   :3.000             
##                                                                         
##    Saving.accounts Checking.account Credit.amount      Duration   
##  little    :603    little  :274     Min.   :  250   Min.   : 4.0  
##  moderate  :103    moderate:269     1st Qu.: 1366   1st Qu.:12.0  
##  quite rich: 63    rich    : 63     Median : 2320   Median :18.0  
##  rich      : 48    NA's    :394     Mean   : 3271   Mean   :20.9  
##  NA's      :183                     3rd Qu.: 3972   3rd Qu.:24.0  
##                                     Max.   :18424   Max.   :72.0  
##                                                                   
##                 Purpose      Risk    
##  car                :337   bad :300  
##  radio/TV           :280   good:700  
##  furniture/equipment:181             
##  business           : 97             
##  education          : 59             
##  repairs            : 22             
##  (Other)            : 24
describe(riskdat)
## riskdat 
## 
##  11  Variables      1000  Observations
## ---------------------------------------------------------------------------
## X 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1    499.5    333.7    49.95    99.90 
##      .25      .50      .75      .90      .95 
##   249.75   499.50   749.25   899.10   949.05 
## 
## lowest :   0   1   2   3   4, highest: 995 996 997 998 999
## ---------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       53    0.999    35.55    12.41       22       23 
##      .25      .50      .75      .90      .95 
##       27       33       42       52       60 
## 
## lowest : 19 20 21 22 23, highest: 67 68 70 74 75
## ---------------------------------------------------------------------------
## Sex 
##        n  missing distinct 
##     1000        0        2 
##                         
## Value      female   male
## Frequency     310    690
## Proportion   0.31   0.69
## ---------------------------------------------------------------------------
## Job 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        4    0.739    1.904   0.6413 
##                                   
## Value          0     1     2     3
## Frequency     22   200   630   148
## Proportion 0.022 0.200 0.630 0.148
## ---------------------------------------------------------------------------
## Housing 
##        n  missing distinct 
##     1000        0        3 
##                             
## Value       free   own  rent
## Frequency    108   713   179
## Proportion 0.108 0.713 0.179
## ---------------------------------------------------------------------------
## Saving.accounts 
##        n  missing distinct 
##      817      183        4 
##                                                       
## Value          little   moderate quite rich       rich
## Frequency         603        103         63         48
## Proportion      0.738      0.126      0.077      0.059
## ---------------------------------------------------------------------------
## Checking.account 
##        n  missing distinct 
##      606      394        3 
##                                      
## Value        little moderate     rich
## Frequency       274      269       63
## Proportion    0.452    0.444    0.104
## ---------------------------------------------------------------------------
## Credit.amount 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      921        1     3271     2773      709      932 
##      .25      .50      .75      .90      .95 
##     1366     2320     3972     7179     9163 
## 
## lowest :   250   276   338   339   343, highest: 15653 15672 15857 15945 18424
## ---------------------------------------------------------------------------
## Duration 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       33    0.985     20.9    12.98        6        9 
##      .25      .50      .75      .90      .95 
##       12       18       24       36       48 
## 
## lowest :  4  5  6  7  8, highest: 47 48 54 60 72
## ---------------------------------------------------------------------------
## Purpose 
##        n  missing distinct 
##     1000        0        8 
## 
## business (97, 0.097), car (337, 0.337), domestic appliances (12, 0.012),
## education (59, 0.059), furniture/equipment (181, 0.181), radio/TV (280,
## 0.280), repairs (22, 0.022), vacation/others (12, 0.012)
## ---------------------------------------------------------------------------
## Risk 
##        n  missing distinct 
##     1000        0        2 
##                     
## Value       bad good
## Frequency   300  700
## Proportion  0.3  0.7
## ---------------------------------------------------------------------------
#Making some changes to data
colnames(riskdat)[1] <- "index"
#Some Explorations

hist(riskdat$Credit.amount, main = "Histogram of Credit Amount", xlab = "Credit Amount", ylab = "Frequency", col = "green", border = "black" )

ggplot(riskdat, aes(Sex) ) + geom_bar(aes(fill = as.factor(riskdat$Sex))) + 
  scale_fill_discrete(name="Sex",
                      labels=c( "Female","Male")) + 
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(x= "Sex",y= "Frequency" , title = "Plot of Sex")

ggplot(riskdat, aes(Job) ) + geom_bar(aes(fill = as.factor(riskdat$Job))) + 
  scale_fill_discrete(name="Job Type",
                      labels=c( "Unskilled and Non-Resident","Unskilled and Resident", "Skilled", "Highly Skilled")) + 
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(x= "Level of Job",y= "Frequency" , title = "Plot of Job")

ggplot(riskdat, aes(Housing) ) + geom_bar(aes(fill = as.factor(riskdat$Housing))) + 
  scale_fill_discrete(name="Housing",
                      labels=c( "Free","Own", "Rent")) + 
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(x= "Housing",y= "Frequency" , title = "Plot of Housing")

ggplot(riskdat, aes(Saving.accounts) ) + geom_bar(aes(fill = as.factor(riskdat$Saving.accounts)))  + 
  scale_fill_discrete(name="Saving Accounts",
                      labels=c( "Little","Moderate", "Quite Rich", "Rich", "NA"))  +
  labs(x= "Saving Accounts",y= "Frequency" , title = "Plot of Saving Accounts")

ggplot(riskdat, aes(Checking.account) ) + geom_bar(aes(fill = as.factor(riskdat$Checking.account))) + 
  scale_fill_discrete(name="Checking Account",
                      labels=c( "Little","Moderate", "Rich")) + 
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(x= "Checking Account",y= "Frequency" , title = "Plot of Checking Account")

ggplot(riskdat, aes(Duration)) + geom_histogram(binwidth=4, colour="black", fill="green") +
  labs(x= "Duration in Months",y= "Frequency" , title = "Plot of Duration")

ggplot(riskdat, aes(Purpose) ) + geom_bar(aes(fill = as.factor(riskdat$Purpose))) + 
  scale_fill_discrete(name="Purpose of Loan",labels=c( "Business","Car", "Domestic Appliances","Education","Furniture/Equipment","Radio/TV","Repairs","Vacation/Others")) + 
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(x= "Purpose of Loan",y= "Frequency" , title = "Plot of Loan Purpose")

#Some important inferences
riskdat$age_gp<-c(0)
riskdat$age_gp<-findInterval(riskdat$Age,c(18,25,35,60,120))
boxplot(riskdat$Credit.amount~riskdat$age_gp,horizontal=TRUE,ylab="Age Group",xlab="credit amount",las=1,main="Credit amount v/s Age group",col=c("red","green","pink","yellow"))

histogram(Credit.amount~Housing|Risk,data=riskdat)

#It can be seen that there is high correlation between good risk and people who own houses.

boxplot(Credit.amount~Housing+Risk,data=riskdat,horizontal=TRUE,xlab="Distribution of Housing by credit amount")

#it can be seen that the highest values come from category "free".

boxplot(Credit.amount~Sex+Risk,data=riskdat,horizontal=TRUE,xlab="Distribution of credit amount by sex")

histogram(Credit.amount~Sex|Risk,data=riskdat)

ggplot(riskdat, aes(Risk) ) + geom_bar(aes(fill = as.factor(riskdat$Risk))) + 
  scale_fill_discrete(name="Risk",
                      labels=c( "good", "bad")) + 
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) + labs(x= "Risk",y= "Frequency" , title = "Plot of Risk")

histogram(Saving.accounts~Job|Risk,data=riskdat)

mytab1<-xtabs(~Risk+Job,data=riskdat)
mytab1
##       Job
## Risk     0   1   2   3
##   bad    7  56 186  51
##   good  15 144 444  97
histogram(Credit.amount~Purpose|Risk,data=riskdat)

boxplot(Credit.amount~Purpose+Risk,data=riskdat,horizontal=TRUE,xlab="Distribution of Purpose by credit amount")

mytab<-xtabs(~Risk+Purpose,data=riskdat)
mytab
##       Purpose
## Risk   business car domestic appliances education furniture/equipment
##   bad        34 106                   4        23                  58
##   good       63 231                   8        36                 123
##       Purpose
## Risk   radio/TV repairs vacation/others
##   bad        62       8               5
##   good      218      14               7
histogram(Risk~Purpose,data=riskdat)

boxplot(Credit.amount~Purpose+Risk,data=riskdat,horizontal=TRUE,xlab="Distribution of Housing by credit amount",col=c("red","yellow"))

boxplot(Duration~Risk+Credit.amount,data=riskdat,horizontal=TRUE,xlab="Distribution of Housing by credit amount")

#It can be clearly seen that the highest duration have the highest amount The highest density is between [12~18~24] months.
#Some important contingency tables
mytable_ch<-with(riskdat,table(Checking.account))
mytable_ch
## Checking.account
##   little moderate     rich 
##      274      269       63
mytable_js<-xtabs(~Job+Sex,data=riskdat)
mytable_js
##    Sex
## Job female male
##   0     12   10
##   1     64  136
##   2    197  433
##   3     37  111
mytable_cs<-xtabs(~Checking.account+Sex,data=riskdat)
mytable_cs
##                 Sex
## Checking.account female male
##         little       88  186
##         moderate     86  183
##         rich         20   43
mytable_sp<-xtabs(~Sex+Purpose,data=riskdat)
mytable_sp
##         Purpose
## Sex      business car domestic appliances education furniture/equipment
##   female       19  94                   6        24                  74
##   male         78 243                   6        35                 107
##         Purpose
## Sex      radio/TV repairs vacation/others
##   female       85       5               3
##   male        195      17               9
#Some important changes to be made in the data
riskdat$Sex_up<-c(0)
riskdat$Sex_up[which(riskdat$Sex=="male")]<-c(1)
riskdat$Sex_up[which(riskdat$Sex=="female")]<-c(2)

riskdat$housing_up<-c(0)
riskdat$housing_up[which(riskdat$Housing=="own")]<-c(1)
riskdat$housing_up[which(riskdat$Housing=="free")]<-c(2)
riskdat$housing_up[which(riskdat$Housing=="rent")]<-c(3)

riskdat$Sav_up<-c(0)
riskdat$Sav_up[which(riskdat$Saving.accounts=="little")]<-c(1)
riskdat$Sav_up[which(riskdat$Saving.accounts=="moderate")]<-c(2)
riskdat$Sav_up[which(riskdat$Saving.accounts=="quite rich")]<-c(3)
riskdat$Sav_up[which(riskdat$Saving.accounts=="rich")]<-c(4)

riskdat$rs_up<-c(0)
riskdat$rs_up[which(riskdat$Risk=="good")]<-c(1)
riskdat$rs_up[which(riskdat$Risk=="bad")]<-c(0)

riskdat$ch_ac<-c(0)
riskdat$ch_ac[which(riskdat$Checking.account=="little")]<-c(1)
riskdat$ch_ac[which(riskdat$Checking.account=="moderate")]<-c(2)
riskdat$ch_ac[which(riskdat$Checking.account=="rich")]<-c(3)

riskdat$pu_up<-c(0)
riskdat$pu_up[which(riskdat$Purpose=="radio/TV")]<-c(1)
riskdat$pu_up[which(riskdat$Purpose=="education")]<-c(2)
riskdat$pu_up[which(riskdat$Purpose=="furniture/equipment")]<-c(3)
riskdat$pu_up[which(riskdat$Purpose=="car")]<-c(4)
riskdat$pu_up[which(riskdat$Purpose=="business")]<-c(5)
riskdat$pu_up[which(riskdat$Purpose=="domestic appliances")]<-c(6)
riskdat$pu_up[which(riskdat$Purpose=="repairs")]<-c(7)
riskdat$pu_up[which(riskdat$Purpose=="vacation/others")]<-c(8)
#Correlation in the data
cor(riskdat[,c(1,4,8,9,12:18)])
##                      index         Job Credit.amount    Duration
## index          1.000000000 -0.02734538    0.01348793  0.03078762
## Job           -0.027345376  1.00000000    0.28538533  0.21090973
## Credit.amount  0.013487929  0.28538533    1.00000000  0.62498420
## Duration       0.030787617  0.21090973    0.62498420  1.00000000
## age_gp        -0.005784457  0.04244108    0.03875637 -0.03041478
## Sex_up         0.001692754 -0.07029834   -0.09348244 -0.08143219
## housing_up     0.020210925  0.01520106    0.05611874  0.01195019
## Sav_up        -0.041050835 -0.04080295   -0.10753805 -0.06452558
## rs_up         -0.034606444 -0.03273500   -0.15473864 -0.21492667
## ch_ac         -0.048268278 -0.05425460    0.02456123  0.03504999
## pu_up          0.009952104  0.02462906    0.21451307  0.10552447
##                     age_gp       Sex_up   housing_up       Sav_up
## index         -0.005784457  0.001692754  0.020210925 -0.041050835
## Job            0.042441082 -0.070298338  0.015201060 -0.040802954
## Credit.amount  0.038756369 -0.093482437  0.056118745 -0.107538048
## Duration      -0.030414781 -0.081432194  0.011950187 -0.064525576
## age_gp         1.000000000 -0.231082397 -0.176412073 -0.031660984
## Sex_up        -0.231082397  1.000000000  0.179136893  0.029309993
## housing_up    -0.176412073  0.179136893  1.000000000  0.003267732
## Sav_up        -0.031660984  0.029309993  0.003267732  1.000000000
## rs_up          0.127149930 -0.075492697 -0.123815236  0.033871266
## ch_ac         -0.067709213  0.021903226  0.032924525 -0.005614445
## pu_up          0.078191468 -0.051435596  0.041964365  0.024616085
##                     rs_up        ch_ac        pu_up
## index         -0.03460644 -0.048268278  0.009952104
## Job           -0.03273500 -0.054254600  0.024629056
## Credit.amount -0.15473864  0.024561230  0.214513073
## Duration      -0.21492667  0.035049995  0.105524472
## age_gp         0.12714993 -0.067709213  0.078191468
## Sex_up        -0.07549270  0.021903226 -0.051435596
## housing_up    -0.12381524  0.032924525  0.041964365
## Sav_up         0.03387127 -0.005614445  0.024616085
## rs_up          1.00000000 -0.197787636 -0.090000522
## ch_ac         -0.19778764  1.000000000  0.032660814
## pu_up         -0.09000052  0.032660814  1.000000000
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
corrplot(corr=cor(riskdat[,c(1,4,8,9,12:18)]),method="ellipse")

#Extracting missing values
miss= apply(X= riskdat, MARGIN = 2, FUN = function(k) which(is.na(k) | is.nan(k) | is.infinite(k)))

#removing missing values
misspos  = sort(unique(unlist(miss, use.names=FALSE)))
riskdat1 = riskdat[-misspos,]
nrow(riskdat1)
## [1] 522

Since,the response variable(Risk-good or Bad) is binary, it is preferred to use Logistic regression instead of linear.

#Model
mylogit=Risk~Sex+Saving.accounts+Checking.account+Duration+Purpose+Job+Housing+Age
fit=glm(mylogit,data=riskdat,family = binomial(link= "logit"))
summary(fit)
## 
## Call:
## glm(formula = mylogit, family = binomial(link = "logit"), data = riskdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1430  -1.0862   0.6152   0.9840   2.1206  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 0.484205   0.736256   0.658  0.51076    
## Sexmale                     0.372072   0.211820   1.757  0.07899 .  
## Saving.accountsmoderate     0.124210   0.304999   0.407  0.68383    
## Saving.accountsquite rich   0.678331   0.524194   1.294  0.19565    
## Saving.accountsrich         1.329302   0.583156   2.279  0.02264 *  
## Checking.accountmoderate    0.190906   0.212711   0.897  0.36946    
## Checking.accountrich        1.023122   0.372824   2.744  0.00607 ** 
## Duration                   -0.052239   0.008832  -5.914 3.33e-09 ***
## Purposecar                 -0.361756   0.362628  -0.998  0.31848    
## Purposedomestic appliances -0.731609   0.940269  -0.778  0.43652    
## Purposeeducation           -1.096391   0.545624  -2.009  0.04449 *  
## Purposefurniture/equipment -0.079054   0.388066  -0.204  0.83858    
## Purposeradio/TV            -0.092045   0.371760  -0.248  0.80445    
## Purposerepairs             -0.394206   0.661345  -0.596  0.55113    
## Purposevacation/others      0.361401   0.804903   0.449  0.65343    
## Job                         0.038334   0.150073   0.255  0.79839    
## Housingown                  0.266776   0.328882   0.811  0.41727    
## Housingrent                -0.067311   0.377737  -0.178  0.85857    
## Age                         0.009913   0.008993   1.102  0.27032    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 716.73  on 521  degrees of freedom
## Residual deviance: 635.89  on 503  degrees of freedom
##   (478 observations deleted due to missingness)
## AIC: 673.89
## 
## Number of Fisher Scoring iterations: 4
#Confidence interval for the coefficients Beta_hat
confint(fit)
## Waiting for profiling to be done...
##                                   2.5 %      97.5 %
## (Intercept)                -0.956949473  1.93557959
## Sexmale                    -0.042649511  0.78879500
## Saving.accountsmoderate    -0.471210200  0.72832311
## Saving.accountsquite rich  -0.301818358  1.78376692
## Saving.accountsrich         0.276915869  2.61822035
## Checking.accountmoderate   -0.225891929  0.60899225
## Checking.accountrich        0.317566967  1.78925680
## Duration                   -0.070008462 -0.03532318
## Purposecar                 -1.081512595  0.34481712
## Purposedomestic appliances -2.637183615  1.17398278
## Purposeeducation           -2.194859171 -0.04503009
## Purposefurniture/equipment -0.845728261  0.67993005
## Purposeradio/TV            -0.828805165  0.63312887
## Purposerepairs             -1.689540214  0.93469817
## Purposevacation/others     -1.213209932  2.00378701
## Job                        -0.256416505  0.33307641
## Housingown                 -0.378407847  0.91488487
## Housingrent                -0.808577282  0.67573711
## Age                        -0.007595798  0.02773828
#Performing Wald test to test the significance of the coefficients
library(aod)
## Warning: package 'aod' was built under R version 3.4.3
## 
## Attaching package: 'aod'
## The following object is masked from 'package:survival':
## 
##     rats
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 1)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 0.43, df = 1, P(> X2) = 0.51
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 2)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 3.1, df = 1, P(> X2) = 0.079
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 3)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 0.17, df = 1, P(> X2) = 0.68
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 4)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 1.7, df = 1, P(> X2) = 0.2
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 5)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 5.2, df = 1, P(> X2) = 0.023
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 6)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 0.81, df = 1, P(> X2) = 0.37
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 7)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 7.5, df = 1, P(> X2) = 0.0061
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 8)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 35.0, df = 1, P(> X2) = 3.3e-09
#Pseudo R-Squared(MacFadden Test)
# McFadden test =(1 log likelihood of fitted model/log likelihood of null model)
library(pscl)
## Warning: package 'pscl' was built under R version 3.4.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(fit)
##          llh      llhNull           G2     McFadden         r2ML 
## -317.9466369 -610.8643021  585.8353304    0.4795135    0.6744665 
##         r2CU 
##    0.7463239

Since the value of McFadden is 0.4795 (>0.3), the model is a good fit to the data.

#Confusion Matrix
pred=predict(fit,type="response")
summary(pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06354 0.42872 0.58344 0.55747 0.69331 0.95555
table(riskdat1$Risk,round(pred))
##       
##          0   1
##   bad  125 106
##   good  61 230

The table above shows that the model predicted 125 bad and 230 good Risk correctly. and 106 bad and 61 good Risk incorrectly.

#Calculating the accuracy of the model
y=ifelse(riskdat1$Risk=="bad",0,1)
fitted.results <- ifelse(pred > 0.5,1,0)
fitted.results
##    2    4    5    8   10   11   12   13   14   15   16   19   22   23   24 
##    0    0    0    0    0    0    0    1    1    0    0    0    1    1    1 
##   26   28   29   30   31   32   33   35   36   38   39   40   42   43   44 
##    1    1    1    0    1    1    1    1    0    1    1    1    1    1    0 
##   45   48   52   55   59   60   61   63   64   68   73   74   76   77   78 
##    0    1    0    0    1    0    1    0    0    1    1    0    1    0    1 
##   80   84   85   87   88   89   90   92   95   96   98   99  102  104  106 
##    0    1    1    1    0    1    1    1    1    0    1    0    0    1    1 
##  108  110  111  112  113  115  119  120  121  124  126  127  128  129  130 
##    1    1    1    0    0    1    0    1    0    1    1    1    1    1    1 
##  132  138  140  141  142  143  144  146  147  149  153  154  155  156  157 
##    0    1    1    1    0    1    1    0    1    0    1    1    1    0    1 
##  158  159  164  167  168  170  171  173  174  175  177  178  180  182  185 
##    0    1    1    1    1    1    1    1    1    0    1    1    1    0    1 
##  187  188  189  190  192  193  195  196  198  200  202  204  206  208  209 
##    1    1    1    1    0    1    0    0    1    1    1    0    0    1    1 
##  213  214  217  218  219  221  227  228  230  231  234  236  238  240  243 
##    1    1    1    1    0    1    1    1    0    1    1    1    1    0    0 
##  249  251  252  253  258  261  262  263  266  269  274  275  285  286  287 
##    1    1    1    0    1    1    1    1    1    1    0    0    1    0    0 
##  288  289  290  292  293  294  296  300  302  304  308  309  310  313  314 
##    0    1    1    0    0    0    0    1    0    1    1    1    1    1    1 
##  316  317  320  321  322  323  324  326  329  330  331  333  335  336  337 
##    0    1    1    0    1    0    1    1    1    1    1    0    0    1    1 
##  339  340  341  342  343  344  345  347  348  350  352  354  356  360  363 
##    1    1    0    0    1    1    1    1    1    1    1    1    0    0    1 
##  365  368  369  370  375  376  379  382  384  388  389  392  393  394  396 
##    1    0    0    1    0    0    0    0    1    1    1    1    0    1    0 
##  397  398  399  406  408  410  411  417  423  426  430  432  433  435  439 
##    1    0    1    1    1    1    0    1    1    0    0    1    1    1    0 
##  440  442  443  445  447  448  450  455  457  458  459  461  462  463  466 
##    1    1    1    0    0    1    1    0    1    1    1    0    1    1    1 
##  467  471  472  473  475  476  478  479  480  481  482  483  486  492  495 
##    0    1    0    1    1    0    1    1    1    1    1    1    1    0    1 
##  497  499  500  501  502  503  504  505  507  508  511  513  514  516  517 
##    0    1    1    0    0    1    1    0    1    1    1    1    1    1    1 
##  519  522  523  525  526  529  530  531  532  536  538  539  540  541  544 
##    1    1    0    1    0    0    1    0    0    1    1    0    1    1    1 
##  546  549  553  554  555  556  557  559  560  562  563  566  567  570  571 
##    0    1    0    1    1    1    1    1    1    0    1    0    1    0    0 
##  574  575  577  579  581  582  584  586  587  588  589  590  591  594  596 
##    1    1    1    0    1    1    0    0    1    1    0    1    1    0    1 
##  597  598  601  602  603  605  606  608  611  612  613  614  618  619  621 
##    0    1    1    1    0    1    1    0    0    1    1    0    1    0    1 
##  624  625  627  628  631  632  635  640  641  642  645  647  649  650  651 
##    1    1    1    1    0    1    0    0    0    1    1    0    1    0    0 
##  652  653  654  656  657  659  660  661  664  665  667  669  670  678  679 
##    1    1    0    1    1    0    1    1    1    1    1    1    1    0    1 
##  685  688  690  691  692  693  697  700  702  703  704  705  707  708  709 
##    0    0    1    1    1    1    1    1    0    1    1    0    0    1    1 
##  710  712  714  715  720  721  722  723  724  728  729  730  731  732  733 
##    1    0    1    0    1    1    1    1    1    0    0    1    1    0    1 
##  737  738  740  741  742  744  746  747  748  751  752  753  757  760  762 
##    0    1    0    1    1    1    1    0    1    1    0    1    1    1    0 
##  763  766  767  769  772  775  778  780  781  783  784  786  789  790  791 
##    1    1    0    1    0    1    1    1    0    0    1    1    0    0    1 
##  794  802  803  806  807  809  810  811  812  813  814  815  816  819  820 
##    1    1    0    0    1    0    0    1    1    0    0    0    0    1    1 
##  822  823  824  826  827  832  833  835  836  839  841  849  850  851  854 
##    1    0    1    1    0    0    0    1    1    1    0    1    1    0    0 
##  859  863  867  870  872  873  875  876  877  879  885  886  888  891  893 
##    0    0    1    1    1    1    1    1    1    1    1    0    0    1    1 
##  894  897  900  901  906  912  915  916  918  919  920  923  924  925  926 
##    0    0    1    1    1    0    0    0    1    1    0    1    1    1    1 
##  927  928  930  931  932  935  936  937  938  939  945  946  947  951  952 
##    0    0    1    1    1    1    1    1    1    0    1    0    1    1    0 
##  953  955  956  958  959  960  962  965  967  970  971  973  974  976  977 
##    1    1    1    1    0    1    1    1    1    1    1    0    0    1    1 
##  980  981  983  984  986  987  989  990  994  997  999 1000 
##    1    1    1    0    0    1    0    1    0    0    0    0
misClasificError <- mean(fitted.results != y)
1-misClasificError
## [1] 0.6800766

The fitted model is 68.00% accurate.

#Roc Curve
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.4.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.4.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
pred=predict(fit,type="response")
pr <- prediction(pred, riskdat1$Risk)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

#Area Under the Curve (AUC)
library(ROCR)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7202883
#Tjur Statistic
tjur= mean(pred[which(y==1)])-mean(pred[which(y==0)])
tjur
## [1] 0.1461676

The value of TJUR statistic reveals that the model is a moderate fit to the data.

DISCUSSION:

The logistic regression fits well to the credit risk data since the response variable is binary. The prdictor varaibles are: 1. Age 2. Sex 3. Saving.accounts 4. Checking.account 5. Duration 6. Purpose 7. Job 8. Housing

The mentioned variables have been used to predict the response ( whether giving away a credit card a good risk or a bad one).

INFERENCES:

Following conclusions have been noted:

  1. Some of the entries aren’t complete. There are a lot of missing values in the data. Therefore a new dataset “riskdat1” has been created to study the prediction.

  2. Most of the data is categorical which is one of the main reason for using Logistic regression instead of Linear Regression.

  3. The distribution of various variables have been shown above to have a glipmse of how the distribution look like.

  4. It is observed that the people who own houses are highly correlated with the good risk.

  5. Distribution of “Housing by Credit amount” reveals that the highest values come from Housing category “free”.

  6. A lot of explorations of the data have been done using one-way, two-way contingency tables and a few histograms.

  7. Highest Credit amounts have been taken for longer duration. The highest density is between [121824] months which makes sense.

  8. Correlation matrix and corrplot have been constructed to study the correlation in the data where the darkest colour shows high correlation betweem the variables involved.

  9. The final fitted model is a moderate fit to the data. This can be infered from various tests like Wald test, Accuracy,McFadden test and TJUR statistic. The model is found to be 68.00% accurate.

REFERENCES:

  1. https://www.theanalysisfactor.com/sensitivity-and-specificity/

  2. https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-logistic-regression-in-r/

  3. https://www.kaggle.com/kabure/german-credit-risk-financial-eda

  4. https://en.wikipedia.org/wiki/Logistic_regression

  5. https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/