GERMAN CREDIT RISK

Credit risk

Credit Risk is the probable risk of loss resulting from a borrower’s failure to repay a loan or meet contractual obligations. If a company offers credit to its client,then there is a risk that its clients may not pay their invoices.

Types of Credit Risk

Good Risk: An investment that one believes is likely to be profitable. The term most often refers to a loan made to a creditworthy person or company. Good risks are considered exceptionally likely to be repaid.

Bad Risk: A loan that is unlikely to be repaid because of bad credit history, insufficient income, or some other reason. A bad risk increases the risk to the lender and the likelihood of default on the part of the borrower.

Objective:

Based on the attributes, classify a person as good or bad credit risk.

Dataset Description:

The German Credit data set (available at ftp.ics.uci.edu/pub/machine-learning-databases/statlog/) contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit” (700 cases) or “bad credit” (300 cases). The dataset contains 1000 entries with 20 independent variables (7 numerical, 13 categorical) and 1 target variable prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes.The attributes are:

Attribute 1: (qualitative) Status of existing checking account A11 : … < 0 DM A12 : 0 <= … < 200 DM A13 : … >= 200 DM / salary assignments for at least 1 year A14 : no checking account

Attribute 2: (numerical) Duration in month

Attribute 3: (qualitative) Credit history A30 : no credits taken/all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/other credits existing (not at this bank)

Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : vacation A48 : retraining A49 : business A410 : others

Attribute 5: (numerical) Credit amount

Attibute 6: (qualitative) Savings account/bonds A61 : … < 100 DM A62 : 100 <= … < 500 DM A63 : 500 <= … < 1000 DM A64 : .. >= 1000 DM A65 : unknown/ no savings account

Attribute 7: (qualitative) Present employment since A71 : unemployed A72 : … < 1 year A73 : 1 <= … < 4 years A74 : 4 <= … < 7 years A75 : .. >= 7 years

Attribute 8: (numerical) Installment rate in percentage of disposable income Attribute 9: (qualitative) Personal status and sex A91 : male : divorced/separated A92 : female : divorced/separated/married A93 : male : single A94 : male : married/widowed A95 : female : single

Attribute 10: (qualitative) Other debtors / guarantors A101 : none A102 : co-applicant A103 : guarantor

Attribute 11: (numerical) Present residence since

Attribute 12: (qualitative) Property A121 : real estate A122 : if not A121 : building society savings agreement/life insurance A123 : if not A121/A122 : car or other, not in attribute 6 A124 : unknown / no property

Attribute 13: (numerical) Age in years

Attribute 14: (qualitative) Other installment plans A141 : bank A142 : stores A143 : none

Attribute 15: (qualitative) Housing A151 : rent A152 : own A153 : for free

Attribute 16: (numerical) Number of existing credits at this bank

Attribute 17: (qualitative) Job A171 : unemployed/ unskilled - non-resident A172 : unskilled - resident A173 : skilled employee / official A174 : management/ self-employed/highly qualified employee/ officer

Attribute 18: (numerical) Number of people being liable to provide maintenance for

Attribute 19: (qualitative) Telephone A191 : none A192 : yes, registered under the customers name

Attribute 20: (qualitative) foreign worker A201 : yes A202 : no

Target Variable

Cost Matrix

1 = Good Risk

2 = Bad

Libraries Used

library(readr)

# The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

#  The easiest way to get ggplot2 is to install the whole tidyverse:
# install.packages("tidyverse")
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.2

## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.1     v dplyr   1.0.0
## v tibble  3.0.1     v stringr 1.4.0
## v tidyr   1.1.0     v forcats 0.5.0
## v purrr   0.3.4

## Warning: package 'stringr' was built under R version 4.0.2

## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

#ggrepel: Automatically Position Non-Overlapping Text Labels with 'ggplot2'
#Provides text and label geoms for 'ggplot2' that help to avoid overlapping text labels. Labels repel away from each other and away from the data points.
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.0.2

# Create interactive web graphics from 'ggplot2' graphs
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(highcharter)

## Warning: package 'highcharter' was built under R version 4.0.2

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Highcharts (www.highcharts.com) is a Highsoft software product which is

## not free for commercial and Governmental use

library(dplyr)
library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.0.2

library(ggrepel)
library(stringr)
library(magrittr)

## Warning: package 'magrittr' was built under R version 4.0.2

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(psych)

## Warning: package 'psych' was built under R version 4.0.2

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.0.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(knitr)
library(dplyr)
library(tidyr)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(RColorBrewer)
library(GGally)

## Warning: package 'GGally' was built under R version 4.0.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)
library(caret)

## Warning: package 'caret' was built under R version 4.0.2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.0.2

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.0-2

library(boot)

## Warning: package 'boot' was built under R version 4.0.2

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

## The following object is masked from 'package:psych':
## 
##     logit

library(verification)

## Warning: package 'verification' was built under R version 4.0.2

## Loading required package: fields

## Warning: package 'fields' was built under R version 4.0.2

## Loading required package: spam

## Warning: package 'spam' was built under R version 4.0.2

## Loading required package: dotCall64

## Warning: package 'dotCall64' was built under R version 4.0.2

## Loading required package: grid

## Spam version 2.5-1 (2019-12-12) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.

## 
## Attaching package: 'spam'

## The following object is masked from 'package:Matrix':
## 
##     det

## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve

## Loading required package: maps

## Warning: package 'maps' was built under R version 4.0.2

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

## See https://github.com/NCAR/Fields for
##  an extensive vignette, other supplements and source code

## 
## Attaching package: 'fields'

## The following object is masked from 'package:psych':
## 
##     describe

## Loading required package: CircStats

## Warning: package 'CircStats' was built under R version 4.0.2

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:rstatix':
## 
##     select

## The following object is masked from 'package:plotly':
## 
##     select

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## Attaching package: 'CircStats'

## The following object is masked from 'package:psych':
## 
##     r.test

## Loading required package: dtw

## Warning: package 'dtw' was built under R version 4.0.2

## Loading required package: proxy

## Warning: package 'proxy' was built under R version 4.0.2

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:spam':
## 
##     as.matrix

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loaded dtw v1.21-3. See ?dtw for help, citation("dtw") for use in publication.

## Registered S3 method overwritten by 'verification':
##   method    from
##   lines.roc pROC

library(ggcorrplot)

## Warning: package 'ggcorrplot' was built under R version 4.0.2

## 
## Attaching package: 'ggcorrplot'

## The following object is masked from 'package:rstatix':
## 
##     cor_pmat

Load the data

german_credit <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")

colnames(german_credit) <- c("chk_acct", "duration", "credit_his", "purpose", 
                            "amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor", 
                            "present_resid", "property", "age", "other_install", "housing", "n_credits", 
                            "job", "n_people", "telephone", "foreign", "response")

german_credit$response <- german_credit$response - 1
german_credit$response <- as.factor(german_credit$response)

glimpse(german_credit)

## Rows: 1,000
## Columns: 21
## $ chk_acct         <chr> "A11", "A12", "A14", "A11", "A11", "A14", "A14", "...
## $ duration         <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12,...
## $ credit_his       <chr> "A34", "A32", "A34", "A32", "A33", "A32", "A32", "...
## $ purpose          <chr> "A43", "A43", "A46", "A42", "A40", "A46", "A42", "...
## $ amount           <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 30...
## $ saving_acct      <chr> "A65", "A61", "A61", "A61", "A61", "A65", "A63", "...
## $ present_emp      <chr> "A75", "A73", "A74", "A74", "A73", "A73", "A75", "...
## $ installment_rate <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4,...
## $ sex              <chr> "A93", "A92", "A93", "A93", "A93", "A93", "A93", "...
## $ other_debtor     <chr> "A101", "A101", "A101", "A103", "A101", "A101", "A...
## $ present_resid    <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4,...
## $ property         <chr> "A121", "A121", "A121", "A122", "A124", "A124", "A...
## $ age              <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22...
## $ other_install    <chr> "A143", "A143", "A143", "A143", "A143", "A143", "A...
## $ housing          <chr> "A152", "A152", "A152", "A153", "A153", "A153", "A...
## $ n_credits        <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2,...
## $ job              <chr> "A173", "A173", "A172", "A173", "A173", "A172", "A...
## $ n_people         <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ telephone        <chr> "A192", "A191", "A191", "A191", "A191", "A192", "A...
## $ foreign          <chr> "A201", "A201", "A201", "A201", "A201", "A201", "A...
## $ response         <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0,...

Look at the dimensions and the structure of the data.

dim(german_credit)

## [1] 1000   21

str(german_credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : chr  "A11" "A12" "A14" "A11" ...
##  $ duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : chr  "A34" "A32" "A34" "A32" ...
##  $ purpose         : chr  "A43" "A43" "A46" "A42" ...
##  $ amount          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ saving_acct     : chr  "A65" "A61" "A61" "A61" ...
##  $ present_emp     : chr  "A75" "A73" "A74" "A74" ...
##  $ installment_rate: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : chr  "A93" "A92" "A93" "A93" ...
##  $ other_debtor    : chr  "A101" "A101" "A101" "A103" ...
##  $ present_resid   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : chr  "A121" "A121" "A121" "A122" ...
##  $ age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : chr  "A143" "A143" "A143" "A143" ...
##  $ housing         : chr  "A152" "A152" "A152" "A153" ...
##  $ n_credits       : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : chr  "A173" "A173" "A172" "A173" ...
##  $ n_people        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : chr  "A192" "A191" "A191" "A191" ...
##  $ foreign         : chr  "A201" "A201" "A201" "A201" ...
##  $ response        : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 2 ...

Exploratory data analysis for Continuous Variables

BOXPLOT FOR AGE

ggplot(german_credit,aes(x=as.factor(response),y=age))+ 
  geom_boxplot(fill="blue") + 
  labs(title='Age by risk classification',x='Risk Classification(1 for bad risk, 0 for good risk)',y='Age(in years)')

Median age for people classified as bad is lower. Hence, younger people have more chance of being a bad customer and pose higher credit risk.

Installment Rate variable

ggplot(german_credit, aes(factor(installment_rate), ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") + xlab("Installment Rates") +
  scale_colour_brewer( palette="Set1")

The installment_rate variable has a great deal of difference between the good and bad records, we see that bad records have almost the double median value than good ones.

Age variable

ggplot(melt(german_credit[,c(13,21)]), aes(x = variable, y = value, fill = response)) + 
geom_boxplot()+ xlab("response") + ylab("age")

## Using response as id variables

From the age variable, we see that the median value for bad records is lesser than that of good records, it might be premature to say young people tend to have bad credit records, but we can safely assume it tends to be riskier.

Working with HTML Widgets and Highcharter

# load required packages
library(highcharter)
hc <- german_credit %>%
  hchart('column', hcaes(x = amount, y = duration))

## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.

## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

hc

ggplot(german_credit, aes(credit_his, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge")

For credit_his, we observe that proportion of the response variable varies significantly, for categories A30, A31 we see the number of bad credit records are greater.

Correlation Between Numerical variable

factors.credit <- c(1,3,4,6,7,9,10,12,14,15,17,19,20)
german_credit[,factors.credit] <- lapply(german_credit[,factors.credit], factor)
num.credit <- c(2,5,8,11,13,16,18,21)
german_credit[num.credit] <- sapply(german_credit[num.credit], as.numeric)
cor.credit <- cor(german_credit[num.credit])
ggcorrplot(cor.credit,hc.order = TRUE,lab=TRUE) + ggtitle("Correlation between numeric variables") + theme(axis.text.x = element_text(angle = 90, vjust=0.5))

Variables amount and duration have a positive correlation with response, while age impacts it negatively. Hence higher amount and duration imply the value of response will be be higher i.e. closer to 1. Hence, it increases the chance of high credit risk i.e. bad customer.

FINAL PROJECT

SALIFOU SYLLA

JULY 10, 2020