DATA 606 Data Project Proposal

Load packages

Data Preparation

Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file “german.data”.

For algorithms that need numerical attributes, Strathclyde University produced the file “german.data-numeric”

I decided to use the orginal data file(the original data file). There are 20 columns and the column descriptions are provided in german.doc. I created a vector for all headers listed in the german.doc.

Below R code snippet reads data from the german.data data file, creates a dataframe with prepared header values.

germanCreditDF <- read.delim("C:/Users/Charls/Documents/CunyMSDS/606-Statistical Analsyis/final-project/data/german.data.txt", sep = " ", header = F)

column_headers <-  c("checkingAcc", "Duration", "Credit_Hist", "Purpose" ,"Credit_Amt" , "SavingsAcc" , "Employment_Stat", "Installment_rate", "personal_stat" , "deptor_stat" ,"residence_in_years" ,"Property" , "age","other_instalment_plans" , "Housing" , "no_of_credits" ,"Job_type" , "no_liable" , "Telephone" , 
"foreign_worker", "Customer_class" )

colnames(germanCreditDF) <- column_headers

The structure of the dataframe is as below. Along with 20 feature metrics, we have a response variable called Customer_class ( Value = 1 means Good, Value = 2 means Bad Credit risk).

str(germanCreditDF)

## 'data.frame':    1000 obs. of  21 variables:
##  $ checkingAcc           : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ Duration              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Credit_Hist           : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ Purpose               : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ Credit_Amt            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ SavingsAcc            : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ Employment_Stat       : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ Installment_rate      : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_stat         : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ deptor_stat           : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ residence_in_years    : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Property              : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_instalment_plans: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Housing               : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ no_of_credits         : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ Job_type              : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ no_liable             : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone             : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign_worker        : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Customer_class        : int  1 2 1 1 2 1 1 1 1 2 ...

Research question

Find the signifance of few quantitive and qualitative variables( Duration , Credit_Amt , age , foreign_worker , Employment_Stat , Purpose and credit_hist) on the response variable. Using backword elimination process, find out the most significant variable which is highly correlated to the response variable.
From the inference analysis, we noted that there is an anomaly on credit_hist. Critical credits having a higher approval rate and considered to be a good customer. Find what is the reason behind this observation ?
Try different machine learning models to predict the customer class based on most significant variables identified above and guage their accuracy using confusion matrix and other means.

Cases

The cases are the observations on different customers who appied for a credit in a german bank and bank’s decision weather the credit would be approved or not. The observations are the details regarding like their credit history, employment, credit amount, purpose and bank details.

There are 1000 cases in the dataset for this research activity.

Data collection

The data is available in UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets.html and is downloaded to perform the research study on the regression.

Type of study

This is an observational study, since we are doing analysis on the dataset collected as observation

Data Source

The data is available in UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Dependent Variable

Customer_class is the response variable. This is qualititive variable whether a bank is considering his credit application or not. If the bank approves, the customer is classifies as good, otherwise not.

Independent Variable

There are 20 independent Variables including the quantitative and qualitative variables.

checkingAcc: (qualitative)
- Status of existing checking account
  - A11 : … < 0 DM
  - A12 : 0 <= … < 200 DM
  - A13 : … >= 200 DM /salary assignments for at least 1 year
  - A14 : no checking account
Duration: (numerical)
- Duration in month
Credit_Hist: (qualitative)
- Credit history
  - A30 : no credits taken/ all credits paid back duly
  - A31 : all credits at this bank paid back duly
  - A32 : existing credits paid back duly till now
  - A33 : delay in paying off in the past
  - A34 : critical account/other credits existing (not at this bank)
Purpose : (qualitative)
- Purpose
  - A40 : car (new)
  - A41 : car (used)
  - A42 : furniture/equipment
  - A43 : radio/television
  - A44 : domestic appliances
  - A45 : repairs
  - A46 : education
  - A47 : (vacation - does not exist?)
  - A48 : retraining
  - A49 : business
  - A410 : others
Credit_Amt: (numerical) + Credit amount
SavingsAcc: (qualitative)
- Savings account/bonds
  - A61 : … < 100 DM
  - A62 : 100 <= … < 500 DM
  - A63 : 500 <= … < 1000 DM
  - A64 : .. >= 1000 DM
  - A65 : unknown/ no savings account
Employment_Stat: (qualitative)
- Present employment since
  - A71 : unemployed
  - A72 : … < 1 year
  - A73 : 1 <= … < 4 years
  - A74 : 4 <= … < 7 years
  - A75 : .. >= 7 years
Installment_rate: (numerical)
- Installment rate in percentage of disposable income
personal_stat: (qualitative)
- Personal status and sex
  - A91 : male : divorced/separated
  - A92 : female : divorced/separated/married
  - A93 : male : single
  - A94 : male : married/widowed
  - A95 : female : single
deptor_stat: (qualitative)
- Other debtors / guarantors
  - A101 : none
  - A102 : co-applicant -A103 : guarantor
residence_in_years: (numerical)
- Present residence since
Property: (qualitative)
- Property
  - A121 : real estate
  - A122 : if not A121 : building society savings agreement/life insurance
  - A123 : if not A121/A122 : car or other, not in attribute 6
  - A124 : unknown / no property
age: (numerical)
- Age in years
other_instalment_plans: (qualitative)
- Other installment plans
  - A141 : bank
  - A142 : stores
  - A143 : none
Housing: (qualitative) + Housing - A151 : rent - A152 : own - A153 : for free
no_of_credits: (numerical)
- Number of existing credits at this bank
Job_type: (qualitative)
- Job
  - A171 : unemployed/ unskilled - non-resident
  - A172 : unskilled - resident
  - A173 : skilled employee / official
  - A174 : management/ self-employed/highly qualified employee/ officer
no_liable: (numerical)
- Number of people being liable to provide maintenance for
Telephone: (qualitative)
- Telephone
  - A191 : none
  - A192 : yes, registered under the customers name
foreign_worker : (qualitative)
- foreign worker
  - A201 : yes
  - A202 : no

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Lets look from some basic inference by looking at the distribution and the box plot graph. Based on this analysis, we will derive our research questions.

Firstly we will look for distribution of some of the quantitive variables.

Duration

Below is the histogram and the box plot of the ‘Duration’ variable. The histogram shows it is slightly right skewed, the box plot shows that the median for good customer class is less than the bad customer class. However there are presence of outliers for good customer class which also means that there can be some other factors which co-relates to the response variable.

With respect to the spread, the good customer class is more condensed than the bad customer class. It doesnt show a much of significance towards the response variable from the box plots. Let’s confirm if the Duration is statistically significant to the response variable using the hypothesis test.

hist(germanCreditDF$Duration)
germanCreditDF$class <- sapply(germanCreditDF$Customer_class, function(x){
  switch(as.character(x), "1" = "Good", "2" = "Bad")
})
library(ggplot2)

ggplot(germanCreditDF, aes(x=class, y=Duration)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

Let’s define the null and alternate hypotheis in terms of median since the distribution is skewed.

h0 -> good customer’s median of Duration is same as the bad customer.

ha -> goodcustomer’s median of Duration is different than the bad customer.

Conclusion: since the p-val is less than .05, the null hypothesis can be rejected. In short, using hypothesis test, we see that the Duration is statistically significant to the response variable.

source('http://www.openintro.org/stat/slides/inference.R')
inference(y = germanCreditDF$Duration, x = germanCreditDF$class, est = "median", type = "ht", null = 0, 
          alternative = "twosided", method = "simulation")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 24, n_Good = 700, median_Good = 18,

## Observed difference between medians (Bad-Good) = 6
## H0: median_Bad - median_Good = 0 
## HA: median_Bad - median_Good != 0

## p-value =  0

Lets move on to some other Quantitive variables.

Credit_Amt

As per the box plot, we dont see much of significance. The median is more or less same, The good customer is more condensed than the bad customer class. Noted that the there are more outliers for good customer class.

Let’s confirm if the Credit_Amt is statistically significant to the response variable using the hypothesis test.

# Quantitive variable comparison(using Boxplots)

hist(germanCreditDF$Credit_Amt)

# Set a unique color with fill, colour, and alpha
ggplot(germanCreditDF, aes(x=class, y=Credit_Amt)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

Let’s define the null and alternate hypotheis in terms of median since the distribution is skewed.

h0 -> good customer’s median of Credit_Amt is same as the bad customer.

ha -> good customer’s median of Credit_Amt is different than the bad customer.

Conclusion: since the p-val is less than .05, the null hypothesis can be rejected. In short, using hypothesis test, we see that the Credit_Amt is statistically significant to the response variable.

inference(y = germanCreditDF$Credit_Amt, x = germanCreditDF$class, est = "median", type = "ht", null = 0, 
          alternative = "twosided", method = "simulation")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 2574.5, n_Good = 700, median_Good = 2244,

## Observed difference between medians (Bad-Good) = 330.5
## H0: median_Bad - median_Good = 0 
## HA: median_Bad - median_Good != 0

## p-value =  0.0486

no_of_credits

The box plot for both class looks same. Hence I’m assuming that there is no significance of no_of_credits towards the response variable.

hist(germanCreditDF$no_of_credits)

ggplot(germanCreditDF, aes(x=class, y=no_of_credits)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

According to the hypothesis test, p-val is greater than the the .05, so we can conclude that there is no significance of no_of_credits towards the response variable.

inference(y = germanCreditDF$no_of_credits, x = germanCreditDF$class, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Bad = 300, mean_Bad = 1.3667, sd_Bad = 0.5597
## n_Good = 700, mean_Good = 1.4243, sd_Good = 0.5847

## Observed difference between means (Bad-Good) = -0.0576
## H0: mu_Bad - mu_Good = 0 
## HA: mu_Bad - mu_Good != 0 
## Standard error = 0.039 
## Test statistic: Z =  -1.472 
## p-value =  0.141

age

Based on boxplot, we dont see much of a significance of age too on both group since median is more or less same. And spread is overlapping. But according to the hypotheisis test, we can reject the null hypothesis and there is a significance of age to the response variable.

hist(germanCreditDF$age)

ggplot(germanCreditDF, aes(x=class, y=age)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

inference(y = germanCreditDF$age, x = germanCreditDF$class, est = "median", type = "ht", null = 0, 
          alternative = "twosided", method = "simulation")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 31, n_Good = 700, median_Good = 34,

## Observed difference between medians (Bad-Good) = -3
## H0: median_Bad - median_Good = 0 
## HA: median_Bad - median_Good != 0

## p-value =  0.0254

We are running out of all quantitive variables. Lets look at some of the qualitative variables by looking at the contigency tables, contitional probabilties(column wise) and few stacked histograms.

foreign_worker

The conditinal probabilty says that the although the number of non-foreign applicants is less, the probabily of approving their application is more compaired to the foreign applicants. However noted that there are more number of foreign applicants.

# Qualitive variable comparison(using congidency table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Contigency table
foreignDF <- germanCreditDF %>% select(class, foreign_worker)
table(foreignDF)

##       foreign_worker
## class  A201 A202
##   Bad   296    4
##   Good  667   33

# Contitional probabily table

prop.table(table(foreignDF), 2)

##       foreign_worker
## class       A201      A202
##   Bad  0.3073728 0.1081081
##   Good 0.6926272 0.8918919

# side barplot of contigency table

barplot(table(foreignDF), beside =  T, legend = T, main = "foreign v/s customer class", col=c("coral", "aquamarine3"), xlab = "foreign_worker - A201 : yes - A202 : no")

Employment_Stat

Unemployment and less no of years shows lower approval rate. So there looks to be some relation exist.

# Qualitive variable comparison(using congidency table)
library(dplyr)
# Contigency table
employmentDF <- germanCreditDF %>% select(class, Employment_Stat)
table(employmentDF)

##       Employment_Stat
## class  A71 A72 A73 A74 A75
##   Bad   23  70 104  39  64
##   Good  39 102 235 135 189

# Contitional probabily table

prop.table(table(employmentDF), 2)

##       Employment_Stat
## class        A71       A72       A73       A74       A75
##   Bad  0.3709677 0.4069767 0.3067847 0.2241379 0.2529644
##   Good 0.6290323 0.5930233 0.6932153 0.7758621 0.7470356

# side barplot of contigency table

barplot(table(employmentDF), beside =  T, legend = T, main = "Employment v/s customer class", col=c("coral", "aquamarine3"))

Purpose

Contitional probabilty says that the used cars application as higher approval rate. and education credit application has more or less same chances to get approved or rejected. So there looks to be some relation exist.

Here is label defination for each Purpose.

Purpose : (qualitative)
- Purpose
  - A40 : car (new)
  - A41 : car (used)
  - A42 : furniture/equipment
  - A43 : radio/television
  - A44 : domestic appliances
  - A45 : repairs
  - A46 : education
  - A47 : (vacation - does not exist?)
  - A48 : retraining
  - A49 : business
  - A410 : others

# Qualitive variable comparison(using congidency table)
library(dplyr)
# Contigency table
PurposeDF <- germanCreditDF %>% select(class, Purpose)
table(PurposeDF)

##       Purpose
## class  A40 A41 A410 A42 A43 A44 A45 A46 A48 A49
##   Bad   89  17    5  58  62   4   8  22   1  34
##   Good 145  86    7 123 218   8  14  28   8  63

# Contitional probabily table

prop.table(table(PurposeDF), 2)

##       Purpose
## class        A40       A41      A410       A42       A43       A44
##   Bad  0.3803419 0.1650485 0.4166667 0.3204420 0.2214286 0.3333333
##   Good 0.6196581 0.8349515 0.5833333 0.6795580 0.7785714 0.6666667
##       Purpose
## class        A45       A46       A48       A49
##   Bad  0.3636364 0.4400000 0.1111111 0.3505155
##   Good 0.6363636 0.5600000 0.8888889 0.6494845

# side barplot of contigency table

barplot(table(PurposeDF), beside =  T, legend = T, main = "Purpose v/s customer class", col=c("coral", "aquamarine3"))

Credit_Hist

‘A34 : critical account/other credits existing’ is having higher approval rate which looks strange. Either there is a data anomaly or some other relation is overriding the significance of this variable.

‘A33 : delay in paying off in the past’ shows higher approval rate.

# Qualitive variable comparison(using congidency table)
# Contigency table
Credit_HistDF <- germanCreditDF %>% select(class, Credit_Hist)
table(Credit_HistDF)

##       Credit_Hist
## class  A30 A31 A32 A33 A34
##   Bad   25  28 169  28  50
##   Good  15  21 361  60 243

# Contitional probabily table

prop.table(table(Credit_HistDF), 2)

##       Credit_Hist
## class        A30       A31       A32       A33       A34
##   Bad  0.6250000 0.5714286 0.3188679 0.3181818 0.1706485
##   Good 0.3750000 0.4285714 0.6811321 0.6818182 0.8293515

# side barplot of contigency table

barplot(table(Credit_HistDF), beside =  T, legend = T, main = "Credit_History v/s customer class", col=c("coral", "aquamarine3"))