Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file “german.data”.
For algorithms that need numerical attributes, Strathclyde University produced the file “german.data-numeric”
I decided to use the orginal data file(the original data file). There are 20 columns and the column descriptions are provided in german.doc. I created a vector for all headers listed in the german.doc.
Below R code snippet reads data from the german.data data file, creates a dataframe with prepared header values.
germanCreditDF <- read.delim("C:/Users/Charls/Documents/CunyMSDS/606-Statistical Analsyis/final-project/data/german.data.txt", sep = " ", header = F)
column_headers <- c("checkingAcc", "Duration", "Credit_Hist", "Purpose" ,"Credit_Amt" , "SavingsAcc" , "Employment_Stat", "Installment_rate", "personal_stat" , "deptor_stat" ,"residence_in_years" ,"Property" , "age","other_instalment_plans" , "Housing" , "no_of_credits" ,"Job_type" , "no_liable" , "Telephone" ,
"foreign_worker", "Customer_class" )
colnames(germanCreditDF) <- column_headers
The structure of the dataframe is as below. Along with 20 feature metrics, we have a response variable called Customer_class ( Value = 1 means Good, Value = 2 means Bad Credit risk).
str(germanCreditDF)
## 'data.frame': 1000 obs. of 21 variables:
## $ checkingAcc : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Credit_Hist : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ Purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ Credit_Amt : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ SavingsAcc : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ Employment_Stat : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ Installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_stat : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ deptor_stat : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ residence_in_years : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_instalment_plans: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ no_of_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ Job_type : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ no_liable : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ Customer_class : int 1 2 1 1 2 1 1 1 1 2 ...
Find the signifance of few quantitive and qualitative variables( Duration , Credit_Amt , age , foreign_worker , Employment_Stat , Purpose and credit_hist) on the response variable. Using backword elimination process, find out the most significant variable which is highly correlated to the response variable.
From the inference analysis, we noted that there is an anomaly on credit_hist. Critical credits having a higher approval rate and considered to be a good customer. Find what is the reason behind this observation ?
Try different machine learning models to predict the customer class based on most significant variables identified above and guage their accuracy using confusion matrix and other means.
The cases are the observations on different customers who appied for a credit in a german bank and bank’s decision weather the credit would be approved or not. The observations are the details regarding like their credit history, employment, credit amount, purpose and bank details.
There are 1000 cases in the dataset for this research activity.
The data is available in UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets.html and is downloaded to perform the research study on the regression.
This is an observational study, since we are doing analysis on the dataset collected as observation
The data is available in UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Customer_class is the response variable. This is qualititive variable whether a bank is considering his credit application or not. If the bank approves, the customer is classifies as good, otherwise not.
There are 20 independent Variables including the quantitative and qualitative variables.
Credit_Amt: (numerical) + Credit amount
Housing: (qualitative) + Housing - A151 : rent - A152 : own - A153 : for free
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
Lets look from some basic inference by looking at the distribution and the box plot graph. Based on this analysis, we will derive our research questions.
Firstly we will look for distribution of some of the quantitive variables.
Duration
Below is the histogram and the box plot of the ‘Duration’ variable. The histogram shows it is slightly right skewed, the box plot shows that the median for good customer class is less than the bad customer class. However there are presence of outliers for good customer class which also means that there can be some other factors which co-relates to the response variable.
With respect to the spread, the good customer class is more condensed than the bad customer class. It doesnt show a much of significance towards the response variable from the box plots. Let’s confirm if the Duration is statistically significant to the response variable using the hypothesis test.
hist(germanCreditDF$Duration)
germanCreditDF$class <- sapply(germanCreditDF$Customer_class, function(x){
switch(as.character(x), "1" = "Good", "2" = "Bad")
})
library(ggplot2)
ggplot(germanCreditDF, aes(x=class, y=Duration)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)
Let’s define the null and alternate hypotheis in terms of median since the distribution is skewed.
h0 -> good customer’s median of Duration is same as the bad customer.
ha -> goodcustomer’s median of Duration is different than the bad customer.
Conclusion: since the p-val is less than .05, the null hypothesis can be rejected. In short, using hypothesis test, we see that the Duration is statistically significant to the response variable.
source('http://www.openintro.org/stat/slides/inference.R')
inference(y = germanCreditDF$Duration, x = germanCreditDF$class, est = "median", type = "ht", null = 0,
alternative = "twosided", method = "simulation")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 24, n_Good = 700, median_Good = 18,
## Observed difference between medians (Bad-Good) = 6
## H0: median_Bad - median_Good = 0
## HA: median_Bad - median_Good != 0
## p-value = 0
Lets move on to some other Quantitive variables.
Credit_Amt
As per the box plot, we dont see much of significance. The median is more or less same, The good customer is more condensed than the bad customer class. Noted that the there are more outliers for good customer class.
Let’s confirm if the Credit_Amt is statistically significant to the response variable using the hypothesis test.
# Quantitive variable comparison(using Boxplots)
hist(germanCreditDF$Credit_Amt)
# Set a unique color with fill, colour, and alpha
ggplot(germanCreditDF, aes(x=class, y=Credit_Amt)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)
Let’s define the null and alternate hypotheis in terms of median since the distribution is skewed.
h0 -> good customer’s median of Credit_Amt is same as the bad customer.
ha -> good customer’s median of Credit_Amt is different than the bad customer.
Conclusion: since the p-val is less than .05, the null hypothesis can be rejected. In short, using hypothesis test, we see that the Credit_Amt is statistically significant to the response variable.
inference(y = germanCreditDF$Credit_Amt, x = germanCreditDF$class, est = "median", type = "ht", null = 0,
alternative = "twosided", method = "simulation")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 2574.5, n_Good = 700, median_Good = 2244,
## Observed difference between medians (Bad-Good) = 330.5
## H0: median_Bad - median_Good = 0
## HA: median_Bad - median_Good != 0
## p-value = 0.0486
no_of_credits
The box plot for both class looks same. Hence I’m assuming that there is no significance of no_of_credits towards the response variable.
hist(germanCreditDF$no_of_credits)
ggplot(germanCreditDF, aes(x=class, y=no_of_credits)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)
According to the hypothesis test, p-val is greater than the the .05, so we can conclude that there is no significance of no_of_credits towards the response variable.
inference(y = germanCreditDF$no_of_credits, x = germanCreditDF$class, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Bad = 300, mean_Bad = 1.3667, sd_Bad = 0.5597
## n_Good = 700, mean_Good = 1.4243, sd_Good = 0.5847
## Observed difference between means (Bad-Good) = -0.0576
## H0: mu_Bad - mu_Good = 0
## HA: mu_Bad - mu_Good != 0
## Standard error = 0.039
## Test statistic: Z = -1.472
## p-value = 0.141
age
Based on boxplot, we dont see much of a significance of age too on both group since median is more or less same. And spread is overlapping. But according to the hypotheisis test, we can reject the null hypothesis and there is a significance of age to the response variable.
hist(germanCreditDF$age)
ggplot(germanCreditDF, aes(x=class, y=age)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)
inference(y = germanCreditDF$age, x = germanCreditDF$class, est = "median", type = "ht", null = 0,
alternative = "twosided", method = "simulation")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 31, n_Good = 700, median_Good = 34,
## Observed difference between medians (Bad-Good) = -3
## H0: median_Bad - median_Good = 0
## HA: median_Bad - median_Good != 0
## p-value = 0.0254
We are running out of all quantitive variables. Lets look at some of the qualitative variables by looking at the contigency tables, contitional probabilties(column wise) and few stacked histograms.
foreign_worker
The conditinal probabilty says that the although the number of non-foreign applicants is less, the probabily of approving their application is more compaired to the foreign applicants. However noted that there are more number of foreign applicants.
# Qualitive variable comparison(using congidency table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Contigency table
foreignDF <- germanCreditDF %>% select(class, foreign_worker)
table(foreignDF)
## foreign_worker
## class A201 A202
## Bad 296 4
## Good 667 33
# Contitional probabily table
prop.table(table(foreignDF), 2)
## foreign_worker
## class A201 A202
## Bad 0.3073728 0.1081081
## Good 0.6926272 0.8918919
# side barplot of contigency table
barplot(table(foreignDF), beside = T, legend = T, main = "foreign v/s customer class", col=c("coral", "aquamarine3"), xlab = "foreign_worker - A201 : yes - A202 : no")
Employment_Stat
Unemployment and less no of years shows lower approval rate. So there looks to be some relation exist.
# Qualitive variable comparison(using congidency table)
library(dplyr)
# Contigency table
employmentDF <- germanCreditDF %>% select(class, Employment_Stat)
table(employmentDF)
## Employment_Stat
## class A71 A72 A73 A74 A75
## Bad 23 70 104 39 64
## Good 39 102 235 135 189
# Contitional probabily table
prop.table(table(employmentDF), 2)
## Employment_Stat
## class A71 A72 A73 A74 A75
## Bad 0.3709677 0.4069767 0.3067847 0.2241379 0.2529644
## Good 0.6290323 0.5930233 0.6932153 0.7758621 0.7470356
# side barplot of contigency table
barplot(table(employmentDF), beside = T, legend = T, main = "Employment v/s customer class", col=c("coral", "aquamarine3"))
Purpose
Contitional probabilty says that the used cars application as higher approval rate. and education credit application has more or less same chances to get approved or rejected. So there looks to be some relation exist.
Here is label defination for each Purpose.
# Qualitive variable comparison(using congidency table)
library(dplyr)
# Contigency table
PurposeDF <- germanCreditDF %>% select(class, Purpose)
table(PurposeDF)
## Purpose
## class A40 A41 A410 A42 A43 A44 A45 A46 A48 A49
## Bad 89 17 5 58 62 4 8 22 1 34
## Good 145 86 7 123 218 8 14 28 8 63
# Contitional probabily table
prop.table(table(PurposeDF), 2)
## Purpose
## class A40 A41 A410 A42 A43 A44
## Bad 0.3803419 0.1650485 0.4166667 0.3204420 0.2214286 0.3333333
## Good 0.6196581 0.8349515 0.5833333 0.6795580 0.7785714 0.6666667
## Purpose
## class A45 A46 A48 A49
## Bad 0.3636364 0.4400000 0.1111111 0.3505155
## Good 0.6363636 0.5600000 0.8888889 0.6494845
# side barplot of contigency table
barplot(table(PurposeDF), beside = T, legend = T, main = "Purpose v/s customer class", col=c("coral", "aquamarine3"))
Credit_Hist
‘A34 : critical account/other credits existing’ is having higher approval rate which looks strange. Either there is a data anomaly or some other relation is overriding the significance of this variable.
‘A33 : delay in paying off in the past’ shows higher approval rate.
# Qualitive variable comparison(using congidency table)
# Contigency table
Credit_HistDF <- germanCreditDF %>% select(class, Credit_Hist)
table(Credit_HistDF)
## Credit_Hist
## class A30 A31 A32 A33 A34
## Bad 25 28 169 28 50
## Good 15 21 361 60 243
# Contitional probabily table
prop.table(table(Credit_HistDF), 2)
## Credit_Hist
## class A30 A31 A32 A33 A34
## Bad 0.6250000 0.5714286 0.3188679 0.3181818 0.1706485
## Good 0.3750000 0.4285714 0.6811321 0.6818182 0.8293515
# side barplot of contigency table
barplot(table(Credit_HistDF), beside = T, legend = T, main = "Credit_History v/s customer class", col=c("coral", "aquamarine3"))