Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file “german.data”.
For algorithms that need numerical attributes, Strathclyde University produced the file “german.data-numeric”
I decided to use the orginal data file(the original data file). There are 20 columns and the column descriptions are provided in german.doc. I created a vector for all headers listed in the german.doc.
Below R code snippet reads data from the german.data data file, creates a dataframe with prepared header values.
germanCreditDF <- read.delim("C:/Users/Charls/Documents/CunyMSDS/606-Statistical Analsyis/final-project/data/german.data.txt", sep = " ", header = F)
column_headers <- c("checkingAcc", "Duration", "Credit_Hist", "Purpose" ,"Credit_Amt" , "SavingsAcc" , "Employment_Stat", "Installment_rate", "personal_stat" , "deptor_stat" ,"residence_in_years" ,"Property" , "age","other_instalment_plans" , "Housing" , "no_of_credits" ,"Job_type" , "no_liable" , "Telephone" ,
"foreign_worker", "Customer_class" )
colnames(germanCreditDF) <- column_headers
The structure of the dataframe is as below. Along with 20 feature metrics, we have a response variable called Customer_class ( Value = 1 means Good, Value = 2 means Bad Credit risk).
str(germanCreditDF)
## 'data.frame': 1000 obs. of 21 variables:
## $ checkingAcc : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Credit_Hist : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ Purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ Credit_Amt : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ SavingsAcc : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ Employment_Stat : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ Installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_stat : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ deptor_stat : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ residence_in_years : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_instalment_plans: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ no_of_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ Job_type : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ no_liable : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ Customer_class : int 1 2 1 1 2 1 1 1 1 2 ...
Find the signifance of few quantitive and qualitative variables( Duration , Credit_Amt , age , foreign_worker , Employment_Stat , Purpose and credit_hist) on the response variable. Using backword elimination process, find out the most significant variable which is highly correlated to the response variable.
From the inference analysis, we noted that there is an anomaly on credit_hist. Critical credits having a higher approval rate and considered to be a good customer. Find what is the reason behind this observation ?
Try different machine learning models to predict the customer class based on most significant variables identified above and guage their accuracy using confusion matrix and other means.
The cases are the observations on different customers who appied for a credit in a german bank and bank’s decision weather the credit would be approved or not. The observations are the details regarding like their credit history, employment, credit amount, purpose and bank details.
There are 1000 cases in the dataset for this research activity.
The data is available in UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets.html and is downloaded to perform the research study on the regression.
This is an observational study, since we are doing analysis on the dataset collected as observation
The data is available in UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Customer_class is the response variable. This is qualititive variable whether a bank is considering his credit application or not. If the bank approves, the customer is classifies as good, otherwise not.
There are 20 independent Variables including the quantitative and qualitative variables.
Credit_Amt: (numerical) + Credit amount
Housing: (qualitative) + Housing - A151 : rent - A152 : own - A153 : for free
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
Lets look from some basic inference by looking at the distribution and the box plot graph. Based on this analysis, we will derive our research questions.
Firstly we will look for distribution of some of the quantitive variables.
Duration
Below is the histogram and the box plot of the ‘Duration’ variable. The histogram shows it is slightly right skewed, the box plot shows that the median for good customer class is less than the bad customer class. However there are presence of outliers for good customer class which also means that there can be some other factors which co-relates to the response variable.
With respect to the spread, the good customer class is more condensed than the bad customer class. It doesnt show a much of significance towards the response variable from the box plots. Let’s confirm if the Duration is statistically significant to the response variable using the hypothesis test.
hist(germanCreditDF$Duration)
germanCreditDF$class <- sapply(germanCreditDF$Customer_class, function(x){
switch(as.character(x), "1" = "Good", "2" = "Bad")
})
germanCreditDF$response_class <- sapply(germanCreditDF$Customer_class, function(x){
switch(as.character(x), "1" = 1, "2" = 0)
})
library(ggplot2)
ggplot(germanCreditDF, aes(x=class, y=Duration)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)
Let’s define the null and alternate hypotheis in terms of median since the distribution is skewed.
h0 -> good customer’s median of Duration is same as the bad customer.
ha -> goodcustomer’s median of Duration is different than the bad customer.
Conclusion: since the p-val is less than .05, the null hypothesis can be rejected. In short, using hypothesis test, we see that the Duration
is statistically significant to the response variable.
source('http://www.openintro.org/stat/slides/inference.R')
inference(y = germanCreditDF$Duration, x = germanCreditDF$class, est = "median", type = "ht", null = 0,
alternative = "twosided", method = "simulation")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 24, n_Good = 700, median_Good = 18,
## Observed difference between medians (Bad-Good) = 6
## H0: median_Bad - median_Good = 0
## HA: median_Bad - median_Good != 0
## p-value = 4e-04
Lets move on to some other Quantitive variables.
Credit_Amt
As per the box plot, we dont see much of significance. The median is more or less same, The good customer is more condensed than the bad customer class. Noted that the there are more outliers for good customer class.
Let’s confirm if the Credit_Amt
is statistically significant to the response variable using the hypothesis test.
# Quantitive variable comparison(using Boxplots)
hist(germanCreditDF$Credit_Amt)
# Set a unique color with fill, colour, and alpha
ggplot(germanCreditDF, aes(x=class, y=Credit_Amt)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)
Let’s define the null and alternate hypotheis in terms of median since the distribution is skewed.
h0 -> good customer’s median of Credit_Amt
is same as the bad customer.
ha -> good customer’s median of Credit_Amt
is different than the bad customer.
Conclusion: since the p-val is less than .05, the null hypothesis can be rejected. In short, using hypothesis test, we see that the Credit_Amt
is statistically significant to the response variable.
inference(y = germanCreditDF$Credit_Amt, x = germanCreditDF$class, est = "median", type = "ht", null = 0,
alternative = "twosided", method = "simulation")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 2574.5, n_Good = 700, median_Good = 2244,
## Observed difference between medians (Bad-Good) = 330.5
## H0: median_Bad - median_Good = 0
## HA: median_Bad - median_Good != 0
## p-value = 0.0496
no_of_credits
The box plot for both class looks same. Hence I’m assuming that there is no significance of no_of_credits towards the response variable.
hist(germanCreditDF$no_of_credits)
ggplot(germanCreditDF, aes(x=class, y=no_of_credits)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)
According to the hypothesis test, p-val is greater than the the .05, so we can conclude that there is no significance of no_of_credits towards the response variable.
inference(y = germanCreditDF$no_of_credits, x = germanCreditDF$class, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Bad = 300, mean_Bad = 1.3667, sd_Bad = 0.5597
## n_Good = 700, mean_Good = 1.4243, sd_Good = 0.5847
## Observed difference between means (Bad-Good) = -0.0576
## H0: mu_Bad - mu_Good = 0
## HA: mu_Bad - mu_Good != 0
## Standard error = 0.039
## Test statistic: Z = -1.472
## p-value = 0.141
age
Based on boxplot, we dont see much of a significance of age too on both group since median is more or less same. And spread is overlapping. But according to the hypotheisis test, we can reject the null hypothesis and there is a significance of age
to the response variable.
hist(germanCreditDF$age)
ggplot(germanCreditDF, aes(x=class, y=age)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)
inference(y = germanCreditDF$age, x = germanCreditDF$class, est = "median", type = "ht", null = 0,
alternative = "twosided", method = "simulation")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 31, n_Good = 700, median_Good = 34,
## Observed difference between medians (Bad-Good) = -3
## H0: median_Bad - median_Good = 0
## HA: median_Bad - median_Good != 0
## p-value = 0.0222
We are running out of all quantitive variables. Lets look at some of the qualitative variables by looking at the contigency tables, contitional probabilties(column wise) and few stacked histograms.
foreign_worker
The conditinal probabilty says that the although the number of non-foreign applicants is less, the probabily of approving their application is more compaired to the foreign applicants. However noted that there are more number of foreign applicants.
# Qualitive variable comparison(using congidency table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Contigency table
foreignDF <- germanCreditDF %>% select(class, foreign_worker)
table(foreignDF)
## foreign_worker
## class A201 A202
## Bad 296 4
## Good 667 33
# Contitional probabily table
prop.table(table(foreignDF), 2)
## foreign_worker
## class A201 A202
## Bad 0.3073728 0.1081081
## Good 0.6926272 0.8918919
# side barplot of contigency table
barplot(table(foreignDF), beside = T, legend = T, main = "foreign v/s customer class", col=c("coral", "aquamarine3"), xlab = "foreign_worker - A201 : yes - A202 : no")
Employment_Stat
Unemployment and less no of years shows lower approval rate. So there looks to be some relation exist.
# Qualitive variable comparison(using congidency table)
library(dplyr)
# Contigency table
employmentDF <- germanCreditDF %>% select(class, Employment_Stat)
table(employmentDF)
## Employment_Stat
## class A71 A72 A73 A74 A75
## Bad 23 70 104 39 64
## Good 39 102 235 135 189
# Contitional probabily table
prop.table(table(employmentDF), 2)
## Employment_Stat
## class A71 A72 A73 A74 A75
## Bad 0.3709677 0.4069767 0.3067847 0.2241379 0.2529644
## Good 0.6290323 0.5930233 0.6932153 0.7758621 0.7470356
# side barplot of contigency table
barplot(table(employmentDF), beside = T, legend = T, main = "Employment v/s customer class", col=c("coral", "aquamarine3"))
Purpose
Contitional probabilty says that the used cars application as higher approval rate. and education credit application has more or less same chances to get approved or rejected. So there looks to be some relation exist.
Here is label defination for each Purpose.
# Qualitive variable comparison(using congidency table)
library(dplyr)
# Contigency table
PurposeDF <- germanCreditDF %>% select(class, Purpose)
table(PurposeDF)
## Purpose
## class A40 A41 A410 A42 A43 A44 A45 A46 A48 A49
## Bad 89 17 5 58 62 4 8 22 1 34
## Good 145 86 7 123 218 8 14 28 8 63
# Contitional probabily table
prop.table(table(PurposeDF), 2)
## Purpose
## class A40 A41 A410 A42 A43 A44
## Bad 0.3803419 0.1650485 0.4166667 0.3204420 0.2214286 0.3333333
## Good 0.6196581 0.8349515 0.5833333 0.6795580 0.7785714 0.6666667
## Purpose
## class A45 A46 A48 A49
## Bad 0.3636364 0.4400000 0.1111111 0.3505155
## Good 0.6363636 0.5600000 0.8888889 0.6494845
# side barplot of contigency table
barplot(table(PurposeDF), beside = T, legend = T, main = "Purpose v/s customer class", col=c("coral", "aquamarine3"))
Credit_Hist
‘A34 : critical account/other credits existing’ is having higher approval rate which looks strange. Either there is a data anomaly or some other relation is overriding the significance of this variable.
‘A33 : delay in paying off in the past’ shows higher approval rate.
# Qualitive variable comparison(using congidency table)
# Contigency table
Credit_HistDF <- germanCreditDF %>% select(class, Credit_Hist)
table(Credit_HistDF)
## Credit_Hist
## class A30 A31 A32 A33 A34
## Bad 25 28 169 28 50
## Good 15 21 361 60 243
# Contitional probabily table
prop.table(table(Credit_HistDF), 2)
## Credit_Hist
## class A30 A31 A32 A33 A34
## Bad 0.6250000 0.5714286 0.3188679 0.3181818 0.1706485
## Good 0.3750000 0.4285714 0.6811321 0.6818182 0.8293515
# side barplot of contigency table
barplot(table(Credit_HistDF), beside = T, legend = T, main = "Credit_History v/s customer class", col=c("coral", "aquamarine3"))
The below analysis is performed to answer the research questions posted above.
Lets try out the manual way. Below is the steps we follow in the Manual way. This is often called Step-wise Regression.
full_regressor <- glm(formula = response_class ~ . - Customer_class - class , data = germanCreditDF, family = 'binomial' )
summary(full_regressor)
##
## Call:
## glm(formula = response_class ~ . - Customer_class - class, family = "binomial",
## data = germanCreditDF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6116 -0.7095 0.3752 0.6994 2.3410
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.005e-01 1.084e+00 -0.369 0.711869
## checkingAccA12 3.749e-01 2.179e-01 1.720 0.085400 .
## checkingAccA13 9.657e-01 3.692e-01 2.616 0.008905 **
## checkingAccA14 1.712e+00 2.322e-01 7.373 1.66e-13 ***
## Duration -2.786e-02 9.296e-03 -2.997 0.002724 **
## Credit_HistA31 -1.434e-01 5.489e-01 -0.261 0.793921
## Credit_HistA32 5.861e-01 4.305e-01 1.362 0.173348
## Credit_HistA33 8.532e-01 4.717e-01 1.809 0.070470 .
## Credit_HistA34 1.436e+00 4.399e-01 3.264 0.001099 **
## PurposeA41 1.666e+00 3.743e-01 4.452 8.51e-06 ***
## PurposeA410 1.489e+00 7.764e-01 1.918 0.055163 .
## PurposeA42 7.916e-01 2.610e-01 3.033 0.002421 **
## PurposeA43 8.916e-01 2.471e-01 3.609 0.000308 ***
## PurposeA44 5.228e-01 7.623e-01 0.686 0.492831
## PurposeA45 2.164e-01 5.500e-01 0.393 0.694000
## PurposeA46 -3.628e-02 3.965e-01 -0.092 0.927082
## PurposeA48 2.059e+00 1.212e+00 1.699 0.089297 .
## PurposeA49 7.401e-01 3.339e-01 2.216 0.026668 *
## Credit_Amt -1.283e-04 4.444e-05 -2.887 0.003894 **
## SavingsAccA62 3.577e-01 2.861e-01 1.250 0.211130
## SavingsAccA63 3.761e-01 4.011e-01 0.938 0.348476
## SavingsAccA64 1.339e+00 5.249e-01 2.551 0.010729 *
## SavingsAccA65 9.467e-01 2.625e-01 3.607 0.000310 ***
## Employment_StatA72 6.691e-02 4.270e-01 0.157 0.875475
## Employment_StatA73 1.828e-01 4.105e-01 0.445 0.656049
## Employment_StatA74 8.310e-01 4.455e-01 1.866 0.062110 .
## Employment_StatA75 2.766e-01 4.134e-01 0.669 0.503410
## Installment_rate -3.301e-01 8.828e-02 -3.739 0.000185 ***
## personal_statA92 2.755e-01 3.865e-01 0.713 0.476040
## personal_statA93 8.161e-01 3.799e-01 2.148 0.031718 *
## personal_statA94 3.671e-01 4.537e-01 0.809 0.418448
## deptor_statA102 -4.360e-01 4.101e-01 -1.063 0.287700
## deptor_statA103 9.786e-01 4.243e-01 2.307 0.021072 *
## residence_in_years -4.776e-03 8.641e-02 -0.055 0.955920
## PropertyA122 -2.814e-01 2.534e-01 -1.111 0.266630
## PropertyA123 -1.945e-01 2.360e-01 -0.824 0.409743
## PropertyA124 -7.304e-01 4.245e-01 -1.721 0.085308 .
## age 1.454e-02 9.222e-03 1.576 0.114982
## other_instalment_plansA142 1.232e-01 4.119e-01 0.299 0.764878
## other_instalment_plansA143 6.463e-01 2.391e-01 2.703 0.006871 **
## HousingA152 4.436e-01 2.347e-01 1.890 0.058715 .
## HousingA153 6.839e-01 4.770e-01 1.434 0.151657
## no_of_credits -2.721e-01 1.895e-01 -1.436 0.151109
## Job_typeA172 -5.361e-01 6.796e-01 -0.789 0.430160
## Job_typeA173 -5.547e-01 6.549e-01 -0.847 0.397015
## Job_typeA174 -4.795e-01 6.623e-01 -0.724 0.469086
## no_liable -2.647e-01 2.492e-01 -1.062 0.288249
## TelephoneA192 3.000e-01 2.013e-01 1.491 0.136060
## foreign_workerA202 1.392e+00 6.258e-01 2.225 0.026095 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.73 on 999 degrees of freedom
## Residual deviance: 895.82 on 951 degrees of freedom
## AIC: 993.82
##
## Number of Fisher Scoring iterations: 5
Removing variables ‘residence_in_years’, ‘Job_type’, ‘no_liable’ , ‘no_of_credits’
partial_regrssor1 <- glm(formula = response_class ~ . - Customer_class - class - residence_in_years - Job_type - no_liable - no_of_credits , data = germanCreditDF, family = 'binomial' )
summary(partial_regrssor1)
##
## Call:
## glm(formula = response_class ~ . - Customer_class - class - residence_in_years -
## Job_type - no_liable - no_of_credits, family = "binomial",
## data = germanCreditDF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7065 -0.7404 0.3780 0.6972 2.2990
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.587e+00 8.544e-01 -1.858 0.063200 .
## checkingAccA12 4.159e-01 2.165e-01 1.921 0.054670 .
## checkingAccA13 1.046e+00 3.657e-01 2.860 0.004232 **
## checkingAccA14 1.709e+00 2.310e-01 7.399 1.38e-13 ***
## Duration -2.709e-02 9.141e-03 -2.964 0.003040 **
## Credit_HistA31 1.911e-02 5.312e-01 0.036 0.971297
## Credit_HistA32 7.369e-01 4.126e-01 1.786 0.074086 .
## Credit_HistA33 8.387e-01 4.678e-01 1.793 0.073013 .
## Credit_HistA34 1.394e+00 4.349e-01 3.205 0.001352 **
## PurposeA41 1.643e+00 3.670e-01 4.479 7.52e-06 ***
## PurposeA410 1.536e+00 7.639e-01 2.011 0.044308 *
## PurposeA42 8.093e-01 2.585e-01 3.131 0.001741 **
## PurposeA43 9.059e-01 2.462e-01 3.679 0.000234 ***
## PurposeA44 5.908e-01 7.655e-01 0.772 0.440254
## PurposeA45 1.831e-01 5.507e-01 0.332 0.739581
## PurposeA46 -3.402e-02 3.975e-01 -0.086 0.931798
## PurposeA48 2.067e+00 1.227e+00 1.684 0.092111 .
## PurposeA49 7.305e-01 3.318e-01 2.202 0.027698 *
## Credit_Amt -1.251e-04 4.312e-05 -2.900 0.003726 **
## SavingsAccA62 3.327e-01 2.822e-01 1.179 0.238487
## SavingsAccA63 4.438e-01 3.969e-01 1.118 0.263571
## SavingsAccA64 1.276e+00 5.150e-01 2.478 0.013205 *
## SavingsAccA65 9.688e-01 2.610e-01 3.712 0.000206 ***
## Employment_StatA72 -8.459e-02 3.817e-01 -0.222 0.824597
## Employment_StatA73 1.666e-02 3.560e-01 0.047 0.962681
## Employment_StatA74 6.440e-01 3.968e-01 1.623 0.104628
## Employment_StatA75 8.154e-02 3.665e-01 0.222 0.823930
## Installment_rate -3.182e-01 8.644e-02 -3.681 0.000232 ***
## personal_statA92 2.795e-01 3.829e-01 0.730 0.465424
## personal_statA93 7.497e-01 3.737e-01 2.006 0.044817 *
## personal_statA94 3.648e-01 4.518e-01 0.807 0.419494
## deptor_statA102 -4.431e-01 4.104e-01 -1.080 0.280245
## deptor_statA103 9.548e-01 4.248e-01 2.247 0.024619 *
## PropertyA122 -2.834e-01 2.512e-01 -1.128 0.259278
## PropertyA123 -2.056e-01 2.325e-01 -0.884 0.376633
## PropertyA124 -6.624e-01 4.210e-01 -1.573 0.115607
## age 1.357e-02 9.044e-03 1.500 0.133501
## other_instalment_plansA142 1.096e-01 4.092e-01 0.268 0.788800
## other_instalment_plansA143 6.741e-01 2.375e-01 2.838 0.004538 **
## HousingA152 4.513e-01 2.258e-01 1.999 0.045578 *
## HousingA153 6.118e-01 4.735e-01 1.292 0.196285
## TelephoneA192 2.911e-01 1.876e-01 1.552 0.120631
## foreign_workerA202 1.437e+00 6.301e-01 2.281 0.022544 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.73 on 999 degrees of freedom
## Residual deviance: 899.93 on 957 degrees of freedom
## AIC: 985.93
##
## Number of Fisher Scoring iterations: 5
Lets remove Property since it is having high p-value.
partial_regrssor2 <- glm(formula = response_class ~ . - Customer_class - class - residence_in_years - Job_type - no_liable - no_of_credits - Property , data = germanCreditDF, family = 'binomial' )
summary(partial_regrssor2)
##
## Call:
## glm(formula = response_class ~ . - Customer_class - class - residence_in_years -
## Job_type - no_liable - no_of_credits - Property, family = "binomial",
## data = germanCreditDF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7586 -0.7379 0.3888 0.6961 2.3536
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.771e+00 8.388e-01 -2.111 0.034785 *
## checkingAccA12 4.225e-01 2.147e-01 1.968 0.049098 *
## checkingAccA13 1.056e+00 3.648e-01 2.895 0.003789 **
## checkingAccA14 1.730e+00 2.303e-01 7.513 5.78e-14 ***
## Duration -2.805e-02 9.035e-03 -3.105 0.001901 **
## Credit_HistA31 2.768e-02 5.295e-01 0.052 0.958316
## Credit_HistA32 7.686e-01 4.128e-01 1.862 0.062627 .
## Credit_HistA33 8.564e-01 4.678e-01 1.831 0.067126 .
## Credit_HistA34 1.421e+00 4.353e-01 3.264 0.001100 **
## PurposeA41 1.635e+00 3.668e-01 4.459 8.25e-06 ***
## PurposeA410 1.571e+00 7.686e-01 2.043 0.041005 *
## PurposeA42 7.800e-01 2.557e-01 3.050 0.002289 **
## PurposeA43 9.196e-01 2.453e-01 3.749 0.000178 ***
## PurposeA44 5.821e-01 7.558e-01 0.770 0.441197
## PurposeA45 1.791e-01 5.435e-01 0.330 0.741679
## PurposeA46 -1.182e-01 3.934e-01 -0.300 0.763815
## PurposeA48 2.038e+00 1.221e+00 1.669 0.095161 .
## PurposeA49 7.620e-01 3.301e-01 2.308 0.020983 *
## Credit_Amt -1.316e-04 4.259e-05 -3.091 0.001998 **
## SavingsAccA62 2.715e-01 2.784e-01 0.975 0.329510
## SavingsAccA63 4.178e-01 3.982e-01 1.049 0.294010
## SavingsAccA64 1.265e+00 5.117e-01 2.473 0.013410 *
## SavingsAccA65 9.504e-01 2.594e-01 3.663 0.000249 ***
## Employment_StatA72 -8.348e-02 3.815e-01 -0.219 0.826780
## Employment_StatA73 6.315e-02 3.555e-01 0.178 0.859030
## Employment_StatA74 6.694e-01 3.971e-01 1.686 0.091831 .
## Employment_StatA75 7.968e-02 3.670e-01 0.217 0.828114
## Installment_rate -3.262e-01 8.609e-02 -3.789 0.000151 ***
## personal_statA92 2.732e-01 3.802e-01 0.719 0.472368
## personal_statA93 7.373e-01 3.705e-01 1.990 0.046624 *
## personal_statA94 3.980e-01 4.486e-01 0.887 0.374903
## deptor_statA102 -4.870e-01 4.060e-01 -1.199 0.230392
## deptor_statA103 9.804e-01 4.167e-01 2.353 0.018637 *
## age 1.384e-02 9.000e-03 1.537 0.124203
## other_instalment_plansA142 1.556e-01 4.093e-01 0.380 0.703768
## other_instalment_plansA143 7.013e-01 2.360e-01 2.972 0.002963 **
## HousingA152 4.514e-01 2.241e-01 2.014 0.044004 *
## HousingA153 2.117e-01 3.437e-01 0.616 0.537789
## TelephoneA192 2.568e-01 1.861e-01 1.380 0.167706
## foreign_workerA202 1.414e+00 6.236e-01 2.268 0.023351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 902.8 on 960 degrees of freedom
## AIC: 982.8
##
## Number of Fisher Scoring iterations: 5
Lets remove Employment_Stat group which has high p-values. Also remove Telephone.
partial_regrssor3 <- glm(formula = response_class ~ . - Customer_class - class - residence_in_years - Job_type - no_liable - no_of_credits - Property - Employment_Stat - Telephone, data = germanCreditDF, family = 'binomial' )
summary(partial_regrssor3)
##
## Call:
## glm(formula = response_class ~ . - Customer_class - class - residence_in_years -
## Job_type - no_liable - no_of_credits - Property - Employment_Stat -
## Telephone, family = "binomial", data = germanCreditDF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7881 -0.7392 0.3867 0.7027 2.2832
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.787e+00 7.349e-01 -2.431 0.015042 *
## checkingAccA12 3.996e-01 2.115e-01 1.890 0.058802 .
## checkingAccA13 1.025e+00 3.620e-01 2.830 0.004652 **
## checkingAccA14 1.734e+00 2.279e-01 7.610 2.73e-14 ***
## Duration -2.668e-02 8.884e-03 -3.004 0.002667 **
## Credit_HistA31 1.288e-01 5.241e-01 0.246 0.805860
## Credit_HistA32 8.429e-01 4.083e-01 2.064 0.038976 *
## Credit_HistA33 9.351e-01 4.640e-01 2.015 0.043872 *
## Credit_HistA34 1.508e+00 4.306e-01 3.502 0.000462 ***
## PurposeA41 1.623e+00 3.645e-01 4.452 8.52e-06 ***
## PurposeA410 1.534e+00 7.461e-01 2.056 0.039829 *
## PurposeA42 7.352e-01 2.528e-01 2.908 0.003636 **
## PurposeA43 9.093e-01 2.433e-01 3.737 0.000186 ***
## PurposeA44 5.143e-01 7.295e-01 0.705 0.480770
## PurposeA45 9.557e-02 5.367e-01 0.178 0.858657
## PurposeA46 -1.255e-01 3.923e-01 -0.320 0.748916
## PurposeA48 2.115e+00 1.209e+00 1.749 0.080306 .
## PurposeA49 8.128e-01 3.252e-01 2.499 0.012457 *
## Credit_Amt -1.135e-04 4.068e-05 -2.789 0.005286 **
## SavingsAccA62 3.263e-01 2.760e-01 1.182 0.237080
## SavingsAccA63 4.332e-01 3.948e-01 1.097 0.272550
## SavingsAccA64 1.300e+00 5.075e-01 2.562 0.010394 *
## SavingsAccA65 9.589e-01 2.559e-01 3.747 0.000179 ***
## Installment_rate -3.191e-01 8.503e-02 -3.752 0.000175 ***
## personal_statA92 2.932e-01 3.735e-01 0.785 0.432492
## personal_statA93 8.263e-01 3.638e-01 2.271 0.023132 *
## personal_statA94 4.395e-01 4.412e-01 0.996 0.319167
## deptor_statA102 -5.075e-01 4.008e-01 -1.266 0.205402
## deptor_statA103 1.011e+00 4.172e-01 2.424 0.015333 *
## age 1.478e-02 8.315e-03 1.778 0.075416 .
## other_instalment_plansA142 8.431e-02 4.037e-01 0.209 0.834567
## other_instalment_plansA143 6.987e-01 2.351e-01 2.972 0.002955 **
## HousingA152 4.384e-01 2.210e-01 1.983 0.047313 *
## HousingA153 1.574e-01 3.397e-01 0.463 0.643044
## foreign_workerA202 1.326e+00 6.192e-01 2.142 0.032232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.73 on 999 degrees of freedom
## Residual deviance: 912.82 on 965 degrees of freedom
## AIC: 982.82
##
## Number of Fisher Scoring iterations: 5
At this stage, we removed below variables. Because this doesnt show any significance towards the response variable. I’m not sure if we can remove other variables which are partially having p-values.
Below are the current explanatory variables after the manual elimination process.
Lets carry out a step-wise backward elimination method to see if our manual elimination is correct.
step(full_regressor ,data= germanCreditDF , direction = "backward" ,test = "F")
## Start: AIC=993.82
## response_class ~ (checkingAcc + Duration + Credit_Hist + Purpose +
## Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate +
## personal_stat + deptor_stat + residence_in_years + Property +
## age + other_instalment_plans + Housing + no_of_credits +
## Job_type + no_liable + Telephone + foreign_worker + Customer_class +
## class) - Customer_class - class
## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family
## Df Deviance AIC F value Pr(>F)
## - Job_type 3 896.56 988.56 0.2614 0.8532561
## - Property 3 899.08 991.08 1.1546 0.3260711
## - residence_in_years 1 895.82 991.82 0.0032 0.9545966
## - no_liable 1 896.94 992.94 1.1870 0.2762026
## <none> 895.82 993.82
## - no_of_credits 1 897.89 993.89 2.1948 0.1388046
## - Employment_Stat 4 904.03 994.03 2.1807 0.0692702 .
## - Housing 2 900.05 994.05 2.2453 0.1064598
## - Telephone 1 898.06 994.06 2.3783 0.1233632
## - age 1 898.34 994.34 2.6811 0.1018775
## - personal_stat 3 905.15 997.15 3.3009 0.0198254 *
## - deptor_stat 2 903.24 997.24 3.9394 0.0197790 *
## - foreign_worker 1 901.88 997.88 6.4369 0.0113364 *
## - other_instalment_plans 2 903.98 997.98 4.3351 0.0133603 *
## - Credit_Amt 1 904.28 1000.28 8.9799 0.0028005 **
## - Duration 1 904.87 1000.87 9.6054 0.0019971 **
## - SavingsAcc 4 915.63 1005.63 5.2571 0.0003436 ***
## - Installment_rate 1 910.27 1006.27 15.3396 9.623e-05 ***
## - Credit_Hist 4 917.62 1007.62 5.7873 0.0001328 ***
## - Purpose 9 931.12 1011.12 4.1640 2.752e-05 ***
## - checkingAcc 3 962.05 1054.05 23.4381 1.219e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=988.56
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose +
## Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate +
## personal_stat + deptor_stat + residence_in_years + Property +
## age + other_instalment_plans + Housing + no_of_credits +
## no_liable + Telephone + foreign_worker
## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family
## Df Deviance AIC F value Pr(>F)
## - Property 3 899.79 985.79 1.1477 0.3288039
## - residence_in_years 1 896.57 986.57 0.0157 0.9002261
## - no_liable 1 897.67 987.67 1.1809 0.2774567
## - Employment_Stat 4 904.32 988.32 2.0656 0.0833203 .
## - no_of_credits 1 898.47 988.47 2.0405 0.1534852
## <none> 896.56 988.56
## - Housing 2 900.60 988.60 2.1501 0.1170347
## - Telephone 1 899.13 989.13 2.7431 0.0980011 .
## - age 1 899.19 989.19 2.8049 0.0943051 .
## - personal_stat 3 905.83 991.83 3.2905 0.0201057 *
## - deptor_stat 2 903.87 991.87 3.8932 0.0207048 *
## - foreign_worker 1 902.67 992.67 6.5056 0.0109085 *
## - other_instalment_plans 2 904.95 992.95 4.4668 0.0117249 *
## - Credit_Amt 1 905.31 995.31 9.3156 0.0023351 **
## - Duration 1 905.85 995.85 9.8940 0.0017096 **
## - SavingsAcc 4 917.02 1001.02 5.4434 0.0002461 ***
## - Installment_rate 1 911.45 1001.45 15.8462 7.392e-05 ***
## - Credit_Hist 4 918.12 1002.12 5.7370 0.0001453 ***
## - Purpose 9 931.82 1005.82 4.1694 2.698e-05 ***
## - checkingAcc 3 962.35 1048.35 23.3355 1.399e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=985.79
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose +
## Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate +
## personal_stat + deptor_stat + residence_in_years + age +
## other_instalment_plans + Housing + no_of_credits + no_liable +
## Telephone + foreign_worker
## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family
## Df Deviance AIC F value Pr(>F)
## - residence_in_years 1 899.81 983.81 0.0215 0.8833737
## - no_liable 1 900.79 984.79 1.0564 0.3042884
## - Housing 2 903.47 985.47 1.9552 0.1420991
## - no_of_credits 1 901.49 985.49 1.8016 0.1798405
## <none> 899.79 985.79
## - Telephone 1 901.81 985.81 2.1412 0.1437136
## - Employment_Stat 4 907.85 985.85 2.1433 0.0735681 .
## - age 1 902.52 986.52 2.9017 0.0888118 .
## - personal_stat 3 908.67 988.67 3.1485 0.0243723 *
## - foreign_worker 1 905.83 989.83 6.4227 0.0114253 *
## - deptor_stat 2 908.05 990.05 4.3927 0.0126182 *
## - other_instalment_plans 2 908.87 990.87 4.8266 0.0082095 **
## - Credit_Amt 1 909.80 993.80 10.6403 0.0011456 **
## - Duration 1 909.99 993.99 10.8456 0.0010266 **
## - SavingsAcc 4 919.78 997.78 5.3148 0.0003098 ***
## - Installment_rate 1 915.56 999.56 16.7707 4.574e-05 ***
## - Credit_Hist 4 921.66 999.66 5.8145 0.0001264 ***
## - Purpose 9 936.35 1004.35 4.3202 1.571e-05 ***
## - checkingAcc 3 967.78 1047.78 24.1019 4.851e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=983.81
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose +
## Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate +
## personal_stat + deptor_stat + age + other_instalment_plans +
## Housing + no_of_credits + no_liable + Telephone + foreign_worker
## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family
## Df Deviance AIC F value Pr(>F)
## - no_liable 1 900.81 982.81 1.0624 0.3029225
## - no_of_credits 1 901.53 983.53 1.8334 0.1760473
## - Telephone 1 901.81 983.81 2.1258 0.1451634
## <none> 899.81 983.81
## - Employment_Stat 4 907.86 983.86 2.1413 0.0738084 .
## - Housing 2 903.95 983.95 2.2024 0.1110975
## - age 1 902.53 984.53 2.8889 0.0895151 .
## - personal_stat 3 908.69 986.69 3.1515 0.0242734 *
## - foreign_worker 1 905.88 987.88 6.4574 0.0112061 *
## - deptor_stat 2 908.08 988.08 4.3996 0.0125313 *
## - other_instalment_plans 2 908.87 988.87 4.8214 0.0082517 **
## - Credit_Amt 1 909.80 991.80 10.6299 0.0011520 **
## - Duration 1 910.05 992.05 10.8974 0.0009986 ***
## - SavingsAcc 4 919.78 995.78 5.3152 0.0003095 ***
## - Installment_rate 1 915.59 997.59 16.8027 4.499e-05 ***
## - Credit_Hist 4 921.66 997.66 5.8151 0.0001263 ***
## - Purpose 9 936.35 1002.35 4.3223 1.559e-05 ***
## - checkingAcc 3 968.09 1046.09 24.2297 4.064e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=982.81
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose +
## Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate +
## personal_stat + deptor_stat + age + other_instalment_plans +
## Housing + no_of_credits + Telephone + foreign_worker
## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family
## Df Deviance AIC F value Pr(>F)
## - no_of_credits 1 902.80 982.80 2.1199 0.1457260
## <none> 900.81 982.81
## - Employment_Stat 4 908.82 982.82 2.1315 0.0749790 .
## - Telephone 1 902.90 982.90 2.2252 0.1361081
## - Housing 2 905.01 983.01 2.2329 0.1077705
## - age 1 903.38 983.38 2.7341 0.0985521 .
## - personal_stat 3 908.75 984.75 2.8190 0.0380116 *
## - foreign_worker 1 906.79 986.79 6.3613 0.0118248 *
## - deptor_stat 2 908.83 986.83 4.2672 0.0142886 *
## - other_instalment_plans 2 909.90 987.90 4.8373 0.0081230 **
## - Credit_Amt 1 910.50 990.50 10.3129 0.0013650 **
## - Duration 1 910.95 990.95 10.7970 0.0010535 **
## - SavingsAcc 4 920.53 994.53 5.2488 0.0003485 ***
## - Installment_rate 1 915.95 995.95 16.1170 6.419e-05 ***
## - Credit_Hist 4 923.28 997.28 5.9800 9.383e-05 ***
## - Purpose 9 937.61 1001.61 4.3533 1.394e-05 ***
## - checkingAcc 3 969.35 1045.35 24.3232 3.569e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=982.8
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose +
## Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate +
## personal_stat + deptor_stat + age + other_instalment_plans +
## Housing + Telephone + foreign_worker
## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family
## Df Deviance AIC F value Pr(>F)
## - Employment_Stat 4 910.50 982.50 2.0460 0.0859611 .
## - Telephone 1 904.72 982.72 2.0372 0.1538132
## <none> 902.80 982.80
## - Housing 2 907.17 983.17 2.3205 0.0987741 .
## - age 1 905.21 983.21 2.5570 0.1101378
## - personal_stat 3 910.47 984.47 2.7195 0.0434326 *
## - deptor_stat 2 910.88 986.88 4.2938 0.0139160 *
## - foreign_worker 1 909.14 987.14 6.7384 0.0095802 **
## - other_instalment_plans 2 912.57 988.57 5.1924 0.0057159 **
## - Duration 1 912.50 990.50 10.3179 0.0013613 **
## - Credit_Amt 1 912.54 990.54 10.3523 0.0013364 **
## - SavingsAcc 4 922.72 994.72 5.2944 0.0003212 ***
## - Credit_Hist 4 923.43 995.43 5.4840 0.0002287 ***
## - Installment_rate 1 917.65 995.65 15.7916 7.601e-05 ***
## - Purpose 9 940.20 1002.20 4.4184 1.103e-05 ***
## - checkingAcc 3 971.04 1045.04 24.1890 4.286e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=982.5
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose +
## Credit_Amt + SavingsAcc + Installment_rate + personal_stat +
## deptor_stat + age + other_instalment_plans + Housing + Telephone +
## foreign_worker
## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family
## Df Deviance AIC F value Pr(>F)
## <none> 910.50 982.50
## - Telephone 1 912.82 982.82 2.4537 0.1175738
## - age 1 912.96 982.96 2.6118 0.1064003
## - Housing 2 915.01 983.01 2.3885 0.0923074 .
## - foreign_worker 1 916.63 986.63 6.4913 0.0109944 *
## - personal_stat 3 920.98 986.98 3.6987 0.0115203 *
## - deptor_stat 2 919.41 987.41 4.7202 0.0091208 **
## - other_instalment_plans 2 920.76 988.76 5.4348 0.0044969 **
## - Duration 1 918.79 988.79 8.7794 0.0031213 **
## - Credit_Amt 1 920.07 990.07 10.1320 0.0015038 **
## - SavingsAcc 4 931.53 995.53 5.5665 0.0001972 ***
## - Installment_rate 1 925.92 995.92 16.3257 5.758e-05 ***
## - Credit_Hist 4 932.30 996.30 5.7705 0.0001367 ***
## - Purpose 9 947.78 1001.78 4.3854 1.240e-05 ***
## - checkingAcc 3 979.75 1045.75 24.4394 3.022e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call: glm(formula = response_class ~ checkingAcc + Duration + Credit_Hist +
## Purpose + Credit_Amt + SavingsAcc + Installment_rate + personal_stat +
## deptor_stat + age + other_instalment_plans + Housing + Telephone +
## foreign_worker, family = "binomial", data = germanCreditDF)
##
## Coefficients:
## (Intercept) checkingAccA12
## -1.7495411 0.3900152
## checkingAccA13 checkingAccA14
## 1.0240813 1.7177165
## Duration Credit_HistA31
## -0.0256787 0.1187724
## Credit_HistA32 Credit_HistA33
## 0.8303101 0.9097304
## Credit_HistA34 PurposeA41
## 1.4917085 1.6072585
## PurposeA410 PurposeA42
## 1.4349203 0.7404978
## PurposeA43 PurposeA44
## 0.9194787 0.5250945
## PurposeA45 PurposeA46
## 0.1424475 -0.1435655
## PurposeA48 PurposeA49
## 2.1643060 0.7826591
## Credit_Amt SavingsAccA62
## -0.0001294 0.3282182
## SavingsAccA63 SavingsAccA64
## 0.4303584 1.2894345
## SavingsAccA65 Installment_rate
## 0.9628458 -0.3299308
## personal_statA92 personal_statA93
## 0.2872096 0.8227885
## personal_statA94 deptor_statA102
## 0.4169133 -0.4874391
## deptor_statA103 age
## 1.0404263 0.0130933
## other_instalment_plansA142 other_instalment_plansA143
## 0.0786395 0.6994941
## HousingA152 HousingA153
## 0.4415029 0.1496754
## TelephoneA192 foreign_workerA202
## 0.2794111 1.3824572
##
## Degrees of Freedom: 999 Total (i.e. Null); 964 Residual
## Null Deviance: 1222
## Residual Deviance: 910.5 AIC: 982.5
Below are step-wise backward elimination’s result.
p-value for Telephone, age and housing is slightly greater than .05(our significance value). We already removed Telephone from our initial manual elimination process. I’m still skeptical to remove age and housing since it is so close to our significance value.
We could have used R-square and Adjusted R-square to find out whether we can remove these variables or not. However the glm summary doesnt have R-square and Adjusted R-square. Hence we will predict the result and check the confusion matrix to see the effectiviness of the model.
There is scope to use pseudo R-square to evaluate the effectiveness of the model. Use below links for future references.
Splitting the dataset into training and test data set.
library(caTools)
set.seed(123)
split <- sample.split(germanCreditDF$response_class, SplitRatio = 0.75)
training_set <- subset(germanCreditDF , split == TRUE)
test_set <- subset(germanCreditDF , split == FALSE)
nrow(training_set)
## [1] 750
nrow(test_set)
## [1] 250
Fitting the logistic model for above vaiables
regressor1 <- glm(formula = response_class ~ age + Housing + foreign_worker + personal_stat + deptor_stat + other_instalment_plans + Duration + Credit_Amt + SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc, data = training_set, family = 'binomial' )
summary(regressor1)
##
## Call:
## glm(formula = response_class ~ age + Housing + foreign_worker +
## personal_stat + deptor_stat + other_instalment_plans + Duration +
## Credit_Amt + SavingsAcc + Installment_rate + Credit_Hist +
## Purpose + checkingAcc, family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8284 -0.7033 0.3578 0.6915 2.2344
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.153e+00 8.651e-01 -1.333 0.182552
## age 3.510e-03 9.680e-03 0.363 0.716941
## HousingA152 3.037e-01 2.639e-01 1.151 0.249854
## HousingA153 4.003e-02 3.885e-01 0.103 0.917939
## foreign_workerA202 1.048e+00 7.298e-01 1.435 0.151172
## personal_statA92 2.172e-01 4.351e-01 0.499 0.617613
## personal_statA93 8.310e-01 4.237e-01 1.961 0.049830 *
## personal_statA94 4.380e-01 5.096e-01 0.859 0.390134
## deptor_statA102 -7.718e-01 5.054e-01 -1.527 0.126742
## deptor_statA103 9.139e-01 4.824e-01 1.894 0.058159 .
## other_instalment_plansA142 1.124e-01 4.610e-01 0.244 0.807327
## other_instalment_plansA143 8.385e-01 2.754e-01 3.044 0.002334 **
## Duration -2.673e-02 1.064e-02 -2.511 0.012026 *
## Credit_Amt -1.200e-04 4.963e-05 -2.418 0.015609 *
## SavingsAccA62 6.557e-01 3.419e-01 1.918 0.055153 .
## SavingsAccA63 3.705e-01 4.241e-01 0.874 0.382244
## SavingsAccA64 1.217e+00 5.861e-01 2.076 0.037913 *
## SavingsAccA65 1.216e+00 3.298e-01 3.686 0.000228 ***
## Installment_rate -3.062e-01 9.795e-02 -3.126 0.001770 **
## Credit_HistA31 1.251e-01 6.423e-01 0.195 0.845596
## Credit_HistA32 7.091e-01 4.883e-01 1.452 0.146442
## Credit_HistA33 6.684e-01 5.488e-01 1.218 0.223260
## Credit_HistA34 1.441e+00 5.089e-01 2.832 0.004627 **
## PurposeA41 1.525e+00 4.282e-01 3.562 0.000368 ***
## PurposeA410 2.073e+00 9.561e-01 2.168 0.030133 *
## PurposeA42 7.407e-01 2.996e-01 2.472 0.013439 *
## PurposeA43 4.748e-01 2.859e-01 1.661 0.096769 .
## PurposeA44 1.104e+00 9.298e-01 1.187 0.235066
## PurposeA45 -3.480e-01 6.520e-01 -0.534 0.593559
## PurposeA46 -2.153e-01 4.650e-01 -0.463 0.643421
## PurposeA48 1.516e+01 4.584e+02 0.033 0.973607
## PurposeA49 3.257e-01 3.877e-01 0.840 0.400902
## checkingAccA12 3.062e-01 2.509e-01 1.220 0.222335
## checkingAccA13 8.143e-01 4.033e-01 2.019 0.043493 *
## checkingAccA14 1.877e+00 2.696e-01 6.963 3.33e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 916.30 on 749 degrees of freedom
## Residual deviance: 672.95 on 715 degrees of freedom
## AIC: 742.95
##
## Number of Fisher Scoring iterations: 14
prob_pred <- predict(regressor1, type = 'response' , newdata = subset(test_set, select=c(age, Housing,foreign_worker, personal_stat , deptor_stat , other_instalment_plans , Duration ,Credit_Amt , SavingsAcc, Installment_rate , Credit_Hist , Purpose , checkingAcc)))
y_pred <- ifelse(prob_pred > 0.5 , 1 , 0)
cm <- table(test_set[, 'response_class'], y_pred)
cm
## y_pred
## 0 1
## 0 36 39
## 1 24 151
accuracy <- (36 + 151)/nrow(test_set)
accuracy
## [1] 0.748
Now fitting the model with the removal of ‘age’ and ‘Housing’.
regressor2 <- glm(formula = response_class ~ foreign_worker + personal_stat + deptor_stat + other_instalment_plans + Duration + Credit_Amt + SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc, data = training_set, family = 'binomial' )
summary(regressor2)
##
## Call:
## glm(formula = response_class ~ foreign_worker + personal_stat +
## deptor_stat + other_instalment_plans + Duration + Credit_Amt +
## SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc,
## family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8276 -0.7225 0.3596 0.6941 2.1908
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.964e-01 7.635e-01 -1.043 0.296862
## foreign_workerA202 1.068e+00 7.299e-01 1.464 0.143326
## personal_statA92 1.478e-01 4.294e-01 0.344 0.730722
## personal_statA93 8.082e-01 4.193e-01 1.927 0.053921 .
## personal_statA94 3.941e-01 5.035e-01 0.783 0.433779
## deptor_statA102 -8.233e-01 5.004e-01 -1.645 0.099904 .
## deptor_statA103 9.296e-01 4.796e-01 1.938 0.052594 .
## other_instalment_plansA142 1.644e-01 4.594e-01 0.358 0.720502
## other_instalment_plansA143 8.258e-01 2.743e-01 3.011 0.002605 **
## Duration -2.763e-02 1.054e-02 -2.623 0.008723 **
## Credit_Amt -1.205e-04 4.946e-05 -2.437 0.014822 *
## SavingsAccA62 6.312e-01 3.402e-01 1.855 0.063548 .
## SavingsAccA63 3.657e-01 4.239e-01 0.863 0.388387
## SavingsAccA64 1.228e+00 5.858e-01 2.096 0.036067 *
## SavingsAccA65 1.227e+00 3.284e-01 3.737 0.000186 ***
## Installment_rate -3.020e-01 9.726e-02 -3.105 0.001906 **
## Credit_HistA31 1.327e-01 6.376e-01 0.208 0.835076
## Credit_HistA32 7.121e-01 4.843e-01 1.470 0.141507
## Credit_HistA33 6.913e-01 5.452e-01 1.268 0.204761
## Credit_HistA34 1.483e+00 5.048e-01 2.938 0.003303 **
## PurposeA41 1.463e+00 4.218e-01 3.468 0.000524 ***
## PurposeA410 2.102e+00 9.667e-01 2.174 0.029685 *
## PurposeA42 7.311e-01 2.966e-01 2.465 0.013695 *
## PurposeA43 5.038e-01 2.824e-01 1.784 0.074451 .
## PurposeA44 1.220e+00 9.327e-01 1.308 0.190705
## PurposeA45 -3.277e-01 6.559e-01 -0.500 0.617385
## PurposeA46 -2.954e-01 4.615e-01 -0.640 0.522070
## PurposeA48 1.529e+01 4.571e+02 0.033 0.973322
## PurposeA49 3.581e-01 3.844e-01 0.932 0.351560
## checkingAccA12 3.220e-01 2.494e-01 1.291 0.196635
## checkingAccA13 8.330e-01 4.021e-01 2.072 0.038279 *
## checkingAccA14 1.914e+00 2.683e-01 7.137 9.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 916.3 on 749 degrees of freedom
## Residual deviance: 674.8 on 718 degrees of freedom
## AIC: 738.8
##
## Number of Fisher Scoring iterations: 14
prob_pred <- predict(regressor2, type = 'response' , newdata = subset(test_set, select=c(foreign_worker, personal_stat , deptor_stat , other_instalment_plans , Duration ,Credit_Amt , SavingsAcc, Installment_rate , Credit_Hist , Purpose , checkingAcc)))
y_pred <- ifelse(prob_pred > 0.5 , 1 , 0)
cm <- table(test_set[, 'response_class'], y_pred)
cm
## y_pred
## 0 1
## 0 36 39
## 1 24 151
accuracy <- (36 + 151)/nrow(test_set)
accuracy
## [1] 0.748
The accuracy didnt change, So age and housing is not adding value to the model.The accuracy neither goes down nor goes up.
Conclusion: We eliminated lot of useless independent variables and found most signficant variables associated to the response variables.
final_model formulae = [response_class ~ foreign_worker + personal_stat + deptor_stat + other_instalment_plans + Duration + Credit_Amt + SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc ]
For the sake of simplicity, I’m keeping the description of explanatory variable values below.
Credit_Hist: (qualitative)
+ Credit history
- A30 : no credits taken/ all credits paid back duly
- A31 : all credits at this bank paid back duly
- A32 : existing credits paid back duly till now
- A33 : delay in paying off in the past
- A34 : critical account/other credits existing (not at this bank)
From the research question 1, we got few explanatory variables which are significant towards the response variable. credit_hist is one among them. However we need to analyse why customer having critical credit history(A34 group) having higher approval rate. A32 group also having higher approval rate.
I tried to plot side by side boxplot and scatter plot for several categorical and numerical variable with the credit_hist variable. Out of that, I see a pattern for ‘Duration’ which is a numerical variable.
Below is the box-plot of Duration v/s Credit_Hist. I see A32 and A34 has same median and both their areas are condensed. Other group having little higher than the A32 and A34’s median.
boxplot(germanCreditDF$Duration ~ germanCreditDF$Credit_Hist)
Lets take A34 data and plot a side by side plot ‘Duration’ with ‘class’. This shows median of Duration for Good Customer is less than the median of Duration for Bad Customer. This makes me think that the customer having critical credit history probabily applies for a shorter duration and their application gets approved. Also noted that there are outliers for Good customer.
A34_df <- germanCreditDF %>% filter(germanCreditDF$Credit_Hist == 'A34')
boxplot(A34_df$Duration ~ A34_df$class)
Lets try for A32 group which are having higher approval rate. I see same behavior here too.
A32_df <- germanCreditDF %>% filter(germanCreditDF$Credit_Hist == 'A32')
boxplot(A32_df$Duration ~ A32_df$class)
Lets do a hypothesis test to prove this. We have null hypothesis and alternate hypothesis defined as below.
inference(y = A34_df$Duration, x = A34_df$class, est = "median", type = "ht", null = 0,
alternative = "twosided", method = "simulation")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 50, median_Bad = 22.5, n_Good = 243, median_Good = 15,
## Observed difference between medians (Bad-Good) = 7.5
## H0: median_Bad - median_Good = 0
## HA: median_Bad - median_Good != 0
## p-value = 0.0106
Since the p-value is less than .05( significance level), we reject the null hypothesis.
Duration of the credit matters and the Customer having critical credit history probabily applies for a shorter duration and their application gets approved. When they apply for a longer duration, mostly it gets rejected.
In the question 1, we tried to determine the significance of the explanatory variables using glm library which is a logistic variant. Now lets try some classification models and see how well we can predict the response variable using these explanatory variables.
Since we use ML models, we use numeric version of the data (german.data-numeric.txt)
# install.packages("e1071")
library('e1071')
germanCreditDF_numeric <- read.table("C:/Users/Charls/Documents/CunyMSDS/606-Statistical Analsyis/final-project/data/german.data-numeric.txt", header = F)
head(germanCreditDF_numeric)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 1 6 4 12 5 5 3 4 1 67 3 2 1 2 1 0 0 1 0 0
## 2 2 48 2 60 1 3 2 2 1 22 3 1 1 1 1 0 0 1 0 0
## 3 4 12 4 21 1 4 3 3 1 49 3 1 2 1 1 0 0 1 0 0
## 4 1 42 2 79 1 4 3 4 2 45 3 1 2 1 1 0 0 0 0 0
## 5 1 24 3 49 1 3 3 4 4 53 3 2 2 1 1 1 0 1 0 0
## 6 4 36 2 91 5 3 3 4 4 35 3 1 2 2 1 0 0 1 0 0
## V21 V22 V23 V24 V25
## 1 1 0 0 1 1
## 2 1 0 0 1 2
## 3 1 0 1 0 1
## 4 0 0 0 1 1
## 5 0 0 0 1 2
## 6 0 0 1 0 1
colnames(germanCreditDF_numeric) <- column_headers
germanCreditDF_numeric$response_class <- sapply(germanCreditDF$Customer_class, function(x){
switch(as.character(x), "1" = 1, "2" = 0)
})
nrow(germanCreditDF_numeric)
## [1] 1000
library(caTools)
set.seed(123)
split <- sample.split(germanCreditDF_numeric$response_class, SplitRatio = 0.75)
training_set_num <- subset(germanCreditDF_numeric , split == TRUE)
test_set_num <- subset(germanCreditDF_numeric , split == FALSE)
naive_bayes_regressor <- naiveBayes(x = training_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" , "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")], y = as.factor(training_set_num$response_class))
y_pred <- predict(naive_bayes_regressor, newdata = test_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" , "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")])
# Making the Confusion Matrix
cm = table(test_set_num$response_class, y_pred)
cm
## y_pred
## 0 1
## 0 38 37
## 1 24 151
accuracy <- (38 + 151)/nrow(test_set_num)
accuracy
## [1] 0.756
library(class)
y_pred = knn(train = training_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" , "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")],
test = test_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" , "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")],
cl = as.factor(training_set_num$response_class),
k = 5,
prob = TRUE)
cm = table(test_set_num$response_class, y_pred)
cm
## y_pred
## 0 1
## 0 25 50
## 1 23 152
accuracy <- ( 24 + 153)/nrow(test_set_num)
accuracy
## [1] 0.708
# install.packages("rpart")
library(rpart)
classifer_dt <- rpart(data = training_set_num, formula = as.factor(response_class) ~ foreign_worker + personal_stat + deptor_stat + other_instalment_plans + Duration + Credit_Amt + SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc)
y_pred <- predict(classifer_dt, newdata = test_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" , "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")], type = 'class')
cm = table(test_set_num$response_class, y_pred)
cm
## y_pred
## 0 1
## 0 29 46
## 1 20 155
accuracy <- ( 29 + 155)/nrow(test_set_num)
accuracy
## [1] 0.736
# install.packages("randomForest")
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
classifer_rf <- randomForest(x = training_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" , "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")] ,
y = as.factor(training_set_num$response_class), ntree = 15)
y_pred <- predict(classifer_rf, newdata = test_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" , "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")], type = 'class')
# Making the Confusion Matrix
cm <- table(test_set_num$response_class, y_pred)
cm
## y_pred
## 0 1
## 0 36 39
## 1 16 159
accuracy <- ( 32 + 158)/nrow(test_set_num)
accuracy
## [1] 0.76
Below table shows the classification models we tried so far and its accuracy. Out of that, Naive Bayes and the Random Forest are giving good accuracy rate.
c("Naive Bayes", "Knn", "Decision Tree", "Random Forest" )
## [1] "Naive Bayes" "Knn" "Decision Tree" "Random Forest"
c(75.6 ,70.8,73.6,76 )
## [1] 75.6 70.8 73.6 76.0
acc.DF <- data.frame(Model=c("Naive Bayes", "Knn", "Decision Tree", "Random Forest" ) , Accuracy = c(75.6 ,70.8,73.6,76 ))
library(knitr)
library(kableExtra)
acc.DF %>%
kable() %>%
kable_styling()
Model | Accuracy |
---|---|
Naive Bayes | 75.6 |
Knn | 70.8 |
Decision Tree | 73.6 |
Random Forest | 76.0 |