DATA 606 Data Project

Load packages

Data Preparation

Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file “german.data”.

For algorithms that need numerical attributes, Strathclyde University produced the file “german.data-numeric”

I decided to use the orginal data file(the original data file). There are 20 columns and the column descriptions are provided in german.doc. I created a vector for all headers listed in the german.doc.

Below R code snippet reads data from the german.data data file, creates a dataframe with prepared header values.

germanCreditDF <- read.delim("C:/Users/Charls/Documents/CunyMSDS/606-Statistical Analsyis/final-project/data/german.data.txt", sep = " ", header = F)

column_headers <-  c("checkingAcc", "Duration", "Credit_Hist", "Purpose" ,"Credit_Amt" , "SavingsAcc" , "Employment_Stat", "Installment_rate", "personal_stat" , "deptor_stat" ,"residence_in_years" ,"Property" , "age","other_instalment_plans" , "Housing" , "no_of_credits" ,"Job_type" , "no_liable" , "Telephone" , 
"foreign_worker", "Customer_class" )

colnames(germanCreditDF) <- column_headers

The structure of the dataframe is as below. Along with 20 feature metrics, we have a response variable called Customer_class ( Value = 1 means Good, Value = 2 means Bad Credit risk).

str(germanCreditDF)

## 'data.frame':    1000 obs. of  21 variables:
##  $ checkingAcc           : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ Duration              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Credit_Hist           : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ Purpose               : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ Credit_Amt            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ SavingsAcc            : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ Employment_Stat       : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ Installment_rate      : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_stat         : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ deptor_stat           : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ residence_in_years    : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Property              : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_instalment_plans: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Housing               : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ no_of_credits         : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ Job_type              : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ no_liable             : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone             : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign_worker        : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Customer_class        : int  1 2 1 1 2 1 1 1 1 2 ...

Research question

Find the signifance of few quantitive and qualitative variables( Duration , Credit_Amt , age , foreign_worker , Employment_Stat , Purpose and credit_hist) on the response variable. Using backword elimination process, find out the most significant variable which is highly correlated to the response variable.
From the inference analysis, we noted that there is an anomaly on credit_hist. Critical credits having a higher approval rate and considered to be a good customer. Find what is the reason behind this observation ?
Try different machine learning models to predict the customer class based on most significant variables identified above and guage their accuracy using confusion matrix and other means.

Cases

The cases are the observations on different customers who appied for a credit in a german bank and bank’s decision weather the credit would be approved or not. The observations are the details regarding like their credit history, employment, credit amount, purpose and bank details.

There are 1000 cases in the dataset for this research activity.

Data collection

The data is available in UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets.html and is downloaded to perform the research study on the regression.

Type of study

This is an observational study, since we are doing analysis on the dataset collected as observation

Data Source

The data is available in UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Dependent Variable

Customer_class is the response variable. This is qualititive variable whether a bank is considering his credit application or not. If the bank approves, the customer is classifies as good, otherwise not.

Independent Variable

There are 20 independent Variables including the quantitative and qualitative variables.

checkingAcc: (qualitative)
- Status of existing checking account
  - A11 : … < 0 DM
  - A12 : 0 <= … < 200 DM
  - A13 : … >= 200 DM /salary assignments for at least 1 year
  - A14 : no checking account
Duration: (numerical)
- Duration in month
Credit_Hist: (qualitative)
- Credit history
  - A30 : no credits taken/ all credits paid back duly
  - A31 : all credits at this bank paid back duly
  - A32 : existing credits paid back duly till now
  - A33 : delay in paying off in the past
  - A34 : critical account/other credits existing (not at this bank)
Purpose : (qualitative)
- Purpose
  - A40 : car (new)
  - A41 : car (used)
  - A42 : furniture/equipment
  - A43 : radio/television
  - A44 : domestic appliances
  - A45 : repairs
  - A46 : education
  - A47 : (vacation - does not exist?)
  - A48 : retraining
  - A49 : business
  - A410 : others
Credit_Amt: (numerical) + Credit amount
SavingsAcc: (qualitative)
- Savings account/bonds
  - A61 : … < 100 DM
  - A62 : 100 <= … < 500 DM
  - A63 : 500 <= … < 1000 DM
  - A64 : .. >= 1000 DM
  - A65 : unknown/ no savings account
Employment_Stat: (qualitative)
- Present employment since
  - A71 : unemployed
  - A72 : … < 1 year
  - A73 : 1 <= … < 4 years
  - A74 : 4 <= … < 7 years
  - A75 : .. >= 7 years
Installment_rate: (numerical)
- Installment rate in percentage of disposable income
personal_stat: (qualitative)
- Personal status and sex
  - A91 : male : divorced/separated
  - A92 : female : divorced/separated/married
  - A93 : male : single
  - A94 : male : married/widowed
  - A95 : female : single
deptor_stat: (qualitative)
- Other debtors / guarantors
  - A101 : none
  - A102 : co-applicant -A103 : guarantor
residence_in_years: (numerical)
- Present residence since
Property: (qualitative)
- Property
  - A121 : real estate
  - A122 : if not A121 : building society savings agreement/life insurance
  - A123 : if not A121/A122 : car or other, not in attribute 6
  - A124 : unknown / no property
age: (numerical)
- Age in years
other_instalment_plans: (qualitative)
- Other installment plans
  - A141 : bank
  - A142 : stores
  - A143 : none
Housing: (qualitative) + Housing - A151 : rent - A152 : own - A153 : for free
no_of_credits: (numerical)
- Number of existing credits at this bank
Job_type: (qualitative)
- Job
  - A171 : unemployed/ unskilled - non-resident
  - A172 : unskilled - resident
  - A173 : skilled employee / official
  - A174 : management/ self-employed/highly qualified employee/ officer
no_liable: (numerical)
- Number of people being liable to provide maintenance for
Telephone: (qualitative)
- Telephone
  - A191 : none
  - A192 : yes, registered under the customers name
foreign_worker : (qualitative)
- foreign worker
  - A201 : yes
  - A202 : no

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Lets look from some basic inference by looking at the distribution and the box plot graph. Based on this analysis, we will derive our research questions.

Firstly we will look for distribution of some of the quantitive variables.

Duration

Below is the histogram and the box plot of the ‘Duration’ variable. The histogram shows it is slightly right skewed, the box plot shows that the median for good customer class is less than the bad customer class. However there are presence of outliers for good customer class which also means that there can be some other factors which co-relates to the response variable.

With respect to the spread, the good customer class is more condensed than the bad customer class. It doesnt show a much of significance towards the response variable from the box plots. Let’s confirm if the Duration is statistically significant to the response variable using the hypothesis test.

hist(germanCreditDF$Duration)
germanCreditDF$class <- sapply(germanCreditDF$Customer_class, function(x){
  switch(as.character(x), "1" = "Good", "2" = "Bad")
})
germanCreditDF$response_class <- sapply(germanCreditDF$Customer_class, function(x){
  switch(as.character(x), "1" = 1, "2" = 0)
})
library(ggplot2)

ggplot(germanCreditDF, aes(x=class, y=Duration)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

Let’s define the null and alternate hypotheis in terms of median since the distribution is skewed.

h0 -> good customer’s median of Duration is same as the bad customer.

ha -> goodcustomer’s median of Duration is different than the bad customer.

Conclusion: since the p-val is less than .05, the null hypothesis can be rejected. In short, using hypothesis test, we see that the Duration is statistically significant to the response variable.

source('http://www.openintro.org/stat/slides/inference.R')
inference(y = germanCreditDF$Duration, x = germanCreditDF$class, est = "median", type = "ht", null = 0, 
          alternative = "twosided", method = "simulation")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 24, n_Good = 700, median_Good = 18,

## Observed difference between medians (Bad-Good) = 6
## H0: median_Bad - median_Good = 0 
## HA: median_Bad - median_Good != 0

## p-value =  4e-04

Lets move on to some other Quantitive variables.

Credit_Amt

As per the box plot, we dont see much of significance. The median is more or less same, The good customer is more condensed than the bad customer class. Noted that the there are more outliers for good customer class.

Let’s confirm if the Credit_Amt is statistically significant to the response variable using the hypothesis test.

# Quantitive variable comparison(using Boxplots)

hist(germanCreditDF$Credit_Amt)

# Set a unique color with fill, colour, and alpha
ggplot(germanCreditDF, aes(x=class, y=Credit_Amt)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

Let’s define the null and alternate hypotheis in terms of median since the distribution is skewed.

h0 -> good customer’s median of Credit_Amt is same as the bad customer.

ha -> good customer’s median of Credit_Amt is different than the bad customer.

Conclusion: since the p-val is less than .05, the null hypothesis can be rejected. In short, using hypothesis test, we see that the Credit_Amt is statistically significant to the response variable.

inference(y = germanCreditDF$Credit_Amt, x = germanCreditDF$class, est = "median", type = "ht", null = 0, 
          alternative = "twosided", method = "simulation")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 2574.5, n_Good = 700, median_Good = 2244,

## Observed difference between medians (Bad-Good) = 330.5
## H0: median_Bad - median_Good = 0 
## HA: median_Bad - median_Good != 0

## p-value =  0.0496

no_of_credits

The box plot for both class looks same. Hence I’m assuming that there is no significance of no_of_credits towards the response variable.

hist(germanCreditDF$no_of_credits)

ggplot(germanCreditDF, aes(x=class, y=no_of_credits)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

According to the hypothesis test, p-val is greater than the the .05, so we can conclude that there is no significance of no_of_credits towards the response variable.

inference(y = germanCreditDF$no_of_credits, x = germanCreditDF$class, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Bad = 300, mean_Bad = 1.3667, sd_Bad = 0.5597
## n_Good = 700, mean_Good = 1.4243, sd_Good = 0.5847

## Observed difference between means (Bad-Good) = -0.0576
## H0: mu_Bad - mu_Good = 0 
## HA: mu_Bad - mu_Good != 0 
## Standard error = 0.039 
## Test statistic: Z =  -1.472 
## p-value =  0.141

age

Based on boxplot, we dont see much of a significance of age too on both group since median is more or less same. And spread is overlapping. But according to the hypotheisis test, we can reject the null hypothesis and there is a significance of age to the response variable.

hist(germanCreditDF$age)

ggplot(germanCreditDF, aes(x=class, y=age)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

inference(y = germanCreditDF$age, x = germanCreditDF$class, est = "median", type = "ht", null = 0, 
          alternative = "twosided", method = "simulation")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 300, median_Bad = 31, n_Good = 700, median_Good = 34,

## Observed difference between medians (Bad-Good) = -3
## H0: median_Bad - median_Good = 0 
## HA: median_Bad - median_Good != 0

## p-value =  0.0222

We are running out of all quantitive variables. Lets look at some of the qualitative variables by looking at the contigency tables, contitional probabilties(column wise) and few stacked histograms.

foreign_worker

The conditinal probabilty says that the although the number of non-foreign applicants is less, the probabily of approving their application is more compaired to the foreign applicants. However noted that there are more number of foreign applicants.

# Qualitive variable comparison(using congidency table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Contigency table
foreignDF <- germanCreditDF %>% select(class, foreign_worker)
table(foreignDF)

##       foreign_worker
## class  A201 A202
##   Bad   296    4
##   Good  667   33

# Contitional probabily table

prop.table(table(foreignDF), 2)

##       foreign_worker
## class       A201      A202
##   Bad  0.3073728 0.1081081
##   Good 0.6926272 0.8918919

# side barplot of contigency table

barplot(table(foreignDF), beside =  T, legend = T, main = "foreign v/s customer class", col=c("coral", "aquamarine3"), xlab = "foreign_worker - A201 : yes - A202 : no")

Employment_Stat

Unemployment and less no of years shows lower approval rate. So there looks to be some relation exist.

# Qualitive variable comparison(using congidency table)
library(dplyr)
# Contigency table
employmentDF <- germanCreditDF %>% select(class, Employment_Stat)
table(employmentDF)

##       Employment_Stat
## class  A71 A72 A73 A74 A75
##   Bad   23  70 104  39  64
##   Good  39 102 235 135 189

# Contitional probabily table

prop.table(table(employmentDF), 2)

##       Employment_Stat
## class        A71       A72       A73       A74       A75
##   Bad  0.3709677 0.4069767 0.3067847 0.2241379 0.2529644
##   Good 0.6290323 0.5930233 0.6932153 0.7758621 0.7470356

# side barplot of contigency table

barplot(table(employmentDF), beside =  T, legend = T, main = "Employment v/s customer class", col=c("coral", "aquamarine3"))

Purpose

Contitional probabilty says that the used cars application as higher approval rate. and education credit application has more or less same chances to get approved or rejected. So there looks to be some relation exist.

Here is label defination for each Purpose.

Purpose : (qualitative)
- Purpose
  - A40 : car (new)
  - A41 : car (used)
  - A42 : furniture/equipment
  - A43 : radio/television
  - A44 : domestic appliances
  - A45 : repairs
  - A46 : education
  - A47 : (vacation - does not exist?)
  - A48 : retraining
  - A49 : business
  - A410 : others

# Qualitive variable comparison(using congidency table)
library(dplyr)
# Contigency table
PurposeDF <- germanCreditDF %>% select(class, Purpose)
table(PurposeDF)

##       Purpose
## class  A40 A41 A410 A42 A43 A44 A45 A46 A48 A49
##   Bad   89  17    5  58  62   4   8  22   1  34
##   Good 145  86    7 123 218   8  14  28   8  63

# Contitional probabily table

prop.table(table(PurposeDF), 2)

##       Purpose
## class        A40       A41      A410       A42       A43       A44
##   Bad  0.3803419 0.1650485 0.4166667 0.3204420 0.2214286 0.3333333
##   Good 0.6196581 0.8349515 0.5833333 0.6795580 0.7785714 0.6666667
##       Purpose
## class        A45       A46       A48       A49
##   Bad  0.3636364 0.4400000 0.1111111 0.3505155
##   Good 0.6363636 0.5600000 0.8888889 0.6494845

# side barplot of contigency table

barplot(table(PurposeDF), beside =  T, legend = T, main = "Purpose v/s customer class", col=c("coral", "aquamarine3"))

Credit_Hist

‘A34 : critical account/other credits existing’ is having higher approval rate which looks strange. Either there is a data anomaly or some other relation is overriding the significance of this variable.

‘A33 : delay in paying off in the past’ shows higher approval rate.

# Qualitive variable comparison(using congidency table)
# Contigency table
Credit_HistDF <- germanCreditDF %>% select(class, Credit_Hist)
table(Credit_HistDF)

##       Credit_Hist
## class  A30 A31 A32 A33 A34
##   Bad   25  28 169  28  50
##   Good  15  21 361  60 243

# Contitional probabily table

prop.table(table(Credit_HistDF), 2)

##       Credit_Hist
## class        A30       A31       A32       A33       A34
##   Bad  0.6250000 0.5714286 0.3188679 0.3181818 0.1706485
##   Good 0.3750000 0.4285714 0.6811321 0.6818182 0.8293515

# side barplot of contigency table

barplot(table(Credit_HistDF), beside =  T, legend = T, main = "Credit_History v/s customer class", col=c("coral", "aquamarine3"))

Getting Started with Analysis

The below analysis is performed to answer the research questions posted above.

Research Question - 1

Find the signifance of few quantitive and qualitative variables( Duration , Credit_Amt , age , foreign_worker , Employment_Stat , Purpose and credit_hist) on the response variable. Using backword elimination process, find out the most significant variable which is highly correlated to the response variable.

Backward elimimation can be done in 2 ways.

Manual
Step wise automatic removal

Lets try out the manual way. Below is the steps we follow in the Manual way. This is often called Step-wise Regression.

Step 1: Select a significant Level to stay in the model ( SL : 0.05 )
Step 2: Fit the full model with all possible predictors.
Step 3: Conside the predictor with high P-value and remove that predictor
Step 4: Build the model again without that predictor.
Step 5: Repeat the steps from 1 to 4 until you have all predictors having the SL < 0.05.

full_regressor <- glm(formula = response_class ~ . - Customer_class - class , data = germanCreditDF, family = 'binomial' )
summary(full_regressor)

## 
## Call:
## glm(formula = response_class ~ . - Customer_class - class, family = "binomial", 
##     data = germanCreditDF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6116  -0.7095   0.3752   0.6994   2.3410  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -4.005e-01  1.084e+00  -0.369 0.711869    
## checkingAccA12              3.749e-01  2.179e-01   1.720 0.085400 .  
## checkingAccA13              9.657e-01  3.692e-01   2.616 0.008905 ** 
## checkingAccA14              1.712e+00  2.322e-01   7.373 1.66e-13 ***
## Duration                   -2.786e-02  9.296e-03  -2.997 0.002724 ** 
## Credit_HistA31             -1.434e-01  5.489e-01  -0.261 0.793921    
## Credit_HistA32              5.861e-01  4.305e-01   1.362 0.173348    
## Credit_HistA33              8.532e-01  4.717e-01   1.809 0.070470 .  
## Credit_HistA34              1.436e+00  4.399e-01   3.264 0.001099 ** 
## PurposeA41                  1.666e+00  3.743e-01   4.452 8.51e-06 ***
## PurposeA410                 1.489e+00  7.764e-01   1.918 0.055163 .  
## PurposeA42                  7.916e-01  2.610e-01   3.033 0.002421 ** 
## PurposeA43                  8.916e-01  2.471e-01   3.609 0.000308 ***
## PurposeA44                  5.228e-01  7.623e-01   0.686 0.492831    
## PurposeA45                  2.164e-01  5.500e-01   0.393 0.694000    
## PurposeA46                 -3.628e-02  3.965e-01  -0.092 0.927082    
## PurposeA48                  2.059e+00  1.212e+00   1.699 0.089297 .  
## PurposeA49                  7.401e-01  3.339e-01   2.216 0.026668 *  
## Credit_Amt                 -1.283e-04  4.444e-05  -2.887 0.003894 ** 
## SavingsAccA62               3.577e-01  2.861e-01   1.250 0.211130    
## SavingsAccA63               3.761e-01  4.011e-01   0.938 0.348476    
## SavingsAccA64               1.339e+00  5.249e-01   2.551 0.010729 *  
## SavingsAccA65               9.467e-01  2.625e-01   3.607 0.000310 ***
## Employment_StatA72          6.691e-02  4.270e-01   0.157 0.875475    
## Employment_StatA73          1.828e-01  4.105e-01   0.445 0.656049    
## Employment_StatA74          8.310e-01  4.455e-01   1.866 0.062110 .  
## Employment_StatA75          2.766e-01  4.134e-01   0.669 0.503410    
## Installment_rate           -3.301e-01  8.828e-02  -3.739 0.000185 ***
## personal_statA92            2.755e-01  3.865e-01   0.713 0.476040    
## personal_statA93            8.161e-01  3.799e-01   2.148 0.031718 *  
## personal_statA94            3.671e-01  4.537e-01   0.809 0.418448    
## deptor_statA102            -4.360e-01  4.101e-01  -1.063 0.287700    
## deptor_statA103             9.786e-01  4.243e-01   2.307 0.021072 *  
## residence_in_years         -4.776e-03  8.641e-02  -0.055 0.955920    
## PropertyA122               -2.814e-01  2.534e-01  -1.111 0.266630    
## PropertyA123               -1.945e-01  2.360e-01  -0.824 0.409743    
## PropertyA124               -7.304e-01  4.245e-01  -1.721 0.085308 .  
## age                         1.454e-02  9.222e-03   1.576 0.114982    
## other_instalment_plansA142  1.232e-01  4.119e-01   0.299 0.764878    
## other_instalment_plansA143  6.463e-01  2.391e-01   2.703 0.006871 ** 
## HousingA152                 4.436e-01  2.347e-01   1.890 0.058715 .  
## HousingA153                 6.839e-01  4.770e-01   1.434 0.151657    
## no_of_credits              -2.721e-01  1.895e-01  -1.436 0.151109    
## Job_typeA172               -5.361e-01  6.796e-01  -0.789 0.430160    
## Job_typeA173               -5.547e-01  6.549e-01  -0.847 0.397015    
## Job_typeA174               -4.795e-01  6.623e-01  -0.724 0.469086    
## no_liable                  -2.647e-01  2.492e-01  -1.062 0.288249    
## TelephoneA192               3.000e-01  2.013e-01   1.491 0.136060    
## foreign_workerA202          1.392e+00  6.258e-01   2.225 0.026095 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.73  on 999  degrees of freedom
## Residual deviance:  895.82  on 951  degrees of freedom
## AIC: 993.82
## 
## Number of Fisher Scoring iterations: 5

Removing variables ‘residence_in_years’, ‘Job_type’, ‘no_liable’ , ‘no_of_credits’

partial_regrssor1 <- glm(formula = response_class ~ . - Customer_class - class - residence_in_years - Job_type - no_liable - no_of_credits , data = germanCreditDF, family = 'binomial' )
summary(partial_regrssor1)

## 
## Call:
## glm(formula = response_class ~ . - Customer_class - class - residence_in_years - 
##     Job_type - no_liable - no_of_credits, family = "binomial", 
##     data = germanCreditDF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7065  -0.7404   0.3780   0.6972   2.2990  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.587e+00  8.544e-01  -1.858 0.063200 .  
## checkingAccA12              4.159e-01  2.165e-01   1.921 0.054670 .  
## checkingAccA13              1.046e+00  3.657e-01   2.860 0.004232 ** 
## checkingAccA14              1.709e+00  2.310e-01   7.399 1.38e-13 ***
## Duration                   -2.709e-02  9.141e-03  -2.964 0.003040 ** 
## Credit_HistA31              1.911e-02  5.312e-01   0.036 0.971297    
## Credit_HistA32              7.369e-01  4.126e-01   1.786 0.074086 .  
## Credit_HistA33              8.387e-01  4.678e-01   1.793 0.073013 .  
## Credit_HistA34              1.394e+00  4.349e-01   3.205 0.001352 ** 
## PurposeA41                  1.643e+00  3.670e-01   4.479 7.52e-06 ***
## PurposeA410                 1.536e+00  7.639e-01   2.011 0.044308 *  
## PurposeA42                  8.093e-01  2.585e-01   3.131 0.001741 ** 
## PurposeA43                  9.059e-01  2.462e-01   3.679 0.000234 ***
## PurposeA44                  5.908e-01  7.655e-01   0.772 0.440254    
## PurposeA45                  1.831e-01  5.507e-01   0.332 0.739581    
## PurposeA46                 -3.402e-02  3.975e-01  -0.086 0.931798    
## PurposeA48                  2.067e+00  1.227e+00   1.684 0.092111 .  
## PurposeA49                  7.305e-01  3.318e-01   2.202 0.027698 *  
## Credit_Amt                 -1.251e-04  4.312e-05  -2.900 0.003726 ** 
## SavingsAccA62               3.327e-01  2.822e-01   1.179 0.238487    
## SavingsAccA63               4.438e-01  3.969e-01   1.118 0.263571    
## SavingsAccA64               1.276e+00  5.150e-01   2.478 0.013205 *  
## SavingsAccA65               9.688e-01  2.610e-01   3.712 0.000206 ***
## Employment_StatA72         -8.459e-02  3.817e-01  -0.222 0.824597    
## Employment_StatA73          1.666e-02  3.560e-01   0.047 0.962681    
## Employment_StatA74          6.440e-01  3.968e-01   1.623 0.104628    
## Employment_StatA75          8.154e-02  3.665e-01   0.222 0.823930    
## Installment_rate           -3.182e-01  8.644e-02  -3.681 0.000232 ***
## personal_statA92            2.795e-01  3.829e-01   0.730 0.465424    
## personal_statA93            7.497e-01  3.737e-01   2.006 0.044817 *  
## personal_statA94            3.648e-01  4.518e-01   0.807 0.419494    
## deptor_statA102            -4.431e-01  4.104e-01  -1.080 0.280245    
## deptor_statA103             9.548e-01  4.248e-01   2.247 0.024619 *  
## PropertyA122               -2.834e-01  2.512e-01  -1.128 0.259278    
## PropertyA123               -2.056e-01  2.325e-01  -0.884 0.376633    
## PropertyA124               -6.624e-01  4.210e-01  -1.573 0.115607    
## age                         1.357e-02  9.044e-03   1.500 0.133501    
## other_instalment_plansA142  1.096e-01  4.092e-01   0.268 0.788800    
## other_instalment_plansA143  6.741e-01  2.375e-01   2.838 0.004538 ** 
## HousingA152                 4.513e-01  2.258e-01   1.999 0.045578 *  
## HousingA153                 6.118e-01  4.735e-01   1.292 0.196285    
## TelephoneA192               2.911e-01  1.876e-01   1.552 0.120631    
## foreign_workerA202          1.437e+00  6.301e-01   2.281 0.022544 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.73  on 999  degrees of freedom
## Residual deviance:  899.93  on 957  degrees of freedom
## AIC: 985.93
## 
## Number of Fisher Scoring iterations: 5

Lets remove Property since it is having high p-value.

partial_regrssor2 <- glm(formula = response_class ~ . - Customer_class - class - residence_in_years - Job_type - no_liable - no_of_credits - Property , data = germanCreditDF, family = 'binomial' )
summary(partial_regrssor2)

## 
## Call:
## glm(formula = response_class ~ . - Customer_class - class - residence_in_years - 
##     Job_type - no_liable - no_of_credits - Property, family = "binomial", 
##     data = germanCreditDF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7586  -0.7379   0.3888   0.6961   2.3536  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.771e+00  8.388e-01  -2.111 0.034785 *  
## checkingAccA12              4.225e-01  2.147e-01   1.968 0.049098 *  
## checkingAccA13              1.056e+00  3.648e-01   2.895 0.003789 ** 
## checkingAccA14              1.730e+00  2.303e-01   7.513 5.78e-14 ***
## Duration                   -2.805e-02  9.035e-03  -3.105 0.001901 ** 
## Credit_HistA31              2.768e-02  5.295e-01   0.052 0.958316    
## Credit_HistA32              7.686e-01  4.128e-01   1.862 0.062627 .  
## Credit_HistA33              8.564e-01  4.678e-01   1.831 0.067126 .  
## Credit_HistA34              1.421e+00  4.353e-01   3.264 0.001100 ** 
## PurposeA41                  1.635e+00  3.668e-01   4.459 8.25e-06 ***
## PurposeA410                 1.571e+00  7.686e-01   2.043 0.041005 *  
## PurposeA42                  7.800e-01  2.557e-01   3.050 0.002289 ** 
## PurposeA43                  9.196e-01  2.453e-01   3.749 0.000178 ***
## PurposeA44                  5.821e-01  7.558e-01   0.770 0.441197    
## PurposeA45                  1.791e-01  5.435e-01   0.330 0.741679    
## PurposeA46                 -1.182e-01  3.934e-01  -0.300 0.763815    
## PurposeA48                  2.038e+00  1.221e+00   1.669 0.095161 .  
## PurposeA49                  7.620e-01  3.301e-01   2.308 0.020983 *  
## Credit_Amt                 -1.316e-04  4.259e-05  -3.091 0.001998 ** 
## SavingsAccA62               2.715e-01  2.784e-01   0.975 0.329510    
## SavingsAccA63               4.178e-01  3.982e-01   1.049 0.294010    
## SavingsAccA64               1.265e+00  5.117e-01   2.473 0.013410 *  
## SavingsAccA65               9.504e-01  2.594e-01   3.663 0.000249 ***
## Employment_StatA72         -8.348e-02  3.815e-01  -0.219 0.826780    
## Employment_StatA73          6.315e-02  3.555e-01   0.178 0.859030    
## Employment_StatA74          6.694e-01  3.971e-01   1.686 0.091831 .  
## Employment_StatA75          7.968e-02  3.670e-01   0.217 0.828114    
## Installment_rate           -3.262e-01  8.609e-02  -3.789 0.000151 ***
## personal_statA92            2.732e-01  3.802e-01   0.719 0.472368    
## personal_statA93            7.373e-01  3.705e-01   1.990 0.046624 *  
## personal_statA94            3.980e-01  4.486e-01   0.887 0.374903    
## deptor_statA102            -4.870e-01  4.060e-01  -1.199 0.230392    
## deptor_statA103             9.804e-01  4.167e-01   2.353 0.018637 *  
## age                         1.384e-02  9.000e-03   1.537 0.124203    
## other_instalment_plansA142  1.556e-01  4.093e-01   0.380 0.703768    
## other_instalment_plansA143  7.013e-01  2.360e-01   2.972 0.002963 ** 
## HousingA152                 4.514e-01  2.241e-01   2.014 0.044004 *  
## HousingA153                 2.117e-01  3.437e-01   0.616 0.537789    
## TelephoneA192               2.568e-01  1.861e-01   1.380 0.167706    
## foreign_workerA202          1.414e+00  6.236e-01   2.268 0.023351 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.7  on 999  degrees of freedom
## Residual deviance:  902.8  on 960  degrees of freedom
## AIC: 982.8
## 
## Number of Fisher Scoring iterations: 5

Lets remove Employment_Stat group which has high p-values. Also remove Telephone.

partial_regrssor3 <- glm(formula = response_class ~ . - Customer_class - class - residence_in_years - Job_type - no_liable - no_of_credits - Property - Employment_Stat - Telephone, data = germanCreditDF, family = 'binomial' )
summary(partial_regrssor3)

## 
## Call:
## glm(formula = response_class ~ . - Customer_class - class - residence_in_years - 
##     Job_type - no_liable - no_of_credits - Property - Employment_Stat - 
##     Telephone, family = "binomial", data = germanCreditDF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7881  -0.7392   0.3867   0.7027   2.2832  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.787e+00  7.349e-01  -2.431 0.015042 *  
## checkingAccA12              3.996e-01  2.115e-01   1.890 0.058802 .  
## checkingAccA13              1.025e+00  3.620e-01   2.830 0.004652 ** 
## checkingAccA14              1.734e+00  2.279e-01   7.610 2.73e-14 ***
## Duration                   -2.668e-02  8.884e-03  -3.004 0.002667 ** 
## Credit_HistA31              1.288e-01  5.241e-01   0.246 0.805860    
## Credit_HistA32              8.429e-01  4.083e-01   2.064 0.038976 *  
## Credit_HistA33              9.351e-01  4.640e-01   2.015 0.043872 *  
## Credit_HistA34              1.508e+00  4.306e-01   3.502 0.000462 ***
## PurposeA41                  1.623e+00  3.645e-01   4.452 8.52e-06 ***
## PurposeA410                 1.534e+00  7.461e-01   2.056 0.039829 *  
## PurposeA42                  7.352e-01  2.528e-01   2.908 0.003636 ** 
## PurposeA43                  9.093e-01  2.433e-01   3.737 0.000186 ***
## PurposeA44                  5.143e-01  7.295e-01   0.705 0.480770    
## PurposeA45                  9.557e-02  5.367e-01   0.178 0.858657    
## PurposeA46                 -1.255e-01  3.923e-01  -0.320 0.748916    
## PurposeA48                  2.115e+00  1.209e+00   1.749 0.080306 .  
## PurposeA49                  8.128e-01  3.252e-01   2.499 0.012457 *  
## Credit_Amt                 -1.135e-04  4.068e-05  -2.789 0.005286 ** 
## SavingsAccA62               3.263e-01  2.760e-01   1.182 0.237080    
## SavingsAccA63               4.332e-01  3.948e-01   1.097 0.272550    
## SavingsAccA64               1.300e+00  5.075e-01   2.562 0.010394 *  
## SavingsAccA65               9.589e-01  2.559e-01   3.747 0.000179 ***
## Installment_rate           -3.191e-01  8.503e-02  -3.752 0.000175 ***
## personal_statA92            2.932e-01  3.735e-01   0.785 0.432492    
## personal_statA93            8.263e-01  3.638e-01   2.271 0.023132 *  
## personal_statA94            4.395e-01  4.412e-01   0.996 0.319167    
## deptor_statA102            -5.075e-01  4.008e-01  -1.266 0.205402    
## deptor_statA103             1.011e+00  4.172e-01   2.424 0.015333 *  
## age                         1.478e-02  8.315e-03   1.778 0.075416 .  
## other_instalment_plansA142  8.431e-02  4.037e-01   0.209 0.834567    
## other_instalment_plansA143  6.987e-01  2.351e-01   2.972 0.002955 ** 
## HousingA152                 4.384e-01  2.210e-01   1.983 0.047313 *  
## HousingA153                 1.574e-01  3.397e-01   0.463 0.643044    
## foreign_workerA202          1.326e+00  6.192e-01   2.142 0.032232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.73  on 999  degrees of freedom
## Residual deviance:  912.82  on 965  degrees of freedom
## AIC: 982.82
## 
## Number of Fisher Scoring iterations: 5

At this stage, we removed below variables. Because this doesnt show any significance towards the response variable. I’m not sure if we can remove other variables which are partially having p-values.

residence_in_years
Job_type
no_liable
no_of_credits
Property
Employment_Stat
Telephone

Below are the current explanatory variables after the manual elimination process.

age
Housing
foreign_worker
personal_stat
deptor_stat
other_instalment_plans
Duration
Credit_Amt
SavingsAcc
Installment_rate
Credit_Hist
Purpose
checkingAcc

Lets carry out a step-wise backward elimination method to see if our manual elimination is correct.

step(full_regressor ,data= germanCreditDF , direction = "backward" ,test = "F")

## Start:  AIC=993.82
## response_class ~ (checkingAcc + Duration + Credit_Hist + Purpose + 
##     Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate + 
##     personal_stat + deptor_stat + residence_in_years + Property + 
##     age + other_instalment_plans + Housing + no_of_credits + 
##     Job_type + no_liable + Telephone + foreign_worker + Customer_class + 
##     class) - Customer_class - class

## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family

##                          Df Deviance     AIC F value    Pr(>F)    
## - Job_type                3   896.56  988.56  0.2614 0.8532561    
## - Property                3   899.08  991.08  1.1546 0.3260711    
## - residence_in_years      1   895.82  991.82  0.0032 0.9545966    
## - no_liable               1   896.94  992.94  1.1870 0.2762026    
## <none>                        895.82  993.82                      
## - no_of_credits           1   897.89  993.89  2.1948 0.1388046    
## - Employment_Stat         4   904.03  994.03  2.1807 0.0692702 .  
## - Housing                 2   900.05  994.05  2.2453 0.1064598    
## - Telephone               1   898.06  994.06  2.3783 0.1233632    
## - age                     1   898.34  994.34  2.6811 0.1018775    
## - personal_stat           3   905.15  997.15  3.3009 0.0198254 *  
## - deptor_stat             2   903.24  997.24  3.9394 0.0197790 *  
## - foreign_worker          1   901.88  997.88  6.4369 0.0113364 *  
## - other_instalment_plans  2   903.98  997.98  4.3351 0.0133603 *  
## - Credit_Amt              1   904.28 1000.28  8.9799 0.0028005 ** 
## - Duration                1   904.87 1000.87  9.6054 0.0019971 ** 
## - SavingsAcc              4   915.63 1005.63  5.2571 0.0003436 ***
## - Installment_rate        1   910.27 1006.27 15.3396 9.623e-05 ***
## - Credit_Hist             4   917.62 1007.62  5.7873 0.0001328 ***
## - Purpose                 9   931.12 1011.12  4.1640 2.752e-05 ***
## - checkingAcc             3   962.05 1054.05 23.4381 1.219e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=988.56
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose + 
##     Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate + 
##     personal_stat + deptor_stat + residence_in_years + Property + 
##     age + other_instalment_plans + Housing + no_of_credits + 
##     no_liable + Telephone + foreign_worker

## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family

##                          Df Deviance     AIC F value    Pr(>F)    
## - Property                3   899.79  985.79  1.1477 0.3288039    
## - residence_in_years      1   896.57  986.57  0.0157 0.9002261    
## - no_liable               1   897.67  987.67  1.1809 0.2774567    
## - Employment_Stat         4   904.32  988.32  2.0656 0.0833203 .  
## - no_of_credits           1   898.47  988.47  2.0405 0.1534852    
## <none>                        896.56  988.56                      
## - Housing                 2   900.60  988.60  2.1501 0.1170347    
## - Telephone               1   899.13  989.13  2.7431 0.0980011 .  
## - age                     1   899.19  989.19  2.8049 0.0943051 .  
## - personal_stat           3   905.83  991.83  3.2905 0.0201057 *  
## - deptor_stat             2   903.87  991.87  3.8932 0.0207048 *  
## - foreign_worker          1   902.67  992.67  6.5056 0.0109085 *  
## - other_instalment_plans  2   904.95  992.95  4.4668 0.0117249 *  
## - Credit_Amt              1   905.31  995.31  9.3156 0.0023351 ** 
## - Duration                1   905.85  995.85  9.8940 0.0017096 ** 
## - SavingsAcc              4   917.02 1001.02  5.4434 0.0002461 ***
## - Installment_rate        1   911.45 1001.45 15.8462 7.392e-05 ***
## - Credit_Hist             4   918.12 1002.12  5.7370 0.0001453 ***
## - Purpose                 9   931.82 1005.82  4.1694 2.698e-05 ***
## - checkingAcc             3   962.35 1048.35 23.3355 1.399e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=985.79
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose + 
##     Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate + 
##     personal_stat + deptor_stat + residence_in_years + age + 
##     other_instalment_plans + Housing + no_of_credits + no_liable + 
##     Telephone + foreign_worker

## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family

##                          Df Deviance     AIC F value    Pr(>F)    
## - residence_in_years      1   899.81  983.81  0.0215 0.8833737    
## - no_liable               1   900.79  984.79  1.0564 0.3042884    
## - Housing                 2   903.47  985.47  1.9552 0.1420991    
## - no_of_credits           1   901.49  985.49  1.8016 0.1798405    
## <none>                        899.79  985.79                      
## - Telephone               1   901.81  985.81  2.1412 0.1437136    
## - Employment_Stat         4   907.85  985.85  2.1433 0.0735681 .  
## - age                     1   902.52  986.52  2.9017 0.0888118 .  
## - personal_stat           3   908.67  988.67  3.1485 0.0243723 *  
## - foreign_worker          1   905.83  989.83  6.4227 0.0114253 *  
## - deptor_stat             2   908.05  990.05  4.3927 0.0126182 *  
## - other_instalment_plans  2   908.87  990.87  4.8266 0.0082095 ** 
## - Credit_Amt              1   909.80  993.80 10.6403 0.0011456 ** 
## - Duration                1   909.99  993.99 10.8456 0.0010266 ** 
## - SavingsAcc              4   919.78  997.78  5.3148 0.0003098 ***
## - Installment_rate        1   915.56  999.56 16.7707 4.574e-05 ***
## - Credit_Hist             4   921.66  999.66  5.8145 0.0001264 ***
## - Purpose                 9   936.35 1004.35  4.3202 1.571e-05 ***
## - checkingAcc             3   967.78 1047.78 24.1019 4.851e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=983.81
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose + 
##     Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate + 
##     personal_stat + deptor_stat + age + other_instalment_plans + 
##     Housing + no_of_credits + no_liable + Telephone + foreign_worker

## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family

##                          Df Deviance     AIC F value    Pr(>F)    
## - no_liable               1   900.81  982.81  1.0624 0.3029225    
## - no_of_credits           1   901.53  983.53  1.8334 0.1760473    
## - Telephone               1   901.81  983.81  2.1258 0.1451634    
## <none>                        899.81  983.81                      
## - Employment_Stat         4   907.86  983.86  2.1413 0.0738084 .  
## - Housing                 2   903.95  983.95  2.2024 0.1110975    
## - age                     1   902.53  984.53  2.8889 0.0895151 .  
## - personal_stat           3   908.69  986.69  3.1515 0.0242734 *  
## - foreign_worker          1   905.88  987.88  6.4574 0.0112061 *  
## - deptor_stat             2   908.08  988.08  4.3996 0.0125313 *  
## - other_instalment_plans  2   908.87  988.87  4.8214 0.0082517 ** 
## - Credit_Amt              1   909.80  991.80 10.6299 0.0011520 ** 
## - Duration                1   910.05  992.05 10.8974 0.0009986 ***
## - SavingsAcc              4   919.78  995.78  5.3152 0.0003095 ***
## - Installment_rate        1   915.59  997.59 16.8027 4.499e-05 ***
## - Credit_Hist             4   921.66  997.66  5.8151 0.0001263 ***
## - Purpose                 9   936.35 1002.35  4.3223 1.559e-05 ***
## - checkingAcc             3   968.09 1046.09 24.2297 4.064e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=982.81
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose + 
##     Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate + 
##     personal_stat + deptor_stat + age + other_instalment_plans + 
##     Housing + no_of_credits + Telephone + foreign_worker

## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family

##                          Df Deviance     AIC F value    Pr(>F)    
## - no_of_credits           1   902.80  982.80  2.1199 0.1457260    
## <none>                        900.81  982.81                      
## - Employment_Stat         4   908.82  982.82  2.1315 0.0749790 .  
## - Telephone               1   902.90  982.90  2.2252 0.1361081    
## - Housing                 2   905.01  983.01  2.2329 0.1077705    
## - age                     1   903.38  983.38  2.7341 0.0985521 .  
## - personal_stat           3   908.75  984.75  2.8190 0.0380116 *  
## - foreign_worker          1   906.79  986.79  6.3613 0.0118248 *  
## - deptor_stat             2   908.83  986.83  4.2672 0.0142886 *  
## - other_instalment_plans  2   909.90  987.90  4.8373 0.0081230 ** 
## - Credit_Amt              1   910.50  990.50 10.3129 0.0013650 ** 
## - Duration                1   910.95  990.95 10.7970 0.0010535 ** 
## - SavingsAcc              4   920.53  994.53  5.2488 0.0003485 ***
## - Installment_rate        1   915.95  995.95 16.1170 6.419e-05 ***
## - Credit_Hist             4   923.28  997.28  5.9800 9.383e-05 ***
## - Purpose                 9   937.61 1001.61  4.3533 1.394e-05 ***
## - checkingAcc             3   969.35 1045.35 24.3232 3.569e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=982.8
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose + 
##     Credit_Amt + SavingsAcc + Employment_Stat + Installment_rate + 
##     personal_stat + deptor_stat + age + other_instalment_plans + 
##     Housing + Telephone + foreign_worker

## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family

##                          Df Deviance     AIC F value    Pr(>F)    
## - Employment_Stat         4   910.50  982.50  2.0460 0.0859611 .  
## - Telephone               1   904.72  982.72  2.0372 0.1538132    
## <none>                        902.80  982.80                      
## - Housing                 2   907.17  983.17  2.3205 0.0987741 .  
## - age                     1   905.21  983.21  2.5570 0.1101378    
## - personal_stat           3   910.47  984.47  2.7195 0.0434326 *  
## - deptor_stat             2   910.88  986.88  4.2938 0.0139160 *  
## - foreign_worker          1   909.14  987.14  6.7384 0.0095802 ** 
## - other_instalment_plans  2   912.57  988.57  5.1924 0.0057159 ** 
## - Duration                1   912.50  990.50 10.3179 0.0013613 ** 
## - Credit_Amt              1   912.54  990.54 10.3523 0.0013364 ** 
## - SavingsAcc              4   922.72  994.72  5.2944 0.0003212 ***
## - Credit_Hist             4   923.43  995.43  5.4840 0.0002287 ***
## - Installment_rate        1   917.65  995.65 15.7916 7.601e-05 ***
## - Purpose                 9   940.20 1002.20  4.4184 1.103e-05 ***
## - checkingAcc             3   971.04 1045.04 24.1890 4.286e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=982.5
## response_class ~ checkingAcc + Duration + Credit_Hist + Purpose + 
##     Credit_Amt + SavingsAcc + Installment_rate + personal_stat + 
##     deptor_stat + age + other_instalment_plans + Housing + Telephone + 
##     foreign_worker

## Warning in drop1.glm(fit, scope$drop, scale = scale, trace = trace, k =
## k, : F test assumes 'quasibinomial' family

##                          Df Deviance     AIC F value    Pr(>F)    
## <none>                        910.50  982.50                      
## - Telephone               1   912.82  982.82  2.4537 0.1175738    
## - age                     1   912.96  982.96  2.6118 0.1064003    
## - Housing                 2   915.01  983.01  2.3885 0.0923074 .  
## - foreign_worker          1   916.63  986.63  6.4913 0.0109944 *  
## - personal_stat           3   920.98  986.98  3.6987 0.0115203 *  
## - deptor_stat             2   919.41  987.41  4.7202 0.0091208 ** 
## - other_instalment_plans  2   920.76  988.76  5.4348 0.0044969 ** 
## - Duration                1   918.79  988.79  8.7794 0.0031213 ** 
## - Credit_Amt              1   920.07  990.07 10.1320 0.0015038 ** 
## - SavingsAcc              4   931.53  995.53  5.5665 0.0001972 ***
## - Installment_rate        1   925.92  995.92 16.3257 5.758e-05 ***
## - Credit_Hist             4   932.30  996.30  5.7705 0.0001367 ***
## - Purpose                 9   947.78 1001.78  4.3854 1.240e-05 ***
## - checkingAcc             3   979.75 1045.75 24.4394 3.022e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Call:  glm(formula = response_class ~ checkingAcc + Duration + Credit_Hist + 
##     Purpose + Credit_Amt + SavingsAcc + Installment_rate + personal_stat + 
##     deptor_stat + age + other_instalment_plans + Housing + Telephone + 
##     foreign_worker, family = "binomial", data = germanCreditDF)
## 
## Coefficients:
##                (Intercept)              checkingAccA12  
##                 -1.7495411                   0.3900152  
##             checkingAccA13              checkingAccA14  
##                  1.0240813                   1.7177165  
##                   Duration              Credit_HistA31  
##                 -0.0256787                   0.1187724  
##             Credit_HistA32              Credit_HistA33  
##                  0.8303101                   0.9097304  
##             Credit_HistA34                  PurposeA41  
##                  1.4917085                   1.6072585  
##                PurposeA410                  PurposeA42  
##                  1.4349203                   0.7404978  
##                 PurposeA43                  PurposeA44  
##                  0.9194787                   0.5250945  
##                 PurposeA45                  PurposeA46  
##                  0.1424475                  -0.1435655  
##                 PurposeA48                  PurposeA49  
##                  2.1643060                   0.7826591  
##                 Credit_Amt               SavingsAccA62  
##                 -0.0001294                   0.3282182  
##              SavingsAccA63               SavingsAccA64  
##                  0.4303584                   1.2894345  
##              SavingsAccA65            Installment_rate  
##                  0.9628458                  -0.3299308  
##           personal_statA92            personal_statA93  
##                  0.2872096                   0.8227885  
##           personal_statA94             deptor_statA102  
##                  0.4169133                  -0.4874391  
##            deptor_statA103                         age  
##                  1.0404263                   0.0130933  
## other_instalment_plansA142  other_instalment_plansA143  
##                  0.0786395                   0.6994941  
##                HousingA152                 HousingA153  
##                  0.4415029                   0.1496754  
##              TelephoneA192          foreign_workerA202  
##                  0.2794111                   1.3824572  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  964 Residual
## Null Deviance:       1222 
## Residual Deviance: 910.5     AIC: 982.5

Below are step-wise backward elimination’s result.

Telephone 1 912.82 982.82 2.4537 0.1175738
age 1 912.96 982.96 2.6118 0.1064003
Housing 2 915.01 983.01 2.3885 0.0923074 .
foreign_worker 1 916.63 986.63 6.4913 0.0109944 *
personal_stat 3 920.98 986.98 3.6987 0.0115203 *
deptor_stat 2 919.41 987.41 4.7202 0.0091208 **
other_instalment_plans 2 920.76 988.76 5.4348 0.0044969 **
Duration 1 918.79 988.79 8.7794 0.0031213 **
Credit_Amt 1 920.07 990.07 10.1320 0.0015038 **
SavingsAcc 4 931.53 995.53 5.5665 0.0001972 ***
Installment_rate 1 925.92 995.92 16.3257 5.758e-05 ***
Credit_Hist 4 932.30 996.30 5.7705 0.0001367 ***
Purpose 9 947.78 1001.78 4.3854 1.240e-05 ***
checkingAcc 3 979.75 1045.75 24.4394 3.022e-15 ***

p-value for Telephone, age and housing is slightly greater than .05(our significance value). We already removed Telephone from our initial manual elimination process. I’m still skeptical to remove age and housing since it is so close to our significance value.

We could have used R-square and Adjusted R-square to find out whether we can remove these variables or not. However the glm summary doesnt have R-square and Adjusted R-square. Hence we will predict the result and check the confusion matrix to see the effectiviness of the model.

There is scope to use pseudo R-square to evaluate the effectiveness of the model. Use below links for future references.

Steps to do build glm with training set

Split the dataset into training and test set.
Fit the model with above variables( with and with out adding age and housing).
predict the response variables and generate the confusion matrix
Compare the confusion matrix of with and without adding age and housing variables.

Splitting the dataset into training and test data set.

library(caTools)
set.seed(123)
split <- sample.split(germanCreditDF$response_class, SplitRatio = 0.75)
training_set <- subset(germanCreditDF , split == TRUE)
test_set <- subset(germanCreditDF , split == FALSE)
nrow(training_set)

## [1] 750

nrow(test_set)

## [1] 250

Fitting the logistic model for above vaiables

regressor1 <- glm(formula = response_class ~  age + Housing + foreign_worker + personal_stat + deptor_stat + other_instalment_plans + Duration + Credit_Amt +  SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc, data = training_set, family = 'binomial' )
summary(regressor1)

## 
## Call:
## glm(formula = response_class ~ age + Housing + foreign_worker + 
##     personal_stat + deptor_stat + other_instalment_plans + Duration + 
##     Credit_Amt + SavingsAcc + Installment_rate + Credit_Hist + 
##     Purpose + checkingAcc, family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8284  -0.7033   0.3578   0.6915   2.2344  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.153e+00  8.651e-01  -1.333 0.182552    
## age                         3.510e-03  9.680e-03   0.363 0.716941    
## HousingA152                 3.037e-01  2.639e-01   1.151 0.249854    
## HousingA153                 4.003e-02  3.885e-01   0.103 0.917939    
## foreign_workerA202          1.048e+00  7.298e-01   1.435 0.151172    
## personal_statA92            2.172e-01  4.351e-01   0.499 0.617613    
## personal_statA93            8.310e-01  4.237e-01   1.961 0.049830 *  
## personal_statA94            4.380e-01  5.096e-01   0.859 0.390134    
## deptor_statA102            -7.718e-01  5.054e-01  -1.527 0.126742    
## deptor_statA103             9.139e-01  4.824e-01   1.894 0.058159 .  
## other_instalment_plansA142  1.124e-01  4.610e-01   0.244 0.807327    
## other_instalment_plansA143  8.385e-01  2.754e-01   3.044 0.002334 ** 
## Duration                   -2.673e-02  1.064e-02  -2.511 0.012026 *  
## Credit_Amt                 -1.200e-04  4.963e-05  -2.418 0.015609 *  
## SavingsAccA62               6.557e-01  3.419e-01   1.918 0.055153 .  
## SavingsAccA63               3.705e-01  4.241e-01   0.874 0.382244    
## SavingsAccA64               1.217e+00  5.861e-01   2.076 0.037913 *  
## SavingsAccA65               1.216e+00  3.298e-01   3.686 0.000228 ***
## Installment_rate           -3.062e-01  9.795e-02  -3.126 0.001770 ** 
## Credit_HistA31              1.251e-01  6.423e-01   0.195 0.845596    
## Credit_HistA32              7.091e-01  4.883e-01   1.452 0.146442    
## Credit_HistA33              6.684e-01  5.488e-01   1.218 0.223260    
## Credit_HistA34              1.441e+00  5.089e-01   2.832 0.004627 ** 
## PurposeA41                  1.525e+00  4.282e-01   3.562 0.000368 ***
## PurposeA410                 2.073e+00  9.561e-01   2.168 0.030133 *  
## PurposeA42                  7.407e-01  2.996e-01   2.472 0.013439 *  
## PurposeA43                  4.748e-01  2.859e-01   1.661 0.096769 .  
## PurposeA44                  1.104e+00  9.298e-01   1.187 0.235066    
## PurposeA45                 -3.480e-01  6.520e-01  -0.534 0.593559    
## PurposeA46                 -2.153e-01  4.650e-01  -0.463 0.643421    
## PurposeA48                  1.516e+01  4.584e+02   0.033 0.973607    
## PurposeA49                  3.257e-01  3.877e-01   0.840 0.400902    
## checkingAccA12              3.062e-01  2.509e-01   1.220 0.222335    
## checkingAccA13              8.143e-01  4.033e-01   2.019 0.043493 *  
## checkingAccA14              1.877e+00  2.696e-01   6.963 3.33e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 916.30  on 749  degrees of freedom
## Residual deviance: 672.95  on 715  degrees of freedom
## AIC: 742.95
## 
## Number of Fisher Scoring iterations: 14

prob_pred <- predict(regressor1, type = 'response' , newdata = subset(test_set, select=c(age, Housing,foreign_worker, personal_stat , deptor_stat , other_instalment_plans , Duration ,Credit_Amt ,  SavingsAcc, Installment_rate , Credit_Hist , Purpose , checkingAcc)))

y_pred <- ifelse(prob_pred > 0.5 , 1 , 0)
cm <- table(test_set[, 'response_class'], y_pred)
cm

##    y_pred
##       0   1
##   0  36  39
##   1  24 151

accuracy <- (36 + 151)/nrow(test_set)
accuracy

## [1] 0.748

Now fitting the model with the removal of ‘age’ and ‘Housing’.

regressor2 <- glm(formula = response_class ~   foreign_worker + personal_stat + deptor_stat + other_instalment_plans + Duration + Credit_Amt +  SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc, data = training_set, family = 'binomial' )
summary(regressor2)

## 
## Call:
## glm(formula = response_class ~ foreign_worker + personal_stat + 
##     deptor_stat + other_instalment_plans + Duration + Credit_Amt + 
##     SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc, 
##     family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8276  -0.7225   0.3596   0.6941   2.1908  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -7.964e-01  7.635e-01  -1.043 0.296862    
## foreign_workerA202          1.068e+00  7.299e-01   1.464 0.143326    
## personal_statA92            1.478e-01  4.294e-01   0.344 0.730722    
## personal_statA93            8.082e-01  4.193e-01   1.927 0.053921 .  
## personal_statA94            3.941e-01  5.035e-01   0.783 0.433779    
## deptor_statA102            -8.233e-01  5.004e-01  -1.645 0.099904 .  
## deptor_statA103             9.296e-01  4.796e-01   1.938 0.052594 .  
## other_instalment_plansA142  1.644e-01  4.594e-01   0.358 0.720502    
## other_instalment_plansA143  8.258e-01  2.743e-01   3.011 0.002605 ** 
## Duration                   -2.763e-02  1.054e-02  -2.623 0.008723 ** 
## Credit_Amt                 -1.205e-04  4.946e-05  -2.437 0.014822 *  
## SavingsAccA62               6.312e-01  3.402e-01   1.855 0.063548 .  
## SavingsAccA63               3.657e-01  4.239e-01   0.863 0.388387    
## SavingsAccA64               1.228e+00  5.858e-01   2.096 0.036067 *  
## SavingsAccA65               1.227e+00  3.284e-01   3.737 0.000186 ***
## Installment_rate           -3.020e-01  9.726e-02  -3.105 0.001906 ** 
## Credit_HistA31              1.327e-01  6.376e-01   0.208 0.835076    
## Credit_HistA32              7.121e-01  4.843e-01   1.470 0.141507    
## Credit_HistA33              6.913e-01  5.452e-01   1.268 0.204761    
## Credit_HistA34              1.483e+00  5.048e-01   2.938 0.003303 ** 
## PurposeA41                  1.463e+00  4.218e-01   3.468 0.000524 ***
## PurposeA410                 2.102e+00  9.667e-01   2.174 0.029685 *  
## PurposeA42                  7.311e-01  2.966e-01   2.465 0.013695 *  
## PurposeA43                  5.038e-01  2.824e-01   1.784 0.074451 .  
## PurposeA44                  1.220e+00  9.327e-01   1.308 0.190705    
## PurposeA45                 -3.277e-01  6.559e-01  -0.500 0.617385    
## PurposeA46                 -2.954e-01  4.615e-01  -0.640 0.522070    
## PurposeA48                  1.529e+01  4.571e+02   0.033 0.973322    
## PurposeA49                  3.581e-01  3.844e-01   0.932 0.351560    
## checkingAccA12              3.220e-01  2.494e-01   1.291 0.196635    
## checkingAccA13              8.330e-01  4.021e-01   2.072 0.038279 *  
## checkingAccA14              1.914e+00  2.683e-01   7.137 9.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 916.3  on 749  degrees of freedom
## Residual deviance: 674.8  on 718  degrees of freedom
## AIC: 738.8
## 
## Number of Fisher Scoring iterations: 14

prob_pred <- predict(regressor2, type = 'response' , newdata = subset(test_set, select=c(foreign_worker, personal_stat , deptor_stat , other_instalment_plans , Duration ,Credit_Amt ,  SavingsAcc, Installment_rate , Credit_Hist , Purpose , checkingAcc)))

y_pred <- ifelse(prob_pred > 0.5 , 1 , 0)
cm <- table(test_set[, 'response_class'], y_pred)
cm

##    y_pred
##       0   1
##   0  36  39
##   1  24 151

accuracy <- (36 + 151)/nrow(test_set)
accuracy

## [1] 0.748

The accuracy didnt change, So age and housing is not adding value to the model.The accuracy neither goes down nor goes up.

Conclusion: We eliminated lot of useless independent variables and found most signficant variables associated to the response variables.

foreign_worker
personal_stat
deptor_stat
other_instalment_plans
Duration
Credit_Amt
SavingsAcc
Installment_rate
Credit_Hist
Purpose
checkingAcc

final_model formulae = [response_class ~ foreign_worker + personal_stat + deptor_stat + other_instalment_plans + Duration + Credit_Amt + SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc ]

Research Question - 2

From the inference analysis, we noted that there is an anomaly on credit_hist. Critical credits having a higher approval rate and considered to be a good customer. Find what is the reason behind this observation ?

For the sake of simplicity, I’m keeping the description of explanatory variable values below.

Credit_Hist: (qualitative)

  + Credit history
    - A30 : no credits taken/ all credits paid back duly
    - A31 : all credits at this bank paid back duly
    - A32 : existing credits paid back duly till now
    - A33 : delay in paying off in the past
    - A34 : critical account/other credits existing (not at this bank)

From the research question 1, we got few explanatory variables which are significant towards the response variable. credit_hist is one among them. However we need to analyse why customer having critical credit history(A34 group) having higher approval rate. A32 group also having higher approval rate.

I tried to plot side by side boxplot and scatter plot for several categorical and numerical variable with the credit_hist variable. Out of that, I see a pattern for ‘Duration’ which is a numerical variable.

Below is the box-plot of Duration v/s Credit_Hist. I see A32 and A34 has same median and both their areas are condensed. Other group having little higher than the A32 and A34’s median.

boxplot(germanCreditDF$Duration ~ germanCreditDF$Credit_Hist)

Lets take A34 data and plot a side by side plot ‘Duration’ with ‘class’. This shows median of Duration for Good Customer is less than the median of Duration for Bad Customer. This makes me think that the customer having critical credit history probabily applies for a shorter duration and their application gets approved. Also noted that there are outliers for Good customer.

A34_df <- germanCreditDF %>% filter(germanCreditDF$Credit_Hist == 'A34') 
boxplot(A34_df$Duration ~ A34_df$class)

Lets try for A32 group which are having higher approval rate. I see same behavior here too.

A32_df <- germanCreditDF %>% filter(germanCreditDF$Credit_Hist == 'A32') 
boxplot(A32_df$Duration ~ A32_df$class)

Lets do a hypothesis test to prove this. We have null hypothesis and alternate hypothesis defined as below.

Null Hyposthesis : For A34 group, the median of Good customer is same as the median of Bad Customer.
Alternate Hypothesis : For A34 group, the median of Good customer is not same as the median of Bad Customer.

inference(y = A34_df$Duration, x = A34_df$class, est = "median", type = "ht", null = 0, 
          alternative = "twosided", method = "simulation")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two medians
## Summary statistics:
## n_Bad = 50, median_Bad = 22.5, n_Good = 243, median_Good = 15,

## Observed difference between medians (Bad-Good) = 7.5
## H0: median_Bad - median_Good = 0 
## HA: median_Bad - median_Good != 0

## p-value =  0.0106

Since the p-value is less than .05( significance level), we reject the null hypothesis.

Conclusion

Duration of the credit matters and the Customer having critical credit history probabily applies for a shorter duration and their application gets approved. When they apply for a longer duration, mostly it gets rejected.

Research Question - 3

Try different machine learning models to predict the customer class based on most significant variables identified above and guage their accuracy using confusion matrix and other means.

In the question 1, we tried to determine the significance of the explanatory variables using glm library which is a logistic variant. Now lets try some classification models and see how well we can predict the response variable using these explanatory variables.

Steps to follow

Choose the model
Split the dataset into training and test set. Usually 60-80% of dataset is allocated to build training and rest is allocated to test data.
Fit the model with training data set.
Predict the model with test dataset.
Evaluate the accuracy with confusion matrix.

Lets try with Naive Bayes Model and accuracy is calculated as below.

Since we use ML models, we use numeric version of the data (german.data-numeric.txt)

# install.packages("e1071")
library('e1071')

germanCreditDF_numeric <- read.table("C:/Users/Charls/Documents/CunyMSDS/606-Statistical Analsyis/final-project/data/german.data-numeric.txt", header = F)


head(germanCreditDF_numeric)

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1  1  6  4 12  5  5  3  4  1  67   3   2   1   2   1   0   0   1   0   0
## 2  2 48  2 60  1  3  2  2  1  22   3   1   1   1   1   0   0   1   0   0
## 3  4 12  4 21  1  4  3  3  1  49   3   1   2   1   1   0   0   1   0   0
## 4  1 42  2 79  1  4  3  4  2  45   3   1   2   1   1   0   0   0   0   0
## 5  1 24  3 49  1  3  3  4  4  53   3   2   2   1   1   1   0   1   0   0
## 6  4 36  2 91  5  3  3  4  4  35   3   1   2   2   1   0   0   1   0   0
##   V21 V22 V23 V24 V25
## 1   1   0   0   1   1
## 2   1   0   0   1   2
## 3   1   0   1   0   1
## 4   0   0   0   1   1
## 5   0   0   0   1   2
## 6   0   0   1   0   1

colnames(germanCreditDF_numeric) <- column_headers

germanCreditDF_numeric$response_class <- sapply(germanCreditDF$Customer_class, function(x){
  switch(as.character(x), "1" = 1, "2" = 0)
})


nrow(germanCreditDF_numeric)

## [1] 1000

library(caTools)
set.seed(123)
split <- sample.split(germanCreditDF_numeric$response_class, SplitRatio = 0.75)
training_set_num <- subset(germanCreditDF_numeric , split == TRUE)
test_set_num <- subset(germanCreditDF_numeric , split == FALSE)


naive_bayes_regressor <- naiveBayes(x = training_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" ,  "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")], y = as.factor(training_set_num$response_class))


y_pred <- predict(naive_bayes_regressor, newdata = test_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" ,  "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")])

# Making the Confusion Matrix
cm = table(test_set_num$response_class, y_pred)
cm

##    y_pred
##       0   1
##   0  38  37
##   1  24 151

accuracy <- (38 +   151)/nrow(test_set_num)
accuracy

## [1] 0.756

lets try Knn classification.

library(class)
y_pred = knn(train = training_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" ,  "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")],
             test = test_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" ,  "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")],
             cl =  as.factor(training_set_num$response_class),
             k = 5,
             prob = TRUE)

cm = table(test_set_num$response_class, y_pred)
cm

##    y_pred
##       0   1
##   0  25  50
##   1  23 152

accuracy <- ( 24 +  153)/nrow(test_set_num)
accuracy

## [1] 0.708

Lets try decision tree.

# install.packages("rpart")
library(rpart)

classifer_dt <- rpart(data = training_set_num,  formula = as.factor(response_class) ~ foreign_worker + personal_stat + deptor_stat + other_instalment_plans + Duration + Credit_Amt +  SavingsAcc + Installment_rate + Credit_Hist + Purpose + checkingAcc)

y_pred <- predict(classifer_dt, newdata = test_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" ,  "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")], type = 'class')

cm = table(test_set_num$response_class, y_pred)
cm

##    y_pred
##       0   1
##   0  29  46
##   1  20 155

accuracy <- ( 29 +  155)/nrow(test_set_num)
accuracy

## [1] 0.736

Lets try Random forest which is ensembled model of several decision tree.

# install.packages("randomForest")

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

classifer_rf <- randomForest(x = training_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" ,  "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")] , 
y = as.factor(training_set_num$response_class), ntree = 15)

y_pred <- predict(classifer_rf, newdata = test_set_num[, c("foreign_worker" , "personal_stat" , "deptor_stat" , "other_instalment_plans" , "Duration" , "Credit_Amt" ,  "SavingsAcc" , "Installment_rate" , "Credit_Hist" , "Purpose" , "checkingAcc")], type = 'class')

# Making the Confusion Matrix
cm <- table(test_set_num$response_class, y_pred)
cm

##    y_pred
##       0   1
##   0  36  39
##   1  16 159

accuracy <- ( 32 +  158)/nrow(test_set_num)
accuracy

## [1] 0.76

Conclusion

Below table shows the classification models we tried so far and its accuracy. Out of that, Naive Bayes and the Random Forest are giving good accuracy rate.

c("Naive Bayes", "Knn", "Decision Tree", "Random Forest" )

## [1] "Naive Bayes"   "Knn"           "Decision Tree" "Random Forest"

c(75.6 ,70.8,73.6,76 )

## [1] 75.6 70.8 73.6 76.0

acc.DF <- data.frame(Model=c("Naive Bayes", "Knn", "Decision Tree", "Random Forest" ) , Accuracy = c(75.6 ,70.8,73.6,76 ))

library(knitr)
library(kableExtra)
acc.DF %>%
  kable() %>%
  kable_styling()

Model	Accuracy
Naive Bayes	75.6
Knn	70.8
Decision Tree	73.6
Random Forest	76.0

DATA 606 Data Project

Charls Joseph

Load packages

Data Preparation

Research question

Cases

Data collection

Type of study

Data Source

Dependent Variable

Independent Variable

Relevant summary statistics

Getting Started with Analysis

Research Question - 1

Backward elimimation can be done in 2 ways.

Steps to do build glm with training set

Research Question - 2

Conclusion

Research Question - 3

Steps to follow

Lets try with Naive Bayes Model and accuracy is calculated as below.

lets try Knn classification.

Lets try decision tree.

Lets try Random forest which is ensembled model of several decision tree.

Conclusion