Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. THe response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: There are 62 columns and 1000 rows. In the dataset
information page, it states that several variables are categorical
(status of exisiting checking account,
credit history, purpose, etc) so I assumed
that we might have to re-categorize them as.factor but it
appears that they have already been separated into their different
levels so no re-categorization is necessary.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
colnames(GermanCredit)
## [1] "Duration" "Amount"
## [3] "InstallmentRatePercentage" "ResidenceDuration"
## [5] "Age" "NumberExistingCredits"
## [7] "NumberPeopleMaintenance" "Telephone"
## [9] "ForeignWorker" "Class"
## [11] "CheckingAccountStatus.lt.0" "CheckingAccountStatus.0.to.200"
## [13] "CheckingAccountStatus.gt.200" "CreditHistory.NoCredit.AllPaid"
## [15] "CreditHistory.ThisBank.AllPaid" "CreditHistory.PaidDuly"
## [17] "CreditHistory.Delay" "Purpose.NewCar"
## [19] "Purpose.UsedCar" "Purpose.Furniture.Equipment"
## [21] "Purpose.Radio.Television" "Purpose.DomesticAppliance"
## [23] "Purpose.Repairs" "Purpose.Education"
## [25] "Purpose.Retraining" "Purpose.Business"
## [27] "SavingsAccountBonds.lt.100" "SavingsAccountBonds.100.to.500"
## [29] "SavingsAccountBonds.500.to.1000" "SavingsAccountBonds.gt.1000"
## [31] "EmploymentDuration.lt.1" "EmploymentDuration.1.to.4"
## [33] "EmploymentDuration.4.to.7" "EmploymentDuration.gt.7"
## [35] "Personal.Male.Divorced.Seperated" "Personal.Female.NotSingle"
## [37] "Personal.Male.Single" "OtherDebtorsGuarantors.None"
## [39] "OtherDebtorsGuarantors.CoApplicant" "Property.RealEstate"
## [41] "Property.Insurance" "Property.CarOther"
## [43] "OtherInstallmentPlans.Bank" "OtherInstallmentPlans.Stores"
## [45] "Housing.Rent" "Housing.Own"
## [47] "Job.UnemployedUnskilled" "Job.UnskilledResident"
## [49] "Job.SkilledEmployee"
summary(GermanCredit)
## Duration Amount InstallmentRatePercentage ResidenceDuration
## Min. : 4.0 Min. : 250 Min. :1.000 Min. :1.000
## 1st Qu.:12.0 1st Qu.: 1366 1st Qu.:2.000 1st Qu.:2.000
## Median :18.0 Median : 2320 Median :3.000 Median :3.000
## Mean :20.9 Mean : 3271 Mean :2.973 Mean :2.845
## 3rd Qu.:24.0 3rd Qu.: 3972 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :72.0 Max. :18424 Max. :4.000 Max. :4.000
## Age NumberExistingCredits NumberPeopleMaintenance Telephone
## Min. :19.00 Min. :1.000 Min. :1.000 Min. :0.000
## 1st Qu.:27.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000
## Median :33.00 Median :1.000 Median :1.000 Median :1.000
## Mean :35.55 Mean :1.407 Mean :1.155 Mean :0.596
## 3rd Qu.:42.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :75.00 Max. :4.000 Max. :2.000 Max. :1.000
## ForeignWorker Class CheckingAccountStatus.lt.0
## Min. :0.000 Mode :logical Min. :0.000
## 1st Qu.:1.000 FALSE:300 1st Qu.:0.000
## Median :1.000 TRUE :700 Median :0.000
## Mean :0.963 Mean :0.274
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000
## CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.269 Mean :0.063
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
## Min. :0.00 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000
## Median :0.00 Median :0.000
## Mean :0.04 Mean :0.049
## 3rd Qu.:0.00 3rd Qu.:0.000
## Max. :1.00 Max. :1.000
## CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar Purpose.UsedCar
## Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :1.00 Median :0.000 Median :0.000 Median :0.000
## Mean :0.53 Mean :0.088 Mean :0.234 Mean :0.103
## 3rd Qu.:1.00 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.00 Max. :1.000 Max. :1.000 Max. :1.000
## Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
## Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0.000
## Mean :0.181 Mean :0.28 Mean :0.012
## 3rd Qu.:0.000 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.000
## Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0.000 Median :0.000
## Mean :0.022 Mean :0.05 Mean :0.009 Mean :0.097
## 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.000 Max. :1.000
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :0.000
## Mean :0.603 Mean :0.103
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.063 Mean :0.048
## 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.172 Mean :0.339 Mean :0.174
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:0.00
## Median :0.000 Median :0.00
## Mean :0.253 Mean :0.05
## 3rd Qu.:1.000 3rd Qu.:0.00
## Max. :1.000 Max. :1.00
## Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:1.000
## Median :0.00 Median :1.000 Median :1.000
## Mean :0.31 Mean :0.548 Mean :0.907
## 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.00 Max. :1.000 Max. :1.000
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.041 Mean :0.282 Mean :0.232
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.332 Mean :0.139 Mean :0.047
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0
## Median :0.000 Median :1.000 Median :0.000 Median :0.0
## Mean :0.179 Mean :0.713 Mean :0.022 Mean :0.2
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.0
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.0
## Job.SkilledEmployee
## Min. :0.00
## 1st Qu.:0.00
## Median :1.00
## Mean :0.63
## 3rd Qu.:1.00
## Max. :1.00
Your observation: This dataset has 49 columns total. 48 columns
contain numeric data and one column class contains binary
data that categorizes a customer as good or
bad.
2023 for
reproducibility.set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
credit.train = GermanCredit[index,]
credit.test = GermanCredit[-index,]
Your observation: This dataset was split into training data (80% of the observations, 800 rows) and testing data (20% of the observations, 200 rows). Each dataset still contains 49 columns.
credit.glm0 <- glm(Class~., family="binomial", data=credit.train)
Your observation: We fit a logistic regression model on the training
dataset. Because we’re using a logistic regression instead of a linear
regression, we add family='binomial to our code.
summary(credit.glm0)
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = credit.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.948e+00 1.620e+00 4.908 9.22e-07 ***
## Duration -2.465e-02 1.027e-02 -2.401 0.01636 *
## Amount -1.206e-04 4.943e-05 -2.440 0.01467 *
## InstallmentRatePercentage -2.766e-01 9.796e-02 -2.823 0.00476 **
## ResidenceDuration 4.616e-02 9.831e-02 0.469 0.63872
## Age 1.982e-02 1.046e-02 1.896 0.05802 .
## NumberExistingCredits -2.741e-01 2.145e-01 -1.278 0.20142
## NumberPeopleMaintenance -1.388e-01 2.898e-01 -0.479 0.63190
## Telephone -2.586e-01 2.242e-01 -1.153 0.24877
## ForeignWorker -1.789e+00 8.309e-01 -2.153 0.03132 *
## CheckingAccountStatus.lt.0 -1.944e+00 2.646e-01 -7.347 2.02e-13 ***
## CheckingAccountStatus.0.to.200 -1.278e+00 2.551e-01 -5.009 5.46e-07 ***
## CheckingAccountStatus.gt.200 -5.367e-01 4.445e-01 -1.208 0.22724
## CreditHistory.NoCredit.AllPaid -1.284e+00 4.801e-01 -2.674 0.00750 **
## CreditHistory.ThisBank.AllPaid -1.436e+00 4.997e-01 -2.873 0.00407 **
## CreditHistory.PaidDuly -7.179e-01 2.865e-01 -2.506 0.01221 *
## CreditHistory.Delay -5.630e-01 3.726e-01 -1.511 0.13081
## Purpose.NewCar -1.917e+00 8.668e-01 -2.212 0.02697 *
## Purpose.UsedCar -2.727e-01 8.931e-01 -0.305 0.76006
## Purpose.Furniture.Equipment -1.069e+00 8.737e-01 -1.223 0.22118
## Purpose.Radio.Television -1.054e+00 8.812e-01 -1.196 0.23171
## Purpose.DomesticAppliance -1.109e+00 1.220e+00 -0.909 0.36321
## Purpose.Repairs -1.992e+00 1.035e+00 -1.924 0.05433 .
## Purpose.Education -1.896e+00 9.500e-01 -1.996 0.04595 *
## Purpose.Retraining -1.045e+00 1.507e+00 -0.694 0.48796
## Purpose.Business -1.240e+00 8.975e-01 -1.381 0.16721
## SavingsAccountBonds.lt.100 -9.516e-01 2.975e-01 -3.199 0.00138 **
## SavingsAccountBonds.100.to.500 -7.571e-01 3.877e-01 -1.953 0.05083 .
## SavingsAccountBonds.500.to.1000 -3.102e-01 5.274e-01 -0.588 0.55639
## SavingsAccountBonds.gt.1000 -2.349e-01 5.947e-01 -0.395 0.69284
## EmploymentDuration.lt.1 2.255e-01 4.925e-01 0.458 0.64711
## EmploymentDuration.1.to.4 2.978e-01 4.682e-01 0.636 0.52473
## EmploymentDuration.4.to.7 8.561e-01 5.057e-01 1.693 0.09045 .
## EmploymentDuration.gt.7 3.178e-01 4.724e-01 0.673 0.50108
## Personal.Male.Divorced.Seperated -5.419e-01 4.982e-01 -1.088 0.27668
## Personal.Female.NotSingle -2.182e-01 3.492e-01 -0.625 0.53197
## Personal.Male.Single 2.917e-01 3.523e-01 0.828 0.40770
## OtherDebtorsGuarantors.None -7.453e-01 4.707e-01 -1.583 0.11339
## OtherDebtorsGuarantors.CoApplicant -1.243e+00 6.380e-01 -1.948 0.05138 .
## Property.RealEstate 8.035e-01 4.647e-01 1.729 0.08381 .
## Property.Insurance 6.041e-01 4.511e-01 1.339 0.18050
## Property.CarOther 4.111e-01 4.378e-01 0.939 0.34776
## OtherInstallmentPlans.Bank -5.736e-01 2.706e-01 -2.120 0.03401 *
## OtherInstallmentPlans.Stores -4.597e-01 4.649e-01 -0.989 0.32276
## Housing.Rent -5.839e-01 5.256e-01 -1.111 0.26656
## Housing.Own -7.262e-02 4.909e-01 -0.148 0.88240
## Job.UnemployedUnskilled 9.950e-01 8.532e-01 1.166 0.24352
## Job.UnskilledResident 1.006e-01 3.978e-01 0.253 0.80027
## Job.SkilledEmployee 1.195e-02 3.242e-01 0.037 0.97060
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 980.75 on 799 degrees of freedom
## Residual deviance: 717.58 on 751 degrees of freedom
## AIC: 815.58
##
## Number of Fisher Scoring iterations: 5
Your observation: Out of all our variables, only 14 are significant at the 0.05 level.
pred.glm0.train <- predict(credit.glm0,type="response")
hist(pred.glm0.train)
Your observation: This histogram shows a left-skewed distribution, meaning that probability increases as you move toward 1.
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight_FN = 1 # define the weight for "true=1 but pred=0" (False Neg)
weight_FP = 1 # define the weight for "true=0 but pred=1" (False Pos)
# increase weight for more important category - FN or FP
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (False Neg)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (False Pos)
cost = mean(weight_FN*c1 + weight_FP*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
mean_error = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
mean_error[i] = costfunc(obs = credit.train$Class, pred.p = pred.glm0.train, pcut = p.seq[i])
} # end of the loop
# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)
MR<- mean(credit.train$Class!=credit.train)
print(MR)
## [1] 0.6246939
Your observation: In this model, the optimal probability cut-off point is around 0.42, and the misclassification rate is about 62%.
# get binary prediction
class.train<- (pred.glm0.train> 0.42)*1
# get confusion matrix
table(credit.train$Class, class.train, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## FALSE 112 130
## TRUE 37 521
# (equal-weighted) misclassification rate
MR_train<- mean(credit.train$Class!=class.train)
MR_train
## [1] 0.20875
Your observation: When we set our optimal probability cut-off point at 0.42, our misclassification rate falls to 29%.
#install.packages('ROCR')
library(ROCR)
pred <- prediction(pred.glm0.train, credit.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.835829
Your observation: In our training set, the AUC is 0.84. If an AUC of 0.50 is random chance and an AUC of 1.0 is perfect prediction, we know that an AUC of 0.84 is pretty good!
pred.glm0.test <- predict(credit.glm0, newdata = credit.test, type="response")
hist(pred.glm0.test)
# get binary prediction
class.test <- (pred.glm0.test> 0.42)*1
# get confusion matrix
table(credit.test$Class, class.test, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## FALSE 25 33
## TRUE 14 128
# (equal-weighted) misclassification rate
MR_test<- mean(credit.test$Class!=class.test)
MR_test
## [1] 0.235
Your observation: Our histogram is still left-skewed, but the y-axis frequency has fallen from 200 in the training set to 70 in the test set. Our misclassification rate is down to 23.5%.
pred_test <- prediction(pred.glm0.test, credit.test$Class)
perf_test <- performance(pred, "tpr", "fpr")
plot(perf_test, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred_test, "auc"), "y.values"))
## [1] 0.8091306
Your observation: With our test set, our AUC is 0.81 indicating that it doesn’t perform quite as well as our training set.
Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be ### 1. Then define your cost function accordingly!
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight_FN = 1 # define the weight for "true=1 but pred=0" (False Neg)
weight_FP = 5 # define the weight for "true=0 but pred=1" (False Pos)
# increase weight for more important category - FN or FP
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (False Neg)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (False Pos)
cost = mean(weight_FN*c1 + weight_FP*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
mean_error = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
mean_error[i] = costfunc(obs = credit.train$Class, pred.p = pred.glm0.train, pcut = p.seq[i])
} # end of the loop
# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)
Your observation: When we adjust our weights to give more importance on False Positives, our optimal cut-off point changes to somewhere around 0.79.
# get binary prediction
class.train_weight5 <- (pred.glm0.train> 0.79)*1
# get confusion matrix
table(credit.train$Class, class.train, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## FALSE 112 130
## TRUE 37 521
# (equal-weighted) misclassification rate
MR_train_weight5 <- mean(credit.train$Class!=class.train_weight5)
MR_train
## [1] 0.20875
Your observation: With the adjusted weight, our new misclassification rate increases to 29%, but we decreased our False Positives from 130 in our training set with equal weight down to 34 FP in after adjusting the weights.
# get binary prediction
class.test_weight5 <- (pred.glm0.test> 0.79)*1
# get confusion matrix
table(credit.test$Class, class.test_weight5, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## FALSE 49 9
## TRUE 51 91
# (equal-weighted) misclassification rate
MR_test <- mean(credit.test$Class!=class.test_weight5)
MR_test
## [1] 0.3
Your observation: Our testing dataset shows even few False Positives, now down to 9 observations. Our misclassifcation rate has stayed fairly steady, only increasing slightly from 29% to 30%.
Summarize your findings, including the optimal probability cut-off, MR and AUC (if calculated) for both in-sample and out-of-sample data. Discuss what you observed and make some suggestions on how can we improve the model.
Summary: Our in-sample MR is 0.29 and the AUC is 0.835. Our out-of-sample MR is 0.235 and the AUC is 0.809. When we adjust our weights, our in-smaple MR stays at 0.29 and the out-of-sample MR is 0.30.
If our goal is to give more importance to reducing False Positives, our best model is when we adjust the weights and run it on our testing dataset. To imporve the model, we could look at fine tuning the cut-off points (for example, run it again with a cut off of 0.795, 0.78, or 0.80 to see if we can get better results.) We could also try removing the non-signifcant variables to make the model more parsimonious while still providing accurate probability prediction.