Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. THe response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret)
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good"
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: Upon looking at the structure of the dataset, I can see that there’s multiple variables that are non-binary. These include ‘duration’, ‘amount’, ‘installmentratepercentage’, and ‘residenceduration’ to name a few. As far as variables that are binary, I can see that there’s a lot of them.
#summary statistics
summary(GermanCredit)
#correlation matrix
corr_matrix <- cor(GermanCredit)
## Warning in cor(GermanCredit): the standard deviation is zero
#head
head(GermanCredit)
Your observation: When looking at the summary statistics of the dataset, I can see that many of the variables have a value of 0 for the minimum, 1st quartile, median, and 3rd quartile. Furthermore, ‘amount’ appears to have the highest minimum, 1st quartile, median, mean, 3rd quartile, and maximum. When trying to run a correlation matrix, R gives me a warning message and says that the standard deviation is zero. This could potentially mean a correlation matrix isn’t appropriate for the dataset.
2023 for
reproducibility.#Set the seed for reproducibility
set.seed(2023)
#Randomly sample row indices for the training set
train_indices <- sample(1:NROW(GermanCredit),NROW(GermanCredit)*0.70)
#Create the training set
train_data <- GermanCredit[train_indices,]
#Create the testing set
test_data <- GermanCredit[-train_indices,]
Your observation: When splitting the dataset, in the testing dataset the observations went down to 300 from 1000, and in the training dataset, the observations went down to 700 from 1000.
model = glm(Class ~ ., data = train_data, family = "binomial")
print(model)
##
## Call: glm(formula = Class ~ ., family = "binomial", data = train_data)
##
## Coefficients:
## (Intercept) Duration
## 8.8191208 -0.0174850
## Amount InstallmentRatePercentage
## -0.0001529 -0.2743598
## ResidenceDuration Age
## 0.0586728 0.0147234
## NumberExistingCredits NumberPeopleMaintenance
## -0.2727979 -0.2757525
## Telephone ForeignWorker
## -0.4679334 -2.0028783
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## -2.0813009 -1.2921389
## CheckingAccountStatus.gt.200 CheckingAccountStatus.none
## -0.6405828 NA
## CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
## -1.3543231 -1.4207491
## CreditHistory.PaidDuly CreditHistory.Delay
## -0.7456627 -0.6014926
## CreditHistory.Critical Purpose.NewCar
## NA -1.5170231
## Purpose.UsedCar Purpose.Furniture.Equipment
## -0.1206474 -0.8100598
## Purpose.Radio.Television Purpose.DomesticAppliance
## -0.7686043 -0.4085537
## Purpose.Repairs Purpose.Education
## -1.8202999 -1.3311594
## Purpose.Vacation Purpose.Retraining
## NA -0.6650247
## Purpose.Business Purpose.Other
## -0.9463089 NA
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## -0.9640859 -0.6702615
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## -0.0311834 -0.3865652
## SavingsAccountBonds.Unknown EmploymentDuration.lt.1
## NA -0.1931434
## EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## -0.1587566 0.3891026
## EmploymentDuration.gt.7 EmploymentDuration.Unemployed
## -0.0427383 NA
## Personal.Male.Divorced.Seperated Personal.Female.NotSingle
## -0.7094751 -0.3539920
## Personal.Male.Single Personal.Male.Married.Widowed
## 0.1836423 NA
## Personal.Female.Single OtherDebtorsGuarantors.None
## NA -0.7390022
## OtherDebtorsGuarantors.CoApplicant OtherDebtorsGuarantors.Guarantor
## -1.2549232 NA
## Property.RealEstate Property.Insurance
## 0.9372156 0.6193592
## Property.CarOther Property.Unknown
## 0.5877418 NA
## OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## -0.5171535 -0.3154909
## OtherInstallmentPlans.None Housing.Rent
## NA -0.8147496
## Housing.Own Housing.ForFree
## -0.1218148 NA
## Job.UnemployedUnskilled Job.UnskilledResident
## 0.7152327 0.0667202
## Job.SkilledEmployee Job.Management.SelfEmp.HighlyQualified
## -0.0661251 NA
##
## Degrees of Freedom: 699 Total (i.e. Null); 651 Residual
## Null Deviance: 856.9
## Residual Deviance: 616.2 AIC: 714.2
Your observation: I notice that there’s multiple missing values. In addition, because the AIC is 714.2, which is super high, this means that this model is not a good fit since a lower AIC value indicates a better-fitting model.
summary(model)
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5779 -0.7282 0.3565 0.6825 2.1820
##
## Coefficients: (13 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.8191208 1.7624784 5.004 5.62e-07
## Duration -0.0174850 0.0108921 -1.605 0.10843
## Amount -0.0001529 0.0000530 -2.885 0.00391
## InstallmentRatePercentage -0.2743599 0.1074850 -2.553 0.01069
## ResidenceDuration 0.0586728 0.1078146 0.544 0.58630
## Age 0.0147234 0.0112304 1.311 0.18985
## NumberExistingCredits -0.2727979 0.2399140 -1.137 0.25551
## NumberPeopleMaintenance -0.2757525 0.3274757 -0.842 0.39976
## Telephone -0.4679334 0.2439420 -1.918 0.05508
## ForeignWorker -2.0028783 0.8576485 -2.335 0.01953
## CheckingAccountStatus.lt.0 -2.0813009 0.2881259 -7.224 5.06e-13
## CheckingAccountStatus.0.to.200 -1.2921389 0.2771600 -4.662 3.13e-06
## CheckingAccountStatus.gt.200 -0.6405828 0.4830144 -1.326 0.18477
## CheckingAccountStatus.none NA NA NA NA
## CreditHistory.NoCredit.AllPaid -1.3543231 0.5125284 -2.642 0.00823
## CreditHistory.ThisBank.AllPaid -1.4207491 0.5467234 -2.599 0.00936
## CreditHistory.PaidDuly -0.7456626 0.3106988 -2.400 0.01640
## CreditHistory.Delay -0.6014926 0.4293773 -1.401 0.16126
## CreditHistory.Critical NA NA NA NA
## Purpose.NewCar -1.5170231 0.9359195 -1.621 0.10504
## Purpose.UsedCar -0.1206474 0.9621073 -0.125 0.90021
## Purpose.Furniture.Equipment -0.8100598 0.9420714 -0.860 0.38986
## Purpose.Radio.Television -0.7686043 0.9505841 -0.809 0.41877
## Purpose.DomesticAppliance -0.4085537 1.3479085 -0.303 0.76181
## Purpose.Repairs -1.8202999 1.1005639 -1.654 0.09813
## Purpose.Education -1.3311594 1.0289082 -1.294 0.19575
## Purpose.Vacation NA NA NA NA
## Purpose.Retraining -0.6650247 1.5831072 -0.420 0.67443
## Purpose.Business -0.9463089 0.9635263 -0.982 0.32604
## Purpose.Other NA NA NA NA
## SavingsAccountBonds.lt.100 -0.9640859 0.3172524 -3.039 0.00237
## SavingsAccountBonds.100.to.500 -0.6702615 0.4129080 -1.623 0.10453
## SavingsAccountBonds.500.to.1000 -0.0311834 0.6154087 -0.051 0.95959
## SavingsAccountBonds.gt.1000 -0.3865652 0.6310337 -0.613 0.54015
## SavingsAccountBonds.Unknown NA NA NA NA
## EmploymentDuration.lt.1 -0.1931434 0.5565708 -0.347 0.72857
## EmploymentDuration.1.to.4 -0.1587566 0.5372056 -0.296 0.76759
## EmploymentDuration.4.to.7 0.3891026 0.5671570 0.686 0.49268
## EmploymentDuration.gt.7 -0.0427382 0.5449721 -0.078 0.93749
## EmploymentDuration.Unemployed NA NA NA NA
## Personal.Male.Divorced.Seperated -0.7094751 0.5352264 -1.326 0.18499
## Personal.Female.NotSingle -0.3539920 0.3752277 -0.943 0.34547
## Personal.Male.Single 0.1836423 0.3789886 0.485 0.62799
## Personal.Male.Married.Widowed NA NA NA NA
## Personal.Female.Single NA NA NA NA
## OtherDebtorsGuarantors.None -0.7390022 0.4889566 -1.511 0.13069
## OtherDebtorsGuarantors.CoApplicant -1.2549232 0.7077516 -1.773 0.07621
## OtherDebtorsGuarantors.Guarantor NA NA NA NA
## Property.RealEstate 0.9372156 0.4929071 1.901 0.05725
## Property.Insurance 0.6193592 0.4725768 1.311 0.18999
## Property.CarOther 0.5877418 0.4617889 1.273 0.20311
## Property.Unknown NA NA NA NA
## OtherInstallmentPlans.Bank -0.5171535 0.2875924 -1.798 0.07214
## OtherInstallmentPlans.Stores -0.3154909 0.5294274 -0.596 0.55124
## OtherInstallmentPlans.None NA NA NA NA
## Housing.Rent -0.8147496 0.5572708 -1.462 0.14373
## Housing.Own -0.1218148 0.5223979 -0.233 0.81562
## Housing.ForFree NA NA NA NA
## Job.UnemployedUnskilled 0.7152327 0.8860470 0.807 0.41954
## Job.UnskilledResident 0.0667202 0.4369190 0.153 0.87863
## Job.SkilledEmployee -0.0661251 0.3611982 -0.183 0.85474
## Job.Management.SelfEmp.HighlyQualified NA NA NA NA
##
## (Intercept) ***
## Duration
## Amount **
## InstallmentRatePercentage *
## ResidenceDuration
## Age
## NumberExistingCredits
## NumberPeopleMaintenance
## Telephone .
## ForeignWorker *
## CheckingAccountStatus.lt.0 ***
## CheckingAccountStatus.0.to.200 ***
## CheckingAccountStatus.gt.200
## CheckingAccountStatus.none
## CreditHistory.NoCredit.AllPaid **
## CreditHistory.ThisBank.AllPaid **
## CreditHistory.PaidDuly *
## CreditHistory.Delay
## CreditHistory.Critical
## Purpose.NewCar
## Purpose.UsedCar
## Purpose.Furniture.Equipment
## Purpose.Radio.Television
## Purpose.DomesticAppliance
## Purpose.Repairs .
## Purpose.Education
## Purpose.Vacation
## Purpose.Retraining
## Purpose.Business
## Purpose.Other
## SavingsAccountBonds.lt.100 **
## SavingsAccountBonds.100.to.500
## SavingsAccountBonds.500.to.1000
## SavingsAccountBonds.gt.1000
## SavingsAccountBonds.Unknown
## EmploymentDuration.lt.1
## EmploymentDuration.1.to.4
## EmploymentDuration.4.to.7
## EmploymentDuration.gt.7
## EmploymentDuration.Unemployed
## Personal.Male.Divorced.Seperated
## Personal.Female.NotSingle
## Personal.Male.Single
## Personal.Male.Married.Widowed
## Personal.Female.Single
## OtherDebtorsGuarantors.None
## OtherDebtorsGuarantors.CoApplicant .
## OtherDebtorsGuarantors.Guarantor
## Property.RealEstate .
## Property.Insurance
## Property.CarOther
## Property.Unknown
## OtherInstallmentPlans.Bank .
## OtherInstallmentPlans.Stores
## OtherInstallmentPlans.None
## Housing.Rent
## Housing.Own
## Housing.ForFree
## Job.UnemployedUnskilled
## Job.UnskilledResident
## Job.SkilledEmployee
## Job.Management.SelfEmp.HighlyQualified
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 856.90 on 699 degrees of freedom
## Residual deviance: 616.15 on 651 degrees of freedom
## AIC: 714.15
##
## Number of Fisher Scoring iterations: 5
Your observation: According to the model, there are more coefficients that have non-significant p-values than coefficients with significant p-values.
To evaluate the prediction performance on all the observed data, we can similarly have
pred_prob = predict(model, train_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred_value = 1*(pred_prob>0.5)
cbind(train_data, pred_prob, pred_value)
Now, we plot everything in one figure.
plot(train_data$Duration,train_data$Class)
points(train_data$Duration,train_data$Class,pch=20)
points(train_data$Duration,pred_prob,type="h",lty=3)
abline(h=0.5,col="grey")
points(train_data$Duration,pred_value,pch=1,cex=1.3)
#Confusion matrix
actual_value=train_data$Class
pred_value = 1*(pred_prob>0.5)
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## FALSE 116 95
## TRUE 53 436
#misclassification error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2114286
#define a cost function with input "obs" being observed response
#and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 1 #define the weight for "true=1 but pred=0" (FN)
weight0 = 1 #define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) #count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) #count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) #you have to return to a value when you write R functions
} #end of the function
#define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
mean_error = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
mean_error[i] = costfunc(obs = train_data$Class, pred.p = pred_prob, pcut = p.seq[i])
} #end of the loop
#draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)
#when p-cut is 0.5, we get the previous mean error:
mean_error[which(p.seq==0.5)]
## [1] 0.2114286
#What's the best p-cut and lowest mean error we can get?
optimal.pcut = p.seq[which(mean_error==min(mean_error))]
optimal.pcut
## [1] 0.44
min(mean_error)
## [1] 0.1971429
Your observation: Before using the misclassification rate, the mean error was 0.21. But after using the misclassifcation rate, the mean error is now 0.44.
pred_value = 1*(pred_prob>0.44)
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## FALSE 106 105
## TRUE 33 456
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.1971429
Your observation: In the confusion matrix, there’s almost an equal number of true negatives and false positives. But there are much more number of true positives than false negatives. When looking at the MR, the value is 0.1971, which means that roughly 19.71% of the instances are being misclassified by the model. I would consider this to be a relatively high MR.
library(ROCR)
pred <- prediction(pred_prob, train_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
auc_train <- unlist(slot(performance(pred, "auc"), "y.values"))
auc_train
## [1] 0.8424583
Your observation: When looking at the ROC curve, I can see that the curve approaches more towards a higher false positive rate and true positive rate. Since the AUC value is 0.84, and an AUC model of 1 represents a perfect model, I would consider the AUC in this model to be good at maintaining good performance across different threshold settings.
#logistic regression model
model2 = glm(Class ~ ., data = test_data, family = "binomial")
summary(model2)
#Prediction performance
pred_prob2 = predict(model2, test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred_value2 = 1*(pred_prob2>0.5)
cbind(test_data, pred_prob2, pred_value2)
#Confusion matrix
actual_value2=test_data$Class
pred_value2 = 1*(pred_prob2>0.44)
confusion_matrix2=table(actual_value2,pred_value2)
confusion_matrix2
#MR
misclassification_error_rate2=1-sum(diag(confusion_matrix2))/sum(confusion_matrix2)
misclassification_error_rate2
Your observation: According to the confusion matrix, there are more number of true negatives than false positives And there are more number of true positives than false negatives. According to the MR, the MR is 0.1466, which means that approximately 14.67% of the instances are being misclassified by the model. I would consider this to be a relatively high MR, but it is notably lower than the training set’s MR.
library(ROCR)
pred2 <- prediction(pred_prob2, test_data$Class)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)
#Get the AUC
auc_test <- unlist(slot(performance(pred2, "auc"), "y.values"))
auc_test
## [1] 0.901965
Your observation: Like in the training dataset, the curve approaches more towards a higher false positive rate and true positive rate. When looking at the AUC, the AUC is now at 0.90. This means that in the testing dataset, the AUC in this model is closer to representing a perfect model.
Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be ### 1. Then define your cost function accordingly!
#training probability cut-off point
actual_value=train_data$Class
pred_value = 1*(pred_prob>0.5)
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## FALSE 116 95
## TRUE 53 436
#misclassification error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2114286
#define a cost function with input "obs" being observed response
#and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1again = 5 #define the weight for "true=1 but pred=0" (FN)
weight0again = 1 #define the weight for "true=0 but pred=1" (FP)
c1again = (obs==1)&(pred.p<pcut) #count for "true=1 but pred=0" (FN)
c0again = (obs==0)&(pred.p>=pcut) #count for "true=0 but pred=1" (FP)
cost2 = mean(weight1again*c1again + weight0again*c0again) # misclassification with weight
return(cost2) #you have to return to a value when you write R functions
} #end of the function
#define a sequence from 0.01 to 1 by 0.01
p.seq2 = seq(0.01, 1, 0.01)
mean_error2 = rep(0, length(p.seq2))
for(i in 1:length(p.seq2)){
mean_error2[i] = costfunc(obs = test_data$Class, pred.p = pred_prob2, pcut = p.seq2[i])
} #end of the loop
#draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq2, mean_error2)
#when p-cut is 0.5, we get the previous mean error:
mean_error2[which(p.seq2==0.5)]
## [1] 0.3033333
#What's the best p-cut and lowest mean error we can get?
optimal.pcut2 = p.seq2[which(mean_error2==min(mean_error2))]
optimal.pcut2
## [1] 0.16 0.17
min(mean_error2)
## [1] 0.2166667
Your observation: When obtaining the optimal probability cut-off point with the new weights, I was given two of them: 0.16 and 0.17.
#cut-off point at 0.16
#confusion matrix
pred_value = 1*(pred_prob>0.16)
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## FALSE 27 184
## TRUE 4 485
#MR
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2685714
#cut-off point at 0.17
#confusion matrix
pred_value = 1*(pred_prob>0.17)
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## FALSE 29 182
## TRUE 6 483
#MR
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2685714
Your observation: When the cut-off point is at 0.16, there are more number of false positives than true negatives. And there are significantly more number of true positives than false negatives. When looking at the MR, the value is 0.2685, which means approximately 26.86% of instances are being misclassified by the model. On the other hand, when the cut-off point is at 0.17, there more number of false positives than true negatives, and significantly more number of true positives than false negatives. As far as for the MR, the value is also 0.2685.
#cut-off point at 0.16
pred_value2 = 1*(pred_prob2>0.16)
confusion_matrix2=table(actual_value2,pred_value2)
confusion_matrix2
## pred_value2
## actual_value2 0 1
## FALSE 29 60
## TRUE 1 210
misclassification_error_rate2=1-sum(diag(confusion_matrix2))/sum(confusion_matrix2)
misclassification_error_rate2
## [1] 0.2033333
#cut-off point at 0.17
pred_value2 = 1*(pred_prob2>0.17)
confusion_matrix2=table(actual_value2,pred_value2)
confusion_matrix2
## pred_value2
## actual_value2 0 1
## FALSE 29 60
## TRUE 1 210
misclassification_error_rate2=1-sum(diag(confusion_matrix2))/sum(confusion_matrix2)
misclassification_error_rate2
## [1] 0.2033333
Your observation: When the cut-off point is at 0.16, there are more number of false positives than true negatives, and significantly more number of true positives than false negatives. The MR is also once again at 20.33. When the cut-off point is at 0.17, the output is interestingly identical for the confusion matrix and MR.
Summarize your findings, including the optimal probability cut-off, MR and AUC (if calculated) for both in-sample and out-of-sample data. Discuss what you observed and make some suggestions on how can we improve the model.
For the MR, the value was 0.19 for the in-sample data, while it went down to 0.14 for the out-of-sample data. For the AUC, the value was 0.84 for the in-sample data data, while it went up to 0.90 in the out-of-sample data. When changing the weights to 5 and 1, the overall output for both the training and testing sets changed. For instance, with the optimal probability cut-off, the value was originally 0.44, but it changed to 0.16 and 0.17. When looking at the MR for the training set, the value was 0.26, while it was 0.20 in the testing set. Lastly, as far as how to improve the model, I would decide between which data is better to use: in-sample or out-of-sample data. When comparing the two datasets, it appears that the testing dataset gave an overall better output since it had a lower missclassification rate and a higher AUC.