Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. THe response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')
library(caret)

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good"
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: Upon looking at the structure of the dataset, I can see that there’s multiple variables that are non-binary. These include ‘duration’, ‘amount’, ‘installmentratepercentage’, and ‘residenceduration’ to name a few. As far as variables that are binary, I can see that there’s a lot of them.

2. Explore the dataset to understand its structure.

#summary statistics
summary(GermanCredit)
#correlation matrix
corr_matrix <- cor(GermanCredit)
## Warning in cor(GermanCredit): the standard deviation is zero
#head
head(GermanCredit)

Your observation: When looking at the summary statistics of the dataset, I can see that many of the variables have a value of 0 for the minimum, 1st quartile, median, and 3rd quartile. Furthermore, ‘amount’ appears to have the highest minimum, 1st quartile, median, mean, 3rd quartile, and maximum. When trying to run a correlation matrix, R gives me a warning message and says that the standard deviation is zero. This could potentially mean a correlation matrix isn’t appropriate for the dataset.

3. Split the dataset into training and test set. Please use the random seed as 2023 for reproducibility.

#Set the seed for reproducibility
set.seed(2023)
#Randomly sample row indices for the training set
train_indices <- sample(1:NROW(GermanCredit),NROW(GermanCredit)*0.70)
#Create the training set
train_data <- GermanCredit[train_indices,]
#Create the testing set
test_data <- GermanCredit[-train_indices,]

Your observation: When splitting the dataset, in the testing dataset the observations went down to 300 from 1000, and in the training dataset, the observations went down to 700 from 1000.

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

model = glm(Class ~ ., data = train_data, family = "binomial")
print(model)
## 
## Call:  glm(formula = Class ~ ., family = "binomial", data = train_data)
## 
## Coefficients:
##                            (Intercept)                                Duration  
##                              8.8191208                              -0.0174850  
##                                 Amount               InstallmentRatePercentage  
##                             -0.0001529                              -0.2743598  
##                      ResidenceDuration                                     Age  
##                              0.0586728                               0.0147234  
##                  NumberExistingCredits                 NumberPeopleMaintenance  
##                             -0.2727979                              -0.2757525  
##                              Telephone                           ForeignWorker  
##                             -0.4679334                              -2.0028783  
##             CheckingAccountStatus.lt.0          CheckingAccountStatus.0.to.200  
##                             -2.0813009                              -1.2921389  
##           CheckingAccountStatus.gt.200              CheckingAccountStatus.none  
##                             -0.6405828                                      NA  
##         CreditHistory.NoCredit.AllPaid          CreditHistory.ThisBank.AllPaid  
##                             -1.3543231                              -1.4207491  
##                 CreditHistory.PaidDuly                     CreditHistory.Delay  
##                             -0.7456627                              -0.6014926  
##                 CreditHistory.Critical                          Purpose.NewCar  
##                                     NA                              -1.5170231  
##                        Purpose.UsedCar             Purpose.Furniture.Equipment  
##                             -0.1206474                              -0.8100598  
##               Purpose.Radio.Television               Purpose.DomesticAppliance  
##                             -0.7686043                              -0.4085537  
##                        Purpose.Repairs                       Purpose.Education  
##                             -1.8202999                              -1.3311594  
##                       Purpose.Vacation                      Purpose.Retraining  
##                                     NA                              -0.6650247  
##                       Purpose.Business                           Purpose.Other  
##                             -0.9463089                                      NA  
##             SavingsAccountBonds.lt.100          SavingsAccountBonds.100.to.500  
##                             -0.9640859                              -0.6702615  
##        SavingsAccountBonds.500.to.1000             SavingsAccountBonds.gt.1000  
##                             -0.0311834                              -0.3865652  
##            SavingsAccountBonds.Unknown                 EmploymentDuration.lt.1  
##                                     NA                              -0.1931434  
##              EmploymentDuration.1.to.4               EmploymentDuration.4.to.7  
##                             -0.1587566                               0.3891026  
##                EmploymentDuration.gt.7           EmploymentDuration.Unemployed  
##                             -0.0427383                                      NA  
##       Personal.Male.Divorced.Seperated               Personal.Female.NotSingle  
##                             -0.7094751                              -0.3539920  
##                   Personal.Male.Single           Personal.Male.Married.Widowed  
##                              0.1836423                                      NA  
##                 Personal.Female.Single             OtherDebtorsGuarantors.None  
##                                     NA                              -0.7390022  
##     OtherDebtorsGuarantors.CoApplicant        OtherDebtorsGuarantors.Guarantor  
##                             -1.2549232                                      NA  
##                    Property.RealEstate                      Property.Insurance  
##                              0.9372156                               0.6193592  
##                      Property.CarOther                        Property.Unknown  
##                              0.5877418                                      NA  
##             OtherInstallmentPlans.Bank            OtherInstallmentPlans.Stores  
##                             -0.5171535                              -0.3154909  
##             OtherInstallmentPlans.None                            Housing.Rent  
##                                     NA                              -0.8147496  
##                            Housing.Own                         Housing.ForFree  
##                             -0.1218148                                      NA  
##                Job.UnemployedUnskilled                   Job.UnskilledResident  
##                              0.7152327                               0.0667202  
##                    Job.SkilledEmployee  Job.Management.SelfEmp.HighlyQualified  
##                             -0.0661251                                      NA  
## 
## Degrees of Freedom: 699 Total (i.e. Null);  651 Residual
## Null Deviance:       856.9 
## Residual Deviance: 616.2     AIC: 714.2

Your observation: I notice that there’s multiple missing values. In addition, because the AIC is 714.2, which is super high, this means that this model is not a good fit since a lower AIC value indicates a better-fitting model.

2. Summarize the model and interpret the coefficients.

summary(model)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5779  -0.7282   0.3565   0.6825   2.1820  
## 
## Coefficients: (13 not defined because of singularities)
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             8.8191208  1.7624784   5.004 5.62e-07
## Duration                               -0.0174850  0.0108921  -1.605  0.10843
## Amount                                 -0.0001529  0.0000530  -2.885  0.00391
## InstallmentRatePercentage              -0.2743599  0.1074850  -2.553  0.01069
## ResidenceDuration                       0.0586728  0.1078146   0.544  0.58630
## Age                                     0.0147234  0.0112304   1.311  0.18985
## NumberExistingCredits                  -0.2727979  0.2399140  -1.137  0.25551
## NumberPeopleMaintenance                -0.2757525  0.3274757  -0.842  0.39976
## Telephone                              -0.4679334  0.2439420  -1.918  0.05508
## ForeignWorker                          -2.0028783  0.8576485  -2.335  0.01953
## CheckingAccountStatus.lt.0             -2.0813009  0.2881259  -7.224 5.06e-13
## CheckingAccountStatus.0.to.200         -1.2921389  0.2771600  -4.662 3.13e-06
## CheckingAccountStatus.gt.200           -0.6405828  0.4830144  -1.326  0.18477
## CheckingAccountStatus.none                     NA         NA      NA       NA
## CreditHistory.NoCredit.AllPaid         -1.3543231  0.5125284  -2.642  0.00823
## CreditHistory.ThisBank.AllPaid         -1.4207491  0.5467234  -2.599  0.00936
## CreditHistory.PaidDuly                 -0.7456626  0.3106988  -2.400  0.01640
## CreditHistory.Delay                    -0.6014926  0.4293773  -1.401  0.16126
## CreditHistory.Critical                         NA         NA      NA       NA
## Purpose.NewCar                         -1.5170231  0.9359195  -1.621  0.10504
## Purpose.UsedCar                        -0.1206474  0.9621073  -0.125  0.90021
## Purpose.Furniture.Equipment            -0.8100598  0.9420714  -0.860  0.38986
## Purpose.Radio.Television               -0.7686043  0.9505841  -0.809  0.41877
## Purpose.DomesticAppliance              -0.4085537  1.3479085  -0.303  0.76181
## Purpose.Repairs                        -1.8202999  1.1005639  -1.654  0.09813
## Purpose.Education                      -1.3311594  1.0289082  -1.294  0.19575
## Purpose.Vacation                               NA         NA      NA       NA
## Purpose.Retraining                     -0.6650247  1.5831072  -0.420  0.67443
## Purpose.Business                       -0.9463089  0.9635263  -0.982  0.32604
## Purpose.Other                                  NA         NA      NA       NA
## SavingsAccountBonds.lt.100             -0.9640859  0.3172524  -3.039  0.00237
## SavingsAccountBonds.100.to.500         -0.6702615  0.4129080  -1.623  0.10453
## SavingsAccountBonds.500.to.1000        -0.0311834  0.6154087  -0.051  0.95959
## SavingsAccountBonds.gt.1000            -0.3865652  0.6310337  -0.613  0.54015
## SavingsAccountBonds.Unknown                    NA         NA      NA       NA
## EmploymentDuration.lt.1                -0.1931434  0.5565708  -0.347  0.72857
## EmploymentDuration.1.to.4              -0.1587566  0.5372056  -0.296  0.76759
## EmploymentDuration.4.to.7               0.3891026  0.5671570   0.686  0.49268
## EmploymentDuration.gt.7                -0.0427382  0.5449721  -0.078  0.93749
## EmploymentDuration.Unemployed                  NA         NA      NA       NA
## Personal.Male.Divorced.Seperated       -0.7094751  0.5352264  -1.326  0.18499
## Personal.Female.NotSingle              -0.3539920  0.3752277  -0.943  0.34547
## Personal.Male.Single                    0.1836423  0.3789886   0.485  0.62799
## Personal.Male.Married.Widowed                  NA         NA      NA       NA
## Personal.Female.Single                         NA         NA      NA       NA
## OtherDebtorsGuarantors.None            -0.7390022  0.4889566  -1.511  0.13069
## OtherDebtorsGuarantors.CoApplicant     -1.2549232  0.7077516  -1.773  0.07621
## OtherDebtorsGuarantors.Guarantor               NA         NA      NA       NA
## Property.RealEstate                     0.9372156  0.4929071   1.901  0.05725
## Property.Insurance                      0.6193592  0.4725768   1.311  0.18999
## Property.CarOther                       0.5877418  0.4617889   1.273  0.20311
## Property.Unknown                               NA         NA      NA       NA
## OtherInstallmentPlans.Bank             -0.5171535  0.2875924  -1.798  0.07214
## OtherInstallmentPlans.Stores           -0.3154909  0.5294274  -0.596  0.55124
## OtherInstallmentPlans.None                     NA         NA      NA       NA
## Housing.Rent                           -0.8147496  0.5572708  -1.462  0.14373
## Housing.Own                            -0.1218148  0.5223979  -0.233  0.81562
## Housing.ForFree                                NA         NA      NA       NA
## Job.UnemployedUnskilled                 0.7152327  0.8860470   0.807  0.41954
## Job.UnskilledResident                   0.0667202  0.4369190   0.153  0.87863
## Job.SkilledEmployee                    -0.0661251  0.3611982  -0.183  0.85474
## Job.Management.SelfEmp.HighlyQualified         NA         NA      NA       NA
##                                           
## (Intercept)                            ***
## Duration                                  
## Amount                                 ** 
## InstallmentRatePercentage              *  
## ResidenceDuration                         
## Age                                       
## NumberExistingCredits                     
## NumberPeopleMaintenance                   
## Telephone                              .  
## ForeignWorker                          *  
## CheckingAccountStatus.lt.0             ***
## CheckingAccountStatus.0.to.200         ***
## CheckingAccountStatus.gt.200              
## CheckingAccountStatus.none                
## CreditHistory.NoCredit.AllPaid         ** 
## CreditHistory.ThisBank.AllPaid         ** 
## CreditHistory.PaidDuly                 *  
## CreditHistory.Delay                       
## CreditHistory.Critical                    
## Purpose.NewCar                            
## Purpose.UsedCar                           
## Purpose.Furniture.Equipment               
## Purpose.Radio.Television                  
## Purpose.DomesticAppliance                 
## Purpose.Repairs                        .  
## Purpose.Education                         
## Purpose.Vacation                          
## Purpose.Retraining                        
## Purpose.Business                          
## Purpose.Other                             
## SavingsAccountBonds.lt.100             ** 
## SavingsAccountBonds.100.to.500            
## SavingsAccountBonds.500.to.1000           
## SavingsAccountBonds.gt.1000               
## SavingsAccountBonds.Unknown               
## EmploymentDuration.lt.1                   
## EmploymentDuration.1.to.4                 
## EmploymentDuration.4.to.7                 
## EmploymentDuration.gt.7                   
## EmploymentDuration.Unemployed             
## Personal.Male.Divorced.Seperated          
## Personal.Female.NotSingle                 
## Personal.Male.Single                      
## Personal.Male.Married.Widowed             
## Personal.Female.Single                    
## OtherDebtorsGuarantors.None               
## OtherDebtorsGuarantors.CoApplicant     .  
## OtherDebtorsGuarantors.Guarantor          
## Property.RealEstate                    .  
## Property.Insurance                        
## Property.CarOther                         
## Property.Unknown                          
## OtherInstallmentPlans.Bank             .  
## OtherInstallmentPlans.Stores              
## OtherInstallmentPlans.None                
## Housing.Rent                              
## Housing.Own                               
## Housing.ForFree                           
## Job.UnemployedUnskilled                   
## Job.UnskilledResident                     
## Job.SkilledEmployee                       
## Job.Management.SelfEmp.HighlyQualified    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 856.90  on 699  degrees of freedom
## Residual deviance: 616.15  on 651  degrees of freedom
## AIC: 714.15
## 
## Number of Fisher Scoring iterations: 5

Your observation: According to the model, there are more coefficients that have non-significant p-values than coefficients with significant p-values.

Task 3: Optimal Probability Cut-off, with weight0 = 1 and weight1 = ### 1.

1. Use the training set to predict probabilities.

To evaluate the prediction performance on all the observed data, we can similarly have

pred_prob = predict(model, train_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred_value = 1*(pred_prob>0.5)
cbind(train_data, pred_prob, pred_value)

Now, we plot everything in one figure.

plot(train_data$Duration,train_data$Class)
points(train_data$Duration,train_data$Class,pch=20)
points(train_data$Duration,pred_prob,type="h",lty=3)
abline(h=0.5,col="grey")
points(train_data$Duration,pred_value,pch=1,cex=1.3)

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

#Confusion matrix
actual_value=train_data$Class

pred_value = 1*(pred_prob>0.5)

confusion_matrix=table(actual_value,pred_value)
confusion_matrix
##             pred_value
## actual_value   0   1
##        FALSE 116  95
##        TRUE   53 436
#misclassification error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2114286
#define a cost function with input "obs" being observed response 
#and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1 = 1   #define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    #define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    #count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   #count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) #you have to return to a value when you write R functions
} #end of the function

#define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

mean_error = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    mean_error[i] = costfunc(obs = train_data$Class, pred.p = pred_prob, pcut = p.seq[i])  
} #end of the loop

#draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)

#when p-cut is 0.5, we get the previous mean error:
mean_error[which(p.seq==0.5)]
## [1] 0.2114286
#What's the best p-cut and lowest mean error we can get?
optimal.pcut = p.seq[which(mean_error==min(mean_error))]
optimal.pcut
## [1] 0.44
min(mean_error)
## [1] 0.1971429

Your observation: Before using the misclassification rate, the mean error was 0.21. But after using the misclassifcation rate, the mean error is now 0.44.

Task 4: Model Evaluation

1. Using the optimal probability cut-off point obtained in ### 3.2, generate confusion matrix and obtain MR for the the training set.

pred_value = 1*(pred_prob>0.44)
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
##             pred_value
## actual_value   0   1
##        FALSE 106 105
##        TRUE   33 456
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.1971429

Your observation: In the confusion matrix, there’s almost an equal number of true negatives and false positives. But there are much more number of true positives than false negatives. When looking at the MR, the value is 0.1971, which means that roughly 19.71% of the instances are being misclassified by the model. I would consider this to be a relatively high MR.

2. Using the optimal probability cut-off point obtained in ### 3.2, generate the ROC curve and calculate the AUC for the training set.

library(ROCR)
pred <- prediction(pred_prob, train_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
auc_train <- unlist(slot(performance(pred, "auc"), "y.values"))
auc_train
## [1] 0.8424583

Your observation: When looking at the ROC curve, I can see that the curve approaches more towards a higher false positive rate and true positive rate. Since the AUC value is 0.84, and an AUC model of 1 represents a perfect model, I would consider the AUC in this model to be good at maintaining good performance across different threshold settings.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

#logistic regression model
model2 = glm(Class ~ ., data = test_data, family = "binomial")
summary(model2)

#Prediction performance
pred_prob2 = predict(model2, test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred_value2 = 1*(pred_prob2>0.5)
cbind(test_data, pred_prob2, pred_value2)

#Confusion matrix
actual_value2=test_data$Class
pred_value2 = 1*(pred_prob2>0.44)
confusion_matrix2=table(actual_value2,pred_value2)
confusion_matrix2

#MR
misclassification_error_rate2=1-sum(diag(confusion_matrix2))/sum(confusion_matrix2)
misclassification_error_rate2

Your observation: According to the confusion matrix, there are more number of true negatives than false positives And there are more number of true positives than false negatives. According to the MR, the MR is 0.1466, which means that approximately 14.67% of the instances are being misclassified by the model. I would consider this to be a relatively high MR, but it is notably lower than the training set’s MR.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

library(ROCR)
pred2 <- prediction(pred_prob2, test_data$Class)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)

#Get the AUC
auc_test <- unlist(slot(performance(pred2, "auc"), "y.values"))
auc_test
## [1] 0.901965

Your observation: Like in the training dataset, the curve approaches more towards a higher false positive rate and true positive rate. When looking at the AUC, the AUC is now at 0.90. This means that in the testing dataset, the AUC in this model is closer to representing a perfect model.

Task 5: Using different weights

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be ### 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

#training probability cut-off point

actual_value=train_data$Class

pred_value = 1*(pred_prob>0.5)

confusion_matrix=table(actual_value,pred_value)
confusion_matrix
##             pred_value
## actual_value   0   1
##        FALSE 116  95
##        TRUE   53 436
#misclassification error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2114286
#define a cost function with input "obs" being observed response 
#and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1again = 5   #define the weight for "true=1 but pred=0" (FN)
    weight0again = 1    #define the weight for "true=0 but pred=1" (FP)
    c1again = (obs==1)&(pred.p<pcut)    #count for "true=1 but pred=0"   (FN)
    c0again = (obs==0)&(pred.p>=pcut)   #count for "true=0 but pred=1"   (FP)
    cost2 = mean(weight1again*c1again + weight0again*c0again)  # misclassification with weight
    return(cost2) #you have to return to a value when you write R functions
} #end of the function

#define a sequence from 0.01 to 1 by 0.01
p.seq2 = seq(0.01, 1, 0.01) 

mean_error2 = rep(0, length(p.seq2))  
for(i in 1:length(p.seq2)){ 
    mean_error2[i] = costfunc(obs = test_data$Class, pred.p = pred_prob2, pcut = p.seq2[i])  
} #end of the loop

#draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq2, mean_error2)

#when p-cut is 0.5, we get the previous mean error:
mean_error2[which(p.seq2==0.5)]
## [1] 0.3033333
#What's the best p-cut and lowest mean error we can get?
optimal.pcut2 = p.seq2[which(mean_error2==min(mean_error2))]
optimal.pcut2
## [1] 0.16 0.17
min(mean_error2)
## [1] 0.2166667

Your observation: When obtaining the optimal probability cut-off point with the new weights, I was given two of them: 0.16 and 0.17.

2. Obtain the confusion matrix and MR for the training set.

#cut-off point at 0.16

#confusion matrix
pred_value = 1*(pred_prob>0.16)
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
##             pred_value
## actual_value   0   1
##        FALSE  27 184
##        TRUE    4 485
#MR
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2685714
#cut-off point at 0.17

#confusion matrix
pred_value = 1*(pred_prob>0.17)
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
##             pred_value
## actual_value   0   1
##        FALSE  29 182
##        TRUE    6 483
#MR
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2685714

Your observation: When the cut-off point is at 0.16, there are more number of false positives than true negatives. And there are significantly more number of true positives than false negatives. When looking at the MR, the value is 0.2685, which means approximately 26.86% of instances are being misclassified by the model. On the other hand, when the cut-off point is at 0.17, there more number of false positives than true negatives, and significantly more number of true positives than false negatives. As far as for the MR, the value is also 0.2685.

3. Obtain the confusion matrix and MR for the test set.

#cut-off point at 0.16

pred_value2 = 1*(pred_prob2>0.16)
confusion_matrix2=table(actual_value2,pred_value2)
confusion_matrix2
##              pred_value2
## actual_value2   0   1
##         FALSE  29  60
##         TRUE    1 210
misclassification_error_rate2=1-sum(diag(confusion_matrix2))/sum(confusion_matrix2)
misclassification_error_rate2
## [1] 0.2033333
#cut-off point at 0.17

pred_value2 = 1*(pred_prob2>0.17)
confusion_matrix2=table(actual_value2,pred_value2)
confusion_matrix2
##              pred_value2
## actual_value2   0   1
##         FALSE  29  60
##         TRUE    1 210
misclassification_error_rate2=1-sum(diag(confusion_matrix2))/sum(confusion_matrix2)
misclassification_error_rate2
## [1] 0.2033333

Your observation: When the cut-off point is at 0.16, there are more number of false positives than true negatives, and significantly more number of true positives than false negatives. The MR is also once again at 20.33. When the cut-off point is at 0.17, the output is interestingly identical for the confusion matrix and MR.

Task 6: Report

Summarize your findings, including the optimal probability cut-off, MR and AUC (if calculated) for both in-sample and out-of-sample data. Discuss what you observed and make some suggestions on how can we improve the model.

For the MR, the value was 0.19 for the in-sample data, while it went down to 0.14 for the out-of-sample data. For the AUC, the value was 0.84 for the in-sample data data, while it went up to 0.90 in the out-of-sample data. When changing the weights to 5 and 1, the overall output for both the training and testing sets changed. For instance, with the optimal probability cut-off, the value was originally 0.44, but it changed to 0.16 and 0.17. When looking at the MR for the training set, the value was 0.26, while it was 0.20 in the testing set. Lastly, as far as how to improve the model, I would decide between which data is better to use: in-sample or out-of-sample data. When comparing the two datasets, it appears that the testing dataset gave an overall better output since it had a lower missclassification rate and a higher AUC.