Homework5

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

if (!require(package_name, quietly = TRUE)) {
    install.packages("caret")
}

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation:

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure.

(1) How many observations and variables are there? (2 pts)

# your code here:
str(GermanCredit)

## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

Your observation: There are 1000 observations followed by 49 variables

(2) Please make a frequency table of variable `class` (use `table()` function). How many observations are classed as “good” and how many are “bad”? (2 pts)

# your code here:
as.numeric(GermanCredit$Class)

##    [1] 1 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
##   [38] 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1
##   [75] 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
##  [112] 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1
##  [149] 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 0
##  [186] 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
##  [223] 1 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1
##  [260] 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0
##  [297] 1 1 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0
##  [334] 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1
##  [371] 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1
##  [408] 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0
##  [445] 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 0 0 1 1 1 1 1
##  [482] 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 1 1 0
##  [519] 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 1
##  [556] 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 0 1 1
##  [593] 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1
##  [630] 1 1 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1
##  [667] 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1
##  [704] 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 0
##  [741] 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1
##  [778] 1 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0
##  [815] 0 1 1 1 1 0 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 0 0
##  [852] 1 1 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0
##  [889] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 0
##  [926] 0 1 0 1 1 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1
##  [963] 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [1000] 1

GermanCredit_Good <-  GermanCredit$Class + 0 == "Good"
GermanCredit_Bad <-  GermanCredit$Class + 1 == "Bad"
table(GermanCredit_Good)

## GermanCredit_Good
## FALSE 
##  1000

table(GermanCredit_Bad)

## GermanCredit_Bad
## FALSE 
##  1000

Good_or_Bad <- table(actual = GermanCredit_Good, predict = GermanCredit_Bad)
Good_or_Bad

##        predict
## actual  FALSE
##   FALSE  1000

Your observation:

After making GermanCredit$Class numeric, we are left with all values being Good. Since none are classified as Bad.

(3) Please make a barplot of of response variable `class`. Please add titles and labels to axis. (2 pts)

# your code here:
barplot(GermanCredit$Class, data = Good_or_Bad, xlab = "False", ylab = "Classes", main = "Good or Bad")

## Warning in plot.window(xlim, ylim, log = log, ...): "data" is not a graphical
## parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter

## Warning in axis(if (horiz) 1 else 2, cex.axis = cex.axis, ...): "data" is not a
## graphical parameter

3. Split the dataset into training and test set. A random seed of `2025` is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

set.seed(2025) # set random seed for reproducibility.
# your code here:
train_ind <- sample(1:nrow(GermanCredit), 0.8 * nrow(GermanCredit))
German_train <- GermanCredit[train_ind, ]
German_test <- GermanCredit[-train_ind, ]

dim(German_train)

## [1] 800  49

dim(German_test)

## [1] 200  49

Your comment:

Both German_train and German_test have the same amount of variables, 49. Yet different observations. German_train has only 800/1000, while German_test has 200/1000.

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right. (2 pts)

# your code here:
colnames(GermanCredit)

##  [1] "Duration"                           "Amount"                            
##  [3] "InstallmentRatePercentage"          "ResidenceDuration"                 
##  [5] "Age"                                "NumberExistingCredits"             
##  [7] "NumberPeopleMaintenance"            "Telephone"                         
##  [9] "ForeignWorker"                      "Class"                             
## [11] "CheckingAccountStatus.lt.0"         "CheckingAccountStatus.0.to.200"    
## [13] "CheckingAccountStatus.gt.200"       "CreditHistory.NoCredit.AllPaid"    
## [15] "CreditHistory.ThisBank.AllPaid"     "CreditHistory.PaidDuly"            
## [17] "CreditHistory.Delay"                "Purpose.NewCar"                    
## [19] "Purpose.UsedCar"                    "Purpose.Furniture.Equipment"       
## [21] "Purpose.Radio.Television"           "Purpose.DomesticAppliance"         
## [23] "Purpose.Repairs"                    "Purpose.Education"                 
## [25] "Purpose.Retraining"                 "Purpose.Business"                  
## [27] "SavingsAccountBonds.lt.100"         "SavingsAccountBonds.100.to.500"    
## [29] "SavingsAccountBonds.500.to.1000"    "SavingsAccountBonds.gt.1000"       
## [31] "EmploymentDuration.lt.1"            "EmploymentDuration.1.to.4"         
## [33] "EmploymentDuration.4.to.7"          "EmploymentDuration.gt.7"           
## [35] "Personal.Male.Divorced.Seperated"   "Personal.Female.NotSingle"         
## [37] "Personal.Male.Single"               "OtherDebtorsGuarantors.None"       
## [39] "OtherDebtorsGuarantors.CoApplicant" "Property.RealEstate"               
## [41] "Property.Insurance"                 "Property.CarOther"                 
## [43] "OtherInstallmentPlans.Bank"         "OtherInstallmentPlans.Stores"      
## [45] "Housing.Rent"                       "Housing.Own"                       
## [47] "Job.UnemployedUnskilled"            "Job.UnskilledResident"             
## [49] "Job.SkilledEmployee"

German_train <- GermanCredit[train_ind, ]
model <- glm(Class ~ Duration + Amount + InstallmentRatePercentage + ResidenceDuration + Age + NumberExistingCredits + NumberPeopleMaintenance, data = German_train, family = "binomial")

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable `InstallmentRatePercentage`? Is it significant, and why? (2 pts)

# your code here:
summary(model)

## 
## Call:
## glm(formula = Class ~ Duration + Amount + InstallmentRatePercentage + 
##     ResidenceDuration + Age + NumberExistingCredits + NumberPeopleMaintenance, 
##     family = "binomial", data = German_train)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                1.880e+00  4.824e-01   3.897 9.73e-05 ***
## Duration                  -2.651e-02  8.394e-03  -3.158  0.00159 ** 
## Amount                    -5.927e-05  3.658e-05  -1.620  0.10516    
## InstallmentRatePercentage -1.550e-01  7.882e-02  -1.967  0.04918 *  
## ResidenceDuration         -9.040e-02  7.359e-02  -1.228  0.21931    
## Age                        1.820e-02  7.720e-03   2.358  0.01837 *  
## NumberExistingCredits      7.234e-02  1.432e-01   0.505  0.61349    
## NumberPeopleMaintenance   -2.504e-01  2.190e-01  -1.143  0.25285    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 987.34  on 799  degrees of freedom
## Residual deviance: 942.10  on 792  degrees of freedom
## AIC: 958.1
## 
## Number of Fisher Scoring iterations: 4

Your comment:

InstallmentRatePercentage’s estimated coefficients are: Estimate = -1.550e-01, STD. Error = 7.8822e-02, z value = -1.967, and Pr(>|z|) = 0.04918. From comparing the other coefficients of the other variables, InstallmentRatePercentage has one of the highest STD. Error, while the rest are the lowest of their column. Ergo, InstallmentRatePercentage is good in 1/4 areas.

3. Please interpret this number in detail (please calculate the corresponding odds ratio, and interpret it). (2 pts)

# you might need some code for calculation:
predict_install <- predict(model,data.frame(German_train))

Your comment:

It’s interpretted as the whole of InstallmentRatePercentage.

Task 3: Model Evaluation (Part I)

1. Use the training set to obtain predicted probabilities. (2 pts)

# your code here:
library(ROCR)
pred_Ger <- predict(model, newdata = German_train, type = "response")
summary(pred_Ger)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2624  0.6333  0.7169  0.6925  0.7722  0.8998

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (3 pts)

# your code here:
pcut <- 0.5

pred_value <- 1 * (pred_Ger > pcut)
pred_value

##  909  460  932  922  961  279  510  187  266  461  881  972  900  539  983  891 
##    1    1    1    0    1    1    1    1    1    1    1    1    1    0    1    1 
##  373  407  571  907  159  142  549  800  575  797  801  764  857  630   88  195 
##    1    1    1    1    1    1    1    1    1    1    1    0    1    1    1    0 
##  686  940  371  306   59  317  528  837  564  186  144  882   49  399   50  626 
##    0    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  185  107  491  433  302  536  788  991  779  503  819  937  987  958  214  304 
##    1    1    1    1    1    1    0    1    1    1    1    1    1    1    1    1 
##  850  913  593  744  153  550  440  754  432  772  408  753  517  413  495  325 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  336  363   41  182  792  599  104   15  308  524  742  725  387  300  264  648 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   23  776  898  482  154  781  962  873  743  956  687  708  654  827  878  652 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  340  608  995  425  807  583  960   37  228  468  846  929  117  254  915  820 
##    1    1    1    1    1    1    1    1    1    0    1    1    0    1    1    1 
##  876  676  758  586  879  164  994  672  531  170  825  218  870  691  824  639 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  791  148  590  405  901  110  210   10  241   18  385  158  692  284  177  430 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  504  240    6   90  715  250  447  912   25  465  774  369  930  121  256  314 
##    1    1    0    1    0    1    1    1    1    1    1    1    1    1    0    1 
##  295  845  301  943  331  429  580  557  500  917  671   53  965  644  748  438 
##    0    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##    5  316  397   28  297  908  163  568  999  396  152  280  320  750  836  427 
##    1    1    1    1    1    1    1    1    0    0    1    1    1    1    1    1 
##  298  414   81  713    8  416  474  388  607  278  428  348  130  382  206  559 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  868  680  926  643  699  730  184  656  551  147  679  222  209  145  499   58 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  572  285  866  865  362  959  533  268  444  251  612  986  169  576  619  951 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   40   98  452  199  383  451   16  752  333  415  577  683  337  629  902  996 
##    1    1    1    1    1    0    1    1    0    1    1    1    1    1    1    1 
##  813  740  198  516  345  957  269   71  203    2  860   73  682  450  211  633 
##    0    1    1    1    1    1    1    1    1    0    1    1    1    1    1    1 
##  871  655  155  223  573  116  261  554  855  274  979  334  156  562  714  233 
##    1    1    1    1    1    1    1    1    1    0    1    0    1    1    1    1 
##  622  281  135  993  914  720  616  646  911  225  522  949  693  631  964   44 
##    1    1    0    1    1    1    1    0    1    1    1    1    1    1    1    1 
##  918   94  763  621  126  585  378  974  569  305   26  417  939  799   32  120 
##    1    1    1    1    1    1    1    0    1    1    1    1    0    1    1    1 
##  594  700  802  201  893  473  197  322  512   99  479  793  398  207  380  756 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  817  783  739  969  685  771   96  555  464  176   47  924  434  834   11  127 
##    1    1    1    1    1    1    0    1    1    1    1    1    1    1    1    1 
##  906  403  171  263  160  420  277   13  217  767  778  553  651  119  695  560 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  601 1000   33  360  520  903  598   87  139  404  384  231  335  745  196  332 
##    1    0    1    1    1    1    1    1    1    1    1    1    1    0    1    1 
##  476  361   62   68   31  166  728  844  843  727  923  864  544  118  831  190 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  759  307  265  840  624   55  290  368  697  563   38  341   52  122   84  595 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   20  456  449  143  887  787  998  435  928  814  941   76  658  696  393  650 
##    1    1    1    1    1    1    1    1    0    1    1    1    0    1    1    1 
##    1  321  506  193  640  140  833  483  365  423  260  215  437  967  675  315 
##    1    1    1    1    0    1    0    1    1    1    1    1    1    1    1    1 
##  762  454  927  832  875  136  446   45  647  352  529  966  490   34   72  880 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  179  889  724  657  950  346  982  463   85  521   54  367  180   27  401  916 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0 
##  175  872  138  324  167  102  635  816  989  310  589   80  252  841  273  606 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0    1 
##  567  226  627  869  755  874  497  784  815  112  439  475  477  234  395  462 
##    1    1    1    1    1    1    1    1    0    1    1    1    0    1    1    1 
##  431  272  717  653  632  411  359   83  232  418  312  481  786  128  137  350 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  105  732  410  737  638  883  895  194  592  884  821  919  239  645  936  722 
##    1    1    1    1    0    1    1    1    1    1    1    1    1    1    1    1 
##  798  329  309  947  236   30  229  830  168  376  470  973  856  718  377  684 
##    1    1    1    1    1    1    1    0    1    0    1    1    1    1    1    1 
##  174  854  806  364  954  406  736  794  471  637  712  213  488  496  318  729 
##    1    1    1    1    0    1    1    1    1    1    1    1    1    1    1    1 
##  515  574  582  259  822  469  552  760  459  540  992  749  205  123  489  132 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0 
##   22   14  292  507  108  726  975  611  246  445  386  668  375  353   65  419 
##    1    1    1    1    1    1    1    1    1    1    1    1    0    1    1    1 
##  558  597  861  498   56  698  849  514  106  803  519   19  191  579  230  769 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  392  248  705   36  342  859  237  946  587  366  782  581  204  299  945  161 
##    1    1    1    0    1    1    1    1    1    1    1    1    1    1    1    1 
##  412  508   29  374  523  746  613   82  681  546  790  466  904  349  615  785 
##    1    1    1    0    0    1    1    1    1    1    0    1    1    1    1    1 
##  181  970   35  547   78  735  768  255  165  478  394  111  509  609  925   89 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   12  710   60  723  773   57  303  323  192  505  532  484  731   70  963  330 
##    0    1    0    1    1    1    1    1    0    1    1    1    1    1    1    1 
##  910  796  944   64  823  665  149  984  545  862  276  839  990  134  809  897 
##    1    1    1    0    1    1    1    0    1    1    1    1    1    1    1    1 
##  100  667   46  441  501  678  443  372  287  527  721  271   95  877  502  286 
##    1    1    1    1    1    0    1    1    0    1    1    1    1    1    1    1 
##  146   43  677  565  733  311  328  976  570  625  977  948  766  765  614  596 
##    0    1    1    1    1    1    1    1    0    1    1    1    1    1    1    1 
##  390  662  810  703  402  220  661  664  289  157  789  493  997  666  620  663 
##    1    1    1    1    1    1    1    1    1    1    0    1    1    1    1    1 
##  920  296  953  848  604  747  326  202  812  719  389  347  669  775  391   77 
##    1    0    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   67   91  734  238  101  291  219  188  319  339  227  457   74  694  867  257 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1

as.numeric(pred_Ger <- pcut)

## [1] 0.5

act_value <- German_train$Class

table(act_value)

## act_value
## FALSE  TRUE 
##   246   554

confusion_mat <- table(actual = act_value, predict = pred_value)
confusion_mat

##        predict
## actual    0   1
##   FALSE  36 210
##   TRUE   20 534

MR = (4 + 1)/sum(confusion_mat)
MR

## [1] 0.00625

Your comment:

After doing all our calculations, our MR ends up to be 0.00625 for the training set.

3. Find the optimal probability cut-off point using the MR. Please draw a plot of MR vs. cut-off probability, and comment on optimal cut-off probability. (3 pts)

# your code here:
confusion_mat

##        predict
## actual    0   1
##   FALSE  36 210
##   TRUE   20 534

diag(confusion_mat)

## [1]  36 534

sum(confusion_mat) - sum(diag(confusion_mat))

## [1] 230

sum(diag(confusion_mat)) / sum(confusion_mat)

## [1] 0.7125

#FPR
4/(4 + 55)

## [1] 0.06779661

# TPR
0.725/(0.725 + 1)

## [1] 0.4202899

ggplot(data = GermanCredit, aes(x = MR, y = pcut)) + ggtitle("Optimal Cut Off") + xlab("MR") + ylab("Pcut")

Your comment:

FPR = 0.06779661, and TPR = 0.4202899

4. Please generate the ROC curve and calculate the AUC for the training set. Please comment on this AUC. (2 pts)

# your code here:
#install.packages("pROC")
library(pROC)

## Warning: package 'pROC' was built under R version 4.5.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

pred_Ger <- predict(model, newdata = German_train, type = "response")
auc(act_value, pred_Ger)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

## Area under the curve: 0.6406

Your comment:

Our AUC Curve is 0.6406

Task 4: Model Evaluation (Part II)

1. Use the testing set to obtain predicted probabilities. (2 pts)

# your code here:
pred_German <- predict(model, newdata = German_test, type = "response")
summary(pred_German)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3609  0.6649  0.7269  0.7090  0.7780  0.9214

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (2 pts)

# your code here:
pcut <- 0.5

pred_value <- 1 * (pred_German > pcut)
pred_value

##   3   4   7   9  17  21  24  39  42  48  51  61  63  66  69  75  79  86  92  93 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   1   1   1 
##  97 103 109 113 114 115 124 125 129 131 133 141 150 151 162 172 173 178 183 189 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 200 208 212 216 221 224 235 242 243 244 245 247 249 253 258 262 267 270 275 282 
##   1   1   1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1 
## 283 288 293 294 313 327 338 343 344 351 354 355 356 357 358 370 379 381 400 409 
##   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 421 422 424 426 436 442 448 453 455 458 467 472 480 485 486 487 492 494 511 513 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 518 525 526 530 534 535 537 538 541 542 543 548 556 561 566 578 584 588 591 600 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 602 603 605 610 617 618 623 628 634 636 641 642 649 659 660 670 673 674 688 689 
##   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1   0   1   1   1 
## 690 701 702 704 706 707 709 711 716 738 741 751 757 761 770 777 780 795 804 805 
##   1   1   0   1   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 808 811 818 826 828 829 835 838 842 847 851 852 853 858 863 885 886 888 890 892 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   1   1 
## 894 896 899 905 921 931 933 934 935 938 942 952 955 968 971 978 980 981 985 988 
##   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

as.numeric(pred_German <- pcut)

## [1] 0.5

act_value <- German_test$Class

table(act_value)

## act_value
## FALSE  TRUE 
##    54   146

confusion_mat <- table(actual = act_value, predict = pred_value)
confusion_mat

##        predict
## actual    0   1
##   FALSE   4  50
##   TRUE    5 141

MR = (4 + 1)/sum(confusion_mat)
MR

## [1] 0.025

Your comment:

The MR from the testing set is 0.025.

2. Please generate the ROC curve and calculate the AUC for the test set. Please comment on this AUC. (2 pts)

# your code here:
pred_German <- predict(model, newdata = German_test, type = "response")
auc(act_value, pred_German)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

## Area under the curve: 0.6877

Your comment:

The AUC Curve is 0.6877.

Homework5

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure.

(1) How many observations and variables are there? (2 pts)

(2) Please make a frequency table of variable class (use table() function). How many observations are classed as “good” and how many are “bad”? (2 pts)

(3) Please make a barplot of of response variable class. Please add titles and labels to axis. (2 pts)

3. Split the dataset into training and test set. A random seed of 2025 is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right. (2 pts)

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable InstallmentRatePercentage? Is it significant, and why? (2 pts)

3. Please interpret this number in detail (please calculate the corresponding odds ratio, and interpret it). (2 pts)

Task 3: Model Evaluation (Part I)

1. Use the training set to obtain predicted probabilities. (2 pts)

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (3 pts)

3. Find the optimal probability cut-off point using the MR. Please draw a plot of MR vs. cut-off probability, and comment on optimal cut-off probability. (3 pts)

4. Please generate the ROC curve and calculate the AUC for the training set. Please comment on this AUC. (2 pts)

Task 4: Model Evaluation (Part II)

1. Use the testing set to obtain predicted probabilities. (2 pts)

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (2 pts)

2. Please generate the ROC curve and calculate the AUC for the test set. Please comment on this AUC. (2 pts)

(2) Please make a frequency table of variable `class` (use `table()` function). How many observations are classed as “good” and how many are “bad”? (2 pts)

(3) Please make a barplot of of response variable `class`. Please add titles and labels to axis. (2 pts)

3. Split the dataset into training and test set. A random seed of `2025` is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable `InstallmentRatePercentage`? Is it significant, and why? (2 pts)