Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

library(caret) #this package contains the german data with its numeric format

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.4.3

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response，now 1 is good and 0 is bad.
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

# This is the code that drop variables that provide no information in the data
# Just run it
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

str(GermanCredit)

## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

summary(GermanCredit)

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker   Class   CheckingAccountStatus.lt.0
##  Min.   :0.000   0:300   Min.   :0.000             
##  1st Qu.:1.000   1:700   1st Qu.:0.000             
##  Median :1.000           Median :0.000             
##  Mean   :0.963           Mean   :0.274             
##  3rd Qu.:1.000           3rd Qu.:1.000             
##  Max.   :1.000           Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

head(GermanCredit)

##   Duration Amount InstallmentRatePercentage ResidenceDuration Age
## 1        6   1169                         4                 4  67
## 2       48   5951                         2                 2  22
## 3       12   2096                         2                 3  49
## 4       42   7882                         2                 4  45
## 5       24   4870                         3                 4  53
## 6       36   9055                         2                 4  35
##   NumberExistingCredits NumberPeopleMaintenance Telephone ForeignWorker Class
## 1                     2                       1         0             1     1
## 2                     1                       1         1             1     0
## 3                     1                       2         1             1     1
## 4                     1                       2         1             1     1
## 5                     2                       2         1             1     0
## 6                     1                       2         0             1     1
##   CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 1                          1                              0
## 2                          0                              1
## 3                          0                              0
## 4                          1                              0
## 5                          1                              0
## 6                          0                              0
##   CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 1                            0                              0
## 2                            0                              0
## 3                            0                              0
## 4                            0                              0
## 5                            0                              0
## 6                            0                              0
##   CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## 1                              0                      0                   0
## 2                              0                      1                   0
## 3                              0                      0                   0
## 4                              0                      1                   0
## 5                              0                      0                   1
## 6                              0                      1                   0
##   Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 1              0               0                           0
## 2              0               0                           0
## 3              0               0                           0
## 4              0               0                           1
## 5              1               0                           0
## 6              0               0                           0
##   Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 1                        1                         0               0
## 2                        1                         0               0
## 3                        0                         0               0
## 4                        0                         0               0
## 5                        0                         0               0
## 6                        0                         0               0
##   Purpose.Education Purpose.Retraining Purpose.Business
## 1                 0                  0                0
## 2                 0                  0                0
## 3                 1                  0                0
## 4                 0                  0                0
## 5                 0                  0                0
## 6                 1                  0                0
##   SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 1                          0                              0
## 2                          1                              0
## 3                          1                              0
## 4                          1                              0
## 5                          1                              0
## 6                          0                              0
##   SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 1                               0                           0
## 2                               0                           0
## 3                               0                           0
## 4                               0                           0
## 5                               0                           0
## 6                               0                           0
##   EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 1                       0                         0                         0
## 2                       0                         1                         0
## 3                       0                         0                         1
## 4                       0                         0                         1
## 5                       0                         1                         0
## 6                       0                         1                         0
##   EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## 1                       1                                0
## 2                       0                                0
## 3                       0                                0
## 4                       0                                0
## 5                       0                                0
## 6                       0                                0
##   Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## 1                         0                    1                           1
## 2                         1                    0                           1
## 3                         0                    1                           1
## 4                         0                    1                           0
## 5                         0                    1                           1
## 6                         0                    1                           1
##   OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## 1                                  0                   1                  0
## 2                                  0                   1                  0
## 3                                  0                   1                  0
## 4                                  0                   0                  1
## 5                                  0                   0                  0
## 6                                  0                   0                  0
##   Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 1                 0                          0                            0
## 2                 0                          0                            0
## 3                 0                          0                            0
## 4                 0                          0                            0
## 5                 0                          0                            0
## 6                 0                          0                            0
##   Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## 1            0           1                       0                     0
## 2            0           1                       0                     0
## 3            0           1                       0                     1
## 4            0           0                       0                     0
## 5            0           0                       0                     0
## 6            0           0                       0                     1
##   Job.SkilledEmployee
## 1                   1
## 2                   1
## 3                   0
## 4                   1
## 5                   1
## 6                   0

Your observation: The GermanCredit dataset has 1000 observations with a mix of different variables related to customers’ financial and personal information. Some variables are numeric, like duration, loan amount, and age, while others are categorical, like account status and employment type. The target variable is Class, which shows whether a customer is considered a good or bad credit risk. From the summary, there are more “Good” cases than “Bad” ones, so the data is a bit imbalanced. Overall, the dataset looks clean and usable for building a classification model.

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

set.seed(2024)
train_index <- createDataPartition(GermanCredit$Class, p = 0.8, list = FALSE)
train_data <- GermanCredit[train_index, ]
test_data <- GermanCredit[-train_index, ]
dim(train_data)

## [1] 800  49

dim(test_data)

## [1] 200  49

Your observation: The dataset was split into training and test sets using an 80-20 split with a random seed of 2024. The training set contains 800 observations, while the test set contains 200 observations. Both sets have the same number of variables, so the structure of the data is consistent for modeling

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

library(e1071)

## Warning: package 'e1071' was built under R version 4.4.3

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:ggplot2':
## 
##     element

train_data$Class <- as.factor(train_data$Class)
svm_model <- svm(Class ~ ., 
                 data = train_data, 
                 kernel = "linear")
svm_model

## 
## Call:
## svm(formula = Class ~ ., data = train_data, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  403

Your observation: The linear SVM model ran successfully on the training set. It used all predictors and ended up with 403 support vectors, which are the observations that help define the separating boundary between the two classes.

2. Use the training set to get prediected classes. (5pts)

train_pred <- predict(svm_model, newdata = train_data)
head(train_pred)

## 1 2 3 4 5 6 
## 1 0 1 1 0 1 
## Levels: 0 1

Your observation: The SVM model was used to generate predicted class labels for the training dataset. The predictions are given as binary values (0 and 1), representing the two credit classes.

3. Obtain confusion matrix and MR on training set. (5pts)

conf_matrix <- table(Predicted = train_pred, Actual = train_data$Class)
conf_matrix

##          Actual
## Predicted   0   1
##         0 148  68
##         1  92 492

MR <- 1 - sum(diag(conf_matrix)) / sum(conf_matrix)
MR

## [1] 0.2

Your observation: The confusion matrix shows that the SVM model correctly classified a large portion of the observations, with 148 correct predictions for class 0 and 492 for class 1. However, there were also some misclassifications. The misclassification rate is 0.20, meaning that 20% of the training observations were incorrectly classified, resulting in an accuracy of about 80%

4. Use the testing set to get prediected classes. (5pts)

test_pred <- predict(svm_model, newdata = test_data)
head(test_pred)

## 17 21 24 28 33 46 
##  1  1  1  1  1  1 
## Levels: 0 1

Your observation: The SVM model was used to generate predicted class labels for the test dataset. The predictions are given as binary values (0 and 1), representing the two credit classes. These predictions will be used to evaluate the model’s performance on unseen data

5. Obtain confusion matrix and MR on testing set. (5pts)

conf_matrix_test <- table(Predicted = test_pred, Actual = test_data$Class)
conf_matrix_test

##          Actual
## Predicted   0   1
##         0  31  24
##         1  29 116

MR_test <- 1 - sum(diag(conf_matrix_test)) / sum(conf_matrix_test)
MR_test

## [1] 0.265

Your observation: The confusion matrix shows the model’s performance on the test set. There are some correct predictions for both classes, but also a noticeable number of errors. The misclassification rate is 0.265, meaning about 26.5% of the test observations were incorrectly classified. This suggests the model performs reasonably well, but not as accurately as on the training data.

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

train_data$Class <- as.factor(train_data$Class)
svm_weighted <- svm(
  Class ~ .,
  data = train_data,
  kernel = "linear",
  class.weights = c("0" = 1, "1" = 2),
  probability = TRUE
)
svm_weighted

## 
## Call:
## svm(formula = Class ~ ., data = train_data, kernel = "linear", class.weights = c(`0` = 1, 
##     `1` = 2), probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  378

Your observation: A weighted SVM model with a linear kernel was fit using the training data. A higher weight was assigned to class 1 to emphasize its importance during classification. Probability estimation was also enabled. The model used 378 support vectors, indicating a slightly different decision boundary compared to the unweighted model.

2. Use the training set to get prediected probabilities and classes.

train_pred_w <- predict(svm_weighted, newdata = train_data)

train_prob_w <- attr(predict(svm_weighted, newdata = train_data, probability = TRUE), "probabilities")

head(train_pred_w)

## 1 2 3 4 5 6 
## 1 1 1 1 0 1 
## Levels: 0 1

head(train_prob_w)

##           1          0
## 1 0.8407723 0.15922768
## 2 0.4059899 0.59401014
## 3 0.9206866 0.07931335
## 4 0.7775040 0.22249604
## 5 0.2748831 0.72511691
## 6 0.7380708 0.26192923

Your observation: The weighted SVM model was used to generate both predicted class labels and probabilities for the training dataset. The class predictions show whether each observation is classified as 0 or 1, while the probability output provides the likelihood of each class. Since a higher weight was assigned to class 1, the model tends to assign higher probabilities to that class.

3. Obtain confusion matrix and MR on training set (use predicted classes).

conf_matrix_w <- table(Predicted = train_pred_w, Actual = train_data$Class)
conf_matrix_w

##          Actual
## Predicted   0   1
##         0  64   8
##         1 176 552

MR_w <- 1 - sum(diag(conf_matrix_w)) / sum(conf_matrix_w)
MR_w

## [1] 0.23

Your observation: The confusion matrix shows that the weighted SVM model significantly improves the classification of class 1, with only a small number of misclassifications for that class. However, this comes at the cost of more errors in class 0. The misclassification rate is 0.23, which is slightly higher than the unweighted model. This indicates that while overall accuracy decreased, the model became better at identifying class 1 due to the higher weight assigned to it.

4. Obtain ROC and AUC on training set (use predicted probabilities).

library(pROC)

## Warning: package 'pROC' was built under R version 4.4.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

train_prob_1 <- train_prob_w[, "1"]


roc_obj <- roc(train_data$Class, train_prob_1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_obj, col = "blue", main = "ROC Curve - Training Set")

auc(roc_obj)

## Area under the curve: 0.825

Your observation: The AUC of 0.825 suggests that the model has strong discriminative ability and can effectively separate the two classes.

5. Use the testing set to get prediected probabilities and classes.

test_pred_w <- predict(svm_weighted, newdata = test_data)


test_prob_w <- attr(predict(svm_weighted, newdata = test_data, probability = TRUE), "probabilities")


head(test_pred_w)

## 17 21 24 28 33 46 
##  1  1  1  1  1  1 
## Levels: 0 1

head(test_prob_w)

##            1         0
## 17 0.8571835 0.1428165
## 21 0.8560695 0.1439305
## 24 0.8688477 0.1311523
## 28 0.8136918 0.1863082
## 33 0.7015288 0.2984712
## 46 0.7625487 0.2374513

Your observation: The model appears more confident in predicting class 1, which is consistent with the weighting scheme applied during training.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

conf_matrix_test_w <- table(Predicted = test_pred_w, Actual = test_data$Class)
conf_matrix_test_w

##          Actual
## Predicted   0   1
##         0  13   9
##         1  47 131

MR_test_w <- 1 - sum(diag(conf_matrix_test_w)) / sum(conf_matrix_test_w)
MR_test_w

## [1] 0.28

Your observation: The weighted model improves detection of class 1 but increases overall error compared to the unweighted model

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

test_prob_1 <- test_prob_w[, "1"]


roc_test <- roc(test_data$Class, test_prob_1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_test, col = "red", main = "ROC Curve - Test Set")

auc(roc_test)

## Area under the curve: 0.7115

Your observation: The AUC on the test set is 0.7115, which shows the model has a reasonable ability to classify the data. The lower AUC compared to the training set suggests some drop in performance on unseen data.

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

From the analysis, the SVM models were able to classify the credit data reasonably well. The unweighted SVM had a lower misclassification rate and better overall accuracy, while the weighted SVM improved the model’s ability to correctly identify class 1 by assigning it a higher importance. However, this came at the cost of slightly higher overall error. The ROC and AUC results showed that the model performs well on the training data but drops somewhat on the test data, indicating some loss in performance when applied to unseen data. Overall, the model demonstrates a good balance between accuracy and class-specific performance. ### 2. Please recall the results from last homework, how do you compare SVM to logistic regression? No coding is required for this question, just discuss. (10pts)

Compared to logistic regression from the previous homework, SVM tends to focus more on finding the optimal boundary between classes rather than estimating probabilities directly. Logistic regression is easier to interpret because the coefficients clearly show the effect of each variable, while SVM is more of a “black box” model. In terms of performance, SVM can sometimes achieve better classification results, especially when the data is not perfectly linearly separable. However, logistic regression is generally faster and more straightforward, while SVM may require more tuning and computational time. Overall, SVM may provide better predictive performance, but logistic regression is easier to understand and interpret.

Homework5

Tianhai Zu

10/22/2023

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

2. Use the training set to get prediected classes. (5pts)

3. Obtain confusion matrix and MR on training set. (5pts)

4. Use the testing set to get prediected classes. (5pts)

5. Obtain confusion matrix and MR on testing set. (5pts)

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

2. Use the training set to get prediected probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get prediected probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

3. (Optional) Change th kernel to others such as `radial`, and see if you got a better result.

Homework5

Tianhai Zu

10/22/2023

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as 2024 for reproducibility. (5pts)

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

2. Use the training set to get prediected classes. (5pts)

3. Obtain confusion matrix and MR on training set. (5pts)

4. Use the testing set to get prediected classes. (5pts)

5. Obtain confusion matrix and MR on testing set. (5pts)

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with probability = TRUE.

2. Use the training set to get prediected probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get prediected probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

3. (Optional) Change th kernel to others such as radial, and see if you got a better result.

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

3. (Optional) Change th kernel to others such as `radial`, and see if you got a better result.