Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

library(caret) #this package contains the german data with its numeric format

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response，now 1 is good and 0 is bad.
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

colSums(is.na(GermanCredit)) #the code book says there are no missing, but double checking

##                           Duration                             Amount 
##                                  0                                  0 
##          InstallmentRatePercentage                  ResidenceDuration 
##                                  0                                  0 
##                                Age              NumberExistingCredits 
##                                  0                                  0 
##            NumberPeopleMaintenance                          Telephone 
##                                  0                                  0 
##                      ForeignWorker                              Class 
##                                  0                                  0 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                                  0                                  0 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                                  0                                  0 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                                  0                                  0 
##                CreditHistory.Delay                     Purpose.NewCar 
##                                  0                                  0 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                                  0                                  0 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                                  0                                  0 
##                    Purpose.Repairs                  Purpose.Education 
##                                  0                                  0 
##                 Purpose.Retraining                   Purpose.Business 
##                                  0                                  0 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                                  0                                  0 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                                  0                                  0 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                                  0                                  0 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                                  0                                  0 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                                  0                                  0 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                                  0                                  0 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                                  0                                  0 
##                 Property.Insurance                  Property.CarOther 
##                                  0                                  0 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                                  0                                  0 
##                       Housing.Rent                        Housing.Own 
##                                  0                                  0 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                                  0                                  0 
##                Job.SkilledEmployee 
##                                  0

str(GermanCredit) # structure of the data set

## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

summary(GermanCredit) #Acquire the summ stats

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker   Class   CheckingAccountStatus.lt.0
##  Min.   :0.000   0:300   Min.   :0.000             
##  1st Qu.:1.000   1:700   1st Qu.:0.000             
##  Median :1.000           Median :0.000             
##  Mean   :0.963           Mean   :0.274             
##  3rd Qu.:1.000           3rd Qu.:1.000             
##  Max.   :1.000           Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

barplot(table(GermanCredit$Class), # our focus will be on `Class`, visual rep or proportion.
        ylab = "Frequency",
        xlab = "Class")

Your observation: There are no missings/NA’s, looking at the summery stats, There aren’t any points that stick out of the norm. Visually we can see based on the predictor variable Class, we see the distribution falling higher on 1- non-credit risk.

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80) #using a 80 20 split
german_train = GermanCredit[index,]
german_test = GermanCredit[-index,]

Your observation: 800 observations in training, and 200 in testing.

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

# Load the e1071 package
library(e1071)

## Warning: package 'e1071' was built under R version 4.3.3

#library ggplot2
library(ggplot2)


# Create SVM model with linear kernel
svm_model <- svm(Class ~ ., data = german_train, kernel = 'linear') #linear SVM

# Summary of the trained model
summary(svm_model)

## 
## Call:
## svm(formula = Class ~ ., data = german_train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  391
## 
##  ( 197 194 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Your observation: Using the training data set, I created the svm_model using Class as the predictor and all other variables. The results show that there are 391 support vectors(observations) that support the hyperplane, but the number of vectors on either side is uneven to each other, meaning the that there may be high variance, and the model might be unstable.

2. Use the training set to get prediected classes. (5pts)

# Make predictions on the train data
pred_class_train <- predict(svm_model, german_train)
head(pred_class_train)

## 578 549 557 700 255 913 
##   1   0   0   1   1   1 
## Levels: 0 1

Your observation: Using the predict function to classify observations.

3. Obtain confusion matrix and MR on training set. (5pts)

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = german_train$Class,
                      pred = pred_class_train)
Cmatrix_train

##     pred
## true   0   1
##    0 132  97
##    1  59 512

#Misclassification Rate
MR_train<- 1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
MR_train

## [1] 0.195

Your observation: We see higher values in the confusion matrix on the top right corner, or the False Positive section, meaning a majority of errors we predict they have (based on Class) good credit potential but are actually not. To which in this method of prediction is worse than False Negative-meaning that we predicted they are a bad credit risk but are actually good. The MR for the training data is at most 19.5%, meaning we only mis-classify 19.5% of the time.

4. Use the testing set to get prediected classes. (5pts)

pred_class_test <- predict(svm_model, german_test)
head(pred_class_test)

## 10 13 17 20 41 44 
##  0  1  1  1  1  1 
## Levels: 0 1

Your observation: Same work, now using the testing data.

5. Obtain confusion matrix and MR on testing set. (5pts)

# Confusion matrix to evaluate the model on testing data
Cmatrix_test = table(true = german_test$Class,
                      pred = pred_class_test)
Cmatrix_test

##     pred
## true   0   1
##    0  36  35
##    1  20 109

#Misclassification Rate
MR_test<- 1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
MR_test

## [1] 0.275

Your observation: We see similar results, but they are technically higher on the testing data. We still see higher results on the FP. The MR for the testing data, is higher at 27.5% but is to be expected.

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

german.svm_asymmetric = svm(Class ~ .,
                            data = german_train, 
                            kernel = 'linear',
                            probability = TRUE,
                            class.weights = c("0" = 2, "1" = 1)) #different from logressioin where you put higher weight on the variable you want more of

Your observation: Re-created the svm model, having probability as true to attain AUC, and having weights, that are no longer equal. We have placed higher importance on 1, meaning we want to lower FP.

2. Use the training set to get prediected probabilities and classes.

#get predictions for training
pred_prob_train = predict(german.svm_asymmetric,
                          newdata = german_train,
                          probability = TRUE)

Your observation: Creating the predict model, having probability on so we can attain the AUC and ROC if needed.

3. Obtain confusion matrix and MR on training set (use predicted classes).

Cmatrix_train2 <- table( true = german_train$Class, pred = pred_prob_train)
Cmatrix_train2

##     pred
## true   0   1
##    0 113 116
##    1  41 530

MR_train2 <- 1 - sum(diag(Cmatrix_train2))/sum(Cmatrix_train2)
MR_train2

## [1] 0.19625

Your observation: We have obtained the confusion matrix using different weights, here we have lowered FP, the point of the weighting choice. Seeing a significant decrease in False positives, but also an increase on false negatives, to which is OK in this scenario.

4. Obtain ROC and AUC on training set (use predicted probabilities).

#run only once
pred_prob_train = attr(pred_prob_train, "probabilities")[, "1"] #might be grabbing the 0, input [, "1"] to grab 1

#load the ROCR package to get the ROC
library(ROCR)

## Warning: package 'ROCR' was built under R version 4.3.3

pred <- prediction(pred_prob_train, german_train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#to get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8498153

Your observation: Using the ROCR package I was able to create the ROC curve showing a very rough but high covering line, meaning it will explain quite a bit of the data in the model. Which is shown mathematically by the AUC, to which we want as high as we can, here it is 0.85, saying it this model is able to classify the variable well (good), but not as high as we want (.90 = excellent).

5. Use the testing set to get prediected probabilities and classes.

#obtain testing pred_prob
pred_prob_test = predict(german.svm_asymmetric,
                         newdata = german_test,
                         probability = TRUE)

Your observation: Now using the testing data to get the predicted variables and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

Cmatrix_test2 <- table( true = german_test$Class, pred = pred_prob_test)
Cmatrix_test2

##     pred
## true   0   1
##    0  25  46
##    1  10 119

MR_test2 <- 1 - sum(diag(Cmatrix_test2))/sum(Cmatrix_test2)
MR_test2

## [1] 0.28

Your observation: using the Testing data we were able to see that the results that are technically worse, but to be expected of testing data, since we don’t want to overfit the data, but the MR had gone up from 19.5% to 28% which is concerning.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

 # this is necessary, run only once
pred_prob_test = attr(pred_prob_test, "probabilities")[, "1"]

#ROC
library(ROCR)
pred <- prediction(pred_prob_test, german_test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7421116

Your observation: Based on the AUC, we see that it is lower than what we predicted in the training data, falling from “Good” to “fair” classifier.

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

Using the German credit variables with a higher focus upon Class where 0 = bad credit risk and 1 = good credit risk, we worked to better classify where in the training and testing data sets we created with a 80 20 split, how the Class observations classified based on all other variables. By creating the svm model, we saw that there 391 support vectors for the hyperplane, with more vectors on “0” with 197, with “1” only having 194, possibly indicating high variance, and possibly a unstable model.

After classifying the training and testing data sets with equal weighting we see that, there is a higher concentration on the FP (False positive) side, meaning we incorrectly guess that people have good credit, but are actually a bad credit risk, Which is worse for credit lenders than FN.

We move onto classifying with unequal weights, with higher weight on “0” (at 2) and “1” (at 1), we put higher weight on the one we are wanting to let grow, so we can lower the other, to which in this instance we want to lower the FP’s so we allow FN’s to grow so FP can shrink. Which is what happened on both the training and testing data sets, comparatively it shows great change than the original analysis. But now we were able to create the ROC curve and AUC values for each set, since we set the probabilities to true. We see that the training set has a higher AUC, meaning that model is a good classifier of the data, but the testing data has a lower AUC meaning it is only a fair classifier.

2. Please recall the results from last homework, how do you compare SVM to logistic regression? No coding is required for this question, just discuss. (10pts)

In logistic regression, when assigning weights we would put the higher weight on the one we want to predict better/more, but in SVM we put higher weight on the opposite, so we can decrease the result (FP/FN) of the one we want.

Homework5

Chance Jones

11/12/2024

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

2. Use the training set to get prediected classes. (5pts)

3. Obtain confusion matrix and MR on training set. (5pts)

4. Use the testing set to get prediected classes. (5pts)

5. Obtain confusion matrix and MR on testing set. (5pts)

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

2. Use the training set to get prediected probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get prediected probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

2. Please recall the results from last homework, how do you compare SVM to logistic regression? No coding is required for this question, just discuss. (10pts)

3. (Optional) Change the kernel to others such as `radial`, and see if you got a better result.

Homework5

Chance Jones

11/12/2024

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as 2024 for reproducibility. (5pts)

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

2. Use the training set to get prediected classes. (5pts)

3. Obtain confusion matrix and MR on training set. (5pts)

4. Use the testing set to get prediected classes. (5pts)

5. Obtain confusion matrix and MR on testing set. (5pts)

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with probability = TRUE.

2. Use the training set to get prediected probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get prediected probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

2. Please recall the results from last homework, how do you compare SVM to logistic regression? No coding is required for this question, just discuss. (10pts)

3. (Optional) Change the kernel to others such as radial, and see if you got a better result.

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

3. (Optional) Change the kernel to others such as `radial`, and see if you got a better result.