Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

library(caret) #this package contains the german data with its numeric format

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response，now 1 is good and 0 is bad.
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

# This is the code that drop variables that provide no information in the data
# Just run it
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

# view structure of dataset
str(GermanCredit)

## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

# summary statistics
summary(GermanCredit)

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker   Class   CheckingAccountStatus.lt.0
##  Min.   :0.000   0:300   Min.   :0.000             
##  1st Qu.:1.000   1:700   1st Qu.:0.000             
##  Median :1.000           Median :0.000             
##  Mean   :0.963           Mean   :0.274             
##  3rd Qu.:1.000           3rd Qu.:1.000             
##  Max.   :1.000           Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

# check dimensions
dim(GermanCredit)

## [1] 1000   49

# check class distribution
table(GermanCredit$Class)

## 
##   0   1 
## 300 700

Your observation: dataset contains 1000 observations and 49 variables. response variable Class is binary: levels 0 and 1

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

# set seed for reproducibility
set.seed(2024)

# create index for training set (80%)
train_index <- sample(1:nrow(GermanCredit), 0.8 * nrow(GermanCredit))

# split dataset
train_data <- GermanCredit[train_index, ]
test_data <- GermanCredit[-train_index, ]

# check dimensions
dim(train_data)

## [1] 800  49

dim(test_data)

## [1] 200  49

# check class distribution
table(train_data$Class)

## 
##   0   1 
## 229 571

table(test_data$Class)

## 
##   0   1 
##  71 129

Your observation: more observations in class 1 than class 0 showing an imbalance.

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:ggplot2':
## 
##     element

# make sure response variable is a factor
train_data$Class <- as.factor(train_data$Class)
test_data$Class <- as.factor(test_data$Class)

# fit SVM model with linear kernel
svm_model <- svm(Class ~ ., data = train_data, kernel = "linear")

# view model
svm_model

## 
## Call:
## svm(formula = Class ~ ., data = train_data, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  391

Your observation: model identified 391 support vectors.

2. Use the training set to get prediected classes. (5pts)

# predicted classes for training set
train_pred <- predict(svm_model, train_data)

# view predictions
head(train_pred)

## 578 549 557 700 255 913 
##   1   0   0   1   1   1 
## Levels: 0 1

Your observation: model predicted majority class 1 observations which reflects the class imbalance

3. Obtain confusion matrix and MR on training set. (5pts)

#predicted classes on training set
train_pred <- predict(svm_model, newdata = train_data)

#confusion matrix
table(train_data$Class, train_pred)

##    train_pred
##       0   1
##   0 132  97
##   1  59 512

#MR
mean(train_pred != train_data$Class)

## [1] 0.195

Your observation: the missclassification rate of .195 means the model incorrectly classified 19.5% of the observations

4. Use the testing set to get prediected classes. (5pts)

#predicted classes on testing set
test_pred <- predict(svm_model, newdata = test_data)

#view 
table(test_pred)

## test_pred
##   0   1 
##  56 144

Your observation: the model predicted mostly class 1 similar to the training data because class 0 is the majority class

5. Obtain confusion matrix and MR on testing set. (5pts)

# confusion matrix
table(test_data$Class, test_pred)

##    test_pred
##       0   1
##   0  36  35
##   1  20 109

# misclassification rate (MR)
mean(test_pred != test_data$Class)

## [1] 0.275

Your observation: the model incorrectly classified 27.5% of the observations, higher than on the training set showing the model performs slightly worse on unseen data.

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

library(e1071)

#response variable is a factor
train_data$Class <- as.factor(train_data$Class)
test_data$Class <- as.factor(test_data$Class)

#weighted SVM model with probabilities enabled
svm_weighted <- svm(Class ~ .,
                    data = train_data,
                    kernel = "linear",
                    class.weights = c("0" = 1, "1" = 2),
                    probability = TRUE)

#view 
svm_weighted

## 
## Call:
## svm(formula = Class ~ ., data = train_data, kernel = "linear", class.weights = c(`0` = 1, 
##     `1` = 2), probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  359

Your observation: model identified 359 support vectors, a slight decrease, indicating the decision boundary changed after applying weights.

2. Use the training set to get prediected probabilities and classes.

# predicted classes (training set)
train_pred_weighted <- predict(svm_weighted, train_data)

# predicted probabilities (training set)
train_prob_weighted <- attr(
  predict(svm_weighted, train_data, probability = TRUE),
  "probabilities"
)

# view first few results
head(train_pred_weighted)

## 578 549 557 700 255 913 
##   1   0   1   1   1   1 
## Levels: 0 1

head(train_prob_weighted)

##             1          0
## 578 0.8955952 0.10440480
## 549 0.2956665 0.70433348
## 557 0.4643401 0.53565991
## 700 0.6952495 0.30475053
## 255 0.9410486 0.05895141
## 913 0.7893343 0.21066567

Your observation: the probability values indicate the model’s confidence in which class to assign observations to. observations show the model is generally more confident when predicting class 1

3. Obtain confusion matrix and MR on training set (use predicted classes).

# confusion matrix (training set)
table(train_data$Class, train_pred_weighted)

##    train_pred_weighted
##       0   1
##   0  51 178
##   1   9 562

# misclassification rate (training set)
mean(train_pred_weighted != train_data$Class)

## [1] 0.23375

Your observation: the misclassification rate of 0.23375 is higher than the unweighted model, indicating that adding more weight to class 1 increased the training error

4. Obtain ROC and AUC on training set (use predicted probabilities).

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# predicted probabilities for class 1
train_prob_weighted <- attr(predict(svm_weighted, train_data, probability = TRUE), "probabilities")[, "1"]

# ROC and AUC
ROC_train <- roc(train_data$Class, train_prob_weighted)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(ROC_train)

auc(ROC_train)

## Area under the curve: 0.8336

Your observation: AUC value of about 0.8336 indicates the model does a decent job of distinguishing between class 1 and 0 on the training set.

5. Use the testing set to get prediected probabilities and classes.

#predicted classes (testing set)
test_pred_weighted <- predict(svm_weighted, test_data)

#predicted probabilities (testing set)
test_prob_weighted <- attr(
  predict(svm_weighted, test_data, probability = TRUE),
  "probabilities"
)

# view first few results
head(test_pred_weighted)

## 10 13 17 20 41 44 
##  1  1  1  1  1  1 
## Levels: 0 1

head(test_prob_weighted)

##            1         0
## 10 0.5358289 0.4641711
## 13 0.8314207 0.1685793
## 17 0.8768769 0.1231231
## 20 0.8561473 0.1438527
## 41 0.7491336 0.2508664
## 44 0.7215988 0.2784012

Your observation: this shows the model’s confidence in assigning observations to each class, and the probability for 1 is still significantly higher like in the training set

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

# confusion matrix (testing set)
table(test_data$Class, test_pred_weighted)

##    test_pred_weighted
##       0   1
##   0  20  51
##   1   7 122

# misclassification rate (MR)
mean(test_pred_weighted != test_data$Class)

## [1] 0.29

Your observation: misclassification rate of about 0.29 is higher than the training set, showing the model performs slightly worse on unseen data.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

library(pROC)

# get probability of class 1
test_prob_class1 <- test_prob_weighted[, "1"]

# create ROC curve
ROC_test <- roc(test_data$Class, test_prob_class1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# plot ROC curve
plot(ROC_test)

# get AUC value
auc(ROC_test)

## Area under the curve: 0.7122

Your observation: AUC value of 0.7122 is slightly smaller, but similar to the training set result, showing the model performs a little worse, but stll fairly well on new data.

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

Adding class weights changed the SVM model so it placed more emphasis on Class 1. The weighted model showed good separation on the training set, but its testing misclassification rate was higher than its training error, and the testing AUC was lower than the training AUC. This shows the model still performed reasonably well, but it worked better on the training data than on new data. ### 2. Please recall the results from last homework, how do you compare SVM to logistic regression? No coding is required for this question, just discuss. (10pts) logistic regression performed slightly better overall. It had lower misclassification rates and higher AUC values on both the training and testing sets. While the weighted SVM helped address the class imbalance by improving prediction of Class 1, logistic regression provided stronger overall classification performance for this dataset. ### 3. (Optional) Change th kernel to others such as radial, and see if you got a better result.

Homework6

Sydney Thedin

04/12/2026

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

2. Use the training set to get prediected classes. (5pts)

3. Obtain confusion matrix and MR on training set. (5pts)

4. Use the testing set to get prediected classes. (5pts)

5. Obtain confusion matrix and MR on testing set. (5pts)

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

2. Use the training set to get prediected probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get prediected probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

Homework6

Sydney Thedin

04/12/2026

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as 2024 for reproducibility. (5pts)

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

2. Use the training set to get prediected classes. (5pts)

3. Obtain confusion matrix and MR on training set. (5pts)

4. Use the testing set to get prediected classes. (5pts)

5. Obtain confusion matrix and MR on testing set. (5pts)

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with probability = TRUE.

2. Use the training set to get prediected probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get prediected probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.