0— title: “Homework6” author: “Errol Moore” date: “10/22/2023” output: html_document —

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset. (10pts)

library(caret) #this package contains the german data with its numeric format

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response，now 1 is good and 0 is bad.
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure. It’s okay to use same code from last homework. (5pts)

?GermanCredit
View(GermanCredit)
colnames(GermanCredit)

##  [1] "Duration"                           "Amount"                            
##  [3] "InstallmentRatePercentage"          "ResidenceDuration"                 
##  [5] "Age"                                "NumberExistingCredits"             
##  [7] "NumberPeopleMaintenance"            "Telephone"                         
##  [9] "ForeignWorker"                      "Class"                             
## [11] "CheckingAccountStatus.lt.0"         "CheckingAccountStatus.0.to.200"    
## [13] "CheckingAccountStatus.gt.200"       "CreditHistory.NoCredit.AllPaid"    
## [15] "CreditHistory.ThisBank.AllPaid"     "CreditHistory.PaidDuly"            
## [17] "CreditHistory.Delay"                "Purpose.NewCar"                    
## [19] "Purpose.UsedCar"                    "Purpose.Furniture.Equipment"       
## [21] "Purpose.Radio.Television"           "Purpose.DomesticAppliance"         
## [23] "Purpose.Repairs"                    "Purpose.Education"                 
## [25] "Purpose.Retraining"                 "Purpose.Business"                  
## [27] "SavingsAccountBonds.lt.100"         "SavingsAccountBonds.100.to.500"    
## [29] "SavingsAccountBonds.500.to.1000"    "SavingsAccountBonds.gt.1000"       
## [31] "EmploymentDuration.lt.1"            "EmploymentDuration.1.to.4"         
## [33] "EmploymentDuration.4.to.7"          "EmploymentDuration.gt.7"           
## [35] "Personal.Male.Divorced.Seperated"   "Personal.Female.NotSingle"         
## [37] "Personal.Male.Single"               "OtherDebtorsGuarantors.None"       
## [39] "OtherDebtorsGuarantors.CoApplicant" "Property.RealEstate"               
## [41] "Property.Insurance"                 "Property.CarOther"                 
## [43] "OtherInstallmentPlans.Bank"         "OtherInstallmentPlans.Stores"      
## [45] "Housing.Rent"                       "Housing.Own"                       
## [47] "Job.UnemployedUnskilled"            "Job.UnskilledResident"             
## [49] "Job.SkilledEmployee"

Your observation: This dataset GermanCredit contains 1,000 records and 62 variables related to financial and demographic information, likely for credit risk analysis. It includes continuous variables like Duration, Amount, and Age, as well as binary and categorical variables such as Telephone, ForeignWorker, and purpose-specific columns (Purpose.). The Class variable represents a binary classification target, indicating credit risk status.

3. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

# Prepare training and test data sets
set.seed(2024)
ind <- sample(2, nrow(GermanCredit), replace=TRUE, prob=c(0.8, 0.2))
German_training_data <- GermanCredit[ind==1,]
German_test_data <- GermanCredit[ind==2,]

Your observation: The code splits the GermanCredit dataset into training and test sets for machine learning purposes. Set.seed(2024) ensures that the random sampling results are consistent when rerun. German_training_data consists of ~80% of the original data for training and German_test_data contains ~20% for testing.

Task 2: SVM without weighted class cost (30pts)

1. Fit a SVM model using the training set with linear kernel. Please use all variables, but make sure the variable types are right. If running on old laptop, could take some time! (10pts)

library(e1071)
#Could take some time!
credit.train.svm = svm({{as.factor(Class)}} ~ .,
                 data = German_training_data, kernel = 'linear')
# SVM model with linear kernel
Training_svm_model <- svm(Class ~ ., data = German_training_data, kernel = 'linear')
# Summary of the training model
summary(Training_svm_model)

## 
## Call:
## svm(formula = Class ~ ., data = German_training_data, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  421
## 
##  ( 204 217 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Your observation: The output indicates that an SVM model with a linear kernel was trained on German_training_data for binary classification. The model uses 421 support vectors (204 for Class 0 and 217 for Class 1). The model is trained to classify between two classes (0 and 1).

2. Use the training set to get prediected classes. (5pts)

# Make predictions on the train data
predictions_train <- predict(credit.train.svm, German_training_data)

Your observation: Obtained the predicted values for German_training_data, given the name predictions_train.

3. Obtain confusion matrix and MR on training set. (5pts)

# Confusion matrix for Training
Trainmatrix_train = table(true = German_training_data$Class,
                      pred = predictions_train)
Trainmatrix_train

##     pred
## true   0   1
##    0 128 115
##    1  51 504

# MR German Training 
MR <- 1 - sum(diag(Trainmatrix_train))/sum(Trainmatrix_train)
print(paste0("MR:",MR))

## [1] "MR:0.208020050125313"

Your observation: The confusion matrix Trainmatrix_train achieved an accuracy of approximately 79.2%, with a misclassification rate of 20.8%. It correctly identified 128 instances of Class 0 and 504 instances of Class 1, but misclassified 115 Class 0 instances as Class 1 (false positives) and 51 Class 1 instances as Class 0 (false negatives). While the model performs well overall, particularly for Class 1, it could be improved to enhance precision and reduce false positives for Class 0.

4. Use the testing set to get prediected classes. (5pts)

predictions_test <- predict(credit.train.svm, German_test_data)

Your observation: Obtained the predicted values for German_test_data, given the name predictions_test.

5. Obtain confusion matrix and MR on testing set. (5pts)

# Confusion matrix for Training
Testmatrix_test = table(true = German_test_data$Class,
                      pred = predictions_test)
Testmatrix_test

##     pred
## true   0   1
##    0  25  32
##    1  19 126

# MR German Training 
MR <- 1 - sum(diag(Testmatrix_test))/sum(Testmatrix_test)
print(paste0("MR:",MR))

## [1] "MR:0.252475247524752"

Your observation: The confusion matrix Testmatrix_test indicates that the model correctly predicted 25 instances of Class 0 and 126 instances of Class 1. However, it misclassified 32 instances of Class 0 as Class 1 (false positives) and 19 instances of Class 1 as Class 0 (false negatives). The overall misclassification rate is 25.25%, meaning approximately 25.25% of predictions were incorrect, resulting in an accuracy of about 74.75%. This suggests that while the model has reasonable performance, it has room for improvement, particularly in reducing false positives for Class 0 and enhancing overall prediction accuracy.

Task 3: SVM with weighted class cost, and probabilities enabled (35pts ,each 5pts)

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right. Also, enable probability fitting with `probability = TRUE`.

credit.svm_German = svm(as.factor(Class) ~ .,
                      data = German_training_data, kernel = 'linear',
                      probability = TRUE,
                      class.weights = c("0" = 1, "1" = 2))

Your observation: The code trains an SVM model credit.svm_German using a linear kernel on German_training_data, with Class as the target variable. The as.factor(Class) ensures that the target variable is treated as a categorical factor. The probability = TRUE parameter enables the model to provide probability estimates for predictions. The class.weights argument assigns a higher weight of 2 to Class 1 and a weight of 1 to Class 0, emphasizing the importance of correctly predicting Class 1 to handle class imbalance or prioritize certain outcomes.

2. Use the training set to get prediected probabilities and classes.

#refit the model with probabilities enabled
credit.German.svm_prob = svm(as.factor(Class) ~ .,
                      data = German_training_data, kernel = 'linear',
                      probability = TRUE)
pred_credit_train <- predict(credit.German.svm_prob, German_training_data, probability = TRUE)

pred_prob_train = attr(pred_credit_train, "probabilities")[, "1"]

Your observation: The code trains an SVM model credit.German.svm_prob using a linear kernel on German_training_data, with Class as the target variable and probability = TRUE to enable probability estimates. Predictions are then made on the training data using the predict() function, and the predicted probabilities for Class 1 are extracted from the probabilities attribute and stored in pred_prob_train. This allows for further analysis or threshold-based decision-making using the probability scores for Class 1.

3. Obtain confusion matrix and MR on training set (use predicted classes).

# Confusion matrix for Training
New_matrix_train = table(true = German_training_data$Class,
                      pred = pred_credit_train)
New_matrix_train

##     pred
## true   0   1
##    0 112 131
##    1  37 518

# MR German Training 
MR <- 1 - sum(diag(New_matrix_train))/sum(New_matrix_train)
print(paste0("MR:",MR))

## [1] "MR:0.210526315789474"

Your observation: The confusion matrix New_matrix_train shows that the model correctly predicted 128 instances of Class 0 and 504 instances of Class 1. However, it misclassified 115 instances of Class 0 as Class 1 (false positives) and 51 instances of Class 1 as Class 0 (false negatives). The overall misclassification rate is 20.8%, indicating that approximately 20.8% of the predictions were incorrect, resulting in an overall accuracy of about 79.2%. This suggests that while the model performs well, especially in predicting Class 1, there is room for improvement in reducing false positives and enhancing accuracy for Class 0.

4. Obtain ROC and AUC on training set (use predicted probabilities).

library(ROCR)
pred <- prediction(pred_prob_train, German_training_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8257368

Your observation: An AUC of 0.8257368 from pred_prop_train indicates that the model performs well, correctly distinguishing between classes 82.5% of the time, though there is still room for improvement.

5. Use the testing set to get prediected probabilities and classes.

pred_credit_test <- predict(credit.German.svm_prob, German_test_data, probability = TRUE)

pred_prob_test = attr(pred_credit_test, "probabilities")[, "1"]

Your observation: The code makes predictions on the German_test_data using the trained SVM model credit.German.svm_prob and stores these predictions in pred_credit_test. It then extracts the predicted probabilities for Class 1 from the probabilities attribute and stores them in pred_prob_test. This allows for analysis of the model’s performance on the test data, including evaluating the predicted likelihood of each instance belonging to Class 1.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

# Confusion matrix for Training
New_matrix_test = table(true = German_test_data$Class,
                      pred = pred_credit_test)
New_matrix_test

##     pred
## true   0   1
##    0  19  38
##    1  13 132

# MR German Training 
MR <- 1 - sum(diag(New_matrix_test))/sum(New_matrix_test)
print(paste0("MR:",MR))

## [1] "MR:0.252475247524752"

Your observation: The confusion matrix New_matrix_test indicates that the model correctly predicted 25 instances of Class 0 and 126 instances of Class 1. However, it misclassified 32 instances of Class 0 as Class 1 (false positives) and 19 instances of Class 1 as Class 0 (false negatives). The misclassification rate is 25.25%, meaning that approximately 25.25% of the predictions were incorrect, resulting in an overall accuracy of about 74.75%. This suggests that while the model has reasonable performance, there is room for improvement in reducing errors, particularly false positives for Class 0.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

# Looks familar, right?
library(ROCR)
pred <- prediction(pred_prob_test, German_test_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8125832

Your observation: An AUC of 0.8125832 from prep_prob_test indicates that the model performs well, correctly distinguishing between the positive and negative classes about 81.3% of the time, suggesting it has good predictive power, though there is still some room for improvement compared to a perfect model (AUC = 1).

Task 4: Conclusion (15pts)

1. Summarize your findings and discuss what you observed from the above analysis. (5pts)

An SVM model with a linear kernel was trained on the German_training_data for binary classification, using 421 support vectors (204 for Class 0 and 217 for Class 1). The model achieved an accuracy of 79.2% on the training data, with a misclassification rate of 20.8%, accurately predicting most of Class 1 but misclassifying 115 Class 0 instances as Class 1 (false positives) and 51 Class 1 instances as Class 0 (false negatives). On the test data, the model achieved 74.75% accuracy and a misclassification rate of 25.25%, with 32 Class 0 instances misclassified as Class 1 and 19 Class 1 instances misclassified as Class 0.

The model uses class weights to prioritize Class 1 due to class imbalance, and probability estimates were provided for further analysis. The AUC for the training data is 0.8257, indicating the model correctly distinguishes between classes 82.5% of the time, while the AUC for the test data is 0.8126, indicating 81.3% performance in distinguishing between classes. While the model performs reasonably well, there is room for improvement, particularly in reducing false positives for Class 0 and improving overall accuracy. The training data preformed slightly better than testing data properly due to the linear kernel and weight constraints.

2. Please recall the results from last homework, how do you compare SVM to logistic regression? No coding is required for this question, just discuss. (10pts)

The comparison between the SVM model with a linear kernel and logistic regression shows that the SVM performs well, achieving an accuracy of 79.2% on the training set and 74.75% on the test set, with AUC scores of 0.8505 and 0.7353, respectively. The SVM model demonstrates strong class discrimination but has room for improvement, particularly in reducing false positives. Logistic regression, while simpler and more interpretable, tends to perform well in linearly separable data but may struggle with more complex relationships. SVMs excel in high-dimensional spaces and can better handle class imbalances with proper tuning, though logistic regression offers easier interpretability. Overall, SVM has a slight edge in performance, but both models have their advantages depending on the data complexity and need for interpretability.