0— title: “Homework6” author: “Errol Moore” date: “10/22/2023” output: html_document —
Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response,now 1 is good and 0 is bad.
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
?GermanCredit
View(GermanCredit)
colnames(GermanCredit)
## [1] "Duration" "Amount"
## [3] "InstallmentRatePercentage" "ResidenceDuration"
## [5] "Age" "NumberExistingCredits"
## [7] "NumberPeopleMaintenance" "Telephone"
## [9] "ForeignWorker" "Class"
## [11] "CheckingAccountStatus.lt.0" "CheckingAccountStatus.0.to.200"
## [13] "CheckingAccountStatus.gt.200" "CreditHistory.NoCredit.AllPaid"
## [15] "CreditHistory.ThisBank.AllPaid" "CreditHistory.PaidDuly"
## [17] "CreditHistory.Delay" "Purpose.NewCar"
## [19] "Purpose.UsedCar" "Purpose.Furniture.Equipment"
## [21] "Purpose.Radio.Television" "Purpose.DomesticAppliance"
## [23] "Purpose.Repairs" "Purpose.Education"
## [25] "Purpose.Retraining" "Purpose.Business"
## [27] "SavingsAccountBonds.lt.100" "SavingsAccountBonds.100.to.500"
## [29] "SavingsAccountBonds.500.to.1000" "SavingsAccountBonds.gt.1000"
## [31] "EmploymentDuration.lt.1" "EmploymentDuration.1.to.4"
## [33] "EmploymentDuration.4.to.7" "EmploymentDuration.gt.7"
## [35] "Personal.Male.Divorced.Seperated" "Personal.Female.NotSingle"
## [37] "Personal.Male.Single" "OtherDebtorsGuarantors.None"
## [39] "OtherDebtorsGuarantors.CoApplicant" "Property.RealEstate"
## [41] "Property.Insurance" "Property.CarOther"
## [43] "OtherInstallmentPlans.Bank" "OtherInstallmentPlans.Stores"
## [45] "Housing.Rent" "Housing.Own"
## [47] "Job.UnemployedUnskilled" "Job.UnskilledResident"
## [49] "Job.SkilledEmployee"
Your observation: This dataset GermanCredit contains
1,000 records and 62 variables related to financial and
demographic information, likely for credit risk analysis. It includes
continuous variables like Duration, Amount,
and Age, as well as binary and categorical variables such
as Telephone, ForeignWorker, and
purpose-specific columns (Purpose.). The Class variable
represents a binary classification target, indicating credit risk
status.
2024 for
reproducibility. (5pts)# Prepare training and test data sets
set.seed(2024)
ind <- sample(2, nrow(GermanCredit), replace=TRUE, prob=c(0.8, 0.2))
German_training_data <- GermanCredit[ind==1,]
German_test_data <- GermanCredit[ind==2,]
Your observation: The code splits the
GermanCredit dataset into training and
test sets for machine learning purposes.
Set.seed(2024) ensures that the random sampling results are
consistent when rerun. German_training_data consists of
~80% of the original data for training and
German_test_data contains ~20% for
testing.
library(e1071)
#Could take some time!
credit.train.svm = svm({{as.factor(Class)}} ~ .,
data = German_training_data, kernel = 'linear')
# SVM model with linear kernel
Training_svm_model <- svm(Class ~ ., data = German_training_data, kernel = 'linear')
# Summary of the training model
summary(Training_svm_model)
##
## Call:
## svm(formula = Class ~ ., data = German_training_data, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 421
##
## ( 204 217 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Your observation: The output indicates that an SVM model with a
linear kernel was trained on
German_training_data for binary classification. The model
uses 421 support vectors (204 for Class 0 and 217 for Class
1). The model is trained to classify between two classes (0 and
1).
# Make predictions on the train data
predictions_train <- predict(credit.train.svm, German_training_data)
Your observation: Obtained the predicted values for
German_training_data, given the name
predictions_train.
# Confusion matrix for Training
Trainmatrix_train = table(true = German_training_data$Class,
pred = predictions_train)
Trainmatrix_train
## pred
## true 0 1
## 0 128 115
## 1 51 504
# MR German Training
MR <- 1 - sum(diag(Trainmatrix_train))/sum(Trainmatrix_train)
print(paste0("MR:",MR))
## [1] "MR:0.208020050125313"
Your observation: The confusion matrix Trainmatrix_train
achieved an accuracy of approximately 79.2%, with a
misclassification rate of 20.8%. It correctly identified
128 instances of Class 0 and 504 instances of Class 1,
but misclassified 115 Class 0 instances as Class 1 (false
positives) and 51 Class 1 instances as Class 0 (false
negatives). While the model performs well overall, particularly for
Class 1, it could be improved to enhance precision and reduce false
positives for Class 0.
predictions_test <- predict(credit.train.svm, German_test_data)
Your observation: Obtained the predicted values for German_test_data, given the name predictions_test.
# Confusion matrix for Training
Testmatrix_test = table(true = German_test_data$Class,
pred = predictions_test)
Testmatrix_test
## pred
## true 0 1
## 0 25 32
## 1 19 126
# MR German Training
MR <- 1 - sum(diag(Testmatrix_test))/sum(Testmatrix_test)
print(paste0("MR:",MR))
## [1] "MR:0.252475247524752"
Your observation: The confusion matrix Testmatrix_test
indicates that the model correctly predicted 25 instances of Class
0 and 126 instances of Class 1. However, it
misclassified 32 instances of Class 0 as Class 1 (false
positives) and 19 instances of Class 1 as Class 0
(false negatives). The overall misclassification rate is
25.25%, meaning approximately 25.25% of predictions were
incorrect, resulting in an accuracy of about 74.75%. This
suggests that while the model has reasonable performance, it has room
for improvement, particularly in reducing false positives for Class 0
and enhancing overall prediction accuracy.
probability = TRUE.credit.svm_German = svm(as.factor(Class) ~ .,
data = German_training_data, kernel = 'linear',
probability = TRUE,
class.weights = c("0" = 1, "1" = 2))
Your observation: The code trains an SVM model
credit.svm_German using a linear kernel on
German_training_data, with Class as the target
variable. The as.factor(Class) ensures that the target
variable is treated as a categorical factor. The
probability = TRUE parameter enables the model to provide
probability estimates for predictions. The class.weights
argument assigns a higher weight of 2 to Class 1 and a weight of 1 to
Class 0, emphasizing the importance of correctly predicting Class 1 to
handle class imbalance or prioritize certain outcomes.
#refit the model with probabilities enabled
credit.German.svm_prob = svm(as.factor(Class) ~ .,
data = German_training_data, kernel = 'linear',
probability = TRUE)
pred_credit_train <- predict(credit.German.svm_prob, German_training_data, probability = TRUE)
pred_prob_train = attr(pred_credit_train, "probabilities")[, "1"]
Your observation: The code trains an SVM model
credit.German.svm_prob using a linear kernel
on German_training_data, with Class as the
target variable and probability = TRUE to enable
probability estimates. Predictions are then made on the training data
using the predict() function, and the predicted
probabilities for Class 1 are extracted from the probabilities attribute
and stored in pred_prob_train. This allows for further
analysis or threshold-based decision-making using the probability scores
for Class 1.
# Confusion matrix for Training
New_matrix_train = table(true = German_training_data$Class,
pred = pred_credit_train)
New_matrix_train
## pred
## true 0 1
## 0 112 131
## 1 37 518
# MR German Training
MR <- 1 - sum(diag(New_matrix_train))/sum(New_matrix_train)
print(paste0("MR:",MR))
## [1] "MR:0.210526315789474"
Your observation: The confusion matrix New_matrix_train
shows that the model correctly predicted 128 instances of Class
0 and 504 instances of Class 1. However, it
misclassified 115 instances of Class 0 as Class 1 (false
positives) and 51 instances of Class 1 as Class 0 (false
negatives). The overall misclassification rate is 20.8%,
indicating that approximately 20.8% of the predictions were
incorrect, resulting in an overall accuracy of about
79.2%. This suggests that while the model performs well, especially
in predicting Class 1, there is room for improvement in reducing false
positives and enhancing accuracy for Class 0.
library(ROCR)
pred <- prediction(pred_prob_train, German_training_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8257368
Your observation: An AUC of 0.8257368 from
pred_prop_train indicates that the model performs well,
correctly distinguishing between classes 82.5% of the time,
though there is still room for improvement.
pred_credit_test <- predict(credit.German.svm_prob, German_test_data, probability = TRUE)
pred_prob_test = attr(pred_credit_test, "probabilities")[, "1"]
Your observation: The code makes predictions on the
German_test_data using the trained SVM model
credit.German.svm_prob and stores these predictions in
pred_credit_test. It then extracts the predicted
probabilities for Class 1 from the probabilities attribute and stores
them in pred_prob_test. This allows for analysis of the
model’s performance on the test data, including evaluating the predicted
likelihood of each instance belonging to Class 1.
# Confusion matrix for Training
New_matrix_test = table(true = German_test_data$Class,
pred = pred_credit_test)
New_matrix_test
## pred
## true 0 1
## 0 19 38
## 1 13 132
# MR German Training
MR <- 1 - sum(diag(New_matrix_test))/sum(New_matrix_test)
print(paste0("MR:",MR))
## [1] "MR:0.252475247524752"
Your observation: The confusion matrix New_matrix_test
indicates that the model correctly predicted 25 instances of Class
0 and 126 instances of Class 1. However, it
misclassified 32 instances of Class 0 as Class 1 (false
positives) and 19 instances of Class 1 as Class 0 (false
negatives). The misclassification rate is 25.25%, meaning
that approximately 25.25% of the predictions were incorrect,
resulting in an overall accuracy of about 74.75%. This suggests
that while the model has reasonable performance, there is room for
improvement in reducing errors, particularly false positives for Class
0.
# Looks familar, right?
library(ROCR)
pred <- prediction(pred_prob_test, German_test_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8125832
Your observation: An AUC of 0.8125832 from
prep_prob_test indicates that the model performs well,
correctly distinguishing between the positive and negative classes
about 81.3% of the time, suggesting it has good predictive power,
though there is still some room for improvement compared to a perfect
model (AUC = 1).
An SVM model with a linear kernel was trained on the
German_training_data for binary classification, using
421 support vectors (204 for Class 0 and 217 for Class 1). The
model achieved an accuracy of 79.2% on the
training data, with a misclassification rate of
20.8%, accurately predicting most of Class 1 but misclassifying
115 Class 0 instances as Class 1 (false positives) and 51 Class
1 instances as Class 0 (false negatives). On the
test data, the model achieved 74.75% accuracy and
a misclassification rate of 25.25%, with 32 Class 0 instances
misclassified as Class 1 and 19 Class 1 instances misclassified as
Class 0.
The model uses class weights to prioritize Class 1 due to class
imbalance, and probability estimates were provided for further analysis.
The AUC for the training data is 0.8257,
indicating the model correctly distinguishes between classes
82.5% of the time, while the AUC for the test data
is 0.8126, indicating 81.3% performance in
distinguishing between classes. While the model performs reasonably
well, there is room for improvement, particularly in reducing false
positives for Class 0 and improving overall accuracy. The
training data preformed slightly better than
testing data properly due to the linear kernel
and weight constraints.
The comparison between the SVM model with a
linear kernel and logistic regression shows
that the SVM performs well, achieving an accuracy of 79.2% on
the training set and 74.75% on the
test set, with AUC scores of 0.8505 and
0.7353, respectively. The SVM model demonstrates
strong class discrimination but has room for improvement, particularly
in reducing false positives. Logistic regression, while
simpler and more interpretable, tends to perform well in linearly
separable data but may struggle with more complex relationships.
SVMs excel in high-dimensional spaces and can better handle
class imbalances with proper tuning, though
logistic regression offers easier interpretability.
Overall, SVM has a slight edge in performance, but both
models have their advantages depending on the data complexity and need
for interpretability.
radial, and see if you got a better result.