# Enter your name here: Joshua Gaze
# 1. I did this homework by myself, with help from the book and the professor.
Supervised learning means that there is a criterion one is trying to predict. The typical strategy is to divide data into a training set and a test set (for example, two-thirds training and one-third test), train the model on the training set, and then see how well the model does on the test set.
Support vector machines (SVM) are a highly flexible and powerful method of doing supervised machine learning.
Another approach is to use partition trees (rpart)
In this homework, we will use another banking dataset to train an SVM model, as well as an rpart model, to classify potential borrowers into 2 groups of credit risk – reliable borrowers and borrowers posing a risk. You can learn more about the variables in the dataset here:
https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
This kind of classification algorithms is used in many aspects of our lives – from credit card approvals to stock market predictions, and even some medical diagnoses.
https://intro-datascience.s3.us-east-2.amazonaws.com/GermanCredit.csv
You will also need to install( ) and library( ) several other libraries, such as kernlab and caret.
library(kernlab)
library(caret)
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
## Loading required package: lattice
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::alpha() masks kernlab::alpha()
## x purrr::cross() masks kernlab::cross()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.2
library(e1071)
## Warning: package 'e1071' was built under R version 4.1.2
credit <- read.csv("https://intro-datascience.s3.us-east-2.amazonaws.com/GermanCredit.csv")
cred <- data.frame(duration=credit$duration,
amount=credit$amount,
installment_rate=credit$installment_rate,
present_residence=credit$present_residence,
age=credit$age,
credit_history=credit$number_credits,
people_liable=credit$people_liable,
credit_risk=as.factor(credit$credit_risk))
Error in data.frame(duration = credit$duration, amount = credit$amount, : object 'credit' not found
Traceback:
1. data.frame(duration = credit$duration, amount = credit$amount,
. installment_rate = credit$installment_rate, present_residence = credit$present_residence,
. age = credit$age, credit_history = credit$number_credits,
. people_liable = credit$people_liable, credit_risk = as.factor(credit$credit_risk))
# The factor variable in the cred dataframe is the 'credit_risk' variable
# duration: numerical variable showing the duration in month
# amount: Credit amount
# installment_rate: Installment rate in percentage of disposable income
# present_residence: numerical variable showing present residence since
# age: age in years
# credit_history: number of existing credits at this bank
# people_liable: number of people being liable to provide maintenance for
# credit_risk: This represents the actual classification and the columns the predicted classification (1 = Good, 2 = Bad)
# Makes the sampling predictable
set.seed(123)
# Randomly sample elements to go into the training dataset
train_list <- createDataPartition(y = cred$credit_risk, p=2/3, list = FALSE)
training <- cred[train_list,]
testing <- cred[-train_list,]
dim(training)
## [1] 667 8
dim(testing)
## [1] 333 8
# Train the SVM model
svm_fit <- train(credit_risk ~ ., data=training, method = "svmRadial", preProc = c("center","scale"))
B. output the model
Hint: explore finalModel in the model that would created in F.
svm_fit
## Support Vector Machines with Radial Basis Function Kernel
##
## 667 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 667, 667, 667, 667, 667, 667, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.6987006 0.01570214
## 0.50 0.6971104 0.04523360
## 1.00 0.6931175 0.06907547
##
## Tuning parameter 'sigma' was held constant at a value of 0.15342
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.15342 and C = 0.25.
svm_pred <- predict(svm_fit, newdata=testing)
head(svm_pred)
## [1] 1 1 1 1 1 1
## Levels: 0 1
conf_matrix <- table(svm_pred, testing$credit_risk)
conf_matrix
##
## svm_pred 0 1
## 0 0 0
## 1 100 233
prop.table(conf_matrix)
##
## svm_pred 0 1
## 0 0.0000000 0.0000000
## 1 0.3003003 0.6996997
error_rate <- (sum(conf_matrix) -
sum(diag(conf_matrix))) /
sum(conf_matrix)
error_rate
## [1] 0.3003003
# With an error_rate of 0.3003003, we can say that our SVM model had an accuracy of ~ 70%
confusion <- confusionMatrix(svm_pred, testing$credit_risk)
confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 100 233
##
## Accuracy : 0.6997
## 95% CI : (0.6473, 0.7485)
## No Information Rate : 0.6997
## P-Value [Acc > NIR] : 0.527
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.6997
## Prevalence : 0.3003
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
# 1) It is valuable to have a "test" dataset because if we truly care about creating an efficient and effective model, it would be prudent to evaluate it with input parameters that the model has never seen before. Eliminates a source of recency bias from aspects the model learned from the training dataset.
# 2) Ethical concerns could range from a significant indicator of some result of interset may be an individual's race. If something like this were implemented by the federal government, it may be in direct violation of the equal protection clause of the 14th Amendment of the United States Constitution, as well as Title VII of the Civil Rights Act of 1964. It may not necessarily be legal concerns this classification produces. It may give unfair advantages to some underlying population that wasn't considered in the model.
A. Build a model with rpart
Note: you might need to install the e1071 package
tree_fit <- train(credit_risk ~ ., data = training, method = "treebag", preProc = c("center","scale"))
# varImp(tree_fit)
B. Visualize the results using rpart.plot()
cart_tree <- rpart(credit_risk ~ ., data = training, method = "class")
prp(cart_tree, faclen = 0, cex = 0.8, extra = 1)
C. Use the predict() function to predict the testData, and then generate a confusion matrix to explore the results
tree_pred <- predict(tree_fit, testing)
tree_confusion <- confusionMatrix(tree_pred, testing$credit_risk)
tree_confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 28
## 1 65 205
##
## Accuracy : 0.7207
## 95% CI : (0.6692, 0.7683)
## No Information Rate : 0.6997
## P-Value [Acc > NIR] : 0.2195105
##
## Kappa : 0.257
##
## Mcnemar's Test P-Value : 0.0001892
##
## Sensitivity : 0.3500
## Specificity : 0.8798
## Pos Pred Value : 0.5556
## Neg Pred Value : 0.7593
## Prevalence : 0.3003
## Detection Rate : 0.1051
## Detection Prevalence : 0.1892
## Balanced Accuracy : 0.6149
##
## 'Positive' Class : 0
##
D. Review the attributes being used for this credit decision. Are there any that might not be appropriate, with respect to fairness? If so, which attribute, and how would you address this fairness situation. Answer in a comment block below
varImp(tree_fit)
## treebag variable importance
##
## Overall
## amount 100.00
## age 77.19
## duration 52.55
## present_residence 27.94
## installment_rate 25.57
## credit_history 11.33
## people_liable 0.00
# The Age attribute might not be appropriate with respect to fairness. As I know that age is not allowed as a decisioning factor for credit worithiness in the United States. As higher aged individuals are typically favored from a bank's credit perspective. I'd address the fairness perspective by not allowing it as a decisioning factor in the ultimate model that is implemented.