# Enter your name here: Gil Raitses
# 1. I did this homework by myself, with help from the book and the professor.
Supervised learning means that there is a criterion one is
trying to predict. The typical strategy is to divide
data into a training set and a test
set (for example, two-thirds training and
one-third test), train the model on the training set,
and then see how well the model does on the test set.
Support vector machines (SVM) are a highly flexible and powerful method of doing supervised machine learning.
Another approach is to use partition trees (rpart)
In this homework, we will use another banking dataset to train an SVM
model, as well as an rpart model, to classify potential
borrowers into 2 groups of credit risk – reliable
borrowers and borrowers posing a risk. You can
learn more about the variables in the dataset here:
https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
This kind of classification algorithms is used in many aspects of our
lives – from credit card approvals to stock market predictions, and even
some medical diagnoses.
https://intro-datascience.s3.us-east-2.amazonaws.com/GermanCredit.csv
You will also need to install( ) and library( ) several other libraries, such as kernlab and caret.
#install.packages(c("kernlab", "caret", "rpart"))
library(kernlab)
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
## Loading required package: lattice
library(rpart)
credit <- read.csv("https://intro-datascience.s3.us-east-2.amazonaws.com/GermanCredit.csv")
cred <- data.frame(duration=credit$duration,
amount=credit$amount,
installment_rate=credit$installment_rate,
present_residence=credit$present_residence,
age=credit$age,
credit_history=credit$number_credits,
people_liable=credit$people_liable,
credit_risk=as.factor(credit$credit_risk))
cred <- data.frame(
duration = credit$duration, # Duration of the credit in months. (Numeric)
amount = credit$amount, # Credit amount. (Numeric)
installment_rate = credit$installment_rate, # Installment rate as a percentage of disposable income. (Ordered factor)
present_residence = credit$present_residence, # Length of time (in years) the borrower has lived at their current residence. (Numeric)
age = credit$age, # Age of the borrower in years. (Numeric)
number_credits = credit$number_credits, # Number of credits at this bank. (Numeric)
people_liable = credit$people_liable, # Number of people liable to provide maintenance for the borrower. (Numeric)
credit_risk = as.factor(credit$credit_risk) # Credit risk of the borrower. This is the target variable, with categories such as "good" and "bad". (Factor)
)
# 'installment_rate' is an ordered factor with values representing percentage ranges of disposable income.
# Other variables are numeric, and 'credit_risk' is the target factor.
# Set a seed for reproducibility
set.seed(123)
# Create a partition index for the training set (67% of the data) based on the 'credit_risk' variable
trainIndex <- createDataPartition(cred$credit_risk, p = 0.67, list = FALSE)
# Create training and testing datasets
credTrain <- cred[trainIndex, ]
credTest <- cred[-trainIndex, ]
# Verify dimensions of the resulting datasets
train_dims <- dim(credTrain)
test_dims <- dim(credTest)
# Display the dimensions of the training and test datasets
print(paste("Training set dimensions: ", train_dims[1], "rows and", train_dims[2], "columns"))
## [1] "Training set dimensions: 670 rows and 8 columns"
print(paste("Test set dimensions: ", test_dims[1], "rows and", test_dims[2], "columns"))
## [1] "Test set dimensions: 330 rows and 8 columns"
library(caret)
library(kernlab)
# Set a seed for reproducibility
set.seed(123)
# Build the SVM model using all variables to predict credit_risk
svmModel <- train(credit_risk ~ ., # Define the formula for the model
data = credTrain, # Specify the training dataset
method = "svmRadial", # Use the SVM with radial basis function kernel
trControl = trainControl(method = "cv", number = 10), # Use 10-fold cross-validation
preProcess = c("center", "scale")) # Center and scale the predictors
B. output the model
Hint: explore finalModel in the model
# Output the final model
print(svmModel$finalModel) # Print the details of the final SVM model
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 0.25
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.137594931035687
##
## Number of Support Vectors : 447
##
## Objective Function Value : -98.7836
## Training error : 0.3
# Validate the model against test data
svmPred <- predict(svmModel, credTest) # Use the predict() function to make predictions on the test data
# Review the contents of svmPred
head(svmPred) # Display the first few predictions
## [1] 1 1 1 1 1 1
## Levels: 0 1
# Calculate the confusion matrix
# Creates a table to compare the predicted classifications (svmPred) with the actual classifications (credTest$credit_risk)
confMatrix <- table(Predicted = svmPred, Actual = credTest$credit_risk)
# Display the confusion matrix
print(confMatrix)
## Actual
## Predicted 0 1
## 0 0 0
## 1 99 231
The diag( ) command can be applied to the results of the table command you ran in the previous step. You can also use sum( ) to get the total of all four cells. Error rate: sum of diagonal vs total table
# Calculate accuracy
# Sum of true positives and true negatives
correct_predictions <- sum(diag(confMatrix))
# This gives the total number of correctly classified instances
# Total number of predictions
total_predictions <- sum(confMatrix)
# This gives the sum of all the entries in the confusion matrix, representing the total number of instances
# Calculate accuracy
accuracy <- correct_predictions / total_predictions
# Accuracy is the ratio of correctly classified instances to the total number of instances
# Display the accuracy
accuracy
## [1] 0.7
# Print the accuracy to the console
library(caret)
# Use confusionMatrix() from the caret package to calculate and display the confusion matrix
confMatrix_caret <- confusionMatrix(svmPred, credTest$credit_risk)
print(confMatrix_caret)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 99 231
##
## Accuracy : 0.7
## 95% CI : (0.6474, 0.749)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.5271
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0
## Specificity : 1.0
## Pos Pred Value : NaN
## Neg Pred Value : 0.7
## Prevalence : 0.3
## Detection Rate : 0.0
## Detection Prevalence : 0.0
## Balanced Accuracy : 0.5
##
## 'Positive' Class : 0
##
# Extract and print the accuracy from the confusion matrix result
confMatrix_caret_accuracy <- confMatrix_caret$overall['Accuracy']
confMatrix_caret_accuracy # Display the accuracy calculated by the confusionMatrix function
## Accuracy
## 0.7
# 1) Separate datasets prevent overfitting (where the model learns noise/details in training data), helping the model generalize well to new data. It provides an unbiased performance evaluation.
# 2) Ethical challenges include bias, lack of transparency, privacy concerns, and significant impacts on individuals' lives, such as unfair loan denials.
A. Build a model with rpart
Note: you might need to install the
e1071 package
#install.packages("e1071")
library(caret) # For model training and evaluation
library(rpart) # For decision tree modeling
library(e1071) # Required by caret for certain models
# Set a seed for reproducibility
set.seed(123)
# Use the train() function from the caret package to build the decision tree model using all variables to predict credit_risk
treeModel <- train(credit_risk ~ ., # Define the formula for the model
data = credTrain, # Specify the training dataset
method = "rpart", # Use the rpart method for decision tree
trControl = trainControl(method = "cv", number = 10), # Use 10-fold cross-validation
preProcess = c("center", "scale")) # Center and scale the predictors
B. Visualize the results using rpart.plot()
#install.packages("rpart.plot")
library(rpart.plot) # For visualizing the decision tree
# Visualize the decision tree using rpart.plot
rpart.plot(treeModel$finalModel) # Plot the decision tree
C. Use the predict() function to predict the testData, and then generate a confusion matrix to explore the results
# Predict the test data using the decision tree model
treePred <- predict(treeModel, credTest, type = "raw") # Use type = "raw" for class predictions
# Generate a confusion matrix to explore the results
treeConfMatrix <- confusionMatrix(treePred, credTest$credit_risk)
# Display the confusion matrix
print(treeConfMatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 15 17
## 1 84 214
##
## Accuracy : 0.6939
## 95% CI : (0.6411, 0.7432)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.6208
##
## Kappa : 0.0966
##
## Mcnemar's Test P-Value : 5.125e-11
##
## Sensitivity : 0.15152
## Specificity : 0.92641
## Pos Pred Value : 0.46875
## Neg Pred Value : 0.71812
## Prevalence : 0.30000
## Detection Rate : 0.04545
## Detection Prevalence : 0.09697
## Balanced Accuracy : 0.53896
##
## 'Positive' Class : 0
##
D. Review the attributes being used for this credit decision. Are there any that might not be appropriate, with respect to fairness? If so, which attribute, and how would you address this fairness situation. Answer in a comment block below
# The attribute 'age' is used in the decision tree (e.g., node 'age < -0.95'), which might lead to age discrimination.
# To address this, remove 'age' from the model or use like reweighting, adversarial debiasing, or disparate impact remover, focusing decisions on financial performance and ability.