Intro to Data Science HW 8

Copyright Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva

# Enter your name here: Gil Raitses

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

Supervised learning means that there is a criterion one is trying to predict. The typical strategy is to divide data into a training set and a test set (for example, two-thirds training and one-third test), train the model on the training set, and then see how well the model does on the test set.

Support vector machines (SVM) are a highly flexible and powerful method of doing supervised machine learning.

Another approach is to use partition trees (rpart)

In this homework, we will use another banking dataset to train an SVM model, as well as an rpart model, to classify potential borrowers into 2 groups of credit risk – reliable borrowers and borrowers posing a risk. You can learn more about the variables in the dataset here:
https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

This kind of classification algorithms is used in many aspects of our lives – from credit card approvals to stock market predictions, and even some medical diagnoses.

Part 1: Load and condition the data

Read the contents of the following .csv file into a dataframe called credit:

https://intro-datascience.s3.us-east-2.amazonaws.com/GermanCredit.csv

You will also need to install( ) and library( ) several other libraries, such as kernlab and caret.

#install.packages(c("kernlab", "caret", "rpart"))
library(kernlab)
library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:kernlab':
## 
##     alpha

## Loading required package: lattice

library(rpart)
credit <- read.csv("https://intro-datascience.s3.us-east-2.amazonaws.com/GermanCredit.csv")

Which variable contains the outcome we are trying to predict, credit risk? For the purposes of this analysis, we will focus only on the numeric variables and save them in a new dataframe called cred:

cred <- data.frame(duration=credit$duration, 
                   amount=credit$amount, 
                   installment_rate=credit$installment_rate, 
                   present_residence=credit$present_residence, 
                   age=credit$age, 
                   credit_history=credit$number_credits, 
                   people_liable=credit$people_liable, 
                   credit_risk=as.factor(credit$credit_risk))

Although all variables in cred except credit_risk are coded as numeric, the values of one of them are also ordered factors rather than actual numbers. In consultation with the data description link from the intro, write a comment identifying the factor variable and briefly describe each variable in the dataframe.

cred <- data.frame(
  duration = credit$duration,  # Duration of the credit in months. (Numeric)
  amount = credit$amount,  # Credit amount. (Numeric)
  installment_rate = credit$installment_rate,  # Installment rate as a percentage of disposable income. (Ordered factor)
  present_residence = credit$present_residence,  # Length of time (in years) the borrower has lived at their current residence. (Numeric)
  age = credit$age,  # Age of the borrower in years. (Numeric)
  number_credits = credit$number_credits,  # Number of credits at this bank. (Numeric)
  people_liable = credit$people_liable,  # Number of people liable to provide maintenance for the borrower. (Numeric)
  credit_risk = as.factor(credit$credit_risk)  # Credit risk of the borrower. This is the target variable, with categories such as "good" and "bad". (Factor)
)

# 'installment_rate' is an ordered factor with values representing percentage ranges of disposable income.
# Other variables are numeric, and 'credit_risk' is the target factor.

Part 2: Create training and test data sets

Using techniques discussed in class, create two datasets – one for training and one for testing.

# Set a seed for reproducibility
set.seed(123)

# Create a partition index for the training set (67% of the data) based on the 'credit_risk' variable
trainIndex <- createDataPartition(cred$credit_risk, p = 0.67, list = FALSE)

# Create training and testing datasets
credTrain <- cred[trainIndex, ]
credTest <- cred[-trainIndex, ]

Use the dim( ) function to demonstrate that the resulting training data set and test data set contain the appropriate number of cases.

# Verify dimensions of the resulting datasets
train_dims <- dim(credTrain)
test_dims <- dim(credTest)

# Display the dimensions of the training and test datasets
print(paste("Training set dimensions: ", train_dims[1], "rows and", train_dims[2], "columns"))

## [1] "Training set dimensions:  670 rows and 8 columns"

print(paste("Test set dimensions: ", test_dims[1], "rows and", test_dims[2], "columns"))

## [1] "Test set dimensions:  330 rows and 8 columns"

Part 3: Build a Model using SVM

Using the caret package, build a support vector model using all of the variables to predict credit_risk

library(caret)
library(kernlab)
# Set a seed for reproducibility
set.seed(123)

# Build the SVM model using all variables to predict credit_risk
svmModel <- train(credit_risk ~ .,                # Define the formula for the model
                  data = credTrain,               # Specify the training dataset
                  method = "svmRadial",           # Use the SVM with radial basis function kernel
                  trControl = trainControl(method = "cv", number = 10),  # Use 10-fold cross-validation
                  preProcess = c("center", "scale"))  # Center and scale the predictors

B. output the model

Hint: explore finalModel in the model

# Output the final model
print(svmModel$finalModel)  # Print the details of the final SVM model

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 0.25 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.137594931035687 
## 
## Number of Support Vectors : 447 
## 
## Objective Function Value : -98.7836 
## Training error : 0.3

Part 4: Predict Values in the Test Data and Create a Confusion Matrix

Use the predict( ) function to validate the model against test data. Store the predictions in a variable named svmPred.

# Validate the model against test data
svmPred <- predict(svmModel, credTest)  # Use the predict() function to make predictions on the test data

The svmPred object contains a list of classifications for reliable (=0) or risky (=1) borrowers. Review the contents of svmPred using head().

# Review the contents of svmPred
head(svmPred)  # Display the first few predictions

## [1] 1 1 1 1 1 1
## Levels: 0 1

Calculate a confusion matrix using the table function.

# Calculate the confusion matrix
# Creates a table to compare the predicted classifications (svmPred) with the actual classifications (credTest$credit_risk)
confMatrix <- table(Predicted = svmPred, Actual = credTest$credit_risk)  


# Display the confusion matrix
print(confMatrix)

##          Actual
## Predicted   0   1
##         0   0   0
##         1  99 231

What is the accuracy based on what you see in the confusion matrix?

The diag( ) command can be applied to the results of the table command you ran in the previous step. You can also use sum( ) to get the total of all four cells. Error rate: sum of diagonal vs total table

# Calculate accuracy

# Sum of true positives and true negatives
correct_predictions <- sum(diag(confMatrix))  
# This gives the total number of correctly classified instances

# Total number of predictions
total_predictions <- sum(confMatrix)  
# This gives the sum of all the entries in the confusion matrix, representing the total number of instances

# Calculate accuracy
accuracy <- correct_predictions / total_predictions 
# Accuracy is the ratio of correctly classified instances to the total number of instances

# Display the accuracy
accuracy

## [1] 0.7

# Print the accuracy to the console

Compare your calculations with the confusionMatrix() function from the caret package.

library(caret)

# Use confusionMatrix() from the caret package to calculate and display the confusion matrix
confMatrix_caret <- confusionMatrix(svmPred, credTest$credit_risk)
print(confMatrix_caret)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0   0   0
##          1  99 231
##                                          
##                Accuracy : 0.7            
##                  95% CI : (0.6474, 0.749)
##     No Information Rate : 0.7            
##     P-Value [Acc > NIR] : 0.5271         
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.0            
##             Specificity : 1.0            
##          Pos Pred Value : NaN            
##          Neg Pred Value : 0.7            
##              Prevalence : 0.3            
##          Detection Rate : 0.0            
##    Detection Prevalence : 0.0            
##       Balanced Accuracy : 0.5            
##                                          
##        'Positive' Class : 0              
##

# Extract and print the accuracy from the confusion matrix result
confMatrix_caret_accuracy <- confMatrix_caret$overall['Accuracy']
confMatrix_caret_accuracy  # Display the accuracy calculated by the confusionMatrix function

## Accuracy 
##      0.7

Explain, in a block comment:
1) why it is valuable to have a “test” dataset that is separate from a “training” dataset, and
2) what potential ethical challenges this type of automated classification may pose.

# 1) Separate datasets prevent overfitting (where the model learns noise/details in training data), helping the model generalize well to new data. It provides an unbiased performance evaluation.

# 2) Ethical challenges include bias, lack of transparency, privacy concerns, and significant impacts on individuals' lives, such as unfair loan denials.

Part 5: Now build a tree model (with rpart)

A. Build a model with rpart
Note: you might need to install the e1071 package

#install.packages("e1071")
library(caret)  # For model training and evaluation
library(rpart)  # For decision tree modeling
library(e1071)  # Required by caret for certain models

# Set a seed for reproducibility
set.seed(123)

# Use the train() function from the caret package to build the decision tree model using all variables to predict credit_risk
treeModel <- train(credit_risk ~ .,                # Define the formula for the model
                   data = credTrain,               # Specify the training dataset
                   method = "rpart",               # Use the rpart method for decision tree
                   trControl = trainControl(method = "cv", number = 10),  # Use 10-fold cross-validation
                   preProcess = c("center", "scale"))  # Center and scale the predictors

B. Visualize the results using rpart.plot()

#install.packages("rpart.plot")
library(rpart.plot)  # For visualizing the decision tree

# Visualize the decision tree using rpart.plot
rpart.plot(treeModel$finalModel)  # Plot the decision tree

C. Use the predict() function to predict the testData, and then generate a confusion matrix to explore the results

# Predict the test data using the decision tree model
treePred <- predict(treeModel, credTest, type = "raw")  # Use type = "raw" for class predictions

# Generate a confusion matrix to explore the results
treeConfMatrix <- confusionMatrix(treePred, credTest$credit_risk)

# Display the confusion matrix
print(treeConfMatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  15  17
##          1  84 214
##                                           
##                Accuracy : 0.6939          
##                  95% CI : (0.6411, 0.7432)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.6208          
##                                           
##                   Kappa : 0.0966          
##                                           
##  Mcnemar's Test P-Value : 5.125e-11       
##                                           
##             Sensitivity : 0.15152         
##             Specificity : 0.92641         
##          Pos Pred Value : 0.46875         
##          Neg Pred Value : 0.71812         
##              Prevalence : 0.30000         
##          Detection Rate : 0.04545         
##    Detection Prevalence : 0.09697         
##       Balanced Accuracy : 0.53896         
##                                           
##        'Positive' Class : 0               
##

D. Review the attributes being used for this credit decision. Are there any that might not be appropriate, with respect to fairness? If so, which attribute, and how would you address this fairness situation. Answer in a comment block below

# The attribute 'age' is used in the decision tree (e.g., node 'age < -0.95'), which might lead to age discrimination.

# To address this, remove 'age' from the model or use  like reweighting, adversarial debiasing, or disparate impact remover, focusing decisions on financial performance and ability.