Intro to Data Science HW 8

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

Supervised learning means that there is a criterion one is trying to predict. The typical strategy is to divide data into a training set and a test set (for example, two-thirds training and one-third test), train the model on the training set, and then see how well the model does on the test set.

Support vector machines (SVM) are a highly flexible and powerful method of doing supervised machine learning.

Another approach is to use partition trees (rpart)

In this homework, we will use another banking dataset to train an SVM model, as well as an rpart model, to classify potential borrowers into 2 groups of credit riskreliable borrowers and borrowers posing a risk. You can learn more about the variables in the dataset here:
https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

This kind of classification algorithms is used in many aspects of our lives – from credit card approvals to stock market predictions, and even some medical diagnoses.

Part 1: Load and condition the data

  1. Read the contents of the following .csv file into a dataframe called credit:

https://intro-datascience.s3.us-east-2.amazonaws.com/GermanCredit.csv

You will also need to install( ) and library( ) several other libraries, such as kernlab and caret.

library(kernlab)
library(caret)
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
## 
##     alpha
## Loading required package: lattice
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::alpha() masks kernlab::alpha()
## x purrr::cross()   masks kernlab::cross()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x purrr::lift()    masks caret::lift()
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.2
library(e1071)
## Warning: package 'e1071' was built under R version 4.1.2
credit <- read.csv("https://intro-datascience.s3.us-east-2.amazonaws.com/GermanCredit.csv")
  1. Which variable contains the outcome we are trying to predict, credit risk? For the purposes of this analysis, we will focus only on the numeric variables and save them in a new dataframe called cred:
cred <- data.frame(duration=credit$duration, 
                   amount=credit$amount, 
                   installment_rate=credit$installment_rate, 
                   present_residence=credit$present_residence, 
                   age=credit$age, 
                   credit_history=credit$number_credits, 
                   people_liable=credit$people_liable, 
                   credit_risk=as.factor(credit$credit_risk))
Error in data.frame(duration = credit$duration, amount = credit$amount, : object 'credit' not found
Traceback:


1. data.frame(duration = credit$duration, amount = credit$amount, 
 .     installment_rate = credit$installment_rate, present_residence = credit$present_residence, 
 .     age = credit$age, credit_history = credit$number_credits, 
 .     people_liable = credit$people_liable, credit_risk = as.factor(credit$credit_risk))
  1. Although all variables in cred except credit_risk are coded as numeric, the values of one of them are also ordered factors rather than actual numbers. In consultation with the data description link from the intro, write a comment identifying the factor variable and briefly describe each variable in the dataframe.
# The factor variable in the cred dataframe is the 'credit_risk' variable
# duration: numerical variable showing the duration in month
# amount: Credit amount
# installment_rate: Installment rate in percentage of disposable income
# present_residence: numerical variable showing present residence since
# age: age in years
# credit_history: number of existing credits at this bank
# people_liable: number of people being liable to provide maintenance for
# credit_risk: This represents the actual classification and the columns the predicted classification (1 = Good, 2 = Bad)

Part 2: Create training and test data sets

  1. Using techniques discussed in class, create two datasets – one for training and one for testing.
# Makes the sampling predictable
set.seed(123)

# Randomly sample elements to go into the training dataset
train_list <- createDataPartition(y = cred$credit_risk, p=2/3, list = FALSE)

training <- cred[train_list,]

testing <- cred[-train_list,]
  1. Use the dim( ) function to demonstrate that the resulting training data set and test data set contain the appropriate number of cases.
dim(training)
## [1] 667   8
dim(testing)
## [1] 333   8

Part 3: Build a Model using SVM

  1. Using the caret package, build a support vector model using all of the variables to predict credit_risk
# Train the SVM model
svm_fit <- train(credit_risk ~ ., data=training, method = "svmRadial", preProc = c("center","scale"))

B. output the model

Hint: explore finalModel in the model that would created in F.

svm_fit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 667 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 667, 667, 667, 667, 667, 667, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa     
##   0.25  0.6987006  0.01570214
##   0.50  0.6971104  0.04523360
##   1.00  0.6931175  0.06907547
## 
## Tuning parameter 'sigma' was held constant at a value of 0.15342
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.15342 and C = 0.25.

Part 4: Predict Values in the Test Data and Create a Confusion Matrix

  1. Use the predict( ) function to validate the model against test data. Store the predictions in a variable named svmPred.
svm_pred <- predict(svm_fit, newdata=testing)
  1. The svmPred object contains a list of classifications for reliable (=0) or risky (=1) borrowers. Review the contents of svmPred using head( ).
head(svm_pred)
## [1] 1 1 1 1 1 1
## Levels: 0 1
  1. Explore the confusion matrix, using the caret package
conf_matrix <- table(svm_pred, testing$credit_risk)
conf_matrix
##         
## svm_pred   0   1
##        0   0   0
##        1 100 233
prop.table(conf_matrix)
##         
## svm_pred         0         1
##        0 0.0000000 0.0000000
##        1 0.3003003 0.6996997
error_rate <- (sum(conf_matrix) -
                   sum(diag(conf_matrix))) /
                        sum(conf_matrix)
error_rate
## [1] 0.3003003
  1. What is the accuracy based on what you see in the confusion matrix.
# With an error_rate of 0.3003003, we can say that our SVM model had an accuracy of ~ 70%
  1. Compare your calculations with the confusionMatrix() function from the caret package.
confusion <- confusionMatrix(svm_pred, testing$credit_risk)
confusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0   0   0
##          1 100 233
##                                           
##                Accuracy : 0.6997          
##                  95% CI : (0.6473, 0.7485)
##     No Information Rate : 0.6997          
##     P-Value [Acc > NIR] : 0.527           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.6997          
##              Prevalence : 0.3003          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 
  1. Explain, in a block comment:
    1) why it is valuable to have a “test” dataset that is separate from a “training” dataset, and
    2) what potential ethical challenges this type of automated classification may pose.
# 1) It is valuable to have a "test" dataset because if we truly care about creating an efficient and effective model, it would be prudent to evaluate it with input parameters that the model has never seen before. Eliminates a source of recency bias from aspects the model learned from the training dataset. 

# 2) Ethical concerns could range from a significant indicator of some result of interset may be an individual's race. If something like this were implemented by the federal government, it may be in direct violation of the equal protection clause of the 14th Amendment of the United States Constitution, as well as Title VII of the Civil Rights Act of 1964. It may not necessarily be legal concerns this classification produces. It may give unfair advantages to some underlying population that wasn't considered in the model. 

Part 5: Now build a tree model (with rpart)

A. Build a model with rpart
Note: you might need to install the e1071 package

tree_fit <- train(credit_risk ~ ., data = training, method = "treebag", preProc = c("center","scale"))
# varImp(tree_fit)

B. Visualize the results using rpart.plot()

cart_tree <- rpart(credit_risk ~ ., data = training, method = "class")

prp(cart_tree, faclen = 0, cex = 0.8, extra = 1)

C. Use the predict() function to predict the testData, and then generate a confusion matrix to explore the results

tree_pred <- predict(tree_fit, testing)

tree_confusion <- confusionMatrix(tree_pred, testing$credit_risk)
tree_confusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  35  28
##          1  65 205
##                                           
##                Accuracy : 0.7207          
##                  95% CI : (0.6692, 0.7683)
##     No Information Rate : 0.6997          
##     P-Value [Acc > NIR] : 0.2195105       
##                                           
##                   Kappa : 0.257           
##                                           
##  Mcnemar's Test P-Value : 0.0001892       
##                                           
##             Sensitivity : 0.3500          
##             Specificity : 0.8798          
##          Pos Pred Value : 0.5556          
##          Neg Pred Value : 0.7593          
##              Prevalence : 0.3003          
##          Detection Rate : 0.1051          
##    Detection Prevalence : 0.1892          
##       Balanced Accuracy : 0.6149          
##                                           
##        'Positive' Class : 0               
## 

D. Review the attributes being used for this credit decision. Are there any that might not be appropriate, with respect to fairness? If so, which attribute, and how would you address this fairness situation. Answer in a comment block below

varImp(tree_fit)
## treebag variable importance
## 
##                   Overall
## amount             100.00
## age                 77.19
## duration            52.55
## present_residence   27.94
## installment_rate    25.57
## credit_history      11.33
## people_liable        0.00
# The Age attribute might not be appropriate with respect to fairness. As I know that age is not allowed as a decisioning factor for credit worithiness in the United States. As higher aged individuals are typically favored from a bank's credit perspective. I'd address the fairness perspective by not allowing it as a decisioning factor in the ultimate model that is implemented.