Creditworthiness Score Prediction using Machine Learning

Import Packages

# Importing Packages
library(tidyverse)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(e1071)
library(pROC)
cat("The requisite libraries are successfully imported.")

## The requisite libraries are successfully imported.

1. Load Dataset

# Read the creditworthiness.csv file
creditworthiness_data <- read.csv("creditworthiness.csv")
cat("The shape of the dataset - ", dim(creditworthiness_data))

## The shape of the dataset -  2500 46

2. Data Pre-processing

2.1 Rename Attributes

# Get the current column names
current_names <- colnames(creditworthiness_data)

# Replace "." or ".." with "_"
new_names <- gsub("\\.|\\.\\.", "_", current_names)

# Remove trailing "_"
new_names <- sub("_$", "", new_names)

# Set the new column names
colnames(creditworthiness_data) <- new_names

cat("All the column names are changed successfully.")

## All the column names are changed successfully.

2.2 Remove Missing Values

# Filter the data to include only the rows with known ratings
known_ratings_data <- creditworthiness_data[complete.cases(creditworthiness_data$credit_rating), ]
cat("The observations with missing values are removed from the dataset.")

## The observations with missing values are removed from the dataset.

3. Categorical Classification

3.1 Train-Test Split

# Set the seed for reproducibility
set.seed(42)

# Splitting the dataset into X and y
X <- known_ratings_data[, !(names(known_ratings_data) %in% c("credit_rating"))]
y <- known_ratings_data$credit_rating

# Split the data into train and test sets
train_index <- createDataPartition(y, p = 0.5, list = FALSE)
train_X <- X[train_index, ]
train_y <- y[train_index]
test_X <- X[-train_index, ]
test_y <- y[-train_index]

# Print train and test data lengths
cat("Train Data Shape: ", dim(train_X), "\n")

## Train Data Shape:  1251 45

cat("Test Data Length: ", dim(test_X), "\n")

## Test Data Length:  1249 45

Q.1. We need to predict the credit rating that would be assigned to each individual. The data on 2500 customers have been collected, and credit rating for 1962 of them has been assessed as either A, B, or C, coded as 1, 2, or 3, respectively, with the remaining 538 needing to be classified. Write the code to split the dataset into 50% training set and 50% test set and only include the data with known ratings.

Q.1. Analysis -

In this phase, the creditworthiness data is divided into 50% training set and 50% test set only considering the data with known credit ratings. The training and test data consist of 1251 and 1249 observations along with 45 features respectively.

3.2 Decision Tree

3.2.1 Model Implementation

# Fit a decision tree model to the training set
dt_model <- rpart(train_y ~ ., data = train_X, method = "class")
cat("The Decision Tree Model is trained successfully.")

## The Decision Tree Model is trained successfully.

3.2.2 Model Tree

# Plot the decision tree
rpart.plot(dt_model)

Q.2.(a) Using default settings, fit a decision tree to the training set predict the credit ratings of customers using all of the other variables in the dataset. Report the resulting tree.

Q.2.(a) Analysis -

The Decision Tree shown above indicates that the model considers four important features from the dataset as follows -

functionary
re_balanced_paid_back_a_recently_overdrawn_current_acount
FI3O_credit_score
savings_on_other_accounts

The model classify the data based on folowing if-else conditions -

if FI3O_credit_score = 0
- if savings_on_other_accounts < 4
  - predicted credit_rating = 0
- else
  - predicted credit_rating = 3
else
- if functionary = 1
  - predicted credit_rating = 1
- else
  - if re_balanced_paid_back_a_recently_overdrawn_current_acount = 0
    - if savings_on_other_accounts < 3
      - predicted credit_rating = 0
    - else
      - predicted credit_rating = 3
  - else
    - predicted credit_rating = 2

3.2.3 Model Prediction

# Calculate medians for all attributes in train_X
attribute_medians <- sapply(train_X, median)

# Create the median_customer dataframe
median_customer <- data.frame(t(attribute_medians))

median_prediction <- predict(dt_model, newdata = median_customer, type = "class")
cat("The predicted credit rating using Decision Tree model is =", as.character(median_prediction[1]))

## The predicted credit rating using Decision Tree model is = 2

Q.2.(b) Based on this output, predict the credit rating of a hypothetical “median” customer, i.e., one observation considering all the attributes from the dataset. Show the steps involved.

Q.2.(b) Analysis -

Here the steps for predicting the credit rating on the median values for all the features are presented as follows -

Identify the median values for all the 45 features from Table 1.
Create a dataframe with the identified median values for all the features.
Predict the credit rating using the Decision Tree model.

The predicted credit rating using Decision Tree model is = 2

# Predict credit ratings for the test set
predictions <- predict(dt_model, newdata = test_X, type = "class")

# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)

# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table

##           Reference
## Prediction   0   1   2   3
##          0 188   5  11  43
##          1  22 140  84  35
##          2  48  94 376 137
##          3  12   1  14  39

# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)

# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of Decision Tree model:", round(accuracy * 100, 2), "%"))

## [1] "Overall Accuracy Rate of Decision Tree model: 59.49 %"

Q.2.(c) Produce the confusion matrix for predicting the credit rating from this tree on the test set, and also report the overall accuracy rate.

Q.2.(c) Analysis -

Here the confusion matrix is presented considering the predicted and actual credit ratings from the test data using the Decision Tree model. There are total 4 classes for the credit_rating column e.g., 0, 1, 2, and 3. From the confusion matrix of Decision Tree model, the different metrics can be identified as follows -

Actual 0 Predicted 0 (P00) = 188 observations
Actual 1 Predicted 0 (P01) = 5 observations
Actual 2 Predicted 0 (P02) = 11 observations
Actual 3 Predicted 0 (P03) = 43 observations
Actual 0 Predicted 1 (P10) = 22 observations
Actual 1 Predicted 1 (P11) = 140 observations
Actual 2 Predicted 1 (P12) = 84 observations
Actual 3 Predicted 1 (P13) = 35 observations
Actual 0 Predicted 2 (P20) = 48 observations
Actual 1 Predicted 2 (P21) = 94 observations
Actual 2 Predicted 2 (P22) = 376 observations
Actual 3 Predicted 2 (P23) = 137 observations
Actual 0 Predicted 3 (P30) = 12 observations
Actual 1 Predicted 3 (P31) = 1 observation
Actual 2 Predicted 3 (P32) = 14 observations
Actual 3 Predicted 3 (P33) = 39 observations

The Decision Tree model achieved 59.49% validation accuracy to predict the credit ratings from the test data.

Q.2.(d) What is the numerical value of the gain in entropy corresponding to the first split at the top of the tree? (Use logarithms to base 2, and show the details of the calculation rather than just providing a final answer.)

Q.2.(d) Analysis -

In order to calculate the information gain in entropy considering the first split at the top of the Decision Tree, the entropy for all the respective nodes need to be calculated at first.

Entropy for root/parent node \[E_p = - (0.21 * \log_{2}0.21 + 0.19 * \log_{2}0.19 + 0.39 * \log_{2}0.39 + 0.20 * \log_{2}0.20) = 1.9222\]

Entropy for child 1 node \[E_c^1 = - (0.76 * \log_{2}0.76 + 0.00 * \log_{2}0.00 + 0.04 * \log_{2}0.04 + 0.20 * \log_{2}0.20) = 0.9510\]

Entropy for child 2 node \[E_c^2 = - (0.10 * \log_{2}0.10 + 0.24 * \log_{2}0.24 + 0.46 * \log_{2}0.46 + 0.21 * \log_{2}0.21) = 1.8144\]

Now, the total number of root instances = 1251 since the train data consists of the same number of observations, among which -

The number of child 1 instances (18% of root instances) \(= \frac{1251}{18} * 100 \approx 225\)
The number of child 2 instances (82% of root instances) \(= \frac{1251}{82} * 100 \approx 1026\)

Hence, information gain \[IG = 1.9222 - (\frac{225}{1251} * 0.9510) - (\frac{1026}{1251} * 1.8144) = 0.2631\]

3.3 Random Forest

3.3.1 Model Implementation

# Convert the target variable to a factor with appropriate levels
train_y <- factor(train_y, levels = c("0", "1", "2", "3"))

# Fit a random forest model
rf_model <- randomForest(train_y ~ ., data = train_X, method = "class")

cat("The Random Forest Model is trained successfully.")

## The Random Forest Model is trained successfully.

# Print the model summary
print(rf_model)

## 
## Call:
##  randomForest(formula = train_y ~ ., data = train_X, method = "class") 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 41.25%
## Confusion matrix:
##     0  1   2  3 class.error
## 0 219  7  39  3   0.1828358
## 1   2 78 161  2   0.6790123
## 2  15 56 402 12   0.1711340
## 3  46 18 155 36   0.8588235

Q.2.(e) Fit a random forest model to the training set to try to improve prediction. Report the R output.

Q.2.(e) Analysis -

In this phase, the Random Forest model is trained on training data for predicting the credit ratings from the test data.

# Predict credit ratings for the test set
predictions <- predict(rf_model, newdata = test_X, type = "class")

# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)

# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table

##           Reference
## Prediction   0   1   2   3
##          0 219   4  12  45
##          1   4  78  37  18
##          2  39 157 421 158
##          3   8   1  15  33

# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)

# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of Random Forest model:", round(accuracy * 100, 2), "%"))

## [1] "Overall Accuracy Rate of Random Forest model: 60.13 %"

Q.2.(f) Produce the confusion matrix for predicting the credit rating from this forest on the test set, and also report the overall accuracy rate.

Q.2.(f) Analysis - From the confusion matrix of Random Forest model, the different metrics can be identified as follows -

Actual 0 Predicted 0 (P00) = 219 observations
Actual 1 Predicted 0 (P01) = 4 observations
Actual 2 Predicted 0 (P02) = 12 observations
Actual 3 Predicted 0 (P03) = 45 observations
Actual 0 Predicted 1 (P10) = 4 observations
Actual 1 Predicted 1 (P11) = 78 observations
Actual 2 Predicted 1 (P12) = 37 observations
Actual 3 Predicted 1 (P13) = 18 observations
Actual 0 Predicted 2 (P20) = 39 observations
Actual 1 Predicted 2 (P21) = 157 observations
Actual 2 Predicted 2 (P22) = 421 observations
Actual 3 Predicted 2 (P23) = 158 observations
Actual 0 Predicted 3 (P30) = 8 observations
Actual 1 Predicted 3 (P31) = 1 observation
Actual 2 Predicted 3 (P32) = 15 observations
Actual 3 Predicted 3 (P33) = 33 observations

The Random Forest model achieved 60.13% validation accuracy to predict the credit ratings from the test data. The Random Forest model outperforms the Decision Tree model based on the validation accuracy with the default configuration.

3.4 Support Vector Machine I

3.4.1 Model Implementation

# Fit a support vector machine
svm_model <- svm(train_y ~ ., data = train_X)
cat("The SVM Model is trained successfully.")

## The SVM Model is trained successfully.

# Print the model summary
print(svm_model)

## 
## Call:
## svm(formula = train_y ~ ., data = train_X)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  1181

3.4.2 Model Prediction

median_prediction <- predict(svm_model, newdata = median_customer, type = "class")
cat("The predicted credit rating using SVM model is =", as.character(median_prediction[1]))

## The predicted credit rating using SVM model is = 2

Q.3.(a) Using default settings for svm() from the e1071 package, fit a support vector machine to predict the credit ratings of customers using all of the other variables in the dataset. Predict the credit rating of a hypothetical “median” customer, i.e., one observation considering all the attributes from the dataset. Report decision values as well.

Q.3.(a) Analysis -

Here the credit rating on the median values is predicted using the Support Vector Machine (SVM) model. The predicted credit rating using SVM model is = 2.

# Predict credit ratings for the test set
predictions <- predict(svm_model, newdata = test_X, type = "class")

# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)

# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table

##           Reference
## Prediction   0   1   2   3
##          0 203   3   9  44
##          1   9 127  71  30
##          2  39 107 370 133
##          3  19   3  35  47

# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)

# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of SVM model:", round(accuracy * 100, 2), "%"))

## [1] "Overall Accuracy Rate of SVM model: 59.81 %"

Q.3.(b) Produce the confusion matrix for predicting the credit rating from this SVM on the test set, and also report the overall accuracy rate.

Q.3.(b) Analysis -

From the confusion matrix of SVM model, the different metrics can be identified as follows -

Actual 0 Predicted 0 (P00) = 203 observations
Actual 1 Predicted 0 (P01) = 3 observations
Actual 2 Predicted 0 (P02) = 9 observations
Actual 3 Predicted 0 (P03) = 44 observations
Actual 0 Predicted 1 (P10) = 9 observations
Actual 1 Predicted 1 (P11) = 127 observations
Actual 2 Predicted 1 (P12) = 71 observations
Actual 3 Predicted 1 (P13) = 30 observations
Actual 0 Predicted 2 (P20) = 39 observations
Actual 1 Predicted 2 (P21) = 107 observations
Actual 2 Predicted 2 (P22) = 370 observations
Actual 3 Predicted 2 (P23) = 133 observations
Actual 0 Predicted 3 (P30) = 19 observations
Actual 1 Predicted 3 (P31) = 3 observation
Actual 2 Predicted 3 (P32) = 35 observations
Actual 3 Predicted 3 (P33) = 47 observations

The SVM model achieved 59.81% validation accuracy to predict the credit ratings from the test data.

3.4.3 Model Optimization

# Define the parameter grid for tuning
param_grid <- expand.grid(C = c(0.001, 0.005, 0.01),
                          gamma = c(0.001, 0.005, 0.01))

# Perform the grid search for tuning
tuned_model <- tune(svm, train_X, train_y, kernel = "linear", ranges = param_grid)
cat("The optimized SVM Model is trained successfully.")

## The optimized SVM Model is trained successfully.

# Print the tuned model
print(tuned_model)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      C gamma
##  0.001 0.001
## 
## - best performance: 0.3765206

# Predict credit ratings for the test set using the tuned model
predictions <- predict(tuned_model$best.model, newdata = test_X)

# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)

# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table

##           Reference
## Prediction   0   1   2   3
##          0 206   0   7  30
##          1   6 132  74  30
##          2  25  96 331 103
##          3  33  12  73  91

# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)

# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of optimized SVM model:", round(accuracy * 100, 2), "%"))

## [1] "Overall Accuracy Rate of optimized SVM model: 60.85 %"

Q.3.(c) Automatically or manually tune the SVM to improve prediction over that found in 3b. Report the resulting SVM settings and the resulting confusion matrix for predicting the test set. (Any amount of improvement is acceptable.)

Q.3.(c) Analysis -

In this phase, grid search is performed on the SVM model to find the optimized values for the hyperparameters C and gamma and optimal values of both hyperparameters are selected as 0.001. The optimized SVM model is validated based on the test dataset to predict the credit ratings. From the confusion matrix of the optimized SVM model, the different metrics can be identified as follows -

Actual 0 Predicted 0 (P00) = 206 observations
Actual 1 Predicted 0 (P01) = 0 observations
Actual 2 Predicted 0 (P02) = 7 observations
Actual 3 Predicted 0 (P03) = 30 observations
Actual 0 Predicted 1 (P10) = 6 observations
Actual 1 Predicted 1 (P11) = 132 observations
Actual 2 Predicted 1 (P12) = 74 observations
Actual 3 Predicted 1 (P13) = 30 observations
Actual 0 Predicted 2 (P20) = 25 observations
Actual 1 Predicted 2 (P21) = 96 observations
Actual 2 Predicted 2 (P22) = 331 observations
Actual 3 Predicted 2 (P23) = 103 observations
Actual 0 Predicted 3 (P30) = 33 observations
Actual 1 Predicted 3 (P31) = 12 observation
Actual 2 Predicted 3 (P32) = 73 observations
Actual 3 Predicted 3 (P33) = 91 observations

The optimized SVM model achieved 60.85% validation accuracy and outperform the previous SVM model by improving the accuracy up to 1.04% with a default configuration to predict the credit ratings from the test data.

3.5 Naïve Bayes

3.5.1 Model Implementation

# Fit the Naive Bayes model
nb_model <- naiveBayes(train_y ~ ., data = train_X)
cat("The Naïve Bayes Model is trained successfully.")

## The Naïve Bayes Model is trained successfully.

3.5.2 Model Prediction

median_prediction <- predict(nb_model, newdata = median_customer, type = "class")
cat("The predicted credit rating using Naïve Bayes model is =", as.character(median_prediction[1]))

## The predicted credit rating using Naïve Bayes model is = 1

Q.4.(a) Fit the Naive Bayes model to predict the credit ratings of customers using all of the other variables in the dataset. Predict the credit rating of a hypothetical “median” customer, i.e., one observation considering all the attributes from the dataset. Report predicted probabilities as well.

Q.4.(a) Analysis -

Here the credit rating on the median values is predicted using the Naïve Bayes model. The predicted credit rating using Naïve Bayes model is = 1.

# Print the first 20 lines of the model summary
summary_output <- capture.output(print(nb_model))
first_20_lines <- head(summary_output, n = 20)
print(first_20_lines)

##  [1] ""                                                            
##  [2] "Naive Bayes Classifier for Discrete Predictors"              
##  [3] ""                                                            
##  [4] "Call:"                                                       
##  [5] "naiveBayes.default(x = X, y = Y, laplace = laplace)"         
##  [6] ""                                                            
##  [7] "A-priori probabilities:"                                     
##  [8] "Y"                                                           
##  [9] "        0         1         2         3 "                    
## [10] "0.2142286 0.1942446 0.3876898 0.2038369 "                    
## [11] ""                                                            
## [12] "Conditional probabilities:"                                  
## [13] "   functionary"                                              
## [14] "Y        [,1]      [,2]"                                     
## [15] "  0 0.2313433 0.4224803"                                     
## [16] "  1 0.6296296 0.4839006"                                     
## [17] "  2 0.2020619 0.4019527"                                     
## [18] "  3 0.2078431 0.4065619"                                     
## [19] ""                                                            
## [20] "   re_balanced_paid_back_a_recently_overdrawn_current_acount"

Q.4.(b) Reproduce the first 20 or so lines of the R output for the Naive Bayes fit, and use them to explain the steps involved in making this prediction.

Q.4.(b) Analysis -

In this phase of analysis, the summary of Naïve Bayes classifier is presented for discrete predictors. It represents the A-priori probabilities of each class, and the conditional probabilities for each predictor variable given in each class. The probabilities indicate the likelihood of each class and the conditional probabilities describe the distribution of each predictor variable within each class.

# Predict credit ratings for the test set
predictions <- predict(nb_model, newdata = test_X, type = "class")

# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)

# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table

##           Reference
## Prediction   0   1   2   3
##          0 206   5  14  48
##          1  38 200 287 101
##          2   6  27 117  29
##          3  20   8  67  76

# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)

# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of Naïve Bayes model:", round(accuracy * 100, 2), "%"))

## [1] "Overall Accuracy Rate of Naïve Bayes model: 47.96 %"

Q.4.(c) Produce the confusion matrix for predicting the credit rating using Naive Bayes on the test set, and also report the overall accuracy rate.

Q.4.(c) Analysis -

From the confusion matrix of Naïve Bayes model, the different metrics can be identified as follows -

Actual 0 Predicted 0 (P00) = 206 observations
Actual 1 Predicted 0 (P01) = 5 observations
Actual 2 Predicted 0 (P02) = 14 observations
Actual 3 Predicted 0 (P03) = 48 observations
Actual 0 Predicted 1 (P10) = 38 observations
Actual 1 Predicted 1 (P11) = 200 observations
Actual 2 Predicted 1 (P12) = 287 observations
Actual 3 Predicted 1 (P13) = 101 observations
Actual 0 Predicted 2 (P20) = 6 observations
Actual 1 Predicted 2 (P21) = 27 observations
Actual 2 Predicted 2 (P22) = 117 observations
Actual 3 Predicted 2 (P23) = 29 observations
Actual 0 Predicted 3 (P30) = 20 observations
Actual 1 Predicted 3 (P31) = 8 observation
Actual 2 Predicted 3 (P32) = 67 observations
Actual 3 Predicted 3 (P33) = 76 observations

The Naïve Bayes model achieved 47.96% validation accuracy to predict the credit ratings from the test data.

Q.5.(a) Based on the confusion matrices reported in the preceding parts, Which of the classifiers look to be the best? (Be specific, and specify the argures you used to answer this question.)

Q.5.(a) Analysis -

The optimized Support Vector Machine (SVM) classifier achieved 60.85% validation accuracy and it outperforms all the implemented classifiers in terms of predicting the credit ratings from the test data. Therefore, this classifier is selected as the best-performing classifier for categorical classification.

Q.5.(b) Which look to be the worst? (Be specific, and specify the argures you used to answer this question.)

Q.5.(b) Analysis -

The Naïve Bayes classifier achieved 47.96% validation accuracy which is the lowest compared to the other classifiers. Hence, this classifier is selected as the worst-performing classifier for categorical classification.

Q.5.(c) Are there any categories that all classifiers seem to have trouble with?

Q.5.(c) Analysis -

In order to perform predictions on the creditworthiness observations having credit rating as 1 or 2 all the classifiers significantly seem to have trouble as the majority of the observations are wrongly classified to other respective categories. This indicates that poor classification performance leads to an increase in false-positive or false-negative rates.

4. Binary Classification

4.1 Logistic Regression

4.1.1 Data Preparation

# Replace all values in the "credit_rating" attribute as "0" except for "1"
known_ratings_data$credit_rating[known_ratings_data$credit_rating != "1"] <- "0"

# Convert the "credit_rating" column to numeric
known_ratings_data$credit_rating <- as.numeric(known_ratings_data$credit_rating)
cat("The dataset is transformed as binary labeled data.")

## The dataset is transformed as binary labeled data.

# Set the seed for reproducibility
set.seed(42)

# Splitting the dataset into X and y
X <- known_ratings_data[, !(names(known_ratings_data) %in% c("credit_rating"))]
y <- known_ratings_data$credit_rating


# Split the data into train and test sets
bin_train_index <- createDataPartition(y, p = 0.5, list = FALSE)
train_X <- X[train_index, ]
train_y <- y[train_index]
test_X <- X[-train_index, ]
test_y <- y[-train_index]

4.1.2 Model Implementation

# Fit logistic regression model
lr_model <- glm(train_y ~ ., data = train_X, family = binomial)
cat("The Logistic Regression Model is trained successfully.")

## The Logistic Regression Model is trained successfully.

Q.6.(a) Consider a simpler problem of predicting whether a customer gets a credit rating of A or not. Fit a logistic regression model to predict whether a customer gets a credit rating of A using all of the other variables in the dataset, with no interactions.

Q.6.(a) Analysis -

In order to perform this analysis the data label is converted into binary values considering the credit rating as A. The data is divided into the training and test data with the previous split ratio. The Logistic Regression model is trained to perform binary classification considering whether the predicted customer rating is “A” (1) or not (0).

# Print the model summary
summary(lr_model)

## 
## Call:
## glm(formula = train_y ~ ., family = binomial, data = train_X)
## 
## Coefficients:
##                                                             Estimate Std. Error
## (Intercept)                                               -7.8290073  1.6677665
## functionary                                                1.8704272  0.1721814
## re_balanced_paid_back_a_recently_overdrawn_current_acount  2.7335508  0.7429015
## FI3O_credit_score                                          3.9018769  1.0241991
## gender                                                     0.3877227  0.1698910
## X0_accounts_at_other_banks                                -0.0087100  0.0600297
## credit_refused_in_past                                    -1.3394385  0.3666053
## years_employed                                             0.0649816  0.1256107
## savings_on_other_accounts                                 -0.0116092  0.0975863
## self_employed                                             -0.1302103  0.2213958
## max_account_balance_12_months_ago                         -0.0387255  0.0616430
## min_account_balance_12_months_ago                          0.0058569  0.0602593
## avrg_account_balance_12_months_ago                        -0.0534866  0.0606943
## max_account_balance_11_months_ago                         -0.0493602  0.0616380
## min_account_balance_11_months_ago                         -0.0132211  0.0600179
## avrg_account_balance_11_months_ago                         0.0439142  0.0614795
## max_account_balance_10_months_ago                          0.0150089  0.0594532
## min_account_balance_10_months_ago                         -0.0296392  0.0602040
## avrg_account_balance_10_months_ago                        -0.0485102  0.0614909
## max_account_balance_9_months_ago                          -0.0011398  0.0582369
## min_account_balance_9_months_ago                           0.0099618  0.0594209
## avrg_account_balance_9_months_ago                         -0.0101332  0.0585938
## max_account_balance_8_months_ago                           0.0322165  0.0587269
## min_account_balance_8_months_ago                           0.0308818  0.0600507
## avrg_account_balance_8_months_ago                         -0.1060412  0.0604374
## max_account_balance_7_months_ago                          -0.0475631  0.0592932
## min_account_balance_7_months_ago                          -0.0294268  0.0609789
## avrg_account_balance_7_months_ago                         -0.0007051  0.0590718
## max_account_balance_6_months_ago                           0.0457034  0.0613093
## min_account_balance_6_months_ago                           0.0703718  0.0597448
## avrg_account_balance_6_months_ago                          0.0118785  0.0609635
## max_account_balance_5_months_ago                          -0.0330777  0.0593192
## min_account_balance_5_months_ago                          -0.0577640  0.0601648
## avrg_account_balance_5_months_ago                          0.0325946  0.0594651
## max_account_balance_4_months_ago                           0.1003490  0.0596857
## min_account_balance_4_months_ago                          -0.0990256  0.0620236
## avrg_account_balance_4_months_ago                          0.0186868  0.0600654
## max_account_balance_3_months_ago                          -0.0489371  0.0590694
## min_account_balance_3_months_ago                          -0.0621361  0.0617342
## avrg_account_balance_3_months_ago                          0.0057960  0.0607878
## max_account_balance_2_months_ago                           0.0006822  0.0589490
## min_account_balance_2_months_ago                          -0.0293073  0.0591024
## avrg_account_balance_2_months_ago                          0.0251421  0.0585877
## max_account_balance_1_months_ago                           0.0626658  0.0607768
## min_account_balance_1_months_ago                           0.0276050  0.0598167
## avrg_account_balance_1_months_ago                         -0.0676082  0.0612855
##                                                           z value Pr(>|z|)    
## (Intercept)                                                -4.694 2.68e-06 ***
## functionary                                                10.863  < 2e-16 ***
## re_balanced_paid_back_a_recently_overdrawn_current_acount   3.680 0.000234 ***
## FI3O_credit_score                                           3.810 0.000139 ***
## gender                                                      2.282 0.022478 *  
## X0_accounts_at_other_banks                                 -0.145 0.884635    
## credit_refused_in_past                                     -3.654 0.000259 ***
## years_employed                                              0.517 0.604929    
## savings_on_other_accounts                                  -0.119 0.905304    
## self_employed                                              -0.588 0.556443    
## max_account_balance_12_months_ago                          -0.628 0.529858    
## min_account_balance_12_months_ago                           0.097 0.922571    
## avrg_account_balance_12_months_ago                         -0.881 0.378185    
## max_account_balance_11_months_ago                          -0.801 0.423243    
## min_account_balance_11_months_ago                          -0.220 0.825648    
## avrg_account_balance_11_months_ago                          0.714 0.475048    
## max_account_balance_10_months_ago                           0.252 0.800694    
## min_account_balance_10_months_ago                          -0.492 0.622498    
## avrg_account_balance_10_months_ago                         -0.789 0.430170    
## max_account_balance_9_months_ago                           -0.020 0.984384    
## min_account_balance_9_months_ago                            0.168 0.866861    
## avrg_account_balance_9_months_ago                          -0.173 0.862698    
## max_account_balance_8_months_ago                            0.549 0.583293    
## min_account_balance_8_months_ago                            0.514 0.607068    
## avrg_account_balance_8_months_ago                          -1.755 0.079334 .  
## max_account_balance_7_months_ago                           -0.802 0.422456    
## min_account_balance_7_months_ago                           -0.483 0.629398    
## avrg_account_balance_7_months_ago                          -0.012 0.990476    
## max_account_balance_6_months_ago                            0.745 0.455996    
## min_account_balance_6_months_ago                            1.178 0.238847    
## avrg_account_balance_6_months_ago                           0.195 0.845513    
## max_account_balance_5_months_ago                           -0.558 0.577102    
## min_account_balance_5_months_ago                           -0.960 0.337007    
## avrg_account_balance_5_months_ago                           0.548 0.583602    
## max_account_balance_4_months_ago                            1.681 0.092707 .  
## min_account_balance_4_months_ago                           -1.597 0.110360    
## avrg_account_balance_4_months_ago                           0.311 0.755719    
## max_account_balance_3_months_ago                           -0.828 0.407405    
## min_account_balance_3_months_ago                           -1.007 0.314171    
## avrg_account_balance_3_months_ago                           0.095 0.924039    
## max_account_balance_2_months_ago                            0.012 0.990767    
## min_account_balance_2_months_ago                           -0.496 0.619984    
## avrg_account_balance_2_months_ago                           0.429 0.667824    
## max_account_balance_1_months_ago                            1.031 0.302503    
## min_account_balance_1_months_ago                            0.461 0.644445    
## avrg_account_balance_1_months_ago                          -1.103 0.269955    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1231.78  on 1250  degrees of freedom
## Residual deviance:  910.72  on 1205  degrees of freedom
## AIC: 1002.7
## 
## Number of Fisher Scoring iterations: 8

Q.6.(b) Report the summary table of the logistic regression model fit.

Q.6.(b) Analysis -

The summary of the Logistic Regression model represents the evaluated coefficient values which can be used to develop a linear equation for predicting the credit ratings considering the corresponding features. Furthermore, it presents the statistical significance of all the features which will help to identify the most relevant features for the analysis.

Q.6.(c) Which predictors of credit rating appear to be significant at 5% significance level?

Q.6.(c) Analysis -

During the analysis, it has been found that several features are significant at 5% significance level e.g.,

functionary (< 2e-16)
re_balanced_paid_back_a_recently_overdrawn_current_acount (0.000234)
FI3O_credit_score (0.000139)
credit_refused_in_past (0.000259)

4.2 Support Vector Machine II

4.2.1 Model Implementation

# Fit a support vector machine
svm_model <- svm(train_y ~ ., data = train_X, kernel = "linear")
cat("The SVM Model is trained successfully.")

## The SVM Model is trained successfully.

Q.6.(d) Fit an SVM model of your choice to the training set.

Q.6.(d) Analysis -

The SVM model is trained to perform binary classification considering whether the predicted customer rating is “A” (1) or not (0).

4.3 Comparative Analysis

# Predict probabilities for logistic regression model
lr_probs <- predict(lr_model, newdata = test_X, type = "response")

# Predict probabilities for SVM model
svm_probs <- predict(svm_model, newdata = test_X, probability = TRUE)

# Create a ROC curve for logistic regression model
lr_roc <- roc(test_y, lr_probs, levels = c("0", "1"))

# Create a ROC curve for SVM model
svm_roc <- roc(test_y, svm_probs, levels = c("0", "1"))

# Plot ROC curves for both models in a single plot
plot(lr_roc, col = "blue", main = "ROC Curves", xlab = "False Positive Rate", ylab = "True Positive Rate")
lines(svm_roc, col = "red")
legend("bottomright", legend = c("Logistic Regression", "SVM"), col = c("blue", "red"), lwd = 1)

# Add text for AUC values
text(0.05, 0.4, paste("LR AUC =", round(auc(lr_roc), 2)), col = "blue")
text(0.05, 0.3, paste("SVM AUC =", round(auc(svm_roc), 2)), col = "red")

Q.6.(e) Produce an ROC chart comparing the logistic regression and the SVM results of predicting the test set. Comment on any differences in their performance.

Q.6.(e) Analysis -

In this phase of analysis both the Logistic Regression and the SVM classifiers are validated using the test data and the respective ROC curves are plotted using a line plot by comparing the True-Positive and False-Positive rates. In the result of this analysis, the Logistic Regression model achieved 0.81 ROC AUC which is slightly better than the SVM classifier (ROC AUC = 0.80). The Logistic Regression classifier is selected as the best-performing classifier for binary classification.