# Importing Packages
library(tidyverse)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(e1071)
library(pROC)
cat("The requisite libraries are successfully imported.")
## The requisite libraries are successfully imported.
# Read the creditworthiness.csv file
creditworthiness_data <- read.csv("creditworthiness.csv")
cat("The shape of the dataset - ", dim(creditworthiness_data))
## The shape of the dataset - 2500 46
# Get the current column names
current_names <- colnames(creditworthiness_data)
# Replace "." or ".." with "_"
new_names <- gsub("\\.|\\.\\.", "_", current_names)
# Remove trailing "_"
new_names <- sub("_$", "", new_names)
# Set the new column names
colnames(creditworthiness_data) <- new_names
cat("All the column names are changed successfully.")
## All the column names are changed successfully.
# Filter the data to include only the rows with known ratings
known_ratings_data <- creditworthiness_data[complete.cases(creditworthiness_data$credit_rating), ]
cat("The observations with missing values are removed from the dataset.")
## The observations with missing values are removed from the dataset.
# Set the seed for reproducibility
set.seed(42)
# Splitting the dataset into X and y
X <- known_ratings_data[, !(names(known_ratings_data) %in% c("credit_rating"))]
y <- known_ratings_data$credit_rating
# Split the data into train and test sets
train_index <- createDataPartition(y, p = 0.5, list = FALSE)
train_X <- X[train_index, ]
train_y <- y[train_index]
test_X <- X[-train_index, ]
test_y <- y[-train_index]
# Print train and test data lengths
cat("Train Data Shape: ", dim(train_X), "\n")
## Train Data Shape: 1251 45
cat("Test Data Length: ", dim(test_X), "\n")
## Test Data Length: 1249 45
Q.1. We need to predict the credit rating that would be assigned to each individual. The data on 2500 customers have been collected, and credit rating for 1962 of them has been assessed as either A, B, or C, coded as 1, 2, or 3, respectively, with the remaining 538 needing to be classified. Write the code to split the dataset into 50% training set and 50% test set and only include the data with known ratings.
Q.1. Analysis -
In this phase, the creditworthiness data is divided into 50% training set and 50% test set only considering the data with known credit ratings. The training and test data consist of 1251 and 1249 observations along with 45 features respectively.
# Fit a decision tree model to the training set
dt_model <- rpart(train_y ~ ., data = train_X, method = "class")
cat("The Decision Tree Model is trained successfully.")
## The Decision Tree Model is trained successfully.
# Plot the decision tree
rpart.plot(dt_model)
Q.2.(a) Using default settings, fit a decision tree to the training set predict the credit ratings of customers using all of the other variables in the dataset. Report the resulting tree.
Q.2.(a) Analysis -
The Decision Tree shown above indicates that the model considers four important features from the dataset as follows -
functionaryre_balanced_paid_back_a_recently_overdrawn_current_acountFI3O_credit_scoresavings_on_other_accountsThe model classify the data based on folowing if-else conditions -
FI3O_credit_score = 0
savings_on_other_accounts < 4
functionary = 1
re_balanced_paid_back_a_recently_overdrawn_current_acount =
0
savings_on_other_accounts < 3
# Calculate medians for all attributes in train_X
attribute_medians <- sapply(train_X, median)
# Create the median_customer dataframe
median_customer <- data.frame(t(attribute_medians))
median_prediction <- predict(dt_model, newdata = median_customer, type = "class")
cat("The predicted credit rating using Decision Tree model is =", as.character(median_prediction[1]))
## The predicted credit rating using Decision Tree model is = 2
Q.2.(b) Based on this output, predict the credit rating of a hypothetical “median” customer, i.e., one observation considering all the attributes from the dataset. Show the steps involved.
Q.2.(b) Analysis -
Here the steps for predicting the credit rating on the median values for all the features are presented as follows -
The predicted credit rating using Decision Tree model is = 2
# Predict credit ratings for the test set
predictions <- predict(dt_model, newdata = test_X, type = "class")
# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))
# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)
# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table
## Reference
## Prediction 0 1 2 3
## 0 188 5 11 43
## 1 22 140 84 35
## 2 48 94 376 137
## 3 12 1 14 39
# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)
# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of Decision Tree model:", round(accuracy * 100, 2), "%"))
## [1] "Overall Accuracy Rate of Decision Tree model: 59.49 %"
Q.2.(c) Produce the confusion matrix for predicting the credit rating from this tree on the test set, and also report the overall accuracy rate.
Q.2.(c) Analysis -
Here the confusion matrix is presented considering the predicted and
actual credit ratings from the test data using the Decision Tree model.
There are total 4 classes for the credit_rating column
e.g., 0, 1, 2, and
3. From the confusion matrix of Decision Tree model,
the different metrics can be identified as follows -
The Decision Tree model achieved 59.49% validation accuracy to predict the credit ratings from the test data.
Q.2.(d) What is the numerical value of the gain in entropy corresponding to the first split at the top of the tree? (Use logarithms to base 2, and show the details of the calculation rather than just providing a final answer.)
Q.2.(d) Analysis -
In order to calculate the information gain in entropy considering the first split at the top of the Decision Tree, the entropy for all the respective nodes need to be calculated at first.
Entropy for root/parent node \[E_p = - (0.21 * \log_{2}0.21 + 0.19 *
\log_{2}0.19 + 0.39 * \log_{2}0.39 + 0.20 * \log_{2}0.20) =
1.9222\]
Entropy for child 1 node \[E_c^1 = - (0.76 * \log_{2}0.76 + 0.00 *
\log_{2}0.00 + 0.04 * \log_{2}0.04 + 0.20 * \log_{2}0.20) =
0.9510\]
Entropy for child 2 node \[E_c^2 = - (0.10 * \log_{2}0.10 + 0.24 *
\log_{2}0.24 + 0.46 * \log_{2}0.46 + 0.21 * \log_{2}0.21) =
1.8144\]
Now, the total number of root instances =
1251 since the train data consists of the same number
of observations, among which -
The number of child 1 instances
(18% of root instances) \(= \frac{1251}{18} * 100 \approx
225\)
The number of child 2 instances
(82% of root instances) \(= \frac{1251}{82} * 100 \approx
1026\)
Hence, information gain \[IG = 1.9222
- (\frac{225}{1251} * 0.9510) - (\frac{1026}{1251} * 1.8144) =
0.2631\]
# Convert the target variable to a factor with appropriate levels
train_y <- factor(train_y, levels = c("0", "1", "2", "3"))
# Fit a random forest model
rf_model <- randomForest(train_y ~ ., data = train_X, method = "class")
cat("The Random Forest Model is trained successfully.")
## The Random Forest Model is trained successfully.
# Print the model summary
print(rf_model)
##
## Call:
## randomForest(formula = train_y ~ ., data = train_X, method = "class")
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 41.25%
## Confusion matrix:
## 0 1 2 3 class.error
## 0 219 7 39 3 0.1828358
## 1 2 78 161 2 0.6790123
## 2 15 56 402 12 0.1711340
## 3 46 18 155 36 0.8588235
Q.2.(e) Fit a random forest model to the training set to try to improve prediction. Report the R output.
Q.2.(e) Analysis -
In this phase, the Random Forest model is trained on training data for predicting the credit ratings from the test data.
# Predict credit ratings for the test set
predictions <- predict(rf_model, newdata = test_X, type = "class")
# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))
# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)
# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table
## Reference
## Prediction 0 1 2 3
## 0 219 4 12 45
## 1 4 78 37 18
## 2 39 157 421 158
## 3 8 1 15 33
# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)
# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of Random Forest model:", round(accuracy * 100, 2), "%"))
## [1] "Overall Accuracy Rate of Random Forest model: 60.13 %"
Q.2.(f) Produce the confusion matrix for predicting the credit rating from this forest on the test set, and also report the overall accuracy rate.
Q.2.(f) Analysis - From the confusion matrix of Random Forest model, the different metrics can be identified as follows -
The Random Forest model achieved 60.13% validation accuracy to predict the credit ratings from the test data. The Random Forest model outperforms the Decision Tree model based on the validation accuracy with the default configuration.
# Fit a support vector machine
svm_model <- svm(train_y ~ ., data = train_X)
cat("The SVM Model is trained successfully.")
## The SVM Model is trained successfully.
# Print the model summary
print(svm_model)
##
## Call:
## svm(formula = train_y ~ ., data = train_X)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 1181
median_prediction <- predict(svm_model, newdata = median_customer, type = "class")
cat("The predicted credit rating using SVM model is =", as.character(median_prediction[1]))
## The predicted credit rating using SVM model is = 2
Q.3.(a) Using default settings for svm() from the e1071 package, fit a support vector machine to predict the credit ratings of customers using all of the other variables in the dataset. Predict the credit rating of a hypothetical “median” customer, i.e., one observation considering all the attributes from the dataset. Report decision values as well.
Q.3.(a) Analysis -
Here the credit rating on the median values is predicted using the Support Vector Machine (SVM) model. The predicted credit rating using SVM model is = 2.
# Predict credit ratings for the test set
predictions <- predict(svm_model, newdata = test_X, type = "class")
# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))
# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)
# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table
## Reference
## Prediction 0 1 2 3
## 0 203 3 9 44
## 1 9 127 71 30
## 2 39 107 370 133
## 3 19 3 35 47
# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)
# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of SVM model:", round(accuracy * 100, 2), "%"))
## [1] "Overall Accuracy Rate of SVM model: 59.81 %"
Q.3.(b) Produce the confusion matrix for predicting the credit rating from this SVM on the test set, and also report the overall accuracy rate.
Q.3.(b) Analysis -
From the confusion matrix of SVM model, the different metrics can be identified as follows -
The SVM model achieved 59.81% validation accuracy to predict the credit ratings from the test data.
# Define the parameter grid for tuning
param_grid <- expand.grid(C = c(0.001, 0.005, 0.01),
gamma = c(0.001, 0.005, 0.01))
# Perform the grid search for tuning
tuned_model <- tune(svm, train_X, train_y, kernel = "linear", ranges = param_grid)
cat("The optimized SVM Model is trained successfully.")
## The optimized SVM Model is trained successfully.
# Print the tuned model
print(tuned_model)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## C gamma
## 0.001 0.001
##
## - best performance: 0.3765206
# Predict credit ratings for the test set using the tuned model
predictions <- predict(tuned_model$best.model, newdata = test_X)
# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))
# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)
# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table
## Reference
## Prediction 0 1 2 3
## 0 206 0 7 30
## 1 6 132 74 30
## 2 25 96 331 103
## 3 33 12 73 91
# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)
# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of optimized SVM model:", round(accuracy * 100, 2), "%"))
## [1] "Overall Accuracy Rate of optimized SVM model: 60.85 %"
Q.3.(c) Automatically or manually tune the SVM to improve prediction over that found in 3b. Report the resulting SVM settings and the resulting confusion matrix for predicting the test set. (Any amount of improvement is acceptable.)
Q.3.(c) Analysis -
In this phase, grid search is performed on the SVM model to find the
optimized values for the hyperparameters C and
gamma and optimal values of both hyperparameters are
selected as 0.001. The optimized SVM model is validated
based on the test dataset to predict the credit ratings. From the
confusion matrix of the optimized SVM model, the different metrics can
be identified as follows -
The optimized SVM model achieved 60.85% validation accuracy and outperform the previous SVM model by improving the accuracy up to 1.04% with a default configuration to predict the credit ratings from the test data.
# Fit the Naive Bayes model
nb_model <- naiveBayes(train_y ~ ., data = train_X)
cat("The Naïve Bayes Model is trained successfully.")
## The Naïve Bayes Model is trained successfully.
median_prediction <- predict(nb_model, newdata = median_customer, type = "class")
cat("The predicted credit rating using Naïve Bayes model is =", as.character(median_prediction[1]))
## The predicted credit rating using Naïve Bayes model is = 1
Q.4.(a) Fit the Naive Bayes model to predict the credit ratings of customers using all of the other variables in the dataset. Predict the credit rating of a hypothetical “median” customer, i.e., one observation considering all the attributes from the dataset. Report predicted probabilities as well.
Q.4.(a) Analysis -
Here the credit rating on the median values is predicted using the Naïve Bayes model. The predicted credit rating using Naïve Bayes model is = 1.
# Print the first 20 lines of the model summary
summary_output <- capture.output(print(nb_model))
first_20_lines <- head(summary_output, n = 20)
print(first_20_lines)
## [1] ""
## [2] "Naive Bayes Classifier for Discrete Predictors"
## [3] ""
## [4] "Call:"
## [5] "naiveBayes.default(x = X, y = Y, laplace = laplace)"
## [6] ""
## [7] "A-priori probabilities:"
## [8] "Y"
## [9] " 0 1 2 3 "
## [10] "0.2142286 0.1942446 0.3876898 0.2038369 "
## [11] ""
## [12] "Conditional probabilities:"
## [13] " functionary"
## [14] "Y [,1] [,2]"
## [15] " 0 0.2313433 0.4224803"
## [16] " 1 0.6296296 0.4839006"
## [17] " 2 0.2020619 0.4019527"
## [18] " 3 0.2078431 0.4065619"
## [19] ""
## [20] " re_balanced_paid_back_a_recently_overdrawn_current_acount"
Q.4.(b) Reproduce the first 20 or so lines of the R output for the Naive Bayes fit, and use them to explain the steps involved in making this prediction.
Q.4.(b) Analysis -
In this phase of analysis, the summary of Naïve Bayes classifier is presented for discrete predictors. It represents the A-priori probabilities of each class, and the conditional probabilities for each predictor variable given in each class. The probabilities indicate the likelihood of each class and the conditional probabilities describe the distribution of each predictor variable within each class.
# Predict credit ratings for the test set
predictions <- predict(nb_model, newdata = test_X, type = "class")
# Convert predictions and test_y into factors with the same levels
predictions <- factor(predictions, levels = c("0", "1", "2", "3"))
test_y <- factor(test_y, levels = c("0", "1", "2", "3"))
# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_y)
# Get the confusion matrix table
confusion_table <- confusion_matrix$table
confusion_table
## Reference
## Prediction 0 1 2 3
## 0 206 5 14 48
## 1 38 200 287 101
## 2 6 27 117 29
## 3 20 8 67 76
# Calculate the overall accuracy rate
accuracy <- sum(diag(confusion_table)) / sum(confusion_table)
# Print the overall accuracy rate
print(paste("Overall Accuracy Rate of Naïve Bayes model:", round(accuracy * 100, 2), "%"))
## [1] "Overall Accuracy Rate of Naïve Bayes model: 47.96 %"
Q.4.(c) Produce the confusion matrix for predicting the credit rating using Naive Bayes on the test set, and also report the overall accuracy rate.
Q.4.(c) Analysis -
From the confusion matrix of Naïve Bayes model, the different metrics can be identified as follows -
The Naïve Bayes model achieved 47.96% validation accuracy to predict the credit ratings from the test data.
Q.5.(a) Based on the confusion matrices reported in the preceding parts, Which of the classifiers look to be the best? (Be specific, and specify the argures you used to answer this question.)
Q.5.(a) Analysis -
The optimized Support Vector Machine (SVM) classifier achieved 60.85% validation accuracy and it outperforms all the implemented classifiers in terms of predicting the credit ratings from the test data. Therefore, this classifier is selected as the best-performing classifier for categorical classification.
Q.5.(b) Which look to be the worst? (Be specific, and specify the argures you used to answer this question.)
Q.5.(b) Analysis -
The Naïve Bayes classifier achieved 47.96% validation accuracy which is the lowest compared to the other classifiers. Hence, this classifier is selected as the worst-performing classifier for categorical classification.
Q.5.(c) Are there any categories that all classifiers seem to have trouble with?
Q.5.(c) Analysis -
In order to perform predictions on the creditworthiness observations having credit rating as 1 or 2 all the classifiers significantly seem to have trouble as the majority of the observations are wrongly classified to other respective categories. This indicates that poor classification performance leads to an increase in false-positive or false-negative rates.
# Replace all values in the "credit_rating" attribute as "0" except for "1"
known_ratings_data$credit_rating[known_ratings_data$credit_rating != "1"] <- "0"
# Convert the "credit_rating" column to numeric
known_ratings_data$credit_rating <- as.numeric(known_ratings_data$credit_rating)
cat("The dataset is transformed as binary labeled data.")
## The dataset is transformed as binary labeled data.
# Set the seed for reproducibility
set.seed(42)
# Splitting the dataset into X and y
X <- known_ratings_data[, !(names(known_ratings_data) %in% c("credit_rating"))]
y <- known_ratings_data$credit_rating
# Split the data into train and test sets
bin_train_index <- createDataPartition(y, p = 0.5, list = FALSE)
train_X <- X[train_index, ]
train_y <- y[train_index]
test_X <- X[-train_index, ]
test_y <- y[-train_index]
# Fit logistic regression model
lr_model <- glm(train_y ~ ., data = train_X, family = binomial)
cat("The Logistic Regression Model is trained successfully.")
## The Logistic Regression Model is trained successfully.
Q.6.(a) Consider a simpler problem of predicting whether a customer gets a credit rating of A or not. Fit a logistic regression model to predict whether a customer gets a credit rating of A using all of the other variables in the dataset, with no interactions.
Q.6.(a) Analysis -
In order to perform this analysis the data label is converted into binary values considering the credit rating as A. The data is divided into the training and test data with the previous split ratio. The Logistic Regression model is trained to perform binary classification considering whether the predicted customer rating is “A” (1) or not (0).
# Print the model summary
summary(lr_model)
##
## Call:
## glm(formula = train_y ~ ., family = binomial, data = train_X)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -7.8290073 1.6677665
## functionary 1.8704272 0.1721814
## re_balanced_paid_back_a_recently_overdrawn_current_acount 2.7335508 0.7429015
## FI3O_credit_score 3.9018769 1.0241991
## gender 0.3877227 0.1698910
## X0_accounts_at_other_banks -0.0087100 0.0600297
## credit_refused_in_past -1.3394385 0.3666053
## years_employed 0.0649816 0.1256107
## savings_on_other_accounts -0.0116092 0.0975863
## self_employed -0.1302103 0.2213958
## max_account_balance_12_months_ago -0.0387255 0.0616430
## min_account_balance_12_months_ago 0.0058569 0.0602593
## avrg_account_balance_12_months_ago -0.0534866 0.0606943
## max_account_balance_11_months_ago -0.0493602 0.0616380
## min_account_balance_11_months_ago -0.0132211 0.0600179
## avrg_account_balance_11_months_ago 0.0439142 0.0614795
## max_account_balance_10_months_ago 0.0150089 0.0594532
## min_account_balance_10_months_ago -0.0296392 0.0602040
## avrg_account_balance_10_months_ago -0.0485102 0.0614909
## max_account_balance_9_months_ago -0.0011398 0.0582369
## min_account_balance_9_months_ago 0.0099618 0.0594209
## avrg_account_balance_9_months_ago -0.0101332 0.0585938
## max_account_balance_8_months_ago 0.0322165 0.0587269
## min_account_balance_8_months_ago 0.0308818 0.0600507
## avrg_account_balance_8_months_ago -0.1060412 0.0604374
## max_account_balance_7_months_ago -0.0475631 0.0592932
## min_account_balance_7_months_ago -0.0294268 0.0609789
## avrg_account_balance_7_months_ago -0.0007051 0.0590718
## max_account_balance_6_months_ago 0.0457034 0.0613093
## min_account_balance_6_months_ago 0.0703718 0.0597448
## avrg_account_balance_6_months_ago 0.0118785 0.0609635
## max_account_balance_5_months_ago -0.0330777 0.0593192
## min_account_balance_5_months_ago -0.0577640 0.0601648
## avrg_account_balance_5_months_ago 0.0325946 0.0594651
## max_account_balance_4_months_ago 0.1003490 0.0596857
## min_account_balance_4_months_ago -0.0990256 0.0620236
## avrg_account_balance_4_months_ago 0.0186868 0.0600654
## max_account_balance_3_months_ago -0.0489371 0.0590694
## min_account_balance_3_months_ago -0.0621361 0.0617342
## avrg_account_balance_3_months_ago 0.0057960 0.0607878
## max_account_balance_2_months_ago 0.0006822 0.0589490
## min_account_balance_2_months_ago -0.0293073 0.0591024
## avrg_account_balance_2_months_ago 0.0251421 0.0585877
## max_account_balance_1_months_ago 0.0626658 0.0607768
## min_account_balance_1_months_ago 0.0276050 0.0598167
## avrg_account_balance_1_months_ago -0.0676082 0.0612855
## z value Pr(>|z|)
## (Intercept) -4.694 2.68e-06 ***
## functionary 10.863 < 2e-16 ***
## re_balanced_paid_back_a_recently_overdrawn_current_acount 3.680 0.000234 ***
## FI3O_credit_score 3.810 0.000139 ***
## gender 2.282 0.022478 *
## X0_accounts_at_other_banks -0.145 0.884635
## credit_refused_in_past -3.654 0.000259 ***
## years_employed 0.517 0.604929
## savings_on_other_accounts -0.119 0.905304
## self_employed -0.588 0.556443
## max_account_balance_12_months_ago -0.628 0.529858
## min_account_balance_12_months_ago 0.097 0.922571
## avrg_account_balance_12_months_ago -0.881 0.378185
## max_account_balance_11_months_ago -0.801 0.423243
## min_account_balance_11_months_ago -0.220 0.825648
## avrg_account_balance_11_months_ago 0.714 0.475048
## max_account_balance_10_months_ago 0.252 0.800694
## min_account_balance_10_months_ago -0.492 0.622498
## avrg_account_balance_10_months_ago -0.789 0.430170
## max_account_balance_9_months_ago -0.020 0.984384
## min_account_balance_9_months_ago 0.168 0.866861
## avrg_account_balance_9_months_ago -0.173 0.862698
## max_account_balance_8_months_ago 0.549 0.583293
## min_account_balance_8_months_ago 0.514 0.607068
## avrg_account_balance_8_months_ago -1.755 0.079334 .
## max_account_balance_7_months_ago -0.802 0.422456
## min_account_balance_7_months_ago -0.483 0.629398
## avrg_account_balance_7_months_ago -0.012 0.990476
## max_account_balance_6_months_ago 0.745 0.455996
## min_account_balance_6_months_ago 1.178 0.238847
## avrg_account_balance_6_months_ago 0.195 0.845513
## max_account_balance_5_months_ago -0.558 0.577102
## min_account_balance_5_months_ago -0.960 0.337007
## avrg_account_balance_5_months_ago 0.548 0.583602
## max_account_balance_4_months_ago 1.681 0.092707 .
## min_account_balance_4_months_ago -1.597 0.110360
## avrg_account_balance_4_months_ago 0.311 0.755719
## max_account_balance_3_months_ago -0.828 0.407405
## min_account_balance_3_months_ago -1.007 0.314171
## avrg_account_balance_3_months_ago 0.095 0.924039
## max_account_balance_2_months_ago 0.012 0.990767
## min_account_balance_2_months_ago -0.496 0.619984
## avrg_account_balance_2_months_ago 0.429 0.667824
## max_account_balance_1_months_ago 1.031 0.302503
## min_account_balance_1_months_ago 0.461 0.644445
## avrg_account_balance_1_months_ago -1.103 0.269955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1231.78 on 1250 degrees of freedom
## Residual deviance: 910.72 on 1205 degrees of freedom
## AIC: 1002.7
##
## Number of Fisher Scoring iterations: 8
Q.6.(b) Report the summary table of the logistic regression model fit.
Q.6.(b) Analysis -
The summary of the Logistic Regression model represents the evaluated coefficient values which can be used to develop a linear equation for predicting the credit ratings considering the corresponding features. Furthermore, it presents the statistical significance of all the features which will help to identify the most relevant features for the analysis.
Q.6.(c) Which predictors of credit rating appear to be significant at 5% significance level?
Q.6.(c) Analysis -
During the analysis, it has been found that several features are significant at 5% significance level e.g.,
functionary (< 2e-16)re_balanced_paid_back_a_recently_overdrawn_current_acount
(0.000234)FI3O_credit_score (0.000139)credit_refused_in_past (0.000259)# Fit a support vector machine
svm_model <- svm(train_y ~ ., data = train_X, kernel = "linear")
cat("The SVM Model is trained successfully.")
## The SVM Model is trained successfully.
Q.6.(d) Fit an SVM model of your choice to the training set.
Q.6.(d) Analysis -
The SVM model is trained to perform binary classification considering whether the predicted customer rating is “A” (1) or not (0).
# Predict probabilities for logistic regression model
lr_probs <- predict(lr_model, newdata = test_X, type = "response")
# Predict probabilities for SVM model
svm_probs <- predict(svm_model, newdata = test_X, probability = TRUE)
# Create a ROC curve for logistic regression model
lr_roc <- roc(test_y, lr_probs, levels = c("0", "1"))
# Create a ROC curve for SVM model
svm_roc <- roc(test_y, svm_probs, levels = c("0", "1"))
# Plot ROC curves for both models in a single plot
plot(lr_roc, col = "blue", main = "ROC Curves", xlab = "False Positive Rate", ylab = "True Positive Rate")
lines(svm_roc, col = "red")
legend("bottomright", legend = c("Logistic Regression", "SVM"), col = c("blue", "red"), lwd = 1)
# Add text for AUC values
text(0.05, 0.4, paste("LR AUC =", round(auc(lr_roc), 2)), col = "blue")
text(0.05, 0.3, paste("SVM AUC =", round(auc(svm_roc), 2)), col = "red")
Q.6.(e) Produce an ROC chart comparing the logistic regression and the SVM results of predicting the test set. Comment on any differences in their performance.
Q.6.(e) Analysis -
In this phase of analysis both the Logistic Regression and the SVM classifiers are validated using the test data and the respective ROC curves are plotted using a line plot by comparing the True-Positive and False-Positive rates. In the result of this analysis, the Logistic Regression model achieved 0.81 ROC AUC which is slightly better than the SVM classifier (ROC AUC = 0.80). The Logistic Regression classifier is selected as the best-performing classifier for binary classification.