DATA 621 - Business Analytics and Data Mining (Homework 2)

pkges <- c("tidyverse", "kableExtra", "readr", "dplyr", "ggplot2", "zoo", "caret", "pROC")

# Loop through the packages
for (p in pkges) {
  # Check if package is installed
  if (!requireNamespace(p, quietly = TRUE)) {
    install.packages(p) #If the package is not installed, install the package
    
    library(p, character.only = TRUE) #Load the package
  } else {
    library(p, character.only = TRUE) #If the package is already installed, load the package
  }
}

This code above loops through the list of necessary packages and checks to determine if each is installed. If the package is not found it is installed and loaded.

OVERVIEW

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

INSTRUCTIONS

Complete each of the following steps as instructed:

1. Download the classification output data set (attached in Blackboard to the assignment).

This code below downloads the dataset from GitHub and displays the first few rows of the data in kable style format.

#Load Classification Data from GITHUB Repository
classification_data <- read.csv("https://raw.githubusercontent.com/BeshkiaKvarnstrom/Data-621-Business-Analytics-and-Data-Mining/main/Data/classification-output-data.csv", header=T, stringsAsFactors = F, na.strings=c("","NA"))

# Display the first few rows of the data
head(classification_data) %>% kable() %>% 
kable_styling(bootstrap_options = "striped", font_size = 12) %>% 
scroll_box(height = "300px", width = "100%")

pregnant	glucose	diastolic	skinfold	insulin	bmi	pedigree	age	class	scored.probability
7	124	70	33	215	25.5	0.161	37	0	0.3284523
2	122	76	27	200	35.9	0.483	26	0	0.2731904
3	107	62	13	48	22.9	0.678	23	1	0.1096604
1	91	64	24	0	29.2	0.192	21	0	0.0559984
4	83	86	19	0	29.3	0.317	34	0	0.1004907
1	100	74	12	46	19.5	0.149	28	0	0.0551546

2. The data set has three key columns we will use:

 class: the actual class for the observation

 scored.class: the predicted class for the observation (based on a threshold of 0.5)

 scored.probability: the predicted probability of success for the observation

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output.

In particular, do the rows represent the actual or predicted class? The columns?

In the confusion matrix displayed below; the rows represent the predicted lasses while the columns represent the actual classes.

The confusion matric is created through the use of the confusionMatrix function from the caret package. The confusion matrix will show you how many observations were correctly or incorrectly classified. The rows represent the actual (true) classes and the columns represent the predicted classes.

# Assuming classification_data is your dataframe
confusion_matrix <- table(dplyr::select(classification_data, scored.class, class))

# Display the confusion matrix
confusion_matrix

##             class
## scored.class   0   1
##            0 119  30
##            1   5  27

3. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

\(\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN}\)

# This function takes a dataframe (classif_data) as input and calculates accuracy based on the True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN)
calculate_accuracy <- function(classif_data) {
  TP <- sum(classif_data$class == 1 & classif_data$scored.class == 1)
  TN <- sum(classif_data$class == 0 & classif_data$scored.class == 0)
  FP <- sum(classif_data$class == 0 & classif_data$scored.class == 1)
  FN <- sum(classif_data$class == 1 & classif_data$scored.class == 0)

  accuracy <- (TP + TN) / (TP + FP + TN + FN)
  return(accuracy)
}

accuracy_result <- calculate_accuracy(classification_data)
print(paste("Accuracy of the predictions: ", round(accuracy_result, 4)))

## [1] "Accuracy of the predictions:  0.8066"

4. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

\(\text{Classification Error Rate} = \frac{FP + FN}{TP + FP + TN + FN}\)

Verify that you get an accuracy and an error rate that sums to one.

calculate_error_rate <- function(classif_data) {
  TP <- sum(classif_data$class == 1 & classif_data$scored.class == 1)
  TN <- sum(classif_data$class == 0 & classif_data$scored.class == 0)
  FP <- sum(classif_data$class == 0 & classif_data$scored.class == 1)
  FN <- sum(classif_data$class == 1 & classif_data$scored.class == 0)

  error_rate <- (FP + FN) / (TP + FP + TN + FN)
  return(error_rate)
}

error_rate_result <- calculate_error_rate(classification_data)

print(paste("Error Rate of the predictions: ", round(error_rate_result, 4)))

## [1] "Error Rate of the predictions:  0.1934"

# Verify that you get an accuracy and an error rate that sums to one
total_acc_err <- calculate_accuracy(classification_data) + calculate_error_rate(classification_data)
print(paste0("The sum of the Accuracy and Error Rate of the predictions is: ", calculate_accuracy(classification_data), " + ", calculate_error_rate(classification_data), " = ", total_acc_err))

## [1] "The sum of the Accuracy and Error Rate of the predictions is: 0.806629834254144 + 0.193370165745856 = 1"

5. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

\(\text{Precision} = \frac{TP}{TP + FP}\)

# This function takes a dataframe as input, with columns named "class" and "scored.class" for actual and predicted classifications, and returns the precision calculated using the formula 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑇𝑃 𝑇𝑃+𝐹𝑃 
calculate_precision <- function(classif_data) {
  TP <- sum(classif_data$class == 1 & classif_data$scored.class == 1)
  FP <- sum(classif_data$class == 0 & classif_data$scored.class == 1)

  precision <- TP / (TP + FP)
  return(precision)
}

precision_result <- calculate_precision(classification_data)
print(paste("Precision: ", round(precision_result, 4)))

## [1] "Precision:  0.8438"

6. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

\(\text{Sensitivity} = \frac{TP}{TP + FN}\)

# This function takes a dataframe as input, with column "class" and "scored.class" for actual and predicted classifications, and returns the sensitivity.
calculate_sensitivity <- function(classif_data) {
  TP <- sum(classif_data$class == 1 & classif_data$scored.class == 1)
  FN <- sum(classif_data$class == 1 & classif_data$scored.class == 0)

  sensitivity <- TP / (TP + FN)
  return(sensitivity)
}

sensitivity_result <- calculate_sensitivity(classification_data)
print(paste("Sensitivity (Recall): ", round(sensitivity_result, 4)))

## [1] "Sensitivity (Recall):  0.4737"

7. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

\(\text{Specificity} = \frac{TN}{TN + FP}\)

# This function takes the classif_data dataframe as input, assumes that it has columns named "class" and "scored.class" for actual and predicted classifications, and returns the specificity. 
calculate_specificity <- function(classif_data) {
  TN <- sum(classif_data$class == 0 & classif_data$scored.class == 0)
  FP <- sum(classif_data$class == 0 & classif_data$scored.class == 1)

  specificity <- TN / (TN + FP)
  return(specificity)
}

specificity_result <- calculate_specificity(classification_data)
print(paste("Specificity: ", round(specificity_result, 4)))

## [1] "Specificity:  0.9597"

8. Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\(\text{F1 Score} = \frac{2 * Precision * Sensitivity}{Precision + Sensitivity}\)

# This function takes the dataframe classif_data as input, assumes that it has columns named "class" and "scored.class" for actual and predicted classifications, and returns the F1 score
calculate_f1_score <- function(classif_data) {
  TP <- sum(classif_data$class == 1 & classif_data$scored.class == 1)
  FP <- sum(classif_data$class == 0 & classif_data$scored.class == 1)
  FN <- sum(classif_data$class == 1 & classif_data$scored.class == 0)

  precision <- TP / (TP + FP)
  sensitivity <- TP / (TP + FN)

  f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
  return(f1_score)
}

f1_score_result <- calculate_f1_score(classification_data)
print(paste("F1 Score: ", round(f1_score_result, 4)))

## [1] "F1 Score:  0.6067"

9. Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)

The \(F1\) score is defined as the harmonic mean of precision and recall (sensitivity) and is a positive value that is the result of dividing a value between \(0\) and \(1\) (numerator) by a value between \(0\) and \(2\) (denominator). Precision and recall individually lie between \(0\) and \(0\) because they are ratios of true positive \((TP)\), false positive \((FP)\), and false negative \((FN)\) counts.

Therefore, the \(F1\)score will always be between \(0\) and \(1\).

The \(F1\) score is given by:

\(\text{F1 Score} = 2 * \frac{a * b}{a + b}\)

where \(0 < a < 1\) and \(0 < b < 1\)

Now, consider that \(a = 1\) and \(b = 1\), then:

\(\text{F1 Score} = 2 * \frac{1 * 1}{1 + 1} = 1\)

The code below generates a contour plot of \(F1\) scores for different values of \(a\) and \(b\). The red dashed lines represent the boundaries of \(0\) and \(1\).. As you can see, all \(F1\) scores fall between \(0\) and \(1\)., confirming that the \(F1\) score is bounded within this range.

# Function to calculate F1 score
calculate_f1_score <- function(a, b) {
  return(2 * (a * b) / (a + b))
}

# Generate values for a and b
a_values <- seq(0.01, 0.99, by = 0.01)
b_values <- seq(0.01, 0.99, by = 0.01)

# Initialize a matrix to store F1 scores
f1_matrix <- matrix(0, nrow = length(a_values), ncol = length(b_values))

# Calculate F1 scores
for (i in seq_along(a_values)) {
  for (j in seq_along(b_values)) {
    a <- a_values[i]
    b <- b_values[j]
    f1_matrix[i, j] <- calculate_f1_score(a, b)
  }
}

# Plot F1 scores
contour(a_values, b_values, f1_matrix, xlab = "a", ylab = "b", main = "F1 Score Contour Plot")
abline(h = 0, col = "#381b45", lty = 2)
abline(h = 1, col = "#381b45", lty = 2)
abline(v = 0, col = "#381b45", lty = 2)
abline(v = 1, col = "#381b45", lty = 2)

10. Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

# Function to calculate sensitivity
calculate_sensitivity <- function(data, threshold) {
  true_positives <- sum(data$class == 1 & data$scored.probability > threshold)
  actual_positives <- sum(data$class == 1)
  sensitivity <- true_positives / actual_positives
  return(sensitivity)
}

# Function to calculate specificity
calculate_specificity <- function(data, threshold) {
  true_negatives <- sum(data$class == 0 & data$scored.probability <= threshold)
  actual_negatives <- sum(data$class == 0)
  specificity <- true_negatives / actual_negatives
  return(specificity)
}

# Function to generate ROC curve using ggplot2
generate_roc_curve_ggplot <- function(data, intervals = 10000) {
  thresholds <- seq(0, 1, by = 1/intervals)

  sensitivity <- sort(sapply(thresholds, function(x) calculate_sensitivity(data, threshold = x)))
  specificity <- sort(1 - sapply(thresholds, function(x) calculate_specificity(data, threshold = x)))

  # Create data frame for ggplot
  roc_data <- data.frame(Specificity = 1 - specificity, Sensitivity = sensitivity)
  roc_data <- na.omit(roc_data)

  
  # Create ROC plot curve using ggplot2
  roc_plot <- ggplot(roc_data, aes(x = 1 - Specificity, y = Sensitivity)) + geom_point() + geom_line(color = "green") +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "#381b45") +
    labs(x = "1 - Specificity (False Positive Rate)", y = "Sensitivity (True Positive Rate)",
         title = "Receiver Operator(ROC) Curve") +
    theme_minimal()
  
  
  # Show the plot
  print(roc_plot)

  # Calculate AUC
  auc <- sum(diff(roc_data$Sensitivity) * rollmean(roc_data$Specificity, 2))
  auc
}

classification_df <- subset(classification_data, select = c(scored.probability, class))
roc_result <- generate_roc_curve_ggplot(classification_df)

print(paste("Area Under Curve (AUC): ", round(roc_result, 4)))

## [1] "Area Under Curve (AUC):  0.8502"

11. Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

classification_metrics <- c(accuracy_result, error_rate_result, sensitivity_result, specificity_result, precision_result, f1_score_result)

names(classification_metrics) <- c("Prediction Accuracy", "Classification Error Rate", "Sensitivity", "Specificity", "Precision", "F1 Score")


head(classification_metrics)%>% kable() %>% 
kable_styling(bootstrap_options = "striped", font_size = 12) %>% 
scroll_box(height = "300px", width = "100%")

	x
Prediction Accuracy	0.8066298
Classification Error Rate	0.1933702
Sensitivity	0.4736842
Specificity	0.9596774
Precision	0.8437500
F1 Score	0.6067416

12. Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set.

How do the results compare with your own functions? The results returned from the built in functions are similiar to my own functions.

predicted_values <- as.factor(classification_data$scored.class)
actual_values <- as.factor(classification_data$class)

# Create a confusion matrix
Confusion_Matrix <- confusionMatrix(predicted_values, actual_values, mode = 'everything')

# Extract sensitivity and specificity from the confusion matrix
sensitivity_caret <- Confusion_Matrix$byClass["Sensitivity"]
specificity_caret <- Confusion_Matrix$byClass["Specificity"]

# Print the results
print(Confusion_Matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.9597          
##             Specificity : 0.4737          
##          Pos Pred Value : 0.7987          
##          Neg Pred Value : 0.8438          
##               Precision : 0.7987          
##                  Recall : 0.9597          
##                      F1 : 0.8718          
##              Prevalence : 0.6851          
##          Detection Rate : 0.6575          
##    Detection Prevalence : 0.8232          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 0               
##

print(paste("Sensitivity (caret):", sensitivity_caret))

## [1] "Sensitivity (caret): 0.959677419354839"

print(paste("Specificity (caret):", specificity_caret))

## [1] "Specificity (caret): 0.473684210526316"

13. Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

The pROC package in R is a powerful tool for creating ROC curves and calculating Area Under the Curve(AUC). The roc function from the pROC package can be used to generate a ROC curve for the dataset.

# Convert predicted values to numeric if they are not already
predicted_values <- as.numeric(as.character(classification_data$scored.probability))
# Make sure actual values are binary (0 or 1)
actual_values <- as.factor(classification_data$class)

# Create a ROC curve
roc_curve <- roc(response = actual_values, predictor = predicted_values)

generate_roc_curve_ggplot(classification_data)

## [1] 0.8502405

plot(roc_curve, col = "#381b45", main = "ROC Curve - pROC Function", lwd = 2)

# Calculate AUC
auc_value <- round(auc(roc_curve),4)
print(paste("Area Under Curve (AUC) pROC:", auc_value))

## [1] "Area Under Curve (AUC) pROC: 0.8503"

The results from both Area Under Curve (AUC) are similar as can be seen in both the plots.

CONCLUSION

The solutions above shows a comprehensive approach to evaluating a classification model, combining custom functions with established packages to cover a wide range of metrics and visualization techniques. The modular structure allows for easy adaptability to different data sets and model outputs.