HW3 – Evaluating a Binary Classification Model

2024-09-24

Credit

I modeled my EDA, model creation, and predictions from this website: https://www.datacamp.com/tutorial/logistic-regression-R since I didn’t know how to do logistic regression on my own. However, I used my own data and did my own analysis

Evaluating a (Binary) Classification Model

This presentation is going to evaluate a couple of ways to evaluate the quality of your binary classification model. Topics will include:

Accuracy
Confusion Matrix
Area under the ROC Curve
Precision and Recall

For this presentation we will be using the PimaIndiansDiabetes dataset to predict onset of diabetes (data example below)

##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35       0 33.6    0.627  50      pos
## 2        1      85       66      29       0 26.6    0.351  31      neg
## 3        8     183       64       0       0 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74       0       0 25.6    0.201  30      neg

Quick EDA for Proof of Concept

Let’s use GGPlot and see how pregnancies relate to diabetes. I’ll just be displaying the output since it’s not really related to the presentation, but we can see that as # of times pregnant goes up, generally the rate of positive diabetes also goes up

More EDA

Let’s use Plotly to make a scatterplot of weight and age, and see how these variables correspond to diabetes.

Model Creation/Fitting and Predictions

set.seed(42)
split = initial_split(df, prop = 0.8, strata = diabetes)
train = split %>% training()
test = split %>% testing()
model = logistic_reg() %>% fit(diabetes ~ ., data = train)
pred_class = predict(model, new_data = test, type = 'class')
pred_proba = predict(model, new_data = test, type = 'prob')
tidy(model)

## # A tibble: 9 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -8.62      0.816     -10.6   4.42e-26
## 2 pregnant     0.165     0.0363      4.54  5.74e- 6
## 3 glucose      0.0370    0.00435     8.50  1.85e-17
## 4 pressure    -0.0154    0.00587    -2.62  8.86e- 3
## 5 triceps      0.00128   0.00798     0.160 8.73e- 1
## 6 insulin     -0.00138   0.00104    -1.33  1.85e- 1
## 7 mass         0.0957    0.0171      5.59  2.30e- 8
## 8 pedigree     1.15      0.340       3.38  7.28e- 4
## 9 age          0.00513   0.0107      0.480 6.31e- 1

Explaining a bit of how Logistic Regression Works

A logistic regression model is a linear regression model with an extra step

Note that our normal linear regression model looks like this:

\(\text{x} = \beta_0 + \beta_1 \cdot \omega_1 + ... + \beta_n \cdot \omega_n\)

A logistic regression squeezes this prediction into a value between 0 and 1 using the sigmoid function (\(\dfrac{1}{1+exp(-x)}\))

This returns our predicted probability for our classification prediction and then we use that to make a prediction

Accuracy

Let’s start with the simplest classifier: accuracy!

Accuracy is simply the number of predictions that you got correct over the total number of values in the sample
\(\text{Accuracy} = \dfrac{TP + TN}{TP + TN + FP + FN}\)
The True positives and true negatives are the predictions that we got correct, while false positives and false negatives make up all the predictions we got wrong.

For our model it’s about 77%:

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.773

Confusion Matrix

A confusion matrix gives you the number of true positives, false positives, true negatives, and false negatives in an easy-to-digest square matrix

Here is some code on how to generate a confusion matrix in R

# Credit: https://www.digitalocean.com/community/tutorials/confusion-matrix-in-r
#install.packages('caret')
expected_value <- test$diabetes
predicted_value <- as.factor(results$.pred_class)
example <- confusionMatrix(data=predicted_value, reference = expected_value)

Output

Let’s see our confusion matrix!
- The intersection at 0,0 is our true negatives (we predicted negative and it was negative)
- The intersection at 0,1 is our false positives (we predicted positive but it was negative)
- The intersection at 1,0 is our false negative (we predicted negative but it was positive)
- The intersection at 1,1 is our true positives (we predicted positive and it was positive)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  88  23
##        pos  12  31
##                                           
##                Accuracy : 0.7727          
##                  95% CI : (0.6984, 0.8363)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.0006371       
##                                           
##                   Kappa : 0.4764          
##                                           
##  Mcnemar's Test P-Value : 0.0909689       
##                                           
##             Sensitivity : 0.8800          
##             Specificity : 0.5741          
##          Pos Pred Value : 0.7928          
##          Neg Pred Value : 0.7209          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5714          
##    Detection Prevalence : 0.7208          
##       Balanced Accuracy : 0.7270          
##                                           
##        'Positive' Class : neg             
##

Precision-Recall Curve

Let’s use GGPlot to plot a precision-recall curve! This shows the trade-off of our model as we increase or decrease our classification threshold. The area under the PR-Curve also gives us an indication of the quality of our model

ROC Curve

The ROC Curve demonstrates the Specificity vs 1-Sensitivity, and the area under the ROC curve (AUC) also gives us a better understanding of the quality of our model. Let’s use ggPLOT to construct the ROC curve