I modeled my EDA, model creation, and predictions from this website: https://www.datacamp.com/tutorial/logistic-regression-R since I didn’t know how to do logistic regression on my own. However, I used my own data and did my own analysis
2024-09-24
1
I modeled my EDA, model creation, and predictions from this website: https://www.datacamp.com/tutorial/logistic-regression-R since I didn’t know how to do logistic regression on my own. However, I used my own data and did my own analysis
This presentation is going to evaluate a couple of ways to evaluate the quality of your binary classification model. Topics will include:
For this presentation we will be using the PimaIndiansDiabetes dataset to predict onset of diabetes (data example below)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes ## 1 6 148 72 35 0 33.6 0.627 50 pos ## 2 1 85 66 29 0 26.6 0.351 31 neg ## 3 8 183 64 0 0 23.3 0.672 32 pos ## 4 1 89 66 23 94 28.1 0.167 21 neg ## 5 0 137 40 35 168 43.1 2.288 33 pos ## 6 5 116 74 0 0 25.6 0.201 30 neg
Let’s use GGPlot and see how pregnancies relate to diabetes. I’ll just be displaying the output since it’s not really related to the presentation, but we can see that as # of times pregnant goes up, generally the rate of positive diabetes also goes up
Let’s use Plotly to make a scatterplot of weight and age, and see how these variables correspond to diabetes.
set.seed(42) split = initial_split(df, prop = 0.8, strata = diabetes) train = split %>% training() test = split %>% testing() model = logistic_reg() %>% fit(diabetes ~ ., data = train) pred_class = predict(model, new_data = test, type = 'class') pred_proba = predict(model, new_data = test, type = 'prob') tidy(model)
## # A tibble: 9 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -8.62 0.816 -10.6 4.42e-26 ## 2 pregnant 0.165 0.0363 4.54 5.74e- 6 ## 3 glucose 0.0370 0.00435 8.50 1.85e-17 ## 4 pressure -0.0154 0.00587 -2.62 8.86e- 3 ## 5 triceps 0.00128 0.00798 0.160 8.73e- 1 ## 6 insulin -0.00138 0.00104 -1.33 1.85e- 1 ## 7 mass 0.0957 0.0171 5.59 2.30e- 8 ## 8 pedigree 1.15 0.340 3.38 7.28e- 4 ## 9 age 0.00513 0.0107 0.480 6.31e- 1
A logistic regression model is a linear regression model with an extra step
Note that our normal linear regression model looks like this:
\(\text{x} = \beta_0 + \beta_1 \cdot \omega_1 + ... + \beta_n \cdot \omega_n\)
A logistic regression squeezes this prediction into a value between 0 and 1 using the sigmoid function (\(\dfrac{1}{1+exp(-x)}\))
This returns our predicted probability for our classification prediction and then we use that to make a prediction
Let’s start with the simplest classifier: accuracy!
For our model it’s about 77%:
## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.773
A confusion matrix gives you the number of true positives, false positives, true negatives, and false negatives in an easy-to-digest square matrix
Here is some code on how to generate a confusion matrix in R
# Credit: https://www.digitalocean.com/community/tutorials/confusion-matrix-in-r
#install.packages('caret')
expected_value <- test$diabetes
predicted_value <- as.factor(results$.pred_class)
example <- confusionMatrix(data=predicted_value, reference = expected_value)
## Confusion Matrix and Statistics ## ## Reference ## Prediction neg pos ## neg 88 23 ## pos 12 31 ## ## Accuracy : 0.7727 ## 95% CI : (0.6984, 0.8363) ## No Information Rate : 0.6494 ## P-Value [Acc > NIR] : 0.0006371 ## ## Kappa : 0.4764 ## ## Mcnemar's Test P-Value : 0.0909689 ## ## Sensitivity : 0.8800 ## Specificity : 0.5741 ## Pos Pred Value : 0.7928 ## Neg Pred Value : 0.7209 ## Prevalence : 0.6494 ## Detection Rate : 0.5714 ## Detection Prevalence : 0.7208 ## Balanced Accuracy : 0.7270 ## ## 'Positive' Class : neg ##
Let’s use GGPlot to plot a precision-recall curve! This shows the trade-off of our model as we increase or decrease our classification threshold. The area under the PR-Curve also gives us an indication of the quality of our model
The ROC Curve demonstrates the Specificity vs 1-Sensitivity, and the area under the ROC curve (AUC) also gives us a better understanding of the quality of our model. Let’s use ggPLOT to construct the ROC curve