Question: Suppose you have a dataset called “PimaIndiansDiabetes2” which contains information about diabetes diagnosis. After loading the dataset and removing missing values, you split it into training and test sets using the caret package. You then performed stepwise logistic regression using the stepAIC function from the MASS package. Afterward, you conducted forward selection and backward elimination using the same stepAIC function. Finally, you compared the performance of the forward selection model and the both-direction model. Now, your task is to calculate and compare the accuracy, precision, recall, and F1-score of the both-direction model on the test data.

Can you write R code to perform the required calculations and interpret the results obtained from the confusion matrix?

stepwise-logistic-regression-in-r.html

Load the data and remove NAs

Inspect the data

## 'data.frame':    392 obs. of  9 variables:
##  $ pregnant: num  1 0 3 2 1 5 0 1 1 3 ...
##  $ glucose : num  89 137 78 197 189 166 118 103 115 126 ...
##  $ pressure: num  66 40 50 70 60 72 84 30 70 88 ...
##  $ triceps : num  23 35 32 45 23 19 47 38 30 41 ...
##  $ insulin : num  94 168 88 543 846 175 230 83 96 235 ...
##  $ mass    : num  28.1 43.1 31 30.5 30.1 25.8 45.8 43.3 34.6 39.3 ...
##  $ pedigree: num  0.167 2.288 0.248 0.158 0.398 ...
##  $ age     : num  21 33 26 53 59 51 31 33 32 27 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 1 2 2 2 2 2 2 1 2 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:376] 1 2 3 6 8 10 11 12 13 16 ...
##   ..- attr(*, "names")= chr [1:376] "1" "2" "3" "6" ...

Split the data into training and test set

## Loading required package: ggplot2

## Loading required package: lattice

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## [1] 314   9

## [1] 78  9

Define the base model (intercept-only)

Define the scope model (full model)

Perform stepwise logistic regression

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

Summarize the final selected model

## 
## Call:
## glm(formula = diabetes ~ 1, family = binomial, data = train.data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7027     0.1199  -5.861 4.61e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 398.8  on 313  degrees of freedom
## Residual deviance: 398.8  on 313  degrees of freedom
## AIC: 400.8
## 
## Number of Fisher Scoring iterations: 4

#Perform forward selection and backward elimination

Compare the forward model and both-direction model

## Analysis of Deviance Table
## 
## Model 1: diabetes ~ 1
## Model 2: diabetes ~ 1
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       313      398.8                     
## 2       313      398.8  0        0

Compare AIC values of all three models

##                df      AIC
## base.model      1 400.8003
## forward.model   1 400.8003
## backward.model  6 279.7859
## step.model      1 400.8003

Performance: Calculate the accuracy, precision, recall, and F1-score of the both-direction model on the test data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  52  26
##        pos   0   0
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.5508, 0.7694)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.553           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 9.443e-07       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6667          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : neg             
##

Stepwise Logistic Regression in R A Complete Guide

https://www.data03.online/

2023-08-07