**Class Exercise 15: Chapter 15.9

Question 1: estimated logistic regression equation

Step 1: Load required libraries

library(readxl)

## Warning: package 'readxl' was built under R version 4.2.3

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.2.3

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pscl)

## Warning: package 'pscl' was built under R version 4.2.3

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(pROC)

## Warning: package 'pROC' was built under R version 4.2.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import & clean the data

df1 <- read_excel(file.choose())
df <- subset(df1, select = -c(Tire))

Step 3: Summarize the data

head(df)

## # A tibble: 6 × 4
##     Wet Noise Buy_Again Purchase
##   <dbl> <dbl>     <dbl>    <dbl>
## 1   8     7.2       6.1        0
## 2   8     7.2       6.6        1
## 3   7.6   7.5       6.9        1
## 4   6.6   5.4       6.6        0
## 5   5.8   6.3       4          0
## 6   6.3   5.7       4.5        0

summary(df)

##       Wet            Noise         Buy_Again        Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :1.400   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:3.850   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :6.150   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :5.657   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:7.400   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :8.900   Max.   :1.0000

Step 4: Feature selection (i.e., correlation analysis)

corr <- rcorr(as.matrix(df))

Interpretation: All the predictors are significant with the target variable. There are multicollinearity in the data.

Step 5: Build the logistic regression model

model <- glm(Purchase ~ Wet + Noise, data = df, family = binomial)
summary(model)

## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.90595  -0.07829  -0.00213   0.21082   2.25564  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8

Interpretation: Both of the independent variables (Wet & Noise) are significant (p-value < 0.05).

Question 3: Overall Model Significance

Likelihood Ratio Test

#Fit a null model
null_model <- glm(Purchase ~ 1, data = df, family = binomial)

#Perform likelihood ratio test
anova(null_model, model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo-R-squared

pR2(model)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269

Interpretaion: A McFadden R-squared of 0.7050 means that our LR model explains about 70.50% of the variability in the outcome relative to a model with no predictors.

Area under the curve (AUC)

The Area Under the Curve (AUC) score represents the ability of the model to correctly classify customers will purchase the tire and those who will not.

roc_curve <- roc(df$Purchase, fitted(model))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve)

auc(roc_curve)

## Area under the curve: 0.9741

Interpretation: An AUC score of 0.97 indicates that the LR model has a high level of accuracy in predicting customer purchase.

Question 4 & 5: Predicting with New Information

# a Wet performance rating of 8 and a Noise performance rating of 8
new_data1 <- data.frame(Wet = 8, Noise = 8)
#Predict the probability
prob1 <- predict(model, newdata = new_data1, type = "response")
prob1 * 100

##        1 
## 88.36964

Interpretation: There is a 88.37% chance that the customer will probably or definitely purchase a particular tire again with a Wet performance rating of 8 and a Noise performance rating of 8.

# Wet and Noise performance ratings were 7
new_data2 <- data.frame(Wet = 7, Noise = 7)
#Predict the probability
prob2 <- predict(model, newdata = new_data2, type = "response")
prob2 * 100

##        1 
## 4.058753

Interpretation: There is a 4.06% chance that the customer will probably or definitely purchase a particular tire again with a Wet performance rating of 7 and a Noise performance rating of 7.