Question 1 & 2: Develop Logistic Regression Model & Assess
Predictor Significance
Step 1: Install and load required libraries
library(readxl)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.4.2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pscl)
## Warning: package 'pscl' was built under R version 4.4.2
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Step 2: Import and Clean Data
cars_df <- read_excel(file.choose())
cars_df
## # A tibble: 68 × 5
## Tire Wet Noise Buy_Again Purchase
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BFGoodrich g-Force Super Sport A/S 8 7.2 6.1 0
## 2 BFGoodrich g-Force Super Sport A/S H&V 8 7.2 6.6 1
## 3 BFGoodrich g-Force T/A KDWS 7.6 7.5 6.9 1
## 4 Bridgestone B381 6.6 5.4 6.6 0
## 5 Bridgestone Insignia SE200 5.8 6.3 4 0
## 6 Bridgestone Insignia SE200-02 6.3 5.7 4.5 0
## 7 Bridgestone Potenza G 019 Grid 7.7 5.2 5 0
## 8 Bridgestone Potenza RE92 5 6.2 2.5 0
## 9 Bridgestone Potenza RE92A 5.6 6.4 2.7 0
## 10 Bridgestone Potenza RE960AS Pole Position 8.8 8.5 8.1 1
## # ℹ 58 more rows
cars_df <- subset(cars_df, select = -c(Tire))
cars_df <- subset(cars_df, select = -c(Buy_Again))
Step 3: Summarize Data
head(cars_df)
## # A tibble: 6 × 3
## Wet Noise Purchase
## <dbl> <dbl> <dbl>
## 1 8 7.2 0
## 2 8 7.2 1
## 3 7.6 7.5 1
## 4 6.6 5.4 0
## 5 5.8 6.3 0
## 6 6.3 5.7 0
Data Description:
1. Wet: The values for the variable labeled Wet are the average of the ratings for each tire’s wet traction. Performance Respondents were also asked whether they would buy the tire again using the following 10-point scale
2. Noise:The values for the variable labeled Noise are the average of the ratings for the noise level generated by each tire. Performance Respondents were also asked whether they would buy the tire again using the following 10-point scale
3. Purchase: A binary independant variable representing a consumers decision to purchase the tire again. (1 = yes and 0 = no)
summary(cars_df)
## Wet Noise Purchase
## Min. :4.300 Min. :3.600 Min. :0.0000
## 1st Qu.:6.450 1st Qu.:6.000 1st Qu.:0.0000
## Median :7.750 Median :7.100 Median :0.0000
## Mean :7.315 Mean :6.903 Mean :0.4412
## 3rd Qu.:8.225 3rd Qu.:7.925 3rd Qu.:1.0000
## Max. :9.200 Max. :8.900 Max. :1.0000
Interpretation: The mean Wet score is 7.75, with a mean Noise score of 6.9. A binary purchase decision of 0.44 means that about 44% of consumers would purchase the tire again.
Step 4: Correlation Analysis
corr <- rcorr(as.matrix(cars_df))
corr
## Wet Noise Purchase
## Wet 1.00 0.76 0.74
## Noise 0.76 1.00 0.72
## Purchase 0.74 0.72 1.00
##
## n= 68
##
##
## P
## Wet Noise Purchase
## Wet 0 0
## Noise 0 0
## Purchase 0 0
Interpretation: All the predictors are significant with the target variable(Purchase). There is multicollinearity present in the data.
Step 5: Build the logistic regression model
model <- glm(Purchase ~ Wet + Noise, data = cars_df, family = binomial)
summary(model)
##
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = cars_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.4982 12.4779 -3.165 0.00155 **
## Wet 3.3745 1.2641 2.670 0.00760 **
## Noise 1.8163 0.8312 2.185 0.02887 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93.325 on 67 degrees of freedom
## Residual deviance: 27.530 on 65 degrees of freedom
## AIC: 33.53
##
## Number of Fisher Scoring iterations: 8
Interpretation: All the independent variables were significant (p-value < 0.05)
Question 3: Overall Model Significance
Likelihood Ratio Test
null_model <- glm(Purchase ~ 1, data = cars_df, family = binomial)
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 67 93.325
## 2 65 27.530 2 65.795 5.162e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The inclusion of Wet traction ratings and Noise ratings in our LR model does indeed significantly predict the likelihood of consumers purchasing the tire again. Relative to a model that predicts repurchasing based solely on the mean of observed outcomes(i.e. null model)
Pseudo-R-Squared
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -13.7649516 -46.6623284 65.7947536 0.7050093 0.6199946 0.8305269
Interpretation: A McFadden R-squared of 0.705 means that our LR model explains about %70.5 of the variability in the outcome relative to a model with no predictors. This value indicates a good fit.
Area Under the Curve (AUC)
roc_curve <- roc(cars_df$Purchase, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.9741
Interpretation: An AUC score of 0.97 indicates that the LR model has an extremely high level of accuracy in predicting consumer tire repurchasing.