HW.knit

Question 1 & 2: Develop Logistic Regression Model & Assess Predictor Significance

Step 1: Install and load required libraries

library(readxl)
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.4.2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pscl)

## Warning: package 'pscl' was built under R version 4.4.2

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(pROC)

## Warning: package 'pROC' was built under R version 4.4.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2:Import and Clean Data

cars_df <- read_excel(file.choose())
cars_df <- subset(cars_df, select = -c(Tire))
cars_df <- subset(cars_df, select = -c(Buy_Again))

Step 3:Summarize Data

head(cars_df)

## # A tibble: 6 × 3
##     Wet Noise Purchase
##   <dbl> <dbl>    <dbl>
## 1   8     7.2        0
## 2   8     7.2        1
## 3   7.6   7.5        1
## 4   6.6   5.4        0
## 5   5.8   6.3        0
## 6   6.3   5.7        0

Data Description: 
1. Wet:  The values for the variable labeled Wet are the average of the ratings for each tire’s wet traction. Performance Respondents were also asked whether they would buy the tire again using the following 10-point scale
2. Noise:The values for the variable labeled Noise are the average of the ratings for the noise level generated by each tire. Performance Respondents were also asked whether they would buy the tire again using the following 10-point scale
3. Purchase: A binary independant variable representing a consumers decision to purchase the tire again. (1 = yes and 0 = no)

summary(cars_df)

##       Wet            Noise          Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :1.0000

Interpretation: The mean Wet score is 7.75, with a mean Noise score of 6.9. A binary purchase decision of 0.44 means that about 44% of consumers would purchase the tire again.

Step 4:Correlation Analysis

corr <- rcorr(as.matrix(cars_df))
corr

##           Wet Noise Purchase
## Wet      1.00  0.76     0.74
## Noise    0.76  1.00     0.72
## Purchase 0.74  0.72     1.00
## 
## n= 68 
## 
## 
## P
##          Wet Noise Purchase
## Wet           0     0      
## Noise     0         0      
## Purchase  0   0

Interpretation: All the predictors are significant with the target variable(Purchase). There's no multicollinearity in the data.

Step 5:Build the logistic regression model

model <- glm(Purchase ~ Wet + Noise, data = cars_df, family = binomial)
summary(model)

## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = cars_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8

Interpretation: All the independent variables were significant (p-value < 0.05)

Question 3:Overall Model Significance

null_model <- glm(Purchase ~ 1, data = cars_df, family = binomial)

anova(null_model, model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The inclusion of Wet traction ratings and Noise ratings in our LR model does indeed significantly predict the likelihood of consumers purchasing the tire again. Relative to a model that predicts repurchasing based solely on the mean of observed outcomes(i.e. null model)

pR2(model)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269

Interpretation: A McFadden R-squared of 0.705 means that our LR model explains about %70.5 of the variability in the outcome relative to a model with no predictors. This value indicates a good fit.

roc_curve <- roc(cars_df$Purchase, fitted(model))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve)

auc(roc_curve)

## Area under the curve: 0.9741

Interpretation: An AUC score of 0.97 indicates that the LR model has an extremely high level of accuracy in predicting consumer tire repurchasing.

Question 4 & 5:Predicting with New Information

new_data1 <- data.frame(Wet = 8.0, Noise = 8.0)
new_data2 <- data.frame(Wet = 7.0, Noise = 7.0)

prob1 <- predict(model, newdata = new_data1, type = "response")
prob1 * 100

##        1 
## 88.36964

prob2 <- predict(model, newdata = new_data2, type = "response")
prob2 * 100

##        1 
## 4.058753

Interpretation:
(prob1) = There's a %88.37 chance that a consumer who rated Wet = 8.0 and Noise = 8.0 will repurchase said tire
(prob2) = There's a %4.06 chance that a consumer who rated Wet = 7.0 and Noise = 7.0 will repurchase said tire