Logistic Regression: Predicting Purchases

Projective Objective

To investigate the relationship between product performance ratings (Wet and Noise) and the likelihood of purchase for all season tires.

##Question 1 & 2: Develop the model & assess predictor significance

Step 1: Install and load required libraries

library(readxl) #allows us to import excel files
library(pROC) #allows us to run the area under the curve (AUC) package to get the plot and AUC score
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import & clean the data

data <- read_excel("/Users/ericheredia/Downloads/Class Exercise 15_TireRatings.xlsx")

Step 3: Summarize the data

print(head(data))
## # A tibble: 6 × 5
##   Tire                                     Wet Noise Buy_Again Purchase
##   <chr>                                  <dbl> <dbl>     <dbl>    <dbl>
## 1 BFGoodrich g-Force Super Sport A/S       8     7.2       6.1        0
## 2 BFGoodrich g-Force Super Sport A/S H&V   8     7.2       6.6        1
## 3 BFGoodrich g-Force T/A KDWS              7.6   7.5       6.9        1
## 4 Bridgestone B381                         6.6   5.4       6.6        0
## 5 Bridgestone Insignia SE200               5.8   6.3       4          0
## 6 Bridgestone Insignia SE200-02            6.3   5.7       4.5        0
Data Description: A description of some of the features are presented in the table below.
Variable  | Definition
----------|------------
1. Wet    | Tire's wet handling score
2. Noise  | Tire's noise rating score
3. Purchase| Whether a cusomter purchases the tire
print(summary(data))
##      Tire                Wet            Noise         Buy_Again    
##  Length:68          Min.   :4.300   Min.   :3.600   Min.   :1.400  
##  Class :character   1st Qu.:6.450   1st Qu.:6.000   1st Qu.:3.850  
##  Mode  :character   Median :7.750   Median :7.100   Median :6.150  
##                     Mean   :7.315   Mean   :6.903   Mean   :5.657  
##                     3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:7.400  
##                     Max.   :9.200   Max.   :8.900   Max.   :8.900  
##     Purchase     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4412  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
Interpretation: Higher wet handling and noise rating scores significantly increase the likelihood of tire purchase, with wet handling having the strongest impact.

Step 4: Feature selection

logit_model <- glm(Purchase ~ Wet + Noise, data = data, family = binomial)
print(summary(logit_model))
## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8
Interpretation: All the predictors are significant with the target variable. 

Step 5: Build the logistic regression model

logit_model <- glm(Purchase ~ Wet + Noise, data = data, family = binomial)
print(summary(logit_model))
## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8
Interpretation: All the independent variables were significant (p value < 0.05)

Question 3: Overall Model Significance

Likelihood Ratio Test

# Fit a null model
null_model <- glm(Purchase ~ 1, data = data, family = binomial)

# Perform the likelihood ratio test
anova(null_model, logit_model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The inclusion of Wet and Noise as predictors significantly improves the model's ability to predict Purchase compared to the null model, as indicated by the highly significant p-value (p < 0.001).

Pseudo R-squared

null_deviance <- logit_model$null.deviance
residual_deviance <- logit_model$deviance
mcfadden_r_squared <- 1 - (residual_deviance / null_deviance)
mcfadden_r_squared <- round(mcfadden_r_squared, 2)
print(paste("McFadden R-Squared:", mcfadden_r_squared))
## [1] "McFadden R-Squared: 0.71"
Interpretation: A McFadden R-squared of 0.71 means that our LR model explains approximately 71% of the variability in the likelihood of tire purchase, relative to a model with no predictors. This is considered a moderate to good fit, where values above 0.2 to 0.4 are often seen as indicative of a useful model.

### Area Under the Curve (AUC)

The Area Under the Curve (AUC) score represents the ability of the model to correctly classify customers who will purchase the tires and those who will not.



``` r
roc_curve <- roc(data$Purchase, fitted(logit_model)) # Create the ROC curve object
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve for Logistic Regression Model")

auc_value <- auc(roc_curve) # Calculate the AUC value
print(paste("AUC:", round(auc_value, 2)))
## [1] "AUC: 0.97"
Interpretation:An AUC score of 0.97 indicates that the logistic regression model has a high level of accuracy in predicting between customers who are likely to purchase the tire and those who are not.

Question 4 & 5: Predicting with new information

new_data_1 <- data.frame(Wet = 8, Noise = 8)
prob_1 <- predict(logit_model, newdata = new_data_1, type = "response")
prob_1_percentage <- round(prob_1 * 100, 2)
print(paste("Probability (Wet=8, Noise=8):", prob_1_percentage))
## [1] "Probability (Wet=8, Noise=8): 88.37"
new_data_2 <- data.frame(Wet = 7, Noise = 7)
prob_2 <- predict(logit_model, newdata = new_data_2, type = "response")
prob_2_percentage <- round(prob_2 * 100, 2)
print(paste("Probability (Wet=7, Noise=7):", prob_2_percentage))
## [1] "Probability (Wet=7, Noise=7): 4.06"
Interpretation:
(1) There is an 88.37% chance that a customer will purchase the tire if it has a Wet performance rating of 8 and a Noise performance rating of 8.
(2) There is a 4.06% chance that a customer will purchase the tire if it has a Wet performance rating of 7 and a Noise performance rating of 7.