Predicting Tire Purchases Given Values of Ratings

Project Objective

To investigate the relationship between performance ratings of Wet and Noise on whether a person will purchase depending on the type of tire

Question 1 & 2: Develop the Model and Access Predictor Significance

Step 1: Install and load required libraries

library(readxl)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.4.2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pscl)
## Warning: package 'pscl' was built under R version 4.4.2
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import and Clean the Data

tireratings_df <- read_excel(file.choose())
ratings_df <- subset(tireratings_df, select = -c(Tire, Buy_Again))

Step 3: Summarize the Data

head(ratings_df)
## # A tibble: 6 × 3
##     Wet Noise Purchase
##   <dbl> <dbl>    <dbl>
## 1   8     7.2        0
## 2   8     7.2        1
## 3   7.6   7.5        1
## 4   6.6   5.4        0
## 5   5.8   6.3        0
## 6   6.3   5.7        0
Data Description: A Description of the Features are Presented in the Table Below.
Variable        Definition
------------- | -------------
1. Wet        | Performance Rating of the Tire's in Wet Conditions
2. Noise      | Performance Rating of the Tire's Noise Level
3. Purchase   | Purchase of Tire (1: Yes | 0: No)
summary(ratings_df)
##       Wet            Noise          Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :1.0000
Interpretation: The median of "wet" is 7.75 and "noise" is 7.10. However, the median of "purchase" is  

Step 4: Feature Selection (i.e., correlation analysis)

corr <- rcorr(as.matrix(ratings_df))
corr
##           Wet Noise Purchase
## Wet      1.00  0.76     0.74
## Noise    0.76  1.00     0.72
## Purchase 0.74  0.72     1.00
## 
## n= 68 
## 
## 
## P
##          Wet Noise Purchase
## Wet           0     0      
## Noise     0         0      
## Purchase  0   0
Interpretation: All the predictors are significant with the dependent variable (i.e., Purchase). There is multicollinearity between "wet" and "noise"

Step 5: Build the Logistic Regression Model

model <- glm(Purchase ~ Wet + Noise, data = ratings_df, family = binomial)
summary(model)
## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = ratings_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8
Interpretation: All the independent variables were significant (p-value < 0.05)

Question 3: Overall Model Significance

Liklihood Ratio Test

# Fit a null model
null_model <- glm(Purchase ~ 1, data = ratings_df, family = binomial)

# Perform liklihood ratio test
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The utilization of wetness and noise level as our independent variables in our logistic regression model does significantly predict the likihood of a person making a purchase.

Pseudo-R-squared

pR2(model)
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269
Interpretation: A McFadden R-Squared of 0.71 means that our LR model explains about 71% of the variability in the outcome relative to a model with no predictors. This is considered a great fit as the value is above 0.2 to 0.4, which is the range for a useful model. 

Area Under the Curve (AUC)

The Area Under the Curve (AUC) score represents the ability of the model to correctly classify customers who will either purchase or not purchase tires.
roc_curve <- roc(ratings_df$Purchase, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.9741
Intepretation: An AUC score of 0.9741 indicates that the LR model has a high level of accuracy in predicting purchases.

Question 4 & 5: Predicting with New Information

# Given new score information
new_data1 <- data.frame(Wet = 8, Noise = 8, Purchase = 1)
new_data2 <- data.frame(Wet = 7, Noise = 7, Purchase = 1)

# Prediction of probability
prob1 <- predict(model, newdata = new_data1, type = "response")
round((prob1 * 100), 2)
##     1 
## 88.37
prob2 <- predict(model, newdata = new_data2, type = "response")
round((prob2 * 100), 2)
##    1 
## 4.06
Interpretation:
(1)There is an 88.37% chance that a customer will purchase the tire if it has a Wet and Noise performance rating of 8. 
(2) There is a 4.06% chance that a customer will purchase the tire if it has a Wet and Noise performance rating of 7.