Predicting Customers Repurchasing at Tire Rack

Project Objective

To investigate the relationship between Wet and Noise performance ratings when customers are rebuying tires at Tire Rack.

Question 1 & 2: Develop the Model & Assess Predictor Significance

Step 1: Install and load required libraries

#install.packages("readxl")
#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC)) install.packages("pROC")

library(readxl) #allows us to import excel files
library(Hmisc) #allows us to call the correlation function
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pscl) #allows us to call the pseudo R-square package to evaluate our model
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC) #allows us to run the area under the curve (AUC) package to get the plot and AUC score
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import & clean the data

tire_df <- read_excel("Class Exercise 15_TireRatings.xlsx")
ti_df <- subset(tire_df, select = -c(Tire)) #drop irrelevant column

Step 3: Summarize the data

head(ti_df)
## # A tibble: 6 × 4
##     Wet Noise Buy_Again Purchase
##   <dbl> <dbl>     <dbl>    <dbl>
## 1   8     7.2       6.1        0
## 2   8     7.2       6.6        1
## 3   7.6   7.5       6.9        1
## 4   6.6   5.4       6.6        0
## 5   5.8   6.3       4          0
## 6   6.3   5.7       4.5        0
Data Description: A description of the features are presented in the table below.
Variable    | Definition
------------|--------------
1. Wet      | Tire Wet Traction performance rating (1: Buy again and 0: Won't buy again)
2. Noise    | Tire noise level generated rating (1: Buy again and 0: Won't buy again)
3. Purchase | Average of tire purchases made
summary(ti_df)
##       Wet            Noise         Buy_Again        Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :1.400   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:3.850   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :6.150   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :5.657   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:7.400   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :8.900   Max.   :1.0000
Interpretation: The median purchase is 0, with a median of 1 meaning customers purchased tires based off of their Wet and Noise performance ratings.

Step 4: Feature selection (i.e., correlation analysis)

corr <- rcorr(as.matrix(ti_df))
corr
##            Wet Noise Buy_Again Purchase
## Wet       1.00  0.76      0.91     0.74
## Noise     0.76  1.00      0.83     0.72
## Buy_Again 0.91  0.83      1.00     0.83
## Purchase  0.74  0.72      0.83     1.00
## 
## n= 68 
## 
## 
## P
##           Wet Noise Buy_Again Purchase
## Wet            0     0         0      
## Noise      0         0         0      
## Buy_Again  0   0               0      
## Purchase   0   0     0
Interpretation: All the predictors are significant with the target variable (i.e., Purchase). There's no multicollinearity in the data.

Step 5: Build the logistic regression model

model <- glm(Purchase ~ Wet + Noise, family = binomial, data = ti_df)
summary(model)
## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = ti_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8
Interpretation: ALl the independent variables were significant (p-value < 0.05)

QUestion 3: Overall Model Significance

Likelihood Ratio Test

# Fit a null model
null_model <- glm(Purchase ~ 1, data = ti_df, family = binomial)

# Perform likelihood ratio test
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The inclusion of Wet and Noise as predictors in our LR model does indeed significantly predict the likelihood
of customers purchasing the Tires based off of their wet and noise performances, relative to a model that predicts purchases based solely on the mean of observed outcomes (i.e. null model).

Pseudo-R-squared

pR2(model)
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269
Interpretation: A McFadden R-squared of 0.71 means that our LR model explains about 71% of the variability in the outcome relative to a model with no predictors. This is not considered a good
fit since the value is too high from where a useful model usually lies (0.2 to 0.4).

Area Under the curve (AUC)

The Area Under the Curve (AUC) score represents the ability of the model to correctly classify customers that will purchase tires from Tire Rack and those who will not.
# Compute ROC Curve and the AUC score
roc_curve <- roc(ti_df$Purchase, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.9741
Interpretation: An AUC score of 0.97 indicates that the LR model has a high level of accuracy in predicting customer tire purchases.

Question 4 & 5: Predicting with New Information

# Given the following new performance ratings
new_data1 <- data.frame(Wet = 8, Noise = 8)
new_data2 <- data.frame(Wet = 7, Noise = 7)

# Predict the probability
# (a) probability that customers purchased tires based off the Wet and Noise performance rating being 8
prob1 <- predict(model, newdata = new_data1, type = "response")
prob1 * 100
##        1 
## 88.36964
# (b) probability that customers purchased tires based off the Wet and Noise performance rating being 7
prob2 <- predict(model, newdata = new_data2, type = "response")
prob2 * 100
##        1 
## 4.058753
Interpretation
(1) There's a 88.37% chance that the customer will purchase tires if the Wet and Noise performance rating are both at 8.
(2) There's a 4.06% chance that the customer will purchase tires if the Wet and Noise performance rating are both at 7.