Predicting Tire Purchasing Based Off Consumer Reports

Project Objective

To investigate the relationship of the perfomance score of various tires based on traction when wet and noise and if that effects the probabiliy of consumers buying those tires again.

Question 1 & 2: Develop the logistic regression model & Predictor Significance

Step 1: Install and load required libraries

#install.packages("readxl")
#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC)) install.packages("pROC")

library(readxl) # Allows us to import excel files 
library(Hmisc) # Allows us to call the correlation function
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pscl) # Allows us to call the pseudo R-square package to evaluate our model 
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC) # Allows us to run the areas under the curve (AUC) package to get the plot and AUC score
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import & Clean the Data

tire_df <- read_excel("Class Exercise 15_TireRatings.xlsx")
rating_df <- subset(tire_df, select = -c(Tire, Buy_Again))

Step 3: Summarize Data

head(rating_df)
## # A tibble: 6 × 3
##     Wet Noise Purchase
##   <dbl> <dbl>    <dbl>
## 1   8     7.2        0
## 2   8     7.2        1
## 3   7.6   7.5        1
## 4   6.6   5.4        0
## 5   5.8   6.3        0
## 6   6.3   5.7        0
Data Description: A description of the features are presented in the table below.
Variable  | Definition
----------|------------
1. Wet    | average of the ratings for each tire’s wet traction performance
2. Noise  | average of the ratings for the noise level generated by each tire
3. Purchase| If consumer would buy or not buy a tire again (1 = Yes, 0 = No)
summary(rating_df)
##       Wet            Noise          Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :1.0000
interpretation: the mean wet tire ratings is 7.315 meaning the average score for wet traction for all tires is 7.315 and the mean sound score of the tire is 6.903 whihc mans the average score for noise the tire makes for all tires is 6.903. With a median of 0 for purchase variable, this means customers would likley not buy these tires again

Step 4: Feature selection (i.e correlation analysis)

corr <- rcorr(as.matrix(rating_df))
corr
##           Wet Noise Purchase
## Wet      1.00  0.76     0.74
## Noise    0.76  1.00     0.72
## Purchase 0.74  0.72     1.00
## 
## n= 68 
## 
## 
## P
##          Wet Noise Purchase
## Wet           0     0      
## Noise     0         0      
## Purchase  0   0
interpretation: all the predictors are significant with the target variable

Step 5: Build the logistic regression

model <- glm(Purchase ~ Wet + Noise, data = rating_df, family = binomial)
summary(model)
## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = rating_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8
All of the indpednandt variable were significant (p-value < 0.05)

Question 3: Overall Model Significance

Likelihood Ratio Test

# Fit a null model 
null_model <- glm(Purchase ~ 1, data = rating_df, family = binomial)

# Perform likelihood ratio test 
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The inclusion of scores of tire traction when wet and noise level the tire makes as predictors in our LR model does significantly predict the likelihood of customers buying these wheels again relative to a model that predicts purchases based off the mean observed outcomes

Pseudo-R-squared

pR2(model)
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269
interpretation: A McFadden R-squared of 0.705 means that our LR model explains about 71% of the variability in the outcome relative to a model with no predictors. This is considered a very good fit and indicates a useful model

Area Under the Curve (AUC)

# Compute ROC Curve and the AUC score 
roc_curve <- roc(rating_df$Purchase, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.9741
Interpretation: An AUC score of .97 indicates that the LR model has a very high level of accuracy in predicting 
if a consumer will buy the tires again. 

Question 4 & 5: Predicting with New Information

data1 <- data.frame(Wet = 8, Noise = 8) #Performance of the Wet and noise of the tire are both rated 8 
data2 <- data.frame(Wet = 7, Noise = 7) #Performance of the Wet and noise of the tire are both rated 7 

# Predict the probability 
prob1 <- predict(model, newdata = data1, type = "response") #probability of buying wheels based off a rating of 8 
prob1 * 100 
##        1 
## 88.36964
prob2 <- predict(model, newdata = data2, type = "response") #probability of buying wheels based off a rating of 7
prob2 * 100
##        1 
## 4.058753
Interpretation: 
1) 88.37% that person will probably/definitely buy the wheels again

2) 4.06% that person will probably/definitely buy the wheels again