Predicting Customer Purchase (The Tire Rack’s Independent Consumer Survey)

Project Objective

To investigate the relationship between the ratings for tire’s wet traction performance and the ratings for the noise level on likelihood of repurchasing.

Question 1 and 2: Develop the Logistic Regression Model and Assess Predictor Significance

Step 1: Install and load the required libraries

#install.packages('Hmisc')
#install.packages('pscl')
#install.packages('pROC')

library(readxl) #allows us to import excel files
library(Hmisc) #allows us to call the correlation function

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pscl) #allows us to call the pseudo R-square package to evaluate our model

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(pROC) #allows us to run the area under the curve (AUC) package to get the plot and AUC score

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import and clean the data

data <- read_excel("Class Exercise 15_TireRatings.xlsx")
data1 <- subset(data, select = -c(Tire))

Step 3: Summarize the data

head(data1)

## # A tibble: 6 × 4
##     Wet Noise Buy_Again Purchase
##   <dbl> <dbl>     <dbl>    <dbl>
## 1   8     7.2       6.1        0
## 2   8     7.2       6.6        1
## 3   7.6   7.5       6.9        1
## 4   6.6   5.4       6.6        0
## 5   5.8   6.3       4          0
## 6   6.3   5.7       4.5        0

summary(data1)

##       Wet            Noise         Buy_Again        Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :1.400   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:3.850   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :6.150   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :5.657   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:7.400   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :8.900   Max.   :1.0000

Step 4:Feature selection (ie., correlation analysis)

corr <- rcorr(as.matrix(data1))
corr

##            Wet Noise Buy_Again Purchase
## Wet       1.00  0.76      0.91     0.74
## Noise     0.76  1.00      0.83     0.72
## Buy_Again 0.91  0.83      1.00     0.83
## Purchase  0.74  0.72      0.83     1.00
## 
## n= 68 
## 
## 
## P
##           Wet Noise Buy_Again Purchase
## Wet            0     0         0      
## Noise      0         0         0      
## Buy_Again  0   0               0      
## Purchase   0   0     0

Interpretation: All the predictors are significant with the target variabe (i.e., Purchace). There is no multicollinearity in the data.

Step 5:Build the logistic regression model

model <- glm(Purchase ~ Wet + Noise, data = data1, family = binomial)
summary(model)

## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = data1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8

Interpretation: All the independent variables were significant (p-value < 0.05)
The estimated logistic regression equation is y_hat = e-39.50+3.37x1+1.82x2 / 1+e-39.50+3.37x1+1.82x2.

Question 3: Overall Model Significance

Pseudo-R-squared

pR2(model)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269

Interpretation: A McFadden R-squared of 0.705 means that our LR model explains about 70.5% of the variability in the outcome relative to a model with no predictors. This is considered an excellent fit since the value is above 0.7.

##Quetion 4 and 5: Predicting with New Information

# Given the following new customer information
# Suppose that the Wet and Noise performance ratings were 8
data3 <- data.frame(Wet = 8, Noise = 8)
# Predicting the probability
prob <- predict(model, newdata = data3, type = "response")
prob * 100

##        1 
## 88.36964

# Suppose that the Wet and Noise performance ratings were 7
data4 <- data.frame(Wet = 7, Noise = 7)
# Predicting the probability
prob1 <- predict(model, newdata = data4, type = "response")
prob1 * 100

##        1 
## 4.058753

Interpretation
(1) There is a 88.37% chance that the customer will buy the tire again if the both ratings were 8
(2) There is only a 4.06% chance that the customer will buy the tire again if the both ratings were 7.