Predicting Retention of Tires

Project Objective

To investigate the relationship between the tire traction when wet, the noise generated by each tire, and if they would purchase again.

Question 1: Develop the logistic regression model using x1 = Wet performance rating and x2 = Noise performance rating to y = Purchase. Write out the estimated logistic regression equation.

Step 1: Install and load required libraries

##install.packages("readxl")
##install.packages("Hmisc")
##install.packages("pscl")
##if(!require(pROC)) install.packages("pROC")

library(readxl)
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pscl)

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import & clean the data

college_df <- read_excel("/Applications/R Markdowns/Class Exercise 15_TireRatings (1).xlsx")

coll_df <- subset(college_df, select = -c(Tire))

head(coll_df)

## # A tibble: 6 × 4
##     Wet Noise Buy_Again Purchase
##   <dbl> <dbl>     <dbl>    <dbl>
## 1   8     7.2       6.1        0
## 2   8     7.2       6.6        1
## 3   7.6   7.5       6.9        1
## 4   6.6   5.4       6.6        0
## 5   5.8   6.3       4          0
## 6   6.3   5.7       4.5        0

Data Description: A description of the features are presented in the table below. 
Variable      | Definition 
------------- | -------------
1. Wet        | Traction of tire when wet
2. Noise      | Noise performance rating 
3. Purchase   | Would they repurchase

summary(coll_df)

##       Wet            Noise         Buy_Again        Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :1.400   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:3.850   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :6.150   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :5.657   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:7.400   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :8.900   Max.   :1.0000

Interpretation: The median score for wet is 7.75 and noise is 7.1, which means consumers would not repurchase.

Step 4: Feature selection (i.e., correlation analysis)

corr <- rcorr(as.matrix(coll_df))
corr

##            Wet Noise Buy_Again Purchase
## Wet       1.00  0.76      0.91     0.74
## Noise     0.76  1.00      0.83     0.72
## Buy_Again 0.91  0.83      1.00     0.83
## Purchase  0.74  0.72      0.83     1.00
## 
## n= 68 
## 
## 
## P
##           Wet Noise Buy_Again Purchase
## Wet            0     0         0      
## Noise      0         0         0      
## Buy_Again  0   0               0      
## Purchase   0   0     0

Interpretation: All the predictators are significant with the target variable. There is no multicollinearity in the data.

Step 5: Build the logistic regrssion model

model <- glm(Purchase ~ Wet + Noise, data = coll_df, family = binomial)
summary(model)

## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = coll_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8

Interpretation: All the independent varialbes were significant (p-value < 0.05)

Question 3: Overall Model Significance

Likelihood Ratio Test

null_model <- glm(Purchase ~ 1, data = coll_df, family = binomial)

anova(null_model, model, text = "Chisq")

## Warning in anova.glm(null_model, model, text = "Chisq"): the following
## arguments to 'anova.glm' are invalid and dropped: list(text = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The score of traction of the tires when wet and their noise can significantly predict if consumers will repurchase

Pseudo-R-Squared

pR2(model)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269

Interpretation: A score of 0.71 means our LR model explains about 71% of the variability in the outcome relative to a model with no predictators. This is considered a high to good fit.

Area under the curve (AUC)

The Area Under the Curve score represents the ability of the model to correctly classify students who will re purchase tires and those who will not.

roc_curve <- roc(coll_df$Purchase, fitted(model))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve)

auc(roc_curve)

## Area under the curve: 0.9741

Interpretation: An AUC score of 0.97 indicates that the LR model has a high level of accuracy in predicting repurchase of tires

Predicting with new information

new_data1 <- data.frame(Wet = 8, Noise = 8)
new_data2 <- data.frame(Wet = 7, Noise = 7)

prob1 <- predict(model, new_data1, type= "response")
prob1 * 100

##        1 
## 88.36964

prob2 <- predict(model, new_data2, type= "response")
prob2 * 100

##        1 
## 4.058753

Interpretation: 
1. There is a 88.37% chance that customer will repurchase tires 
2. There is a 4.06% chance that customer will not repurchase tire