Class Exercise 15: Chapter 15.9 - Logistic Regression

Project Objective:

Develop the logistic regression model using x1 = Wet performance rating and x2 = Noise performance rating to y = Purchase.

Load Packages

#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC)) install.packages("pROC")

library(readxl)
library(Hmisc) #allow to call correlation function

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pscl) # for psuedo r^2 package to eval model

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(pROC) #for AUC package -> plot and score

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Import Data

tires_df <- read_excel("Class Exercise 15_TireRatings.xlsx")

Data Cleaning: Remove Irrelevent column(s)

# Removing Tire Company Names because does not help with the analysis
tire_df <- subset(tires_df, select = -c(Tire, Buy_Again)) 
#Tire = unecessary & B_A was used for dummy variable, dont need

Summarize data | Descriptive Stats

head(tire_df)

## # A tibble: 6 × 3
##     Wet Noise Purchase
##   <dbl> <dbl>    <dbl>
## 1   8     7.2        0
## 2   8     7.2        1
## 3   7.6   7.5        1
## 4   6.6   5.4        0
## 5   5.8   6.3        0
## 6   6.3   5.7        0

Data Description:
The Tire Rack maintains an independent consumer survey to help drivers help each other by sharing their long-term tire experiences. The data contained in the file named TireRatings show survey results for 68 all-season tires.
The values for the variable labeled Wet are the average of the ratings for each tire’s wet traction performance and the values for the variable labeled Noise are the average of the ratings for the noise level generated by each tire.
The values for the variable labeled Buy Again are the average of the buy-again responses.For the purposes of this exercise, we created the following binary dependent variable: Purchase
Purchase = 1 if Buy Again > 7
Purchase = 0 if 0 < Buy Again < 7
Thus, if Purchase = 1, the respondent would probably or definitely buy the tire again.

summary(tire_df)

##       Wet            Noise          Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :1.0000

Feature Selection (eg correlation)

corr <- rcorr(as.matrix(tire_df))
corr

##           Wet Noise Purchase
## Wet      1.00  0.76     0.74
## Noise    0.76  1.00     0.72
## Purchase 0.74  0.72     1.00
## 
## n= 68 
## 
## 
## P
##          Wet Noise Purchase
## Wet           0     0      
## Noise     0         0      
## Purchase  0   0

All have a positive high correlation

Build Logistic Regression Model

model <- glm(Purchase ~ Wet + Noise, data = tire_df, family = binomial)
summary(model)

## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = tire_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8

Model Significance

Fit a NULL Model

null_model <- glm(Purchase ~ 1, data = tire_df, family = binomial)

Perform liklihood ratio test

anova(null_model, model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

0.000000000000005162 = 0.00
Meaning that the Model is significant based on liklihood ratio  model 
Interpretation: The inclusion of ratings for each tire’s wet traction performance and noise level generated by each tire (Wet & Noise) as predictors in our LR model does indeed significantly predict the liklihood of a consumer purchasing the tire brand again, realtive to a model that predicts purchase solely based on the mean of observed outcomes (i.e. the null_model)

Psuedo R Squared

pR2(model)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269

McFadden = 0.71
Useful if 0.2 - 0.4
Interpretation: A McFadden R-squared of 0.71 means that our predictors in the model explain a substantial amount (70.50%) of the variability in the outcome. This is considered a good fit.

Area under the curve (AUC)

The Area Under the Curve score represents the ability of the model to correctly classify students who will re purchase tires and those who will not.

roc_curve <- roc(tire_df$Purchase, fitted(model))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve)

auc(roc_curve)

## Area under the curve: 0.9741

Interpretation: An AUC score of 0.97 indicates that the LR model has a high level of accuracy in predicting repurchase of tires

Predicting with new information

new_data1 <- data.frame(Wet = 8, Noise = 8)
new_data2 <- data.frame(Wet = 7, Noise = 7)

prob1 <- predict(model, new_data1, type= "response")
prob1 * 100

##        1 
## 88.36964

prob2 <- predict(model, new_data2, type= "response")
prob2 * 100

##        1 
## 4.058753

Interpretation: 
1. There is a 88.37% chance that customer will repurchase tires 
2. There is a 4.06% chance that customer will not repurchase tire