**Predicting Tire Rack maintains
Project Objective
To investigate from how Wet and Noise influence the tire purchase
Question 1 & 2 Develop the model & Assess Predictor
Significance
Step 1: install and load required libraries
#install.packages("readxl")
#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC))install.packages("pROC")
library(readxl)
## Warning: 套件 'readxl' 是用 R 版本 4.4.2 來建造的
library(Hmisc)
## Warning: 套件 'Hmisc' 是用 R 版本 4.4.2 來建造的
##
## 載入套件:'Hmisc'
## 下列物件被遮斷自 'package:base':
##
## format.pval, units
library(pscl)
## Warning: 套件 'pscl' 是用 R 版本 4.4.2 來建造的
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC)
## Warning: 套件 'pROC' 是用 R 版本 4.4.2 來建造的
## Type 'citation("pROC")' for a citation.
##
## 載入套件:'pROC'
## 下列物件被遮斷自 'package:stats':
##
## cov, smooth, var
Step 2: Import & clean the data
tirerating_df <- read_excel("TireRatings.xlsx")
tirerating_df <- subset(tirerating_df, select = -c(Tire))#drop irrelevant column
Step 3: Summarize the data
head(tirerating_df)
## # A tibble: 6 × 4
## Wet Noise Buy_Again Purchase
## <dbl> <dbl> <dbl> <dbl>
## 1 8 7.2 6.1 0
## 2 8 7.2 6.6 1
## 3 7.6 7.5 6.9 1
## 4 6.6 5.4 6.6 0
## 5 5.8 6.3 4 0
## 6 6.3 5.7 4.5 0
Data Description: A description of some of the features are presented in the table below.
Variable | Definition
-------------- |--------------
1. Wet | The average of the ratings for each tire’s wet traction performance
2. Noise | The average of the ratings for the noise level generated by each tire
3. Purchase | The respondent would probably or definitely buy or not buy the tire again
summary(tirerating_df)
## Wet Noise Buy_Again Purchase
## Min. :4.300 Min. :3.600 Min. :1.400 Min. :0.0000
## 1st Qu.:6.450 1st Qu.:6.000 1st Qu.:3.850 1st Qu.:0.0000
## Median :7.750 Median :7.100 Median :6.150 Median :0.0000
## Mean :7.315 Mean :6.903 Mean :5.657 Mean :0.4412
## 3rd Qu.:8.225 3rd Qu.:7.925 3rd Qu.:7.400 3rd Qu.:1.0000
## Max. :9.200 Max. :8.900 Max. :8.900 Max. :1.0000
Interprettation: The model Wet is 7.75, with the median of 7.31
Step 4: Feature selection (i.e., correlation anaylsis)
corr <- rcorr(as.matrix(tirerating_df))
corr
## Wet Noise Buy_Again Purchase
## Wet 1.00 0.76 0.91 0.74
## Noise 0.76 1.00 0.83 0.72
## Buy_Again 0.91 0.83 1.00 0.83
## Purchase 0.74 0.72 0.83 1.00
##
## n= 68
##
##
## P
## Wet Noise Buy_Again Purchase
## Wet 0 0 0
## Noise 0 0 0
## Buy_Again 0 0 0
## Purchase 0 0 0
Interpetation: All the predictors are significant with the target variables(i.e., Purchase). There's no multicollnearity in the data
Step 5: Build the logistic regression model
model <- glm(Purchase ~ Wet + Noise, data = tirerating_df, family = binomial)
summary(model)
##
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = tirerating_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.4982 12.4779 -3.165 0.00155 **
## Wet 3.3745 1.2641 2.670 0.00760 **
## Noise 1.8163 0.8312 2.185 0.02887 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93.325 on 67 degrees of freedom
## Residual deviance: 27.530 on 65 degrees of freedom
## AIC: 33.53
##
## Number of Fisher Scoring iterations: 8
Interpretation: All the independant variables were significant (p-value < 0.05)
Question 3: Overall Model Significance
Likelihood ratio test
# Fit a null model
null_model <- glm(Purchase ~ 1, data = tirerating_df, family = binomial)
# Perform likelihood ratio test
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 67 93.325
## 2 65 27.530 2 65.795 5.162e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The inclusiuon of Wet and Noise as predictors in our LR does indeed significantly predict the likelihood of the customers will buy ihe tire again or not, relative to a model that predicts purchase based solely on the mean of observed outcomes (i.e., null model).
Pseudo-R-squared
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -13.7649516 -46.6623284 65.7947536 0.7050093 0.6199946 0.8305269
Interpretation: A Mcfadden R-squared of 0.71 means that our LR model about 71% of the varibility in the outcome relative to the model with no predicotrs. This is not considered a moderate to good fit, where values not above 0.2 to 0.4.
Area Under the Curve (AUC)
The Area Under the Curve (AUC) score represents the ability of the model to correctly classify customers who will purchase the tire or not.
# Compute ROC Curve and the AUC score
roc_curve <- roc(tirerating_df$Purchase, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.9741
Interpretation: An AUC score of 0.97 indicates that the LR model has a high level of accuracy in predicting tire rating.
Question 6: Odds ratio
# Extract the coefficients
coefficients <- summary(model)$coefficients
#Calculate the odd ratio for 'Program'
odds_ratio_program <- exp(coefficients["Wet", "Estimate"])
odds_ratio_program
## [1] 29.20949
Interpretation:
The odds ratio = r odds_ratio_program > 1