**Predicting Tire Rack maintains

Project Objective

To investigate from how Wet and Noise influence the tire purchase

Question 1 & 2 Develop the model & Assess Predictor Significance

Step 1: install and load required libraries

#install.packages("readxl")
#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC))install.packages("pROC")

library(readxl)

## Warning: 套件 'readxl' 是用 R 版本 4.4.2 來建造的

library(Hmisc)

## Warning: 套件 'Hmisc' 是用 R 版本 4.4.2 來建造的

## 
## 載入套件：'Hmisc'

## 下列物件被遮斷自 'package:base':
## 
##     format.pval, units

library(pscl)

## Warning: 套件 'pscl' 是用 R 版本 4.4.2 來建造的

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(pROC)

## Warning: 套件 'pROC' 是用 R 版本 4.4.2 來建造的

## Type 'citation("pROC")' for a citation.

## 
## 載入套件：'pROC'

## 下列物件被遮斷自 'package:stats':
## 
##     cov, smooth, var

Step 2: Import & clean the data

tirerating_df <- read_excel("TireRatings.xlsx")
tirerating_df <- subset(tirerating_df, select = -c(Tire))#drop irrelevant column

Step 3: Summarize the data

head(tirerating_df)

## # A tibble: 6 × 4
##     Wet Noise Buy_Again Purchase
##   <dbl> <dbl>     <dbl>    <dbl>
## 1   8     7.2       6.1        0
## 2   8     7.2       6.6        1
## 3   7.6   7.5       6.9        1
## 4   6.6   5.4       6.6        0
## 5   5.8   6.3       4          0
## 6   6.3   5.7       4.5        0

Data Description: A description of some of the features are presented in the table below.
Variable       | Definition
-------------- |--------------
1. Wet         | The average of the ratings for each tire’s wet traction performance
2. Noise       | The average of the ratings for the noise level generated by each tire
3. Purchase    | The respondent would probably or definitely buy or not buy the tire again

summary(tirerating_df)

##       Wet            Noise         Buy_Again        Purchase     
##  Min.   :4.300   Min.   :3.600   Min.   :1.400   Min.   :0.0000  
##  1st Qu.:6.450   1st Qu.:6.000   1st Qu.:3.850   1st Qu.:0.0000  
##  Median :7.750   Median :7.100   Median :6.150   Median :0.0000  
##  Mean   :7.315   Mean   :6.903   Mean   :5.657   Mean   :0.4412  
##  3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:7.400   3rd Qu.:1.0000  
##  Max.   :9.200   Max.   :8.900   Max.   :8.900   Max.   :1.0000

Interprettation: The model Wet is 7.75, with the median of 7.31

Step 4: Feature selection (i.e., correlation anaylsis)

corr <- rcorr(as.matrix(tirerating_df))
corr

##            Wet Noise Buy_Again Purchase
## Wet       1.00  0.76      0.91     0.74
## Noise     0.76  1.00      0.83     0.72
## Buy_Again 0.91  0.83      1.00     0.83
## Purchase  0.74  0.72      0.83     1.00
## 
## n= 68 
## 
## 
## P
##           Wet Noise Buy_Again Purchase
## Wet            0     0         0      
## Noise      0         0         0      
## Buy_Again  0   0               0      
## Purchase   0   0     0

Interpetation: All the predictors are significant with the target variables(i.e., Purchase). There's no multicollnearity in the data

Step 5: Build the logistic regression model

model <- glm(Purchase ~ Wet + Noise, data = tirerating_df, family = binomial)
summary(model)

## 
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = tirerating_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Wet           3.3745     1.2641   2.670  0.00760 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8

Interpretation: All the independant variables were significant (p-value < 0.05)

Question 3: Overall Model Significance

Likelihood ratio test

# Fit a null model
null_model <- glm(Purchase ~ 1, data = tirerating_df, family = binomial)
# Perform likelihood ratio test
anova(null_model, model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        67     93.325                          
## 2        65     27.530  2   65.795 5.162e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The inclusiuon of Wet and Noise as predictors in our LR does indeed significantly predict the likelihood of the customers will buy ihe tire again or not, relative to a model that predicts purchase based solely on the mean of observed outcomes (i.e., null model).

Pseudo-R-squared

pR2(model)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269

Interpretation: A Mcfadden R-squared of 0.71 means that our LR model about 71% of the varibility in the outcome relative to the model with no predicotrs. This is not considered a moderate to good fit, where values not above 0.2 to 0.4.

Area Under the Curve (AUC)

The Area Under the Curve (AUC) score represents the ability of the model to correctly classify customers who will purchase the tire or not.

# Compute ROC Curve and the AUC score
roc_curve <- roc(tirerating_df$Purchase, fitted(model))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve)

auc(roc_curve)

## Area under the curve: 0.9741

Interpretation: An AUC score of 0.97 indicates that the LR model has a high level of accuracy in predicting tire rating.

Question 4 & 5: Predicting with New Information

# Given the following tires information
new_data1 <- data.frame(Wet = 8, Noise = 8)
new_data2 <- data.frame(Wet = 7, Noise = 7)

# Predict the probability
# (a) probability that the Wet is 8 and Noise is 8
prob1 <-predict(model, newdata = new_data1,type = "response")
prob1*100

##        1 
## 88.36964

# (b) probability that the Wet is 7 and Noise is 7
prob2 <-predict(model, newdata = new_data2,type = "response")
prob2*100

##        1 
## 4.058753

Interpretation
(1) There's a 88.37% that a customer will probably or definitely purchase a particular tire again with a Wet performance rating of 8 and a Noise performance rating of 8.
(2) There's a 4.06% that affect the probability that a customer will probably or definitely purchase a particular tire again with wet and noise performance ratings at 7.

Question 6: Odds ratio

# Extract the coefficients
coefficients <- summary(model)$coefficients

#Calculate the odd ratio for 'Program'
odds_ratio_program <- exp(coefficients["Wet", "Estimate"])
odds_ratio_program

## [1] 29.20949

Interpretation:
The odds ratio = r odds_ratio_program > 1