Project Objective

To investigate the relationship between wet and noise score of tires and purchasing the same tires again.

Introduction

The Tire Rack maintains an independent consumer survey to help drivers help each other by sharing their long-term tire experiences. The data contained in the file named TireRatings show survey results for 68 all-season tires. Performance traits are rated using a 10-point scale (1 being unacceptable and 10 being superior).

The values for the variable labeled Wet are the average of the ratings for each tire’s wet traction performance and the values for the variable labeled Noise are the average of the ratings for the noise level generated by each tire. Respondents were also asked whether they would buy the tire again using a 10-point scale (1 being definitely not and 10 being definitely).

The values for the variable labeled Buy Again are the average of the buy-again responses. For this analysis a binary dependent variable was created as follows:

-1 if the value of the Buy-Again variable is 7 or greater

-0 if the value of the Buy-Again variable is less than 7

Thus, if Purchase = 1, the respondent would probably or definitely buy the tire again.

First, let’s isntall and load the required libraries:

install.packages("Hmisc")
install.packages("pscl")
if(!require(pROC)) install.packages(("pROC"))
library(readxl)
library(Hmisc)
library(pscl)
library(pROC)

Now, let’s download the Tire dataset:

df <- read_excel("C:\\Users\\Belen\\OneDrive\\Homework\\Stats\\TireRatings.xlsx")

Summarize the data:

summary(df)
##      Tire                Wet            Noise         Buy_Again    
##  Length:68          Min.   :4.300   Min.   :3.600   Min.   :1.400  
##  Class :character   1st Qu.:6.450   1st Qu.:6.000   1st Qu.:3.850  
##  Mode  :character   Median :7.750   Median :7.100   Median :6.150  
##                     Mean   :7.315   Mean   :6.903   Mean   :5.657  
##                     3rd Qu.:8.225   3rd Qu.:7.925   3rd Qu.:7.400  
##                     Max.   :9.200   Max.   :8.900   Max.   :8.900  
##     Purchase     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4412  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Data cleaning

Drop tire name column:

new_data<- subset(df,select = -c(Tire))

Data Exploration

First, let’s do a correlation analysis to see what variables are important for this LR model

corr<- rcorr(as.matrix(new_data))
corr
##            Wet Noise Buy_Again Purchase
## Wet       1.00  0.76      0.91     0.74
## Noise     0.76  1.00      0.83     0.72
## Buy_Again 0.91  0.83      1.00     0.83
## Purchase  0.74  0.72      0.83     1.00
## 
## n= 68 
## 
## 
## P
##           Wet Noise Buy_Again Purchase
## Wet            0     0         0      
## Noise      0         0         0      
## Buy_Again  0   0               0      
## Purchase   0   0     0
Interpretation: All the predictors are significant with the target variable (i.e., Purchase).

Now, we build the logistic regression model:

model<- glm(Purchase ~ Noise + Wet, data= new_data, family = binomial)
summary(model)
## 
## Call:
## glm(formula = Purchase ~ Noise + Wet, family = binomial, data = new_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -39.4982    12.4779  -3.165  0.00155 **
## Noise         1.8163     0.8312   2.185  0.02887 * 
## Wet           3.3745     1.2641   2.670  0.00760 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93.325  on 67  degrees of freedom
## Residual deviance: 27.530  on 65  degrees of freedom
## AIC: 33.53
## 
## Number of Fisher Scoring iterations: 8

The logistic regression model is defined by:

e^(-39.5+3.37 Wet + 1.82 Noise) / 1- e^(-39.5+3.37 Wet + 1.82 Noise)

Now we check the overall model significance:

Likelihood Ratio Test

  1. Fit a null model
null_model<- glm(Purchase ~ 1, data= new_data, family = binomial)

2)Perform likelihood ratio test

anova(null_model, model, test ="Chisq")
Interpretation: p-values for the independent variables are significant, which indicates that our model's likelihood is significant. The inclusion of Wet and Noise as predictors in this LR model does indeed significantly predict the likelihood or customers purchasing the tires again, relative to a model that predicts purchase solely on the mean of observed outcomes.

Pseudo R-squared Method

pR2(model) 
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -13.7649516 -46.6623284  65.7947536   0.7050093   0.6199946   0.8305269
Interpretation: A McFaddem R-Squared of 0.70 means that the LR model captures 70% variability in the 
outcome vs a model with no predictors.

Now let’s compute the Area Under the Curve (AUC) and AUC score

roc_curve<- roc(new_data$Purchase, fitted(model))
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.9741
Interpretation: An AUC of 0.97 indicates that the logistic model has a high level of accuracy
in predicting purchase behavior.

Now, let’s try to predict purchase behavior based on wet score= 7 and noise score= 7, and wet score= 7 and noise score= 7

new_data1 <- data.frame(Wet = 7, Noise = 7, Buy_Again=1)
new_data2 <- data.frame(Wet = 8, Noise = 8, Buy_Again=1)

Predicting the probability:

prob1<- predict(model, newdata= new_data1, type="response") #Probability that customer will purchase the tire again
prob1*100
##        1 
## 4.058753
prob2<- predict(model, newdata= new_data2, type="response") #Probability that customer will purchase the tire again
prob2*100
##        1 
## 88.36964
Interpretation: 4.06% of customers would purchase tires with a 7 wet and noise score again, and 88.37% of customers would purchase tires with an 8 wet and noise score again.