To investigate the relationship between wet and noise score of tires and purchasing the same tires again.
Introduction
The Tire Rack maintains an independent consumer survey to help drivers help each other by sharing their long-term tire experiences. The data contained in the file named TireRatings show survey results for 68 all-season tires. Performance traits are rated using a 10-point scale (1 being unacceptable and 10 being superior).
The values for the variable labeled Wet are the average of the ratings for each tire’s wet traction performance and the values for the variable labeled Noise are the average of the ratings for the noise level generated by each tire. Respondents were also asked whether they would buy the tire again using a 10-point scale (1 being definitely not and 10 being definitely).
The values for the variable labeled Buy Again are the average of the buy-again responses. For this analysis a binary dependent variable was created as follows:
-1 if the value of the Buy-Again variable is 7 or greater
-0 if the value of the Buy-Again variable is less than 7
Thus, if Purchase = 1, the respondent would probably or definitely buy the tire again.
First, let’s isntall and load the required libraries:
install.packages("Hmisc")
install.packages("pscl")
if(!require(pROC)) install.packages(("pROC"))
library(readxl)
library(Hmisc)
library(pscl)
library(pROC)
Now, let’s download the Tire dataset:
df <- read_excel("C:\\Users\\Belen\\OneDrive\\Homework\\Stats\\TireRatings.xlsx")
Summarize the data:
summary(df)
## Tire Wet Noise Buy_Again
## Length:68 Min. :4.300 Min. :3.600 Min. :1.400
## Class :character 1st Qu.:6.450 1st Qu.:6.000 1st Qu.:3.850
## Mode :character Median :7.750 Median :7.100 Median :6.150
## Mean :7.315 Mean :6.903 Mean :5.657
## 3rd Qu.:8.225 3rd Qu.:7.925 3rd Qu.:7.400
## Max. :9.200 Max. :8.900 Max. :8.900
## Purchase
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4412
## 3rd Qu.:1.0000
## Max. :1.0000
Drop tire name column:
new_data<- subset(df,select = -c(Tire))
First, let’s do a correlation analysis to see what variables are important for this LR model
corr<- rcorr(as.matrix(new_data))
corr
## Wet Noise Buy_Again Purchase
## Wet 1.00 0.76 0.91 0.74
## Noise 0.76 1.00 0.83 0.72
## Buy_Again 0.91 0.83 1.00 0.83
## Purchase 0.74 0.72 0.83 1.00
##
## n= 68
##
##
## P
## Wet Noise Buy_Again Purchase
## Wet 0 0 0
## Noise 0 0 0
## Buy_Again 0 0 0
## Purchase 0 0 0
Interpretation: All the predictors are significant with the target variable (i.e., Purchase).
Now, we build the logistic regression model:
model<- glm(Purchase ~ Noise + Wet, data= new_data, family = binomial)
summary(model)
##
## Call:
## glm(formula = Purchase ~ Noise + Wet, family = binomial, data = new_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.4982 12.4779 -3.165 0.00155 **
## Noise 1.8163 0.8312 2.185 0.02887 *
## Wet 3.3745 1.2641 2.670 0.00760 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93.325 on 67 degrees of freedom
## Residual deviance: 27.530 on 65 degrees of freedom
## AIC: 33.53
##
## Number of Fisher Scoring iterations: 8
The logistic regression model is defined by:
e^(-39.5+3.37 Wet + 1.82 Noise) / 1- e^(-39.5+3.37 Wet + 1.82 Noise)
Now we check the overall model significance:
Likelihood Ratio Test
null_model<- glm(Purchase ~ 1, data= new_data, family = binomial)
2)Perform likelihood ratio test
anova(null_model, model, test ="Chisq")
Interpretation: p-values for the independent variables are significant, which indicates that our model's likelihood is significant. The inclusion of Wet and Noise as predictors in this LR model does indeed significantly predict the likelihood or customers purchasing the tires again, relative to a model that predicts purchase solely on the mean of observed outcomes.
Pseudo R-squared Method
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -13.7649516 -46.6623284 65.7947536 0.7050093 0.6199946 0.8305269
Interpretation: A McFaddem R-Squared of 0.70 means that the LR model captures 70% variability in the
outcome vs a model with no predictors.
Now let’s compute the Area Under the Curve (AUC) and AUC score
roc_curve<- roc(new_data$Purchase, fitted(model))
plot(roc_curve)
auc(roc_curve)
## Area under the curve: 0.9741
Interpretation: An AUC of 0.97 indicates that the logistic model has a high level of accuracy
in predicting purchase behavior.
Now, let’s try to predict purchase behavior based on wet score= 7 and noise score= 7, and wet score= 7 and noise score= 7
new_data1 <- data.frame(Wet = 7, Noise = 7, Buy_Again=1)
new_data2 <- data.frame(Wet = 8, Noise = 8, Buy_Again=1)
Predicting the probability:
prob1<- predict(model, newdata= new_data1, type="response") #Probability that customer will purchase the tire again
prob1*100
## 1
## 4.058753
prob2<- predict(model, newdata= new_data2, type="response") #Probability that customer will purchase the tire again
prob2*100
## 1
## 88.36964
Interpretation: 4.06% of customers would purchase tires with a 7 wet and noise score again, and 88.37% of customers would purchase tires with an 8 wet and noise score again.