Question 1:
Step 1: Install and load required libraries
##install.packages("readxl")
##install.packages("Hmisc")
##install.packages("pscl")
##if(!require(pROC)) install.packages("pROC")
library(readxl)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Step 2: Import & clean the data
college_df <- read_excel("Class Exercise 15_TireRatings.xlsx")
coll_df <- subset(college_df, select = -c(Tire))
head(coll_df)
## # A tibble: 6 × 4
## Wet Noise Buy_Again Purchase
## <dbl> <dbl> <dbl> <dbl>
## 1 8 7.2 6.1 0
## 2 8 7.2 6.6 1
## 3 7.6 7.5 6.9 1
## 4 6.6 5.4 6.6 0
## 5 5.8 6.3 4 0
## 6 6.3 5.7 4.5 0
Data Description: A description of the features are presented in the table below.
Variable | Definition
------------- | -------------
1. Wet | Traction of tire when wet
2. Noise | Noise performance rating
3. Purchase | Would they repurchase
summary(coll_df)
## Wet Noise Buy_Again Purchase
## Min. :4.300 Min. :3.600 Min. :1.400 Min. :0.0000
## 1st Qu.:6.450 1st Qu.:6.000 1st Qu.:3.850 1st Qu.:0.0000
## Median :7.750 Median :7.100 Median :6.150 Median :0.0000
## Mean :7.315 Mean :6.903 Mean :5.657 Mean :0.4412
## 3rd Qu.:8.225 3rd Qu.:7.925 3rd Qu.:7.400 3rd Qu.:1.0000
## Max. :9.200 Max. :8.900 Max. :8.900 Max. :1.0000
Interpretation: The median score for wet is 7.75 and noise score is 7.10, which shows that consumers wouldn't repurchase.
Step 4: Feature selection (i.e., correlation analysis)
corr <- rcorr(as.matrix(coll_df))
corr
## Wet Noise Buy_Again Purchase
## Wet 1.00 0.76 0.91 0.74
## Noise 0.76 1.00 0.83 0.72
## Buy_Again 0.91 0.83 1.00 0.83
## Purchase 0.74 0.72 0.83 1.00
##
## n= 68
##
##
## P
## Wet Noise Buy_Again Purchase
## Wet 0 0 0
## Noise 0 0 0
## Buy_Again 0 0 0
## Purchase 0 0 0
Interpretation: All the predictators are significant in terms of the target variable. This means their is no multicollinearity in this dataset.
Question 3: Overall Model Significance
Likelihood Ratio Test
null_model <- glm(Purchase ~ 1, data = coll_df, family = binomial)
anova(null_model, model, text = "Chisq")
## Warning in anova.glm(null_model, model, text = "Chisq"): the following
## arguments to 'anova.glm' are invalid and dropped: list(text = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 67 93.325
## 2 65 27.530 2 65.795 5.162e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The main score of traction of the tires when wet and their noise can significantly predict if consumers will repurchase tires.
Pseudo-R-Squared
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -13.7649516 -46.6623284 65.7947536 0.7050093 0.6199946 0.8305269
Interpretation: A score of 0.71 means our LR model explains about 71% of the variability in the outcome relative to a model with no predictators. This is shows that their is a high for good fit.
Area under the curve (AUC)
The Area Under the Curve score represents the ability of the model to correctly classify students who will re purchase tires and those who will not.
roc_curve <- roc(coll_df$Purchase, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.9741
Interpretation: An main AUC score of 0.97 which shows that the LR model has a high level of accuracy in predicting repurchasing of tires