Question 1 & 2: Develop the Model and Access Predictor
Significance
Step 1: Install and load required libraries
library(readxl)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.4.2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pscl)
## Warning: package 'pscl' was built under R version 4.4.2
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Step 2: Import and Clean the Data
tireratings_df <- read_excel(file.choose())
ratings_df <- subset(tireratings_df, select = -c(Tire, Buy_Again))
Step 3: Summarize the Data
head(ratings_df)
## # A tibble: 6 × 3
## Wet Noise Purchase
## <dbl> <dbl> <dbl>
## 1 8 7.2 0
## 2 8 7.2 1
## 3 7.6 7.5 1
## 4 6.6 5.4 0
## 5 5.8 6.3 0
## 6 6.3 5.7 0
Data Description: A Description of the Features are Presented in the Table Below.
Variable Definition
------------- | -------------
1. Wet | Performance Rating of the Tire's in Wet Conditions
2. Noise | Performance Rating of the Tire's Noise Level
3. Purchase | Purchase of Tire (1: Yes | 0: No)
summary(ratings_df)
## Wet Noise Purchase
## Min. :4.300 Min. :3.600 Min. :0.0000
## 1st Qu.:6.450 1st Qu.:6.000 1st Qu.:0.0000
## Median :7.750 Median :7.100 Median :0.0000
## Mean :7.315 Mean :6.903 Mean :0.4412
## 3rd Qu.:8.225 3rd Qu.:7.925 3rd Qu.:1.0000
## Max. :9.200 Max. :8.900 Max. :1.0000
Interpretation: The median of "wet" is 7.75 and "noise" is 7.10. However, the median of "purchase" is
Step 4: Feature Selection (i.e., correlation analysis)
corr <- rcorr(as.matrix(ratings_df))
corr
## Wet Noise Purchase
## Wet 1.00 0.76 0.74
## Noise 0.76 1.00 0.72
## Purchase 0.74 0.72 1.00
##
## n= 68
##
##
## P
## Wet Noise Purchase
## Wet 0 0
## Noise 0 0
## Purchase 0 0
Interpretation: All the predictors are significant with the dependent variable (i.e., Purchase). There is multicollinearity between "wet" and "noise"
Step 5: Build the Logistic Regression Model
model <- glm(Purchase ~ Wet + Noise, data = ratings_df, family = binomial)
summary(model)
##
## Call:
## glm(formula = Purchase ~ Wet + Noise, family = binomial, data = ratings_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.4982 12.4779 -3.165 0.00155 **
## Wet 3.3745 1.2641 2.670 0.00760 **
## Noise 1.8163 0.8312 2.185 0.02887 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93.325 on 67 degrees of freedom
## Residual deviance: 27.530 on 65 degrees of freedom
## AIC: 33.53
##
## Number of Fisher Scoring iterations: 8
Interpretation: All the independent variables were significant (p-value < 0.05)
Question 3: Overall Model Significance
Liklihood Ratio Test
# Fit a null model
null_model <- glm(Purchase ~ 1, data = ratings_df, family = binomial)
# Perform liklihood ratio test
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Purchase ~ 1
## Model 2: Purchase ~ Wet + Noise
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 67 93.325
## 2 65 27.530 2 65.795 5.162e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The utilization of wetness and noise level as our independent variables in our logistic regression model does significantly predict the likihood of a person making a purchase.
Pseudo-R-squared
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -13.7649516 -46.6623284 65.7947536 0.7050093 0.6199946 0.8305269
Interpretation: A McFadden R-Squared of 0.71 means that our LR model explains about 71% of the variability in the outcome relative to a model with no predictors. This is considered a great fit as the value is above 0.2 to 0.4, which is the range for a useful model.
Area Under the Curve (AUC)
The Area Under the Curve (AUC) score represents the ability of the model to correctly classify customers who will either purchase or not purchase tires.
roc_curve <- roc(ratings_df$Purchase, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.9741
Intepretation: An AUC score of 0.9741 indicates that the LR model has a high level of accuracy in predicting purchases.