This logistic regression analysis will help the card credit company understand what determines loyalty. The company provided data about transaction, payment, sex and whether the consumer is loyal or not. We create different regression models to predict loyalty.
setwd("C:/Users/12267/Desktop/UWindsor/Winter 2021/MSCI 3230 Data Science Tools & Methods/RSTUDIO Work")
######### List all Libraries used
library(caret)
## Warning: package 'caret' was built under R version 4.0.4
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.0.4
## Loading required package: ggplot2
library(gains)
library(ggplot2)
library(pROC)
## Warning: package 'pROC' was built under R version 4.0.4
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
options(digits = 2)
df <- read.csv("data/loyalty.csv")
##Change the column name to 'actual' from 'loyal'
names(df)[names(df) == 'loyal'] <- 'actual'
##Need to change value coded as "yes" "no" to a value of "1" or "0"
df$actual <- (ifelse(df$actual == "Yes",1,0))
We want to change column name “loyal” to “actual” class. As well as change value coded “yes” to “1” and “no” to “0”.
##Predict loyalty using transaction, payment, and sex
reg1 <- glm(actual ~ ., data = df, family = "binomial")
options(scipen=999)
summary(reg1)
##
## Call:
## glm(formula = actual ~ ., family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7285 -0.2928 0.0044 0.4408 1.8556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.163595 12.353251 -2.12 0.034 *
## transaction 0.001014 0.000508 2.00 0.046 *
## payment 0.010683 0.005376 1.99 0.047 *
## sex 0.073520 1.345292 0.05 0.956
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33.271 on 23 degrees of freedom
## Residual deviance: 15.327 on 20 degrees of freedom
## AIC: 23.33
##
## Number of Fisher Scoring iterations: 6
psuedoR2 <- 1-reg1$deviance/reg1$null.deviance
psuedoR2
## [1] 0.54
Our first model contained all predictors to predict “actual” column. Psuedo R-squared gives us a value of 54%. We notice that “sex” does not contribute to this model as it holds no significance.
##Predict loyalty using transaction and payment
reg2 <- glm(actual ~ . -sex, data = df, family = "binomial")
options(scipen=999)
summary(reg2)
##
## Call:
## glm(formula = actual ~ . - sex, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7410 -0.2970 0.0044 0.4478 1.8672
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.931650 11.486623 -2.26 0.024 *
## transaction 0.001008 0.000494 2.04 0.041 *
## payment 0.010596 0.005089 2.08 0.037 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33.271 on 23 degrees of freedom
## Residual deviance: 15.330 on 21 degrees of freedom
## AIC: 21.33
##
## Number of Fisher Scoring iterations: 6
psuedoR2 <- 1-reg2$deviance/reg2$null.deviance
psuedoR2
## [1] 0.54
Our second model contained all predictors expect for “sex” to predict “actual” column. Psuedo R-squared gives us a value of 54%. This results in the same R squared as Reg1 and shows that by taking out “sex” we remain to have the same value.
##Predict loyalty using transaction only
reg3 <- glm(actual ~ transaction, data = df, family = "binomial")
summary(reg3)
##
## Call:
## glm(formula = actual ~ transaction, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8139 -0.7576 -0.0474 0.6980 1.8035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.793682 2.469457 -2.35 0.019 *
## transaction 0.000782 0.000332 2.35 0.019 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33.271 on 23 degrees of freedom
## Residual deviance: 23.984 on 22 degrees of freedom
## AIC: 27.98
##
## Number of Fisher Scoring iterations: 5
psuedoR2 <- 1-reg3$deviance/reg3$null.deviance
psuedoR2
## [1] 0.28
Our third model contained the “transaction” predictor only to predict “actual” column. Psuedo R-squared gives us a value of 28%. This means that “transaction” contributes 28% in predicting loyalty alone.
##Predict Loyalty using payment only
reg4 <- glm(actual ~ payment, data = df, family = "binomial")
summary(reg4)
##
## Call:
## glm(formula = actual ~ payment, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6944 -0.8300 0.0176 0.7822 1.8017
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.17246 5.23656 -2.32 0.020 *
## payment 0.00705 0.00301 2.34 0.019 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33.271 on 23 degrees of freedom
## Residual deviance: 24.733 on 22 degrees of freedom
## AIC: 28.73
##
## Number of Fisher Scoring iterations: 4
psuedoR2 <- 1-reg4$deviance/reg4$null.deviance
psuedoR2
## [1] 0.26
Our forth model contained the “payment” predictor only to predict “actual” column. Psuedo R-squared gives us a value of 26%. This means that “payment” contributes 26% in predicting loyalty alone.
##Predict loyalty using sex only
reg5 <- glm(actual ~ sex, data = df, family = "binomial")
summary(reg5)
##
## Call:
## glm(formula = actual ~ sex, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2558 -1.1127 -0.0058 1.1367 1.2435
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.182 0.606 0.30 0.76
## sex -0.336 0.822 -0.41 0.68
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33.271 on 23 degrees of freedom
## Residual deviance: 33.103 on 22 degrees of freedom
## AIC: 37.1
##
## Number of Fisher Scoring iterations: 3
psuedoR2 <- 1-reg5$deviance/reg5$null.deviance
psuedoR2
## [1] 0.0051
Our fifth model contained the “sex” predictor only to predict “actual” column. Psuedo R-squared gives us a value of 0.0051. As mentioned early, this makes sense because we said that the “sex” of the consumer do not help determine loyalty.
###Partitioning the dataset
set.seed(1) # set seed for reproducing the partition
train.rows <- sample(1:nrow(df), 15)
train.df <- df[train.rows, ]
valid.df <- df[-train.rows, ]
reg6 <- lm(actual ~ ., data = train.df)
summary(reg6)
##
## Call:
## lm(formula = actual ~ ., data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.482 -0.216 -0.109 0.336 0.531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7717782 0.7951806 -2.23 0.0477 *
## transaction 0.0001308 0.0000373 3.51 0.0049 **
## payment 0.0007899 0.0004293 1.84 0.0929 .
## sex 0.0559817 0.1877666 0.30 0.7711
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.36 on 11 degrees of freedom
## Multiple R-squared: 0.606, Adjusted R-squared: 0.499
## F-statistic: 5.64 on 3 and 11 DF, p-value: 0.0137
This output shows us that the model predict the correct class 4/5 times. Which is 80% correct.
# use accuracy() to compute common accuracy measures.
t(accuracy(lm.pred, valid.df$actual))
## Test set
## ME -0
## RMSE 0
## MAE 0
## MPE -Inf
## MAPE Inf
Gives is accuracy measures of the model.
search <- regsubsets(actual ~ ., data = train.df, nbest = 1, nvmax = dim(train.df)[2],
method = "exhaustive")
sum <- summary(search)
# show models
t(sum$which)
## 1 2 3
## (Intercept) TRUE TRUE TRUE
## transaction TRUE TRUE TRUE
## payment FALSE TRUE TRUE
## sex FALSE FALSE TRUE
We use the “exhaustive” method to determine which model is the best at predicting loyalty. It displays the 3 possible outcomes.
# show metrics
sum$rsq
## [1] 0 1 1
sum$adjr2
## [1] 0 1 0
Upon displaying the outcomes, it displays the R square and adjusted R sqaure for each model.
##Find the model with the highest adjusted R2.
x <- which(sum$adjr2== max(sum$adjr2))
t(t(sum$which[x,]))
## [,1]
## (Intercept) TRUE
## transaction TRUE
## payment TRUE
## sex FALSE
The “exhaustive” method concludes that model 2 is the best model to predict loyalty. The predictors include: transaction and payment. As mentioned eariler, sex does not carry any signifance, therefore it must not be included. By running the Reg1 and Reg2, we got an R squared 0.54. Reg 1 included 3 predictors and Reg2 included 2. We always want to go with the model with the least predictors that does a just as good job at predicting. When displaying the outcome of the “exhaustive” method, the highest adjusted R squared is model 2, a value of 0.537 with only 2 predictors being transaction and payment.