This logistic regression analysis will help the card credit company understand what determines loyalty. The company provided data about transaction, payment, sex and whether the consumer is loyal or not. We create different regression models to predict loyalty.

setwd("C:/Users/12267/Desktop/UWindsor/Winter 2021/MSCI 3230 Data Science Tools & Methods/RSTUDIO Work")

######### List all Libraries used 
library(caret)
## Warning: package 'caret' was built under R version 4.0.4
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.0.4
## Loading required package: ggplot2
library(gains)
library(ggplot2)
library(pROC)
## Warning: package 'pROC' was built under R version 4.0.4
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
options(digits = 2)
df <- read.csv("data/loyalty.csv")

##Change the column name to 'actual' from 'loyal'
names(df)[names(df) == 'loyal'] <- 'actual'

##Need to change value coded as "yes" "no" to a value of "1" or "0"
df$actual <- (ifelse(df$actual == "Yes",1,0))
We want to change column name “loyal” to “actual” class. As well as change value coded “yes” to “1” and “no” to “0”.
##Predict loyalty using transaction, payment, and sex
reg1 <- glm(actual ~ ., data = df, family = "binomial") 
options(scipen=999)
summary(reg1)
## 
## Call:
## glm(formula = actual ~ ., family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7285  -0.2928   0.0044   0.4408   1.8556  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -26.163595  12.353251   -2.12    0.034 *
## transaction   0.001014   0.000508    2.00    0.046 *
## payment       0.010683   0.005376    1.99    0.047 *
## sex           0.073520   1.345292    0.05    0.956  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33.271  on 23  degrees of freedom
## Residual deviance: 15.327  on 20  degrees of freedom
## AIC: 23.33
## 
## Number of Fisher Scoring iterations: 6
psuedoR2 <- 1-reg1$deviance/reg1$null.deviance
psuedoR2
## [1] 0.54

Our first model contained all predictors to predict “actual” column. Psuedo R-squared gives us a value of 54%. We notice that “sex” does not contribute to this model as it holds no significance.

##Predict loyalty using transaction and payment
reg2 <- glm(actual ~ . -sex, data = df, family = "binomial") 
options(scipen=999)
summary(reg2)
## 
## Call:
## glm(formula = actual ~ . - sex, family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7410  -0.2970   0.0044   0.4478   1.8672  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -25.931650  11.486623   -2.26    0.024 *
## transaction   0.001008   0.000494    2.04    0.041 *
## payment       0.010596   0.005089    2.08    0.037 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33.271  on 23  degrees of freedom
## Residual deviance: 15.330  on 21  degrees of freedom
## AIC: 21.33
## 
## Number of Fisher Scoring iterations: 6
psuedoR2 <- 1-reg2$deviance/reg2$null.deviance
psuedoR2
## [1] 0.54

Our second model contained all predictors expect for “sex” to predict “actual” column. Psuedo R-squared gives us a value of 54%. This results in the same R squared as Reg1 and shows that by taking out “sex” we remain to have the same value.

##Predict loyalty using transaction only
reg3 <- glm(actual ~ transaction, data = df, family = "binomial") 
summary(reg3)
## 
## Call:
## glm(formula = actual ~ transaction, family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8139  -0.7576  -0.0474   0.6980   1.8035  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -5.793682   2.469457   -2.35    0.019 *
## transaction  0.000782   0.000332    2.35    0.019 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33.271  on 23  degrees of freedom
## Residual deviance: 23.984  on 22  degrees of freedom
## AIC: 27.98
## 
## Number of Fisher Scoring iterations: 5
psuedoR2 <- 1-reg3$deviance/reg3$null.deviance
psuedoR2
## [1] 0.28

Our third model contained the “transaction” predictor only to predict “actual” column. Psuedo R-squared gives us a value of 28%. This means that “transaction” contributes 28% in predicting loyalty alone.

##Predict Loyalty using payment only
reg4 <- glm(actual ~ payment, data = df, family = "binomial") 
summary(reg4)
## 
## Call:
## glm(formula = actual ~ payment, family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6944  -0.8300   0.0176   0.7822   1.8017  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -12.17246    5.23656   -2.32    0.020 *
## payment       0.00705    0.00301    2.34    0.019 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33.271  on 23  degrees of freedom
## Residual deviance: 24.733  on 22  degrees of freedom
## AIC: 28.73
## 
## Number of Fisher Scoring iterations: 4
psuedoR2 <- 1-reg4$deviance/reg4$null.deviance
psuedoR2
## [1] 0.26

Our forth model contained the “payment” predictor only to predict “actual” column. Psuedo R-squared gives us a value of 26%. This means that “payment” contributes 26% in predicting loyalty alone.

##Predict loyalty using sex only
reg5 <- glm(actual ~ sex, data = df, family = "binomial") 
summary(reg5)
## 
## Call:
## glm(formula = actual ~ sex, family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2558  -1.1127  -0.0058   1.1367   1.2435  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    0.182      0.606    0.30     0.76
## sex           -0.336      0.822   -0.41     0.68
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33.271  on 23  degrees of freedom
## Residual deviance: 33.103  on 22  degrees of freedom
## AIC: 37.1
## 
## Number of Fisher Scoring iterations: 3
psuedoR2 <- 1-reg5$deviance/reg5$null.deviance
psuedoR2
## [1] 0.0051

Our fifth model contained the “sex” predictor only to predict “actual” column. Psuedo R-squared gives us a value of 0.0051. As mentioned early, this makes sense because we said that the “sex” of the consumer do not help determine loyalty.

###Partitioning the dataset
set.seed(1)  # set seed for reproducing the partition

train.rows <- sample(1:nrow(df), 15)

train.df <- df[train.rows, ]
valid.df <- df[-train.rows, ]
reg6 <- lm(actual ~ ., data = train.df)
summary(reg6)
## 
## Call:
## lm(formula = actual ~ ., data = train.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.482 -0.216 -0.109  0.336  0.531 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.7717782  0.7951806   -2.23   0.0477 * 
## transaction  0.0001308  0.0000373    3.51   0.0049 **
## payment      0.0007899  0.0004293    1.84   0.0929 . 
## sex          0.0559817  0.1877666    0.30   0.7711   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.36 on 11 degrees of freedom
## Multiple R-squared:  0.606,  Adjusted R-squared:  0.499 
## F-statistic: 5.64 on 3 and 11 DF,  p-value: 0.0137

We want to partition the dataset into two groups. Training and Validation. Without the formula being there, we want to understand how well the model predicts loyalty.

# use predict() to make predictions on a new set. 
lm.pred <- predict(reg6, valid.df)
options(scipen=999, digits = 0)
some.residuals <- valid.df$actual[1:5] - lm.pred[1:5]
data.frame("Actual" = valid.df$actual[1:5],"Predicted" = lm.pred[1:5], 
           "Residual" = some.residuals)
##    Actual Predicted Residual
## 3       1         1        0
## 8       1         1       -0
## 12      1         1        0
## 13      0         1       -1
## 15      0         0       -0

This output shows us that the model predict the correct class 4/5 times. Which is 80% correct.

# use accuracy() to compute common accuracy measures.
t(accuracy(lm.pred, valid.df$actual))
##      Test set
## ME         -0
## RMSE        0
## MAE         0
## MPE      -Inf
## MAPE      Inf

Gives is accuracy measures of the model.

search <- regsubsets(actual ~ ., data = train.df, nbest = 1, nvmax = dim(train.df)[2],
                     method = "exhaustive")
sum <- summary(search)

# show models
t(sum$which)
##                 1     2    3
## (Intercept)  TRUE  TRUE TRUE
## transaction  TRUE  TRUE TRUE
## payment     FALSE  TRUE TRUE
## sex         FALSE FALSE TRUE

We use the “exhaustive” method to determine which model is the best at predicting loyalty. It displays the 3 possible outcomes.

# show metrics
sum$rsq
## [1] 0 1 1
sum$adjr2
## [1] 0 1 0

Upon displaying the outcomes, it displays the R square and adjusted R sqaure for each model.

##Find the model with the highest adjusted R2.
x <- which(sum$adjr2== max(sum$adjr2))
t(t(sum$which[x,]))
##              [,1]
## (Intercept)  TRUE
## transaction  TRUE
## payment      TRUE
## sex         FALSE

The “exhaustive” method concludes that model 2 is the best model to predict loyalty. The predictors include: transaction and payment. As mentioned eariler, sex does not carry any signifance, therefore it must not be included. By running the Reg1 and Reg2, we got an R squared 0.54. Reg 1 included 3 predictors and Reg2 included 2. We always want to go with the model with the least predictors that does a just as good job at predicting. When displaying the outcome of the “exhaustive” method, the highest adjusted R squared is model 2, a value of 0.537 with only 2 predictors being transaction and payment.