Logistic Regression

We will do a simple logistic regression problem.
The store manager wants to find out the loyalty of his customer. For this he collects data regarding customers’
* Brand preference = on a scale 1-7
* Shopping attitude = on a scale 1-7
* Product preference = on a scale 1-7

So we will load data into R.

head(loyalty)

##   Loyalty brand product shopping
## 1       1     4       3        5
## 2       1     6       4        4
## 3       1     5       2        4
## 4       1     7       5        5
## 5       1     6       3        4
## 6       1     3       4        5

dim(loyalty)

## [1] 30  4

str(loyalty)

## 'data.frame':    30 obs. of  4 variables:
##  $ Loyalty : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ brand   : int  4 6 5 7 6 3 5 5 7 7 ...
##  $ product : int  3 4 2 5 3 4 5 4 5 6 ...
##  $ shopping: int  5 4 4 5 4 5 5 2 4 4 ...

# Since our response variable is numeric, we will go ahead and change that to factor
loyalty$Loyalty <- as.factor(loyalty$Loyalty)
str(loyalty)

## 'data.frame':    30 obs. of  4 variables:
##  $ Loyalty : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ brand   : int  4 6 5 7 6 3 5 5 7 7 ...
##  $ product : int  3 4 2 5 3 4 5 4 5 6 ...
##  $ shopping: int  5 4 4 5 4 5 5 2 4 4 ...

We build the logistic regression model on the data

loyalty.logreg <- glm(Loyalty ~ brand + product + shopping, family = "binomial", data = loyalty)
summary(loyalty.logreg)

## 
## Call:
## glm(formula = Loyalty ~ brand + product + shopping, family = "binomial", 
##     data = loyalty)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5075  -0.5452   0.0459   0.5840   1.6774  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -8.642      3.346   -2.58   0.0098 **
## brand          1.274      0.479    2.66   0.0078 **
## product        0.186      0.322    0.58   0.5629   
## shopping       0.590      0.491    1.20   0.2297   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.589  on 29  degrees of freedom
## Residual deviance: 23.471  on 26  degrees of freedom
## AIC: 31.47
## 
## Number of Fisher Scoring iterations: 5

To check how the model performed, we will do model validation.
Classification table

loyalty.probs <- predict(loyalty.logreg, type = "response")
loyalty.probs[1:5]

##      1      2      3      4      5 
## 0.4905 0.8915 0.6129 0.9846 0.8721

loyalty.pred <- ifelse(loyalty.probs > .5, "1", "0")
table(loyalty.pred, loyalty$Loyalty)

##             
## loyalty.pred  0  1
##            0 12  3
##            1  3 12

mean(loyalty.pred == loyalty$Loyalty)

## [1] 0.8

ROC Curve

require(ROCR)
loyalty.roc <- prediction(loyalty.probs, loyalty$Loyalty)
plot(performance(loyalty.roc, "tpr", "fpr"), col = "red", main = "ROC Curve")
abline(0,1, lty = 8, col = "grey")

plot of chunk unnamed-chunk-5

Kolmogorov-Smirnov test

ks <- performance(loyalty.roc, "tpr", "fpr")
loyalty.ks <- max(attr(ks, "y.values")[[1]] - (attr(ks, "x.values")[[1]]))
loyalty.ks

## [1] 0.7333

Since we see only one variable is significant, we’ll run another model with this significant variable only.

loyalty.logreg2 <- glm(Loyalty ~ brand, family = "binomial", data = loyalty)
summary(loyalty.logreg2)

## 
## Call:
## glm(formula = Loyalty ~ brand, family = "binomial", data = loyalty)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0147  -0.4665   0.0767   0.5308   2.1317  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -6.21       2.21   -2.81    0.005 **
## brand           1.35       0.47    2.87    0.004 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.589  on 29  degrees of freedom
## Residual deviance: 25.100  on 28  degrees of freedom
## AIC: 29.1
## 
## Number of Fisher Scoring iterations: 5

We will do model validation for this model too.
Classification table

loyalty.probs2 <- predict(loyalty.logreg2, type = "response")
loyalty.probs2[1:5]

##      1      2      3      4      5 
## 0.3073 0.8686 0.6313 0.9623 0.8686

loyalty.pred2 <- ifelse(loyalty.probs > 0.5, "1", "0")
table(loyalty.pred2, loyalty$Loyalty)

##              
## loyalty.pred2  0  1
##             0 12  3
##             1  3 12

mean(loyalty.pred2 == loyalty$Loyalty)

## [1] 0.8

ROC Curve

loyalty.roc2 <- prediction(loyalty.probs2, loyalty$Loyalty)
plot(performance(loyalty.roc2, "tpr", "fpr"), col = "red", main = "ROC Curve")
abline(0,1, lty = 2, col = "grey")

plot of chunk unnamed-chunk-9

Kolmogorov-Smirnov test

ks2 <- performance(loyalty.roc2, "tpr", "fpr")
loyalty.ks2 <- max(attr(ks2, "y.values")[[1]] - (attr(ks2, "x.values")[[1]]))
loyalty.ks2

## [1] 0.6667

For every unit change in brand preference of the customer, the loyalty logit ratio increases by 1.35. Based on this model, if there is a new customer with a brand preference of 4, then what is the probability of him being loyal to the brand.

# Brand value from the equation is multiplied by 4
1.35*4

## [1] 5.4

# Result is added with the constant
(-6.21) + 5.4

## [1] -0.81

# We take the exponent of the outcome value
exp(0.81)

## [1] 2.248

# Probability
2.248/(1 + 2.248)

## [1] 0.6921

So a new customer with a brand preference of 4 rating can be said to have a loyalty probability of 0.69

Logistic Regression

Loy

Wednesday, January 21, 2015