1. Introduction

In Lab 11, we will do some hands-on exercise on logistic regression. The data we will use today is OJ dataset from ISLR package (But because I added some modification to the data, we won’t use the dataset straight from the ISLR package).

Using the data, we will explain whether people purchased Minuite Made Orange Joice or Citrus Hill Orange Joice. The product purchased is in Purchase variable, which will serve as our dependent variable. Before we begin, briefly examine this webpage to see the definition of each variable in this data (I added some new variables to the dataset, so the webpage does not contain the definition of all variable).

2. Setting the working directory & reading data

Let’s first download required packages in R. There is one .csv dataset that you can download from Canvas > Files > Lab Materials named ‘lab11_oj.csv’. Download the file and store it in a folder of your choice.

# Call the packages using library()
library(GGally)
library(car)
library(tidyverse)
library(DescTools)

# USE YOUR OWN working directory and file name
setwd("C:/Users/Bonwoo Koo/Dropbox/School/CP6025/Labs/Lab11") # Use your own pathname
oj <- read.csv("lab11_oj.csv")

str(oj)

## 'data.frame':    1070 obs. of  19 variables:
##  $ year          : int  2000 2000 2005 2000 2010 2010 2005 2005 2000 2005 ...
##  $ Purchase      : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
##  $ WeekofPurchase: int  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : int  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : int  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : int  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : int  1 1 1 1 0 0 0 0 0 0 ...

3. Understanding the data

3.1. Examining data itself

So far, you have been given datasets that are well-cleaned by me or by data provider (e.g., the Census Bureau),but that will not be the case for you once this class is over. When you are on your own, always take a close look at your data before you do any analysis.

Using summary() or and str(), compare what you see in the data to their definitions and see if there is anything notable. In the summary() below, there are a few things that catch my attention.

year and WeekofPurchase variables are in numeric form. If you’d like to use those variable to represent the passage of time (i.e., 1 unit change would then mean 1 year or 1 week), the current form is okay. But if you want to simply use them to represent different time points (i.e., nominal variable), I’d consider converting them to character or factor format.
Purchase and Store7 are not in numeric format. You can tell that from the fact that they don’t have Min, 1st Qu., etc. under their names. I think StoreID and STORE variable should be coverted to character or factor format because each value of those variables are actually store code and are thus nominal variables. The same applies to SpecialCH and SpecialMM. Treating these variables as numeric variables will be seriously change the regression outcome, and we will get a misleading result.
The variables PctDiscMM and PctDiscCH appear to be proportional variables and are ranging between 0 and 1 (rather than from 0 to 100). These unit issues will affect the interpretation of the regression coefficient and thus are important to note.
By comparing that to the variable definition from the webpage, see if there are any variables that are incorrectly in numeric format (the opposite can also be the case). Please let me know if there are anything else I missed.

# Descriptive statistics
summary(oj)

##       year      Purchase WeekofPurchase     StoreID        PriceCH     
##  Min.   :2000   CH:653   Min.   :227.0   Min.   :1.00   Min.   :1.690  
##  1st Qu.:2000   MM:417   1st Qu.:240.0   1st Qu.:2.00   1st Qu.:1.790  
##  Median :2005            Median :257.0   Median :3.00   Median :1.860  
##  Mean   :2005            Mean   :254.4   Mean   :3.96   Mean   :1.867  
##  3rd Qu.:2010            3rd Qu.:268.0   3rd Qu.:7.00   3rd Qu.:1.990  
##  Max.   :2010            Max.   :278.0   Max.   :7.00   Max.   :2.090  
##     PriceMM          DiscCH            DiscMM         SpecialCH     
##  Min.   :1.690   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.990   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.090   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :2.085   Mean   :0.05186   Mean   :0.1234   Mean   :0.1477  
##  3rd Qu.:2.180   3rd Qu.:0.00000   3rd Qu.:0.2300   3rd Qu.:0.0000  
##  Max.   :2.290   Max.   :0.50000   Max.   :0.8000   Max.   :1.0000  
##    SpecialMM         LoyalCH          SalePriceMM     SalePriceCH   
##  Min.   :0.0000   Min.   :0.000011   Min.   :1.190   Min.   :1.390  
##  1st Qu.:0.0000   1st Qu.:0.325257   1st Qu.:1.690   1st Qu.:1.750  
##  Median :0.0000   Median :0.600000   Median :2.090   Median :1.860  
##  Mean   :0.1617   Mean   :0.565782   Mean   :1.962   Mean   :1.816  
##  3rd Qu.:0.0000   3rd Qu.:0.850873   3rd Qu.:2.130   3rd Qu.:1.890  
##  Max.   :1.0000   Max.   :0.999947   Max.   :2.290   Max.   :2.090  
##    PriceDiff       Store7      PctDiscMM        PctDiscCH      
##  Min.   :-0.6700   No :714   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 0.0000   Yes:356   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median : 0.2300             Median :0.0000   Median :0.00000  
##  Mean   : 0.1465             Mean   :0.0593   Mean   :0.02731  
##  3rd Qu.: 0.3200             3rd Qu.:0.1127   3rd Qu.:0.00000  
##  Max.   : 0.6400             Max.   :0.4020   Max.   :0.25269  
##  ListPriceDiff       STORE      
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.140   1st Qu.:0.000  
##  Median :0.240   Median :2.000  
##  Mean   :0.218   Mean   :1.631  
##  3rd Qu.:0.300   3rd Qu.:3.000  
##  Max.   :0.440   Max.   :4.000

Now, let’s fix some of those issues and assign the modified data.frame to oj.fix.

# Fixing the noted issues
oj.fix <- oj
oj.fix$year <- factor(oj.fix$year)
oj.fix$StoreID <- factor(oj.fix$StoreID)
oj.fix$STORE <- factor(oj.fix$STORE)
oj.fix$SpecialCH <- factor(oj.fix$SpecialCH)
oj.fix$SpecialMM <- factor(oj.fix$SpecialMM)

# In Tidyverse way
oj.fix <- oj %>%
  mutate(year = factor(year),
         StoreID = factor(StoreID),
         STORE = factor(STORE),
         SpecialCH = factor(SpecialCH),
         SpecialMM = factor(SpecialMM))

3.2. Examining the DV

A positive sign of a coefficient in a logistic regression suggests that 1-unit increase in that IV is associated with a greater probability of a happening of an event. If we are doing some medical study that tests whether someone has cancer, the event would be the existence of cancer. Because the outcome in this case is about the presence or absence of something, it is natural to consider the presence an event. If we develop a logistic regression explaining this variable using age of patients, we expect to see a positive relationship. This means that 1-unit increase in age is associated with a greater chance of the event happening. That is, having cancer is coded as 1 and not having cancer is coded 0.

In the example we are using, however, what it means for an event to happen is not straight-forward because the two cateogories in the dependent variable are names of orange juice. In this case, we need to pick one and call it the event. Let’s define an event as the purchase of Minute Made. There are two ways you can do this.

By making the variable a factor (you can make something a factor even if it is already a factor). Make sure you specify the levels of the factor such that it is written in “non-event”, “event” order.

# Changing the reference category by creating a new factor
oj.fix$Purchase <- factor(oj.fix$Purchase, 
                          levels = c("CH", "MM")) # Here, Buying CH is nonevent; buying MM is the event

Or you can specify which category you want as the non-event.

# Changing the refernece of category by declaring 'ref'
oj.fix$Purchase <- relevel(oj.fix$Purchase, 
                           ref = "CH") # Here, CH is nonevent. This is the 'base'.

3.3. Examining the IVs

As always, let’s take a look at the IVs. Since there are too many variables to fit into one plot, I selected some of the variables. The remainder of this lab document will use those selected variables. After drawing a plot using the selected variables, there are a few things that catch my eyes:

Citrus Hill appears to be a more popular choice.
(Surprisingly) the price of the two juices are not highly correlated.
PriceDiff, for some reason, is more closely correlated with PctDiscMM than it is with PctDiscCH with a high correlation coefficient. This may indicate multicollinearity issue.
LoyalCH vividly separates green and red dots (see the second to the last column in the last row), indicating that it is likely to be one of the strongest predictors.

# This ggpairs groups data points based on the category in the dependent variable
ggpairs(oj.fix[,c("Purchase", "StoreID", "PriceCH", "PriceMM", "PctDiscMM", "PctDiscCH", "Store7", "PriceDiff", "LoyalCH")], 
        mapping = ggplot2::aes(color = Purchase, alpha = 0.5))

4. Running a logistic regression model

Let’s run a logistic regression using what we have so far. The function for logistic regression is glm, which stands for the generalized linear model. The arguments are identical to OLS except the family = "binomial argument. This argument is needed because the generalized linear model is actually not just one model but is an umbrella term for various models.

For those who are curious to know why it is called generalized, here is a quote from Wikipedia. “In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value” (Wikipedia, accessed Nov 10, 2019, https://en.wikipedia.org/wiki/Generalized_linear_model).

# Fitting a logistic regression model
logit.model1 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + PctDiscCH + Store7 + PriceDiff + LoyalCH, # Notice that I did not drop any variables
                   data = oj.fix,
                   family = "binomial") 
summary(logit.model1)

## 
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + 
##     PctDiscCH + Store7 + PriceDiff + LoyalCH, family = "binomial", 
##     data = oj.fix)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8157  -0.5446  -0.2408   0.5306   2.8042  
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.527549   1.831745   1.926   0.0541 .  
## StoreID2      0.009016   0.286318   0.031   0.9749    
## StoreID3     -0.055000   0.329861  -0.167   0.8676    
## StoreID4     -0.399226   0.374221  -1.067   0.2861    
## StoreID7     -0.597578   0.270197  -2.212   0.0270 *  
## PriceCH     -13.053061   8.304722  -1.572   0.1160    
## PriceMM      13.164038   8.060008   1.633   0.1024    
## PctDiscMM   -30.493872  17.323509  -1.760   0.0784 .  
## PctDiscCH    24.859135  15.456068   1.608   0.1078    
## Store7Yes           NA         NA      NA       NA    
## PriceDiff   -17.021085   8.262913  -2.060   0.0394 *  
## LoyalCH      -6.243439   0.407193 -15.333   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1430.85  on 1069  degrees of freedom
## Residual deviance:  821.15  on 1059  degrees of freedom
## AIC: 843.15
## 
## Number of Fisher Scoring iterations: 5

Notice that there is NA for Store7 variable. This indicates that there is something I missed in the inspection (e.g., perfect collinearity). Let’s see what is happening between the two. Since they are categorical variables, let’s use contigency table.

You can see in the output below that you can perfectly predict Store7 variable using StoreID variable because Store7 has “yes” if its ID is 7 and “no” for everything else (I have no idea why this variable exists in the first place). Anyway, we need to drop this variable from the regression.

# Store7 and StoreID
table(oj.fix$StoreID, oj.fix$Store7, dnn = c("StoreID", "Store7")) # dnn stands for dimension names

##        Store7
## StoreID  No Yes
##       1 157   0
##       2 222   0
##       3 196   0
##       4 139   0
##       7   0 356

# Revised logistic regression
logit.model2 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + PctDiscCH + PriceDiff + LoyalCH, # I dropped Store7
                   data = oj.fix,
                   family = "binomial") 
summary(logit.model2)

## 
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + 
##     PctDiscCH + PriceDiff + LoyalCH, family = "binomial", data = oj.fix)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8157  -0.5446  -0.2408   0.5306   2.8042  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.527549   1.831745   1.926   0.0541 .  
## StoreID2      0.009016   0.286318   0.031   0.9749    
## StoreID3     -0.055000   0.329861  -0.167   0.8676    
## StoreID4     -0.399226   0.374221  -1.067   0.2861    
## StoreID7     -0.597578   0.270197  -2.212   0.0270 *  
## PriceCH     -13.053061   8.304722  -1.572   0.1160    
## PriceMM      13.164038   8.060008   1.633   0.1024    
## PctDiscMM   -30.493872  17.323509  -1.760   0.0784 .  
## PctDiscCH    24.859135  15.456068   1.608   0.1078    
## PriceDiff   -17.021085   8.262913  -2.060   0.0394 *  
## LoyalCH      -6.243439   0.407193 -15.333   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1430.85  on 1069  degrees of freedom
## Residual deviance:  821.15  on 1059  degrees of freedom
## AIC: 843.15
## 
## Number of Fisher Scoring iterations: 5

Okay, at least the model did not blow up. But remember that we found a multicollinearity concern in section 3.3. above. Let’s check VIF values. You can use vif function from car package for logistic regression in the same way you used it for OLS.

As shown below, we have some serious multicollinearity issue because most of the IVs have VIF higher than 10 (the highest is 592.51). Since there are multiple variables that have high VIFs, it is difficult to figure out which one to exclude. I decided to test dropping PctDiscMM because (1) we observed a high correlation between PctDiscMM and PriceDiff and (2) we can estimate PctDiscMM using the remaining variables.

# VIF values using car::vif
vif(logit.model2)

##                 GVIF Df GVIF^(1/(2*Df))
## StoreID     2.002056  4        1.090648
## PriceCH   100.736144  1       10.036740
## PriceMM   178.443294  1       13.358267
## PctDiscMM 413.983620  1       20.346587
## PctDiscCH  81.982961  1        9.054444
## PriceDiff 592.512897  1       24.341588
## LoyalCH     1.158587  1        1.076377

# Revised logistic regression
logit.model3 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH + PriceDiff + LoyalCH, # I dropped Store7 & PctDiscMM
                   data = oj.fix,
                   family = "binomial") # Note that I excluded 
vif(logit.model3)

##               GVIF Df GVIF^(1/(2*Df))
## StoreID   1.771270  4        1.074077
## PriceCH   3.329685  1        1.824743
## PriceMM   2.720213  1        1.649307
## PctDiscCH 1.359359  1        1.165916
## PriceDiff 1.674619  1        1.294071
## LoyalCH   1.153348  1        1.073940

We can see that VIF values are now in the safe range. Okay, I think it is now time to take a look at the regression result. When we interpret a result from a logistic regression, it is much easier for us if we convert the coefficients back into odds ratio format. We can do that using exp() function.

# Regression result (standard output)
summary(logit.model3)

## 
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH + 
##     PriceDiff + LoyalCH, family = "binomial", data = oj.fix)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8025  -0.5493  -0.2416   0.5276   2.8006  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.7829     1.7688   1.573   0.1156    
## StoreID2      0.1406     0.2751   0.511   0.6094    
## StoreID3     -0.0193     0.3281  -0.059   0.9531    
## StoreID4     -0.3468     0.3721  -0.932   0.3514    
## StoreID7     -0.6134     0.2711  -2.263   0.0237 *  
## PriceCH       1.3218     1.4996   0.881   0.3781    
## PriceMM      -0.9129     0.9876  -0.924   0.3553    
## PctDiscCH    -2.1370     1.9922  -1.073   0.2834    
## PriceDiff    -2.5131     0.4384  -5.732  9.9e-09 ***
## LoyalCH      -6.2026     0.4052 -15.308  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1430.85  on 1069  degrees of freedom
## Residual deviance:  824.27  on 1060  degrees of freedom
## AIC: 844.27
## 
## Number of Fisher Scoring iterations: 5

# Regression result (with odds ratio conversion)
round( # Rounds the numbers up to 3 digits
  cbind( # Column-bind Odds Ratio to the regerssion output
    "Odds Ratio" = exp(logit.model3$coefficients),
    summary(logit.model3)$coefficients
    ),3)

##             Odds Ratio Estimate Std. Error z value Pr(>|z|)
## (Intercept)     16.166    2.783      1.769   1.573    0.116
## StoreID2         1.151    0.141      0.275   0.511    0.609
## StoreID3         0.981   -0.019      0.328  -0.059    0.953
## StoreID4         0.707   -0.347      0.372  -0.932    0.351
## StoreID7         0.541   -0.613      0.271  -2.263    0.024
## PriceCH          3.750    1.322      1.500   0.881    0.378
## PriceMM          0.401   -0.913      0.988  -0.924    0.355
## PctDiscCH        0.118   -2.137      1.992  -1.073    0.283
## PriceDiff        0.081   -2.513      0.438  -5.732    0.000
## LoyalCH          0.002   -6.203      0.405 -15.308    0.000

Unlike lm(), the standard output from glm() does not provide R-squared statistic. We need to use a package to calculate what is called pseudo R-squared statistics, which are designed to function similar to OLS’ R-squared but not quite the same.

PseudoR2(logit.model3, c("McFadden", "McFaddenAdj"))

##    McFadden McFaddenAdj 
##   0.4239337   0.4099560

If you cannot install DescTools package, copy and paste the following code into your R console and run it. After that, you will have the function available in your R environment.

# A function for pseudo R-squared from DescTools package
PseudoR2 <- function (x, which = NULL) 
{
    if (!(inherits(x, what = "glm") || inherits(x, what = "polr") || 
        inherits(x, what = "multinom") || inherits(x, what = "vglm"))) 
        return(NA)
    if (!(inherits(x, what = "vglm")) && !is.null(x$call$summ) && 
        !identical(x$call$summ, 0)) 
        stop("can NOT get Loglik when 'summ' argument is not zero")
    L.full <- logLik(x)
    D.full <- -2 * L.full
    if (inherits(x, what = "multinom")) 
        L.base <- logLik(update(x, ~1, trace = FALSE))
    else if (inherits(x, what = "glm")) 
        L.base <- logLik(glm(formula = reformulate("1", 
            gsub(" .*$", "", deparse(x$formula))), 
            data = x$data, family = x$family))
    else L.base <- logLik(update(x, ~1))
    D.base <- -2 * L.base
    G2 <- -2 * (L.base - L.full)
    n <- attr(L.full, "nobs")
    if (inherits(x, "multinom")) 
        edf <- x$edf
    else if (inherits(x, "vglm")) {
        edf <- x@rank
        n <- nobs(x)
    }
    else edf <- x$rank
    McFadden <- 1 - (L.full/L.base)
    McFaddenAdj <- 1 - ((L.full - edf)/L.base)
    Nagelkerke <- (1 - exp((D.full - D.base)/n))/(1 - exp(-D.base/n))
    CoxSnell <- 1 - exp(-G2/n)
    res <- c(McFadden = McFadden, McFaddenAdj = McFaddenAdj, 
        CoxSnell = CoxSnell, Nagelkerke = Nagelkerke, AldrichNelson = NA, 
        VeallZimmermann = NA, Efron = NA, McKelveyZavoina = NA, 
        Tjur = NA, AIC = AIC(x), BIC = BIC(x), logLik = L.full, 
        logLik0 = L.base, G2 = G2)
    if (inherits(x, what = "glm") || inherits(x, what = "vglm")) {
        if (inherits(x, what = "vglm")) {
            fam <- x@family@vfamily
            link <- if (all(x@extra$link == "logit")) {
                "logit"
            }
            else if (all(x@extra$link == "probit")) {
                "probit"
            }
            else {
                NA
            }
            y <- x@y
        }
        else {
            fam <- x$family$family
            link <- x$family$link
            y <- x$y
        }
        s2 <- switch(link, probit = 1, logit = pi^2/3, NA)
        res["AldrichNelson"] <- G2/(G2 + n)
        res["VeallZimmermann"] <- res["AldrichNelson"] * 
            (2 * L.base - n)/(2 * L.base)
        y.hat <- predict(x, type = "link")
        sse <- sum((y.hat - mean(y.hat))^2)
        res["McKelveyZavoina"] <- sse/(n * s2 + sse)
        y.hat.resp <- predict(x, type = "response")
        res["Efron"] <- (1 - (sum((y - y.hat.resp)^2))/(sum((y - 
            mean(y))^2)))
        if (identical(fam, "binomial")) 
            res["Tjur"] <- unname(diff(tapply(y.hat.resp, 
                y, mean, na.rm = TRUE)))
    }
    if (is.null(which)) 
        which <- "McFadden"
    else which <- match.arg(which, c("McFadden", "AldrichNelson", 
        "VeallZimmermann", "McFaddenAdj", "CoxSnell", 
        "Nagelkerke", "Efron", "McKelveyZavoina", 
        "Tjur", "AIC", "BIC", "logLik", 
        "logLik0", "G2", "all"), several.ok = TRUE)
    if (any(which == "all")) 
        return(res)
    else return(res[which])
}

5. Interpreting a logistic regression model

As we learned in the lecture, Odds Ratio no longer have additive interpretation that we are used to. Instead, it is now multiplicative. For example, 10 times 1 is 10. 10 times 0.8 is 8. 10 time 1.2 is 12. Notice that a number does not change when it is multiplied by 1. A number decreases if it is multiplied by something smaller than 1, and it increases if it is multiplied by something greater than 1. In OLS, 0 is the point that separates positive and negative effect. In logistic regression, odds ratio of 1 is that point.

Let’s take a look at StoreID7. It has odds ratio of 0.541 and estimate of -0.613. It means that, compared to StoreID1, the odds of purchasing Minute Made (because the reference category is currently Citrus Hill) is lower by a factor 0.541 or that being StoreID7 has odds of purchasing Minute Made 0.541 times lower compared to StoreID1.

Similarly, the the odds of purchasing Minute Made is lower by a factor 0.002 for 1-unit increase in LoyalCH variable. However, notice that the range of LoyalCH is between 0 and 1 (possibly measured in some sort of proportion). In this case, the meaning of 1-unit increase is to change from the minimum level of loyalty to the maximum level, which is a dramatic change. Assuming that this is in fact a proportion, multiplying it by 100 will yield a more reasonable coefficient.

Note that it is the odds of the event that gets multiplied by the odds ratio, not the probability nor the log of odds.

# Another regression with LoyalCH*100
logit.model4 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH + PriceDiff + I(LoyalCH*100), 
                   data = oj.fix,
                   family = "binomial")

# Regression result (with odds ratio conversion)
round( # Rounds the numbers up to 3 digits
  cbind( # Column-bind Odds Ratio to the regerssion output
    "Odds Ratio" = exp(logit.model4$coefficients),
    summary(logit.model4)$coefficients
    ),3)

##                  Odds Ratio Estimate Std. Error z value Pr(>|z|)
## (Intercept)          16.166    2.783      1.769   1.573    0.116
## StoreID2              1.151    0.141      0.275   0.511    0.609
## StoreID3              0.981   -0.019      0.328  -0.059    0.953
## StoreID4              0.707   -0.347      0.372  -0.932    0.351
## StoreID7              0.541   -0.613      0.271  -2.263    0.024
## PriceCH               3.750    1.322      1.500   0.881    0.378
## PriceMM               0.401   -0.913      0.988  -0.924    0.355
## PctDiscCH             0.118   -2.137      1.992  -1.073    0.283
## PriceDiff             0.081   -2.513      0.438  -5.732    0.000
## I(LoyalCH * 100)      0.940   -0.062      0.004 -15.308    0.000

You can see that everything else is NOT changed except the Odds Ratio, Estimate, and Std. Error of LoyalCH. Now, the odds ratio of the variable is not so extreme. The odds ratio of I(LoyalCH*100) is now representing the effect of 1% increase in LoyalCH.

6. Additional material upon a request - plotting the curve

One of you asked if it is possible to plot the curve of logistic regression to see whether the nonevent (i.e., reference category) is correctly specified. I include code for visualization below. There are a few step that must be done in order to get the data ready for the graph shown below.

Step 1. You need to convert Purchase variable into numeric such that non-events are 0 and events are 1 (in the example below, buying Minute Made is the event).

Step 2. Draw a scatterplot between the probability of the event and an independent variable of your choice (in the example below, I am using LoyalCH as an example).

Step 3. Add a smooth curve on top of the plot to represent the fitted logistic regression curve.

Note that the plot uses ggplot2 package, which we did not cover in the lab. In case you want to apply the code to your own problems, you will need to (1) modify the code for Step 1 to specify non-event and event and (2) swap LoyalCH with a name of the variable you are interested in.

Also note that the curve shown below is NOT from the multivariable logistic regression model you fitted above; the curve is what you would get if you run a logistic regression with just one independent variable (this one independent variable would be LoyalCH in the example below).

# Plot the curve between Y and X (using Base R syntax)
oj.plot <- oj.fix
oj.plot$Purchase <- ifelse(oj.plot$Purchase == "CH", 0, 1) # Step 1

ggplot(oj.plot,aes(LoyalCH, Purchase)) + 
  geom_point(alpha = 0.15) + # Step 2
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
  ggtitle("Logistic regression model fit") +
  xlab("LoyalCH") +
  ylab("Probability of Buying MM")

# Plot the curve between Y and X (using Tidyverse syntax)
oj.fix %>%
  mutate(Purchase = ifelse(Purchase == "CH", 0, 1)) %>% # Step 1
  ggplot(aes(LoyalCH, Purchase)) +
  geom_point(alpha = .15) + # Step 2
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
  ggtitle("Logistic regression model fit") +
  xlab("LoyalCH") +
  ylab("Probability of Buying MM")

As you can see in the graphs above, the probability of buying Minute Made goes down as LoyalCH goes up.

The graph below is the case when the non-event is changed to buying Minute Made (see the second line of code below). The curve is now upside-down.

# Plot the curve between Y and X (using Tidyverse syntax)
oj.fix %>%
  mutate(Purchase = ifelse(Purchase == "MM", 0, 1)) %>% # Step 1 (CH is changed to MM here)
  ggplot(aes(LoyalCH, Purchase)) +
  geom_point(alpha = .15) + # Step 2
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
  ggtitle("Logistic regression model fit") +
  xlab("LoyalCH") +
  ylab("Probability of Buying CH")

```

Lab11 - Logistic Regression

Bonwoo Koo

11/11/2019