Functions	Tasks
`relevel()`	Reorders the levels of a factor.
`glm()`	Fits various generalized linear models.
`round()`	Rounds the number to a specified digits.
`cbind()`	Column-bind two data.frames.
`exp()`	Takes exponential of the input value.
`PseudoR2()`	Calculates various pseudo R-squared statistics based on the log-likelihoods.

1. Introduction

In this lab, we will do some hands-on exercise on logistic regression. The data we will use today is OJ dataset from ISLR package (Note: because I added some modification to the data, we won’t use the dataset straight from the ISLR package).

Using the data, we will explain whether people purchased Minute Made Orange Juice or Citrus Hill Orange Juice. The product purchased is in Purchase variable, which will serve as our dependent variable. Before we begin, briefly examine this webpage to see the definition of each variable in this data (I added some new variables to the dataset, so the webpage does not contain the definition of all variable).

2. Setting the working directory & reading data

Let’s first download required packages in R. There is one .csv dataset that you can download from Canvas > Files > Lab Materials named ‘Week12_oj.csv’. Download the file and store it in a folder of your choice.

# Call the packages using library()
library(GGally)
library(car)
library(tidyverse)
library(DescTools)

# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 12\\Lab10_2020") # Use your own pathname
oj <- read.csv("Week12_oj.csv")

str(oj)

## 'data.frame':    1070 obs. of  19 variables:
##  $ year          : int  2000 2000 2005 2000 2010 2010 2005 2005 2000 2005 ...
##  $ Purchase      : chr  "CH" "CH" "CH" "MM" ...
##  $ WeekofPurchase: int  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : int  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : int  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : int  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : chr  "No" "No" "No" "No" ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : int  1 1 1 1 0 0 0 0 0 0 ...

3. Understanding the data

3.1. Examining data itself

So far, you have been given datasets that are well-cleaned by me or by data provider (e.g., the Census Bureau),but that will not be the case for you once this class is over. When you are on your own, always take a close look at your data before you do any analysis.

Using summary() or and str(), compare what you see in the data to their definitions and see if there is anything notable. In the summary() below, there are a few things that catch my attention.

year and WeekofPurchase variables are in numeric form. If you’d like to use those variables to represent the passage of time (i.e., 1 unit change would then mean 1 year or 1 week), the current form is okay. But if you want to simply use them to represent different time points (i.e., nominal variable), I’d consider converting them to character or factor format.
Purchase and Store7 are not in numeric format. You can tell that from the fact that they don’t have Min, 1st Qu., etc. under their names. I think StoreID and STORE variable should be converted to character or factor format because each value of those variables are actually store code and are thus nominal variables. The same applies to SpecialCH and SpecialMM. Treating these variables as numeric variables will be seriously change the regression outcome, and we will get a misleading result.
The variables PctDiscMM and PctDiscCH appear to be proportional variables and are ranging between 0 and 1 (rather than from 0 to 100). These unit issues will affect the interpretation of the regression coefficient and thus are important to note.
By comparing that to the variable definition from the webpage, see if there are any variables that are incorrectly in numeric format (the opposite can also be the case). Please let me know if there are anything else I missed.

# Descriptive statistics
summary(oj)

##       year        Purchase         WeekofPurchase     StoreID    
##  Min.   :2000   Length:1070        Min.   :227.0   Min.   :1.00  
##  1st Qu.:2000   Class :character   1st Qu.:240.0   1st Qu.:2.00  
##  Median :2005   Mode  :character   Median :257.0   Median :3.00  
##  Mean   :2005                      Mean   :254.4   Mean   :3.96  
##  3rd Qu.:2010                      3rd Qu.:268.0   3rd Qu.:7.00  
##  Max.   :2010                      Max.   :278.0   Max.   :7.00  
##     PriceCH         PriceMM          DiscCH            DiscMM      
##  Min.   :1.690   Min.   :1.690   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:1.790   1st Qu.:1.990   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1.860   Median :2.090   Median :0.00000   Median :0.0000  
##  Mean   :1.867   Mean   :2.085   Mean   :0.05186   Mean   :0.1234  
##  3rd Qu.:1.990   3rd Qu.:2.180   3rd Qu.:0.00000   3rd Qu.:0.2300  
##  Max.   :2.090   Max.   :2.290   Max.   :0.50000   Max.   :0.8000  
##    SpecialCH        SpecialMM         LoyalCH          SalePriceMM   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000011   Min.   :1.190  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.325257   1st Qu.:1.690  
##  Median :0.0000   Median :0.0000   Median :0.600000   Median :2.090  
##  Mean   :0.1477   Mean   :0.1617   Mean   :0.565782   Mean   :1.962  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.850873   3rd Qu.:2.130  
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.999947   Max.   :2.290  
##   SalePriceCH      PriceDiff          Store7            PctDiscMM     
##  Min.   :1.390   Min.   :-0.6700   Length:1070        Min.   :0.0000  
##  1st Qu.:1.750   1st Qu.: 0.0000   Class :character   1st Qu.:0.0000  
##  Median :1.860   Median : 0.2300   Mode  :character   Median :0.0000  
##  Mean   :1.816   Mean   : 0.1465                      Mean   :0.0593  
##  3rd Qu.:1.890   3rd Qu.: 0.3200                      3rd Qu.:0.1127  
##  Max.   :2.090   Max.   : 0.6400                      Max.   :0.4020  
##    PctDiscCH       ListPriceDiff       STORE      
##  Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.140   1st Qu.:0.000  
##  Median :0.00000   Median :0.240   Median :2.000  
##  Mean   :0.02731   Mean   :0.218   Mean   :1.631  
##  3rd Qu.:0.00000   3rd Qu.:0.300   3rd Qu.:3.000  
##  Max.   :0.25269   Max.   :0.440   Max.   :4.000

Now, let’s fix some of those issues and assign the modified data.frame to oj.fix.

# Fixing the noted issues
oj.fix <- oj
oj.fix$year <- factor(oj.fix$year)
oj.fix$StoreID <- factor(oj.fix$StoreID)
oj.fix$STORE <- factor(oj.fix$STORE)
oj.fix$SpecialCH <- factor(oj.fix$SpecialCH)
oj.fix$SpecialMM <- factor(oj.fix$SpecialMM)

3.2. Examining the DV

A positive sign of a coefficient in a logistic regression suggests that 1-unit increase in that IV is associated with a greater probability of a happening of an event. If we are doing some medical study that tests whether someone has cancer, the event would be the existence of cancer. Because the outcome in this case is about the presence or absence of something, it is natural to consider the presence of an event. If we develop a logistic regression explaining this variable using age of patients, we expect to see a positive relationship. This means that 1-unit increase in age is associated with a greater chance of the event happening. That is, having cancer is coded as 1 and not having cancer is coded 0.

In the example we are using, however, what it means for an event to happen is not straight-forward because the two categories in the dependent variable are names of orange juice. In this case, we need to pick one and call it the event. Let’s define an event as the purchase of Minute Made. There are two ways you can do this.

By making the variable a factor (you can make something a factor even if it is already a factor). Make sure you specify the levels of the factor such that it is written in “non-event”, “event” order.

# Changing the reference category by creating a new factor
oj.fix$Purchase <- factor(oj.fix$Purchase, 
                          levels = c("CH", "MM")) # Here, Buying CH is nonevent; buying MM is the event

Or you can specify which category you want as the non-event.

# Changing the reference of category by declaring 'ref'
oj.fix$Purchase <- relevel(oj.fix$Purchase, 
                           ref = "CH") # Here, CH is nonevent. This is the 'base'.

3.3. Examining the IVs

As always, let’s take a look at the IVs. Since there are too many variables to fit into one plot, I selected some of the variables. The remainder of this lab document will use those selected variables. After drawing a plot using the selected variables, there are a few things that catch my eyes:

Citrus Hill appears to be a more popular choice.
(Surprisingly) the price of the two juices are not highly correlated.
PriceDiff, for some reason, is more closely correlated with PctDiscMM than it is with PctDiscCH with a high correlation coefficient. This may indicate multicollinearity issue.
LoyalCH vividly separates green and red dots (see the second to the last column in the last row), indicating that it is likely to be one of the strongest predictors.

# This ggpairs groups data points based on the category in the dependent variable
ggpairs(oj.fix[,c("Purchase", "StoreID", "PriceCH", "PriceMM", "PctDiscMM", "PctDiscCH", "Store7", "PriceDiff", "LoyalCH")], 
        mapping = ggplot2::aes(color = Purchase, alpha = 0.5))

4. Running a logistic regression model

Let’s run a logistic regression using what we have so far. The function for logistic regression is glm, which stands for the generalized linear model. The arguments are identical to OLS except the family = "binomial argument. This argument is needed because the generalized linear model is actually not just one model but is an umbrella term for various models.

For those who are curious to know why it is called generalized, here is a quote from Wikipedia. “In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value” (Wikipedia, accessed Nov 10, 2019, https://en.wikipedia.org/wiki/Generalized_linear_model).

# Fitting a logistic regression model
logit.model1 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + PctDiscCH + Store7 + PriceDiff + LoyalCH, # Notice that I did not drop any variables
                   data = oj.fix,
                   family = "binomial") 
summary(logit.model1)

## 
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + 
##     PctDiscCH + Store7 + PriceDiff + LoyalCH, family = "binomial", 
##     data = oj.fix)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8157  -0.5446  -0.2408   0.5306   2.8042  
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.527549   1.831745   1.926   0.0541 .  
## StoreID2      0.009016   0.286318   0.031   0.9749    
## StoreID3     -0.055000   0.329861  -0.167   0.8676    
## StoreID4     -0.399226   0.374221  -1.067   0.2861    
## StoreID7     -0.597578   0.270197  -2.212   0.0270 *  
## PriceCH     -13.053061   8.304722  -1.572   0.1160    
## PriceMM      13.164038   8.060008   1.633   0.1024    
## PctDiscMM   -30.493872  17.323509  -1.760   0.0784 .  
## PctDiscCH    24.859135  15.456068   1.608   0.1078    
## Store7Yes           NA         NA      NA       NA    
## PriceDiff   -17.021085   8.262913  -2.060   0.0394 *  
## LoyalCH      -6.243439   0.407193 -15.333   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1430.85  on 1069  degrees of freedom
## Residual deviance:  821.15  on 1059  degrees of freedom
## AIC: 843.15
## 
## Number of Fisher Scoring iterations: 5

Notice that there is NA for Store7 variable. This indicates that there is something I missed in the inspection (e.g., perfect collinearity). Let’s see what is happening between the two. Since they are categorical variables, let’s use contingency table.

You can see in the output below that you can perfectly predict Store7 variable using StoreID variable because Store7 has “yes” if its ID is 7 and “no” for everything else (I have no idea why this variable exists in the first place). Anyway, we need to drop this variable from the regression.

# Store7 and StoreID
table(oj.fix$StoreID, oj.fix$Store7, dnn = c("StoreID", "Store7")) # dnn stands for dimension names

##        Store7
## StoreID  No Yes
##       1 157   0
##       2 222   0
##       3 196   0
##       4 139   0
##       7   0 356

# Revised logistic regression
logit.model2 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + PctDiscCH + PriceDiff + LoyalCH, # I dropped Store7
                   data = oj.fix,
                   family = "binomial") 
summary(logit.model2)

## 
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + 
##     PctDiscCH + PriceDiff + LoyalCH, family = "binomial", data = oj.fix)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8157  -0.5446  -0.2408   0.5306   2.8042  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.527549   1.831745   1.926   0.0541 .  
## StoreID2      0.009016   0.286318   0.031   0.9749    
## StoreID3     -0.055000   0.329861  -0.167   0.8676    
## StoreID4     -0.399226   0.374221  -1.067   0.2861    
## StoreID7     -0.597578   0.270197  -2.212   0.0270 *  
## PriceCH     -13.053061   8.304722  -1.572   0.1160    
## PriceMM      13.164038   8.060008   1.633   0.1024    
## PctDiscMM   -30.493872  17.323509  -1.760   0.0784 .  
## PctDiscCH    24.859135  15.456068   1.608   0.1078    
## PriceDiff   -17.021085   8.262913  -2.060   0.0394 *  
## LoyalCH      -6.243439   0.407193 -15.333   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1430.85  on 1069  degrees of freedom
## Residual deviance:  821.15  on 1059  degrees of freedom
## AIC: 843.15
## 
## Number of Fisher Scoring iterations: 5

Okay, at least the model did not blow up. But remember that we found a multicollinearity concern in section 3.3. above. Let’s check VIF values. You can use vif function from car package for logistic regression in the same way you used it for OLS.

As shown below, we have some serious multicollinearity issue because most of the IVs have VIF higher than 10 (the highest is 592.51). Since there are multiple variables that have high VIFs, it is difficult to figure out which one to exclude. I decided to test dropping PctDiscMM because (1) we observed a high correlation between PctDiscMM and PriceDiff and (2) we can estimate PctDiscMM using the remaining variables.

# VIF values using car::vif
vif(logit.model2)

##                 GVIF Df GVIF^(1/(2*Df))
## StoreID     2.002056  4        1.090648
## PriceCH   100.736144  1       10.036740
## PriceMM   178.443294  1       13.358267
## PctDiscMM 413.983620  1       20.346587
## PctDiscCH  81.982961  1        9.054444
## PriceDiff 592.512897  1       24.341588
## LoyalCH     1.158587  1        1.076377

# Revised logistic regression
logit.model3 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH + PriceDiff + LoyalCH, # I dropped Store7 & PctDiscMM
                   data = oj.fix,
                   family = "binomial") # Note that I excluded 
vif(logit.model3)

##               GVIF Df GVIF^(1/(2*Df))
## StoreID   1.771270  4        1.074077
## PriceCH   3.329685  1        1.824743
## PriceMM   2.720213  1        1.649307
## PctDiscCH 1.359359  1        1.165916
## PriceDiff 1.674619  1        1.294071
## LoyalCH   1.153348  1        1.073940

We can see that VIF values are now in the safe range. Okay, I think it is now time to take a look at the regression result. When we interpret a result from a logistic regression, it is much easier for us if we convert the coefficients back into odds ratio format. We can do that using exp() function.

# Regression result (standard output)
summary(logit.model3)

## 
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH + 
##     PriceDiff + LoyalCH, family = "binomial", data = oj.fix)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8025  -0.5493  -0.2416   0.5276   2.8006  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.7829     1.7688   1.573   0.1156    
## StoreID2      0.1406     0.2751   0.511   0.6094    
## StoreID3     -0.0193     0.3281  -0.059   0.9531    
## StoreID4     -0.3468     0.3721  -0.932   0.3514    
## StoreID7     -0.6134     0.2711  -2.263   0.0237 *  
## PriceCH       1.3218     1.4996   0.881   0.3781    
## PriceMM      -0.9129     0.9876  -0.924   0.3553    
## PctDiscCH    -2.1370     1.9922  -1.073   0.2834    
## PriceDiff    -2.5131     0.4384  -5.732  9.9e-09 ***
## LoyalCH      -6.2026     0.4052 -15.308  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1430.85  on 1069  degrees of freedom
## Residual deviance:  824.27  on 1060  degrees of freedom
## AIC: 844.27
## 
## Number of Fisher Scoring iterations: 5

# Regression result (with odds ratio conversion)
round( # Rounds the numbers up to 3 digits
  cbind( # Column-bind Odds Ratio to the regerssion output
    "Odds Ratio" = exp(logit.model3$coefficients),
    summary(logit.model3)$coefficients
    ),3)

##             Odds Ratio Estimate Std. Error z value Pr(>|z|)
## (Intercept)     16.166    2.783      1.769   1.573    0.116
## StoreID2         1.151    0.141      0.275   0.511    0.609
## StoreID3         0.981   -0.019      0.328  -0.059    0.953
## StoreID4         0.707   -0.347      0.372  -0.932    0.351
## StoreID7         0.541   -0.613      0.271  -2.263    0.024
## PriceCH          3.750    1.322      1.500   0.881    0.378
## PriceMM          0.401   -0.913      0.988  -0.924    0.355
## PctDiscCH        0.118   -2.137      1.992  -1.073    0.283
## PriceDiff        0.081   -2.513      0.438  -5.732    0.000
## LoyalCH          0.002   -6.203      0.405 -15.308    0.000

Unlike lm(), the standard output from glm() does not provide R-squared statistic. We need to use a package to calculate what is called pseudo R-squared statistics, which are designed to function similar to OLS’ R-squared but not quite the same.

PseudoR2(logit.model3, c("McFadden", "McFaddenAdj"))

##    McFadden McFaddenAdj 
##   0.4239337   0.4099560

If you cannot install DescTools package, copy and paste the following code into your R console and run it. After that, you will have the function available in your R environment.

# A function for pseudo R-squared from DescTools package
PseudoR2 <- function (x, which = NULL) 
{
    if (!(inherits(x, what = "glm") || inherits(x, what = "polr") || 
        inherits(x, what = "multinom") || inherits(x, what = "vglm"))) 
        return(NA)
    if (!(inherits(x, what = "vglm")) && !is.null(x$call$summ) && 
        !identical(x$call$summ, 0)) 
        stop("can NOT get Loglik when 'summ' argument is not zero")
    L.full <- logLik(x)
    D.full <- -2 * L.full
    if (inherits(x, what = "multinom")) 
        L.base <- logLik(update(x, ~1, trace = FALSE))
    else if (inherits(x, what = "glm")) 
        L.base <- logLik(glm(formula = reformulate("1", 
            gsub(" .*$", "", deparse(x$formula))), 
            data = x$data, family = x$family))
    else L.base <- logLik(update(x, ~1))
    D.base <- -2 * L.base
    G2 <- -2 * (L.base - L.full)
    n <- attr(L.full, "nobs")
    if (inherits(x, "multinom")) 
        edf <- x$edf
    else if (inherits(x, "vglm")) {
        edf <- x@rank
        n <- nobs(x)
    }
    else edf <- x$rank
    McFadden <- 1 - (L.full/L.base)
    McFaddenAdj <- 1 - ((L.full - edf)/L.base)
    Nagelkerke <- (1 - exp((D.full - D.base)/n))/(1 - exp(-D.base/n))
    CoxSnell <- 1 - exp(-G2/n)
    res <- c(McFadden = McFadden, McFaddenAdj = McFaddenAdj, 
        CoxSnell = CoxSnell, Nagelkerke = Nagelkerke, AldrichNelson = NA, 
        VeallZimmermann = NA, Efron = NA, McKelveyZavoina = NA, 
        Tjur = NA, AIC = AIC(x), BIC = BIC(x), logLik = L.full, 
        logLik0 = L.base, G2 = G2)
    if (inherits(x, what = "glm") || inherits(x, what = "vglm")) {
        if (inherits(x, what = "vglm")) {
            fam <- x@family@vfamily
            link <- if (all(x@extra$link == "logit")) {
                "logit"
            }
            else if (all(x@extra$link == "probit")) {
                "probit"
            }
            else {
                NA
            }
            y <- x@y
        }
        else {
            fam <- x$family$family
            link <- x$family$link
            y <- x$y
        }
        s2 <- switch(link, probit = 1, logit = pi^2/3, NA)
        res["AldrichNelson"] <- G2/(G2 + n)
        res["VeallZimmermann"] <- res["AldrichNelson"] * 
            (2 * L.base - n)/(2 * L.base)
        y.hat <- predict(x, type = "link")
        sse <- sum((y.hat - mean(y.hat))^2)
        res["McKelveyZavoina"] <- sse/(n * s2 + sse)
        y.hat.resp <- predict(x, type = "response")
        res["Efron"] <- (1 - (sum((y - y.hat.resp)^2))/(sum((y - 
            mean(y))^2)))
        if (identical(fam, "binomial")) 
            res["Tjur"] <- unname(diff(tapply(y.hat.resp, 
                y, mean, na.rm = TRUE)))
    }
    if (is.null(which)) 
        which <- "McFadden"
    else which <- match.arg(which, c("McFadden", "AldrichNelson", 
        "VeallZimmermann", "McFaddenAdj", "CoxSnell", 
        "Nagelkerke", "Efron", "McKelveyZavoina", 
        "Tjur", "AIC", "BIC", "logLik", 
        "logLik0", "G2", "all"), several.ok = TRUE)
    if (any(which == "all")) 
        return(res)
    else return(res[which])
}

5. Interpreting a logistic regression model

As we learned in the lecture, Odds Ratio no longer have additive interpretation that we are used to. Instead, it is now multiplicative. For example, 10 times 1 is 10. 10 times 0.8 is 8. 10 time 1.2 is 12. Notice that a number does not change when it is multiplied by 1. A number decreases if it is multiplied by something smaller than 1, and it increases if it is multiplied by something greater than 1. In OLS, 0 is the point that separates positive and negative effect. In logistic regression, odds ratio of 1 is that point.

“An easy way to understand this is that an Odds ratio of 1 means that there are no odds of having the event of interest (i.e. outcome being evaluated such as an infection) compared to the comparison group (e.g. a placebo/control group in a clinical trial or a group with no complication during surgery). In other words, an odds ratio of 1 means that there are no higher or lower odds of the outcome happening. An odds ratio of above 1 means that there is a greater likelihood of having the outcome and an Odds ratio of below 1 means that there is a lesser likelihood of having the outcome.

When the Odds ratio is above 1 and below 2, the likelihood of having the event is represented as XX % higher odds (where XX % is Odds ratio -1). That means that if odds ratio is 1.24, the likelihood of having the outcome is 24% higher (1.24 – 1 = 0.24 i.e. 24%) than the comparison group. If odds ratio is 1.66, the likelihood of having the outcome is 66% higher. You get the drift. Using the example from the paragraph above, you may read the following sentence in a published paper: “The odds of having a postoperative infection is 65% higher if the patient experienced a complication during the surgery as opposed to not experiencing the complication”. Here the odds ratio would be 1.65.

When Odds ratio is above 2, then it is simply put as XX times higher likelihood (XX=odds ratio). If odds ratio is 2.5, then there is a 2.5 times higher likelihood of having the outcome compared to the comparison group. From the above example, one might say “The odds of having a postoperative infection is 3.5 times higher if the patient experienced a complication during the surgery as opposed to not experiencing the complication”. Here the odds ratio would be 3.5.

When the odds ratio is lower than 1, the likelihood of having the outcome is XX% lower (XX% = 1-Odds ratio). For e.g. if odds ratio is 0.70, then there is a 30% lower likelihood of having the outcome. In the above example related to surgical infection, you might say “The odds of having a postoperative infection is 20% lower if the patient was given prophylactic antibiotics during the surgery as opposed to not getting the antibiotic”. Here the odds ratio would be 0.80”.

source: https://senguptasresearchacademy.com/odds-ratio/

Let’s take a look at StoreID7. It has odds ratio of 0.541. It means that, the odds of StoreID7 purchasing Minute Made (because the reference category is currently Citrus Hill) are 45.9% lower (1-0.541 =0.459) than for StoreID1, holding all else constant (p>0.024; 99% CI: 8.7%, 73%).

Similarly, the odds of purchasing Minute Made in StoreID7is 1.9% lower for 1 unit increase in LoyalCH variable holding all else constant (p> 0.00; 99% CI: 5.1%, 7%). However, notice that the range of LoyalCH is between 0 and 1 (possibly measured in some sort of proportion). In this case, the meaning of 1-unit increase is to change from the minimum level of loyalty to the maximum level, which is a dramatic change. Assuming that this is in fact a proportion, multiplying it by 100 will yield a more reasonable coefficient.

Note that it is the odds of the event that gets multiplied by the odds ratio, not the probability nor the log of odds.

# Another regression with LoyalCH*100
logit.model4 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH + PriceDiff + I(LoyalCH*100), 
                   data = oj.fix,
                   family = "binomial")

# Regression result (with odds ratio conversion)
round( # Rounds the numbers up to 3 digits
  cbind( # Column-bind Odds Ratio to the regerssion output
    "Odds Ratio" = exp(logit.model4$coefficients),
    summary(logit.model4)$coefficients
    ),3)

##                  Odds Ratio Estimate Std. Error z value Pr(>|z|)
## (Intercept)          16.166    2.783      1.769   1.573    0.116
## StoreID2              1.151    0.141      0.275   0.511    0.609
## StoreID3              0.981   -0.019      0.328  -0.059    0.953
## StoreID4              0.707   -0.347      0.372  -0.932    0.351
## StoreID7              0.541   -0.613      0.271  -2.263    0.024
## PriceCH               3.750    1.322      1.500   0.881    0.378
## PriceMM               0.401   -0.913      0.988  -0.924    0.355
## PctDiscCH             0.118   -2.137      1.992  -1.073    0.283
## PriceDiff             0.081   -2.513      0.438  -5.732    0.000
## I(LoyalCH * 100)      0.940   -0.062      0.004 -15.308    0.000

Odd Ratio Conversion and confidence interval

# Regression result (with odds ratio conversion and confidence interval)
round( # Rounds the numbers up to 3 digits
  exp(cbind(OR =logit.model4$coefficients, confint(logit.model4, level=.99)))
  ,3)

## Waiting for profiling to be done...

##                      OR 0.5 %   99.5 %
## (Intercept)      16.166 0.171 1581.132
## StoreID2          1.151 0.566    2.344
## StoreID3          0.981 0.420    2.282
## StoreID4          0.707 0.267    1.828
## StoreID7          0.541 0.268    1.087
## PriceCH           3.750 0.079  182.152
## PriceMM           0.401 0.031    5.068
## PctDiscCH         0.118 0.001   16.858
## PriceDiff         0.081 0.026    0.246
## I(LoyalCH * 100)  0.940 0.930    0.949

You can see that everything else is NOT changed except the Odds Ratio, Estimate, and Std. Error of LoyalCH. Now, the odds ratio of the variable is not so extreme. The odds ratio of I(LoyalCH*100) is now representing the effect of 1% increase in LoyalCH.

6. EXTRA: - Plotting the Logistic Curve

One of past year’s student asked if it is possible to plot the curve of logistic regression to see whether the nonevent (i.e., reference category) is correctly specified. I include code for visualization below - it is optional and you are not required to master the material shown below.

There are a few step that must be done in order to get the data ready for the graph shown below.

Step 1. You need to convert Purchase variable into numeric such that non-events are 0 and events are 1 (in the example below, buying Minute Made is the event).

Step 2. Draw a scatterplot between the probability of the event and an independent variable of your choice (in the example below, I am using LoyalCH as an example).

Step 3. Add a smooth curve on top of the plot to represent the fitted logistic regression curve.

Note that the plot uses ggplot2 package, which we did not cover in the lab. In case you want to apply the code to your own problems, you will need to (1) modify the code for Step 1 to specify non-event and event and (2) swap LoyalCH with a name of the variable you are interested in.

Also note that the curve shown below is NOT from the multivariable logistic regression model you fitted above; the curve is what you would get if you run a logistic regression with just one independent variable (this one independent variable would be LoyalCH in the example below).

# Plot the curve between Y and X (using Base R syntax)
oj.plot <- oj.fix
oj.plot$Purchase <- ifelse(oj.plot$Purchase == "CH", 0, 1) # Step 1

ggplot(oj.plot,aes(LoyalCH, Purchase)) + 
  geom_point(alpha = 0.15) + # Step 2
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
  ggtitle("Logistic regression model fit") +
  xlab("LoyalCH") +
  ylab("Probability of Buying MM")

## `geom_smooth()` using formula 'y ~ x'

# Plot the curve between Y and X (using Tidyverse syntax)
oj.fix %>%
  mutate(Purchase = ifelse(Purchase == "CH", 0, 1)) %>% # Step 1
  ggplot(aes(LoyalCH, Purchase)) +
  geom_point(alpha = .15) + # Step 2
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
  ggtitle("Logistic regression model fit") +
  xlab("LoyalCH") +
  ylab("Probability of Buying MM")

## `geom_smooth()` using formula 'y ~ x'

As you can see in the graphs above, the probability of buying Minute Made goes down as LoyalCH goes up.

The graph below is the case when the non-event is changed to buying Minute Made (see the second line of code below). The curve is now upside-down.

# Plot the curve between Y and X (using Tidyverse syntax)
oj.fix %>%
  mutate(Purchase = ifelse(Purchase == "MM", 0, 1)) %>% # Step 1 (CH is changed to MM here)
  ggplot(aes(LoyalCH, Purchase)) +
  geom_point(alpha = .15) + # Step 2
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
  ggtitle("Logistic regression model fit") +
  xlab("LoyalCH") +
  ylab("Probability of Buying CH")

## `geom_smooth()` using formula 'y ~ x'

```

Week12 - Logistic Regression

Bonwoo Koo

11/6/2020