In Lab 11, we will do some hands-on exercise on logistic regression. The data we will use today is OJ dataset from ISLR package (But because I added some modification to the data, we won’t use the dataset straight from the ISLR package).
Using the data, we will explain whether people purchased Minuite Made Orange Joice or Citrus Hill Orange Joice. The product purchased is in Purchase
variable, which will serve as our dependent variable. Before we begin, briefly examine this webpage to see the definition of each variable in this data (I added some new variables to the dataset, so the webpage does not contain the definition of all variable).
Let’s first download required packages in R. There is one .csv dataset that you can download from Canvas > Files > Lab Materials named ‘lab11_oj.csv’. Download the file and store it in a folder of your choice.
# Call the packages using library()
library(GGally)
library(car)
library(tidyverse)
library(DescTools)
# USE YOUR OWN working directory and file name
setwd("C:/Users/Bonwoo Koo/Dropbox/School/CP6025/Labs/Lab11") # Use your own pathname
oj <- read.csv("lab11_oj.csv")
str(oj)
## 'data.frame': 1070 obs. of 19 variables:
## $ year : int 2000 2000 2005 2000 2010 2010 2005 2005 2000 2005 ...
## $ Purchase : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
## $ WeekofPurchase: int 237 239 245 227 228 230 232 234 235 238 ...
## $ StoreID : int 1 1 1 1 7 7 7 7 7 7 ...
## $ PriceCH : num 1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceMM : num 1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
## $ DiscCH : num 0 0 0.17 0 0 0 0 0 0 0 ...
## $ DiscMM : num 0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
## $ SpecialCH : int 0 0 0 0 0 0 1 1 0 0 ...
## $ SpecialMM : int 0 1 0 0 0 1 1 0 0 0 ...
## $ LoyalCH : num 0.5 0.6 0.68 0.4 0.957 ...
## $ SalePriceMM : num 1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
## $ SalePriceCH : num 1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceDiff : num 0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
## $ Store7 : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
## $ PctDiscMM : num 0 0.151 0 0 0 ...
## $ PctDiscCH : num 0 0 0.0914 0 0 ...
## $ ListPriceDiff : num 0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
## $ STORE : int 1 1 1 1 0 0 0 0 0 0 ...
So far, you have been given datasets that are well-cleaned by me or by data provider (e.g., the Census Bureau),but that will not be the case for you once this class is over. When you are on your own, always take a close look at your data before you do any analysis.
Using summary()
or and str()
, compare what you see in the data to their definitions and see if there is anything notable. In the summary()
below, there are a few things that catch my attention.
year
and WeekofPurchase
variables are in numeric form. If you’d like to use those variable to represent the passage of time (i.e., 1 unit change would then mean 1 year or 1 week), the current form is okay. But if you want to simply use them to represent different time points (i.e., nominal variable), I’d consider converting them to character or factor format.StoreID
and STORE
variable should be coverted to character or factor format because each value of those variables are actually store code and are thus nominal variables. The same applies to SpecialCH
and SpecialMM
. Treating these variables as numeric variables will be seriously change the regression outcome, and we will get a misleading result.PctDiscMM
and PctDiscCH
appear to be proportional variables and are ranging between 0 and 1 (rather than from 0 to 100). These unit issues will affect the interpretation of the regression coefficient and thus are important to note.# Descriptive statistics
summary(oj)
## year Purchase WeekofPurchase StoreID PriceCH
## Min. :2000 CH:653 Min. :227.0 Min. :1.00 Min. :1.690
## 1st Qu.:2000 MM:417 1st Qu.:240.0 1st Qu.:2.00 1st Qu.:1.790
## Median :2005 Median :257.0 Median :3.00 Median :1.860
## Mean :2005 Mean :254.4 Mean :3.96 Mean :1.867
## 3rd Qu.:2010 3rd Qu.:268.0 3rd Qu.:7.00 3rd Qu.:1.990
## Max. :2010 Max. :278.0 Max. :7.00 Max. :2.090
## PriceMM DiscCH DiscMM SpecialCH
## Min. :1.690 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.990 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.090 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :2.085 Mean :0.05186 Mean :0.1234 Mean :0.1477
## 3rd Qu.:2.180 3rd Qu.:0.00000 3rd Qu.:0.2300 3rd Qu.:0.0000
## Max. :2.290 Max. :0.50000 Max. :0.8000 Max. :1.0000
## SpecialMM LoyalCH SalePriceMM SalePriceCH
## Min. :0.0000 Min. :0.000011 Min. :1.190 Min. :1.390
## 1st Qu.:0.0000 1st Qu.:0.325257 1st Qu.:1.690 1st Qu.:1.750
## Median :0.0000 Median :0.600000 Median :2.090 Median :1.860
## Mean :0.1617 Mean :0.565782 Mean :1.962 Mean :1.816
## 3rd Qu.:0.0000 3rd Qu.:0.850873 3rd Qu.:2.130 3rd Qu.:1.890
## Max. :1.0000 Max. :0.999947 Max. :2.290 Max. :2.090
## PriceDiff Store7 PctDiscMM PctDiscCH
## Min. :-0.6700 No :714 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.0000 Yes:356 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.2300 Median :0.0000 Median :0.00000
## Mean : 0.1465 Mean :0.0593 Mean :0.02731
## 3rd Qu.: 0.3200 3rd Qu.:0.1127 3rd Qu.:0.00000
## Max. : 0.6400 Max. :0.4020 Max. :0.25269
## ListPriceDiff STORE
## Min. :0.000 Min. :0.000
## 1st Qu.:0.140 1st Qu.:0.000
## Median :0.240 Median :2.000
## Mean :0.218 Mean :1.631
## 3rd Qu.:0.300 3rd Qu.:3.000
## Max. :0.440 Max. :4.000
Now, let’s fix some of those issues and assign the modified data.frame to oj.fix
.
# Fixing the noted issues
oj.fix <- oj
oj.fix$year <- factor(oj.fix$year)
oj.fix$StoreID <- factor(oj.fix$StoreID)
oj.fix$STORE <- factor(oj.fix$STORE)
oj.fix$SpecialCH <- factor(oj.fix$SpecialCH)
oj.fix$SpecialMM <- factor(oj.fix$SpecialMM)
# In Tidyverse way
oj.fix <- oj %>%
mutate(year = factor(year),
StoreID = factor(StoreID),
STORE = factor(STORE),
SpecialCH = factor(SpecialCH),
SpecialMM = factor(SpecialMM))
A positive sign of a coefficient in a logistic regression suggests that 1-unit increase in that IV is associated with a greater probability of a happening of an event. If we are doing some medical study that tests whether someone has cancer, the event would be the existence of cancer. Because the outcome in this case is about the presence or absence of something, it is natural to consider the presence an event. If we develop a logistic regression explaining this variable using age of patients, we expect to see a positive relationship. This means that 1-unit increase in age is associated with a greater chance of the event happening. That is, having cancer is coded as 1 and not having cancer is coded 0.
In the example we are using, however, what it means for an event to happen is not straight-forward because the two cateogories in the dependent variable are names of orange juice. In this case, we need to pick one and call it the event. Let’s define an event as the purchase of Minute Made. There are two ways you can do this.
# Changing the reference category by creating a new factor
oj.fix$Purchase <- factor(oj.fix$Purchase,
levels = c("CH", "MM")) # Here, Buying CH is nonevent; buying MM is the event
# Changing the refernece of category by declaring 'ref'
oj.fix$Purchase <- relevel(oj.fix$Purchase,
ref = "CH") # Here, CH is nonevent. This is the 'base'.
As always, let’s take a look at the IVs. Since there are too many variables to fit into one plot, I selected some of the variables. The remainder of this lab document will use those selected variables. After drawing a plot using the selected variables, there are a few things that catch my eyes:
PriceDiff
, for some reason, is more closely correlated with PctDiscMM
than it is with PctDiscCH
with a high correlation coefficient. This may indicate multicollinearity issue.# This ggpairs groups data points based on the category in the dependent variable
ggpairs(oj.fix[,c("Purchase", "StoreID", "PriceCH", "PriceMM", "PctDiscMM", "PctDiscCH", "Store7", "PriceDiff", "LoyalCH")],
mapping = ggplot2::aes(color = Purchase, alpha = 0.5))
Let’s run a logistic regression using what we have so far. The function for logistic regression is glm
, which stands for the generalized linear model. The arguments are identical to OLS except the family = "binomial
argument. This argument is needed because the generalized linear model is actually not just one model but is an umbrella term for various models.
For those who are curious to know why it is called generalized, here is a quote from Wikipedia. “In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value” (Wikipedia, accessed Nov 10, 2019, https://en.wikipedia.org/wiki/Generalized_linear_model).
# Fitting a logistic regression model
logit.model1 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + PctDiscCH + Store7 + PriceDiff + LoyalCH, # Notice that I did not drop any variables
data = oj.fix,
family = "binomial")
summary(logit.model1)
##
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM +
## PctDiscCH + Store7 + PriceDiff + LoyalCH, family = "binomial",
## data = oj.fix)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8157 -0.5446 -0.2408 0.5306 2.8042
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.527549 1.831745 1.926 0.0541 .
## StoreID2 0.009016 0.286318 0.031 0.9749
## StoreID3 -0.055000 0.329861 -0.167 0.8676
## StoreID4 -0.399226 0.374221 -1.067 0.2861
## StoreID7 -0.597578 0.270197 -2.212 0.0270 *
## PriceCH -13.053061 8.304722 -1.572 0.1160
## PriceMM 13.164038 8.060008 1.633 0.1024
## PctDiscMM -30.493872 17.323509 -1.760 0.0784 .
## PctDiscCH 24.859135 15.456068 1.608 0.1078
## Store7Yes NA NA NA NA
## PriceDiff -17.021085 8.262913 -2.060 0.0394 *
## LoyalCH -6.243439 0.407193 -15.333 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1430.85 on 1069 degrees of freedom
## Residual deviance: 821.15 on 1059 degrees of freedom
## AIC: 843.15
##
## Number of Fisher Scoring iterations: 5
Notice that there is NA for Store7
variable. This indicates that there is something I missed in the inspection (e.g., perfect collinearity). Let’s see what is happening between the two. Since they are categorical variables, let’s use contigency table.
You can see in the output below that you can perfectly predict Store7
variable using StoreID
variable because Store7
has “yes” if its ID is 7 and “no” for everything else (I have no idea why this variable exists in the first place). Anyway, we need to drop this variable from the regression.
# Store7 and StoreID
table(oj.fix$StoreID, oj.fix$Store7, dnn = c("StoreID", "Store7")) # dnn stands for dimension names
## Store7
## StoreID No Yes
## 1 157 0
## 2 222 0
## 3 196 0
## 4 139 0
## 7 0 356
# Revised logistic regression
logit.model2 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM + PctDiscCH + PriceDiff + LoyalCH, # I dropped Store7
data = oj.fix,
family = "binomial")
summary(logit.model2)
##
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscMM +
## PctDiscCH + PriceDiff + LoyalCH, family = "binomial", data = oj.fix)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8157 -0.5446 -0.2408 0.5306 2.8042
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.527549 1.831745 1.926 0.0541 .
## StoreID2 0.009016 0.286318 0.031 0.9749
## StoreID3 -0.055000 0.329861 -0.167 0.8676
## StoreID4 -0.399226 0.374221 -1.067 0.2861
## StoreID7 -0.597578 0.270197 -2.212 0.0270 *
## PriceCH -13.053061 8.304722 -1.572 0.1160
## PriceMM 13.164038 8.060008 1.633 0.1024
## PctDiscMM -30.493872 17.323509 -1.760 0.0784 .
## PctDiscCH 24.859135 15.456068 1.608 0.1078
## PriceDiff -17.021085 8.262913 -2.060 0.0394 *
## LoyalCH -6.243439 0.407193 -15.333 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1430.85 on 1069 degrees of freedom
## Residual deviance: 821.15 on 1059 degrees of freedom
## AIC: 843.15
##
## Number of Fisher Scoring iterations: 5
Okay, at least the model did not blow up. But remember that we found a multicollinearity concern in section 3.3. above. Let’s check VIF values. You can use vif function from car package for logistic regression in the same way you used it for OLS.
As shown below, we have some serious multicollinearity issue because most of the IVs have VIF higher than 10 (the highest is 592.51). Since there are multiple variables that have high VIFs, it is difficult to figure out which one to exclude. I decided to test dropping PctDiscMM
because (1) we observed a high correlation between PctDiscMM
and PriceDiff
and (2) we can estimate PctDiscMM
using the remaining variables.
# VIF values using car::vif
vif(logit.model2)
## GVIF Df GVIF^(1/(2*Df))
## StoreID 2.002056 4 1.090648
## PriceCH 100.736144 1 10.036740
## PriceMM 178.443294 1 13.358267
## PctDiscMM 413.983620 1 20.346587
## PctDiscCH 81.982961 1 9.054444
## PriceDiff 592.512897 1 24.341588
## LoyalCH 1.158587 1 1.076377
# Revised logistic regression
logit.model3 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH + PriceDiff + LoyalCH, # I dropped Store7 & PctDiscMM
data = oj.fix,
family = "binomial") # Note that I excluded
vif(logit.model3)
## GVIF Df GVIF^(1/(2*Df))
## StoreID 1.771270 4 1.074077
## PriceCH 3.329685 1 1.824743
## PriceMM 2.720213 1 1.649307
## PctDiscCH 1.359359 1 1.165916
## PriceDiff 1.674619 1 1.294071
## LoyalCH 1.153348 1 1.073940
We can see that VIF values are now in the safe range. Okay, I think it is now time to take a look at the regression result. When we interpret a result from a logistic regression, it is much easier for us if we convert the coefficients back into odds ratio format. We can do that using exp()
function.
# Regression result (standard output)
summary(logit.model3)
##
## Call:
## glm(formula = Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH +
## PriceDiff + LoyalCH, family = "binomial", data = oj.fix)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8025 -0.5493 -0.2416 0.5276 2.8006
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.7829 1.7688 1.573 0.1156
## StoreID2 0.1406 0.2751 0.511 0.6094
## StoreID3 -0.0193 0.3281 -0.059 0.9531
## StoreID4 -0.3468 0.3721 -0.932 0.3514
## StoreID7 -0.6134 0.2711 -2.263 0.0237 *
## PriceCH 1.3218 1.4996 0.881 0.3781
## PriceMM -0.9129 0.9876 -0.924 0.3553
## PctDiscCH -2.1370 1.9922 -1.073 0.2834
## PriceDiff -2.5131 0.4384 -5.732 9.9e-09 ***
## LoyalCH -6.2026 0.4052 -15.308 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1430.85 on 1069 degrees of freedom
## Residual deviance: 824.27 on 1060 degrees of freedom
## AIC: 844.27
##
## Number of Fisher Scoring iterations: 5
# Regression result (with odds ratio conversion)
round( # Rounds the numbers up to 3 digits
cbind( # Column-bind Odds Ratio to the regerssion output
"Odds Ratio" = exp(logit.model3$coefficients),
summary(logit.model3)$coefficients
),3)
## Odds Ratio Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.166 2.783 1.769 1.573 0.116
## StoreID2 1.151 0.141 0.275 0.511 0.609
## StoreID3 0.981 -0.019 0.328 -0.059 0.953
## StoreID4 0.707 -0.347 0.372 -0.932 0.351
## StoreID7 0.541 -0.613 0.271 -2.263 0.024
## PriceCH 3.750 1.322 1.500 0.881 0.378
## PriceMM 0.401 -0.913 0.988 -0.924 0.355
## PctDiscCH 0.118 -2.137 1.992 -1.073 0.283
## PriceDiff 0.081 -2.513 0.438 -5.732 0.000
## LoyalCH 0.002 -6.203 0.405 -15.308 0.000
Unlike lm()
, the standard output from glm()
does not provide R-squared statistic. We need to use a package to calculate what is called pseudo R-squared statistics, which are designed to function similar to OLS’ R-squared but not quite the same.
PseudoR2(logit.model3, c("McFadden", "McFaddenAdj"))
## McFadden McFaddenAdj
## 0.4239337 0.4099560
If you cannot install DescTools package, copy and paste the following code into your R console and run it. After that, you will have the function available in your R environment.
# A function for pseudo R-squared from DescTools package
PseudoR2 <- function (x, which = NULL)
{
if (!(inherits(x, what = "glm") || inherits(x, what = "polr") ||
inherits(x, what = "multinom") || inherits(x, what = "vglm")))
return(NA)
if (!(inherits(x, what = "vglm")) && !is.null(x$call$summ) &&
!identical(x$call$summ, 0))
stop("can NOT get Loglik when 'summ' argument is not zero")
L.full <- logLik(x)
D.full <- -2 * L.full
if (inherits(x, what = "multinom"))
L.base <- logLik(update(x, ~1, trace = FALSE))
else if (inherits(x, what = "glm"))
L.base <- logLik(glm(formula = reformulate("1",
gsub(" .*$", "", deparse(x$formula))),
data = x$data, family = x$family))
else L.base <- logLik(update(x, ~1))
D.base <- -2 * L.base
G2 <- -2 * (L.base - L.full)
n <- attr(L.full, "nobs")
if (inherits(x, "multinom"))
edf <- x$edf
else if (inherits(x, "vglm")) {
edf <- x@rank
n <- nobs(x)
}
else edf <- x$rank
McFadden <- 1 - (L.full/L.base)
McFaddenAdj <- 1 - ((L.full - edf)/L.base)
Nagelkerke <- (1 - exp((D.full - D.base)/n))/(1 - exp(-D.base/n))
CoxSnell <- 1 - exp(-G2/n)
res <- c(McFadden = McFadden, McFaddenAdj = McFaddenAdj,
CoxSnell = CoxSnell, Nagelkerke = Nagelkerke, AldrichNelson = NA,
VeallZimmermann = NA, Efron = NA, McKelveyZavoina = NA,
Tjur = NA, AIC = AIC(x), BIC = BIC(x), logLik = L.full,
logLik0 = L.base, G2 = G2)
if (inherits(x, what = "glm") || inherits(x, what = "vglm")) {
if (inherits(x, what = "vglm")) {
fam <- x@family@vfamily
link <- if (all(x@extra$link == "logit")) {
"logit"
}
else if (all(x@extra$link == "probit")) {
"probit"
}
else {
NA
}
y <- x@y
}
else {
fam <- x$family$family
link <- x$family$link
y <- x$y
}
s2 <- switch(link, probit = 1, logit = pi^2/3, NA)
res["AldrichNelson"] <- G2/(G2 + n)
res["VeallZimmermann"] <- res["AldrichNelson"] *
(2 * L.base - n)/(2 * L.base)
y.hat <- predict(x, type = "link")
sse <- sum((y.hat - mean(y.hat))^2)
res["McKelveyZavoina"] <- sse/(n * s2 + sse)
y.hat.resp <- predict(x, type = "response")
res["Efron"] <- (1 - (sum((y - y.hat.resp)^2))/(sum((y -
mean(y))^2)))
if (identical(fam, "binomial"))
res["Tjur"] <- unname(diff(tapply(y.hat.resp,
y, mean, na.rm = TRUE)))
}
if (is.null(which))
which <- "McFadden"
else which <- match.arg(which, c("McFadden", "AldrichNelson",
"VeallZimmermann", "McFaddenAdj", "CoxSnell",
"Nagelkerke", "Efron", "McKelveyZavoina",
"Tjur", "AIC", "BIC", "logLik",
"logLik0", "G2", "all"), several.ok = TRUE)
if (any(which == "all"))
return(res)
else return(res[which])
}
As we learned in the lecture, Odds Ratio no longer have additive interpretation that we are used to. Instead, it is now multiplicative. For example, 10 times 1 is 10. 10 times 0.8 is 8. 10 time 1.2 is 12. Notice that a number does not change when it is multiplied by 1. A number decreases if it is multiplied by something smaller than 1, and it increases if it is multiplied by something greater than 1. In OLS, 0 is the point that separates positive and negative effect. In logistic regression, odds ratio of 1 is that point.
Let’s take a look at StoreID7
. It has odds ratio of 0.541 and estimate of -0.613. It means that, compared to StoreID1, the odds of purchasing Minute Made (because the reference category is currently Citrus Hill) is lower by a factor 0.541 or that being StoreID7 has odds of purchasing Minute Made 0.541 times lower compared to StoreID1.
Similarly, the the odds of purchasing Minute Made is lower by a factor 0.002 for 1-unit increase in LoyalCH
variable. However, notice that the range of LoyalCH
is between 0 and 1 (possibly measured in some sort of proportion). In this case, the meaning of 1-unit increase is to change from the minimum level of loyalty to the maximum level, which is a dramatic change. Assuming that this is in fact a proportion, multiplying it by 100 will yield a more reasonable coefficient.
Note that it is the odds of the event that gets multiplied by the odds ratio, not the probability nor the log of odds.
# Another regression with LoyalCH*100
logit.model4 <- glm(Purchase ~ StoreID + PriceCH + PriceMM + PctDiscCH + PriceDiff + I(LoyalCH*100),
data = oj.fix,
family = "binomial")
# Regression result (with odds ratio conversion)
round( # Rounds the numbers up to 3 digits
cbind( # Column-bind Odds Ratio to the regerssion output
"Odds Ratio" = exp(logit.model4$coefficients),
summary(logit.model4)$coefficients
),3)
## Odds Ratio Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.166 2.783 1.769 1.573 0.116
## StoreID2 1.151 0.141 0.275 0.511 0.609
## StoreID3 0.981 -0.019 0.328 -0.059 0.953
## StoreID4 0.707 -0.347 0.372 -0.932 0.351
## StoreID7 0.541 -0.613 0.271 -2.263 0.024
## PriceCH 3.750 1.322 1.500 0.881 0.378
## PriceMM 0.401 -0.913 0.988 -0.924 0.355
## PctDiscCH 0.118 -2.137 1.992 -1.073 0.283
## PriceDiff 0.081 -2.513 0.438 -5.732 0.000
## I(LoyalCH * 100) 0.940 -0.062 0.004 -15.308 0.000
You can see that everything else is NOT changed except the Odds Ratio, Estimate, and Std. Error of LoyalCH
. Now, the odds ratio of the variable is not so extreme. The odds ratio of I(LoyalCH*100)
is now representing the effect of 1% increase in LoyalCH
.
One of you asked if it is possible to plot the curve of logistic regression to see whether the nonevent (i.e., reference category) is correctly specified. I include code for visualization below. There are a few step that must be done in order to get the data ready for the graph shown below.
Step 1. You need to convert Purchase
variable into numeric such that non-events are 0 and events are 1 (in the example below, buying Minute Made is the event).
Step 2. Draw a scatterplot between the probability of the event and an independent variable of your choice (in the example below, I am using LoyalCH
as an example).
Step 3. Add a smooth curve on top of the plot to represent the fitted logistic regression curve.
Note that the plot uses ggplot2 package, which we did not cover in the lab. In case you want to apply the code to your own problems, you will need to (1) modify the code for Step 1 to specify non-event and event and (2) swap LoyalCH
with a name of the variable you are interested in.
Also note that the curve shown below is NOT from the multivariable logistic regression model you fitted above; the curve is what you would get if you run a logistic regression with just one independent variable (this one independent variable would be LoyalCH
in the example below).
# Plot the curve between Y and X (using Base R syntax)
oj.plot <- oj.fix
oj.plot$Purchase <- ifelse(oj.plot$Purchase == "CH", 0, 1) # Step 1
ggplot(oj.plot,aes(LoyalCH, Purchase)) +
geom_point(alpha = 0.15) + # Step 2
geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
ggtitle("Logistic regression model fit") +
xlab("LoyalCH") +
ylab("Probability of Buying MM")
# Plot the curve between Y and X (using Tidyverse syntax)
oj.fix %>%
mutate(Purchase = ifelse(Purchase == "CH", 0, 1)) %>% # Step 1
ggplot(aes(LoyalCH, Purchase)) +
geom_point(alpha = .15) + # Step 2
geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
ggtitle("Logistic regression model fit") +
xlab("LoyalCH") +
ylab("Probability of Buying MM")
As you can see in the graphs above, the probability of buying Minute Made goes down as LoyalCH
goes up.
The graph below is the case when the non-event is changed to buying Minute Made (see the second line of code below). The curve is now upside-down.
# Plot the curve between Y and X (using Tidyverse syntax)
oj.fix %>%
mutate(Purchase = ifelse(Purchase == "MM", 0, 1)) %>% # Step 1 (CH is changed to MM here)
ggplot(aes(LoyalCH, Purchase)) +
geom_point(alpha = .15) + # Step 2
geom_smooth(method = "glm", method.args = list(family = "binomial")) + # Step 3
ggtitle("Logistic regression model fit") +
xlab("LoyalCH") +
ylab("Probability of Buying CH")
```