Introduction

There are many factors consumers take into consideration when making purchase decisions. Price, brand, suggestions from friends and family, are just some of these factors. In this demonstration, we will be using a very simple example of predicting the probability a consumer will choose Coke over Pepsi using a modeling technique called logistic regression.

A Brief Overview of Logistic Regression

In simplest terms, logistic regression is a type of regression model where the response variable is categorical. In this example, the categorical response variable is the consumer’s decision to choose either Coke or Pepsi. There are two main logistic regression models: the logit model and the probit model. Both concepts are similar and often yield nearly identical results. The probit model is rooted in a standard normal distribution, while the logit model is rooted in a logistic distribution. Because logit models are usually easier to interpret (you can interpret it as the log odds), and because their corresponding cumulative distribution function is a closed form expression (unlike the CDF of the standard normal distribution, which the probit model is rooted in), they are relatively easier to interpret. Thus, we will be using the logit model to predict the probability a consumer will choose Coke over Pepsi.

The Logit Model

To understand how the logit model works, recall in the previous section that it’s based on the logistic distribution. If \(L\) is a logistic random variable, then its probability density function is

\[ \begin{aligned} \lambda(l)=\frac{e^{-l}}{(1+e^{-l})^2}, & -\infty<l<\infty \end{aligned} \]

The corresponding cumulative distribution function is

\[\begin{aligned} \Lambda(l)=P[L≤l]=\frac{1}{1+e^{-l}} \end{aligned} \]

The CDF of a logistic distribution has the following S-shaped curved:

Pay close attention to this S-shaped curve. Remember, we are trying to predict the probability that a consumer will make one choice over the other. By definition this means that the fitted values of our logit model (i.e. the probabilities) must fall between 0 and 1. Notice in the S-shaped curved that the values \(l\) can take are limited to \([0, 1]\).

In the logit model, the probability \(p\) that the observed value of your response variable \(y\) (i.e. the probability that a consumer will choose Coke over Pepsi) takes the value 1 is:

\[ \begin{aligned} p=P[L≤\beta_1+\beta_2x]=\Lambda(\beta_1+\beta_2x)=\frac{1}{1+e^{-(\beta_1+\beta_2x)}} \end{aligned} \]

The above equation can be expressed in a more useful form. The probability that \(y=1\) can be expressed as

\[ \begin{aligned} p=\frac{1}{1+e^{-(\beta_1+\beta_2x)}}=\frac{\exp(\beta_1+\beta_2x)}{1+\exp(\beta_1+\beta_2x)} \end{aligned} \]

and the probability that \(y=0\) is

\[ \begin{aligned} 1-p=\frac{1}{1+\exp(\beta_1+\beta_2x)} \end{aligned} \]

Enough math for now. Our goal is to predict the parameters \(\beta_1\) and \(\beta_2\). Let’s do some actual coding!

Loading, Inspecting, and Transforming the Data

Let’s first load and inspect our data.

coke <- read.csv('/Users/cyobero/Desktop/coke.csv')
head(coke, n = 20)
##    coke pr.pepsi pr.coke
## 1     1     1.79    1.79
## 2     1     1.79    0.89
## 3     1     1.41    0.89
## 4     1     1.79    1.33
## 5     1     1.79    1.79
## 6     1     0.99    1.79
## 7     1     0.77    1.79
## 8     1     1.33    1.79
## 9     1     1.79    0.99
## 10    1     1.79    1.29
## 11    1     0.99    1.79
## 12    1     0.99    1.79
## 13    0     0.99    1.79
## 14    0     1.79    1.33
## 15    1     1.79    1.33
## 16    1     1.79    1.79
## 17    0     0.99    0.99
## 18    0     1.79    1.79
## 19    0     1.33    1.79
## 20    0     1.79    1.79
str(coke)
## 'data.frame':    1140 obs. of  3 variables:
##  $ coke    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pr.pepsi: num  1.79 1.79 1.41 1.79 1.79 0.99 0.77 1.33 1.79 1.79 ...
##  $ pr.coke : num  1.79 0.89 0.89 1.33 1.79 1.79 1.79 1.79 0.99 1.29 ...

Our data consists of 1,140 observations and 3 variables. The coke variable indicates the consumer’s choice (1 if the consumer chose Coke and 0 if the consumer chose Pepsi). The pr.pepsi and pr.coke variables represent the price of Pepsi and Coke, respectively. Now, we could proceed with our logit model with two independent variables, but to keep things simple, let’s create a new variable called cp.ratio, representing the price ratio of Coke to Pepsi, and regress using that variable.

coke$cp.ratio <- coke$pr.coke / coke$pr.pepsi
head(coke)
##   coke pr.pepsi pr.coke  cp.ratio
## 1    1     1.79    1.79 1.0000000
## 2    1     1.79    0.89 0.4972067
## 3    1     1.41    0.89 0.6312057
## 4    1     1.79    1.33 0.7430168
## 5    1     1.79    1.79 1.0000000
## 6    1     0.99    1.79 1.8080808

Splitting Data Into Training and Test Sets

Let’s now split our data into training and test sets.

coke.train <- coke[1:912, ]
coke.test <- coke[913:1140, ]

Fitting Our Model

The glm() function is what we’ll be using to fit our model.

coke.train.fit <- glm(coke ~ cp.ratio, family = binomial(link = 'logit'), data = coke.train)
summary(coke.train.fit)
## 
## Call:
## glm(formula = coke ~ cp.ratio, family = binomial(link = "logit"), 
##     data = coke.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6580  -1.0082  -0.5911   1.1707   2.9541  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.5611     0.3230   7.928 2.22e-15 ***
## cp.ratio     -2.9732     0.3233  -9.197  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1227.7  on 911  degrees of freedom
## Residual deviance: 1120.5  on 910  degrees of freedom
## AIC: 1124.5
## 
## Number of Fisher Scoring iterations: 4

Before we use our model to predictions in our test data, let’s examine the fitted curve of our logit model.

coke.train$prob.coke <- coke.train.fit$fitted.values
g <- ggplot(coke.train, aes(y = prob.coke, x = cp.ratio)) + geom_line(col = 'red')
g + labs(y = 'Probability of Choosing Coke', x = 'Coke/Pepsi Price Ratio')

Our plot seems to make sense. As the price of Coke relative to Pepsi increases, the less likely a consumer will choose Coke. What’s interesting here is that if the price of Coke is equivalent to the price of Pepsi, the data suggests that the average consumer has a preference for Pepsi, since the model suggests that the probability they will choose Coke is 0.4.

Let’s use our model to make predictions about our test data. Note that the threshold that a consumer will choose Coke is

\[\begin{aligned} \hat{y}= \begin{cases} 1, & \hat{p}≥0.5 \\ 0, & \hat{p}<0.5 \end{cases} \end{aligned} \]

coke.test$prob.coke <- predict.glm(coke.train.fit, coke.test)
coke.test$prob.coke <- plogis(coke.test$prob.coke)  # Converts to probability
coke.test$coke.hat <- ifelse(coke.test$prob.coke >= 0.5, 1, 0)

Let’s compare our results:

library(gmodels)
coke.test$coke <- ifelse(coke.test$coke == 1, 'Coke', 'Pepsi')
coke.test$coke.hat <- ifelse(coke.test$coke.hat == 1, 'Coke', 'Pepsi')
CrossTable(coke.test$coke, coke.test$coke.hat, prop.chisq = FALSE, prop.r = FALSE, prop.c = FALSE, 
           dnn = c('Actual', 'Predicted'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  228 
## 
##  
##              | Predicted 
##       Actual |      Coke |     Pepsi | Row Total | 
## -------------|-----------|-----------|-----------|
##         Coke |        48 |        97 |       145 | 
##              |     0.211 |     0.425 |           | 
## -------------|-----------|-----------|-----------|
##        Pepsi |        11 |        72 |        83 | 
##              |     0.048 |     0.316 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        59 |       169 |       228 | 
## -------------|-----------|-----------|-----------|
## 
## 

The performance of our model is acutally pretty poor. It was only able to correctly predict 52.6% of our test data. There are ways to improve our model performance. Perhaps the Coke/Pepsi price ratio isn’t a good indicator of whether a consumer will choose Coke. We could also potentially add more variables to our model. As I mentioned previously, consumers tend to make purchase decisions based on more than one factor, so perhaps there are other variables that consumers take into consideration when choosing Coke or Pepsi.

Conclusion

Logistic regression is a regression model where the response variable is categorical. There are two main logistic regression models: probit and logit models. Probit models are rooted in the standard normal distribution, whereas logit models are rooted in the logistic distriution. We used a probit model to the probability a consumer will choose Coke. Although our model performance was poor, there are certainly ways we can improve on it. Adding more relevant variables is one possibility. Logistic regression can be a powerful tool when trying to predict the probability that one alternative will be chosen over the other.