Multi-Nomial Logit (MNL) model is an extenstion to the Binary Loigt (Logistic Regression) model we learned before.

The Binary Logit model allows a choice between two options, such as Yes or No, buy or not, MNL model allows a choice among multiple options. For example, there are four brands of coffee in the store, which brand are you going to purchase?

The MNL model has its origin in the Bay area. Professor of Economics, 2000 Nobel Laureate in Economics, Daniel McFadden, developed and applied the MNL model in 1972 to forecast the ride share of Bart. Back then, Bart was still in construction, not operating yet. The official forecast was 15%, his model forecasted 6.3%. The actual number turned out to be 6.2%. For reference, please see this speach notes by McFadden. As a result, he received the Nobel Laureate in 2000, a big bonus after many years.

The MNL model became popular in the area of transportatin demand forecasting, especially in forecasting the ride share for public transit. In 1983, a famous paper by Guadagni and Little published in Marketing Science applied the MNL model to study how customers make choices among multiple brands. This paper opens the door for the MNL model to enter the Marketing field, and have gained huge attentions among marketers.

MNL Model Introduction

When a decision maker faces a set of choices, and he/she can choose only one of these available alternatives. MNL model is the best approach that helps to understand how different factors influnce such multiple choice decisions in such a context.

To apply MNL model, the setup requires two conditions:

All decision makers are facing a fixed set of options. In other words, as a modeler, we know what options each decision maker has when making decisions.
Each decision maker only makes one choice, and only choose one alternative.

Here is an example list of possible applications of an MNL model, but this is far from a complete list

Choosing a brand or a product among a few
Choosing a link to click among a list of links
Choosing a person to connect to among a list of recommended frends
Choosing an action to take among the possible actions
(Choosing) a segment among a few segments - discriminant analysis

The MNL model analysis can help achieving the following goals,

Identify how choices/trade-offs are made by customers
Identify the weights of each factor influencing these choices
Understand what options are more attractive to what customers
Calculate propensity scores
Find the target customers
Make predictions

In order to estimate an MNL model, we would need information/data regarding

The available alternatives/choices to each decision maker when they are making the choice decisions. For example, the transportation modes they can choose from when commute; or the available brands on the grocery store shelf when deciding which brand to purchase, etc.
The chosen alternative, i.e. the actual alternative each decision maker decided to choose.
The descriptions (variables) for each alternative, including the alternatives that the decision makers didn’t choose. This could be challenging, as in most cases the data only recorded the information regarding the chosen alternative.

For example, in scanner pannel data from a grocery store, they do not usually record what are the different brands of yogart available when each customer is making the choice decision, not to say the price information of the other unchosen brands. In the context of grocery scanner panel data, as so many people visit the store on the same day, and some other people might have purchased the other brands, where the information of those brands would be recorded that way.

In some other cases, it may be much harder to gather information of all the choice alternatives. For example, when you search a hotel room in Chicago, the list of hotels you saw was generated by certain algorithm from the search engine powering the website. Different customers are likely to see very different list of hotels. It is therefore the responsibility of the website, to make sure that list was recorded in the database. However, it is a lot easier to record which hotel you clicked on from that list, but not always straightforward to record which hotel options you saw. Without such information, it is impossible to collect the descriptions of all the alternatives a customer faces before making a choice decision.

The description of the decision maker. Demographic information is always useful to marketing, especially to make more generalizable findings, such as customers with certain demographics prefer certain type of products.

As collecting/recording customer level data becomes common practice among companies, one important information such data could provide is how each customer make choice decision through trade offs. For example, some customers may think brand name is very important, and only buy a particular brand, without paying attention to minor price changes. Some other customers may be more interested in finding deals and are very sensitive to price changes, without paying attention to the brand they got.

As another example, an online travel website will create a list of hotels to a customer who’s looking to book a hotel. After seeing those options, the customer will choose a hotel room to book. Each hotel room is represented by a bundle of features, such as price, distance to downtown, hotel brand, whether offering free breakfast, etc. The customer is choosing a desirable bundle of these features, although not every feature is at its desirable level. For example, the customer may choose a hotel that is far away from the downtown area, but cheaper than the other one that is located in the downtown. MNL model can help us understand such trade offs. Once we understand how each customer makes decisions, we can provide the best list of options that we expect a higher chance for the customer to make a purchase. This is also called the recommendation system.

Finally, in the Binary Logit session, we presented a case on how the BL model helps to identify the right customers to target to, by calculating the propensity scores. MNL model can do that as well, in the context of multiple alternatives.

In the next sessions, we are going to analyze the following case:

A grocery store sells three brands of yogurt, and have collected individual level purchase data for each brand over a certain period of time. The manufacture of Stonyfield is interested in working with the store to identify the right customers to send a 20% discount coupons to.

To achieve that, we developed the following analysis:

Develope an MNL model to understand how each customer makes decision, and the price parameter for making choices among these three alternatives.
Create the scenario that a 20% coupon is applied to Stonyfield, and recalcualte the purchase probabilities of each customer. The manufacture want to target to those customers whose purchase probabilities increases the most.

MNL Model Specifications

Let’s get ready to run R and get started by loading the data.

#install.packages("mlogit") #it needs R version to be 3.5 or newer
#install.packages("data.table")
library(mlogit)
library(data.table)
yogurtdata = fread("yogurt_3brands.csv")
names(yogurtdata)

##  [1] "Index"      "Stonyfield" "Yoplait"    "Dannon"     "Feature_S" 
##  [6] "Feature_Y"  "Feature_D"  "Price_S"    "Price_Y"    "Price_D"   
## [11] "Income"     "HH Size"    "Pan ID"

head(yogurtdata)

##    Index Stonyfield Yoplait Dannon Feature_S Feature_Y Feature_D Price_S
## 1:     1          0       0      1         0         0         0   0.108
## 2:     2          0       1      0         0         0         0   0.108
## 3:     3          0       1      0         0         0         0   0.108
## 4:     4          0       1      0         0         0         0   0.108
## 5:     5          0       1      0         0         0         0   0.125
## 6:     6          0       1      0         0         0         0   0.108
##    Price_Y Price_D Income HH Size Pan ID
## 1:   0.081   0.061      9       2      1
## 2:   0.098   0.064      9       2      1
## 3:   0.098   0.061      9       2      1
## 4:   0.098   0.061      9       2      1
## 5:   0.098   0.049      9       2      1
## 6:   0.092   0.050      9       2      1

In this dataset, each row is for one purchase incidence by an individual. The individual is identified by the column Pan ID. The same individual may appear in multiple rows, indicatng multiple purchase.

The first three columns tell us which one of the three brands (Stonyfield, Yoplait and Dannon) is chosen in that particular purchase incidence by that particular individual.
The second three columns list the feature variable, for each of the three brands.
The third three columns list the price variable, for each of the three brands.
The 10th column is the income value for the household.
The 11th column is the size of the household.

Among these data, we have two sets of information: product related (feature and price) and customer information (income and HH size). Among the product related, each variable varies across the three brands, therefore the data has three columns for each variable. Among the customer related variables, each variable has only one column.

To setup a model, we need to introduce the concept called Latent Utility. It is “latent,” because analysts cannot observe it from the data. But it describes the utility that a customer derives from purchasing each product. The the probability model assumes a customer will choose the product that provides the highest latent utility to him/her.

The latent utility can be specified as a function of both the product information and the customer demographic variables. The latent utility for each customer (indexed by \(i\)) for each brand can be specified as

Stonyfield \(U_{is}=\beta_{0s}+\beta_1Price_s+\beta_2Feature_s+\beta_3Income_i+\beta_4HHsize_i+\epsilon_{is}\)
Yoplait \(U_{iy}=\beta_{0y}+\beta_1Price_y+\beta_2Feature_y+\beta_3Income_i+\beta_4HHsize_i+\epsilon_{iy}\)
Dannon \(U_{id}=\beta_{0d}+\beta_1Price_d+\beta_2Feature_d+\beta_3Income_i+\beta_4HHsize_i+\epsilon_{id}\)

The reason that a particular brand is chosen is because that brand has the highest latent utility for that customer, in other words

\(P(Choice_i=s)=P(U_{is}\ge U_{iy},U_{is}\ge U_{id})\)
\(P(Choice_i=y)=P(U_{iy}\ge U_{id},U_{iy}\ge U_{is})\)
\(P(Choice_i=d)=P(U_{id}\ge U_{is},U_{id}\ge U_{iy})\)

In other words, we do not actually care about the exact value of these latent utilities \(U_{is},U_{iy},U_{id}\), but just their relative values.

To derive the above probabilities, we need to know the distributions of \(\epsilon_{is},\epsilon_{iy},\epsilon_{id}\). They are assumed to be independent, and all of them follow the same distribution. The differences between any of these two are also assumed to be independent, and all the differences follow the same distribution, called the Logistic Distribution.

The Logistic Distribution is very similar to the Normal distribution, only that the logistic distribution has slightly thicker tails than the normal distribution.

With that distribution assumptions, we can derive the probability functions, and get the following probability functions

\[P(Choice_i=s)=\frac{\exp(V_{is})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}\]

\[P(Choice_i=y)=\frac{\exp(V_{iy})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}\] \[P(Choice_i=d)=\frac{\exp(V_{id})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}\] Each \(V\) represents the observed part of the latent utility, which is the part with \(\beta\) and \(X\). That is

\[V_{is}=U_{is}-\epsilon_{is}=\beta_{0s}+\beta_1Price_s+\beta_2Feature_s+\beta_3Income_i+\beta_4HHsize_i\]

\[V_{iy}=U_{iy}-\epsilon_{iy}=\beta_{0y}+\beta_1Price_y+\beta_2Feature_y+\beta_3Income_i+\beta_4HHsize_i\] \[V_{id}=U_{id}-\epsilon_{id}=\beta_{0d}+\beta_1Price_d+\beta_2Feature_d+\beta_3Income_i+\beta_4HHsize_i\]

If we could get the \(V\) or the \(U\) values for all brands, we can get all the model parameters, as we learned in Lineare Regression model. However, can we get the \(V\) or the \(U\) values?

Note that, these three probability calculations share the same denominator, but they are different in the numerator. Now, let’s examine the probability function for choosing Stonyfield. Suppose we add an arbitrary constant \(\color{red}C\) to all the \(V\) values, the probability cacluation for choosing Stonyfield becomes \[P'(Choice_i=s)=\frac{\exp(V_{is}+\color{red}{C})}{\exp(V_{is}+\color{red}{C})+\exp(V_{iy}+\color{red}{C})+\exp(V_{id}+\color{red}{C})}\]

\[=\frac{\exp(V_{is})\times\color{red}{\exp(C)}}{\exp(V_{is})\color{red}{\exp(C)}+\exp(V_{iy})\color{red}{\exp(C)}+\exp(V_{id}) \color{red}{\exp(C)}}\]

In this equation, all the \(\exp(\color{red}C)\) can be canceled out, therefore the above calculation is exactly the same as the calculation we got without adding that constant \(\color{red}C\), that is \[P(Choice_i=s)=P'(Choice_i=s)\] This causes a problem, as it indicates even though we can get the probability values to fit the data, the \(V\) values are not well defined. This is called the \(\color{red}\text{Identification Problem}\), as the data would not be able to provide enough information for the analysts to obtain estimates for all the model parameters.

The Identification Problem arises due to too many parameters. In order to deal with that, we need to reduce get rid of some parameters. The values of \(V\)’s are not identified, but their differences are: the chosen alternative should have the highest probability with the highest \(V\) value. Therefore, we just need to fix one of the intercepts to be 0. For example, we can choose the intercept for Dannon to be 0, \(\beta_{0d}=0\).

When we have \(B\) alternatives, we can only estimate \(B-1\) alternative specific intercepts.

Similarly, as the variables describing the decisions makers are the same across the three alternatives for the same decision maker, we cannot include these variables to all three alternatives either.

Similarly, when we have \(B\) alternatives, we can only estimate at most \(B-1\) parameters for each variable that are the same across all \(B\) alternatives.

Comparing the Binary Logit model with the MNL model

In a Binary Logit model, we get \[P(Y=1)=\frac{\exp(V)}{1+\exp(V)}\] If you notice that in the denominator, \(1=\exp(0)\), the above equation can be written as \[P(Y=1)=\frac{\exp(V)}{\exp(0)+\exp(V)}\] In other words, we are comparing two alternatives, only that the other alternative is set to have \(V=0\), so no additional parameters will need to estimate for the other alternative. This is also for the same reason - Identification, as we discussed in the MNL model.

Likelihood Function for the MNL model

As discussed in the Binary Logit model class, Likelihood function is defined as the probability that each data point takes the value it takes. To obtain the likelihood value for each data point in the context of MNL model, we need to calculate the probability of the chosen alternative.

For example, if the \(ith\) data point says the chosen alternative is Stonyfield, the likelihood function for the first data point is then the probability of Stonyfield being chosen, that is \(P(Y_i=Stonyfield)\)

The MNL model gives the functional form of calculating the probability for all three brands at each data point. To calculate the likelihood for data point \(i\) can be derived as \[L_i=\sum_{j=1}^3[P(Y_{ij}=1)\times Y_{ij}]\]

Limitations of the MNL model

MNL model is very popular, due to two main reasons:

It can make really accurate predictions of choice behaviors.
It has a very easy probability function to calculate. With that, we can always get an easy likelihood function, and estimate the model.

It does come with limitations as well. The most important limitation arises from its nice closed form in probability calculations. This can be seen from the following scenario.

In the above example, we can calculate the ratio between the purchase probability of Stonyfield and Yopait \[\frac{P(Choice_i=s)}{P(Choice_i=y)}=\frac{\frac{\exp(V_{is})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}}{\frac{\exp(V_{iy})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}}=\frac{\exp(V_{is})}{\exp(V_{iy})}\] This is due to the fact that both probability calculations share the same denominator.

This implies that the ratio between the purchase probabilities of these two brands have nothing to do with the third brand Dannon. This maybe an issue if Dannon is much closer to one brand than the other one. Suppose if Dannon is much closer to Yoplait, a price reduction in Dannon would lead to more Yoplait customers to switch to Dannon than Stonyfield customers.

To solve such problems, more sophisticated models are developed, by incorporating a tree type structure that captures similarities among alternatives. That is not covered in this note.

Estimating an MNL model

Note that in estimating an MNL model, it is not as easy as a regression or logistic regression (Binary Logit) model to just find the Y variable and the X variables. In addition, we need to know

which X variables should enter the latent utility equation of which alternative
which X variables should have the same parameters across multiple alternatives, and which should have the same parameters across these alternatives.

To do that, we first need to get the data into a format that the estimation software can parse out the above information from.

First, it requires a Choice variable specifying the chosen alternatives as factors.

# Create a Choice variable that lists the choice made
yogurtdata[Stonyfield==1, Choice := "Stonyfield"]
yogurtdata[Dannon==1, Choice := "Dannon"]
yogurtdata[Yoplait==1, Choice := "Yoplait"]
yogurtdata[, Choice := as.factor(Choice)]
yogurtdata[, c("Stonyfield","Dannon","Yoplait"):= NULL]#remove these three columns

setnames(yogurtdata, c("Feature_S", "Feature_D", "Feature_Y", "HH Size", "Pan ID"), 
                c("Feature.Stonyfield", "Feature.Dannon", "Feature.Yoplait",
                  "HHSize", "PanID"))
setnames(yogurtdata, c("Price_S", "Price_D", "Price_Y"), 
                  c("Price.Stonyfield", "Price.Dannon", "Price.Yoplait"))

head(yogurtdata)

##    Index Feature.Stonyfield Feature.Yoplait Feature.Dannon Price.Stonyfield
## 1:     1                  0               0              0            0.108
## 2:     2                  0               0              0            0.108
## 3:     3                  0               0              0            0.108
## 4:     4                  0               0              0            0.108
## 5:     5                  0               0              0            0.125
## 6:     6                  0               0              0            0.108
##    Price.Yoplait Price.Dannon Income HHSize PanID  Choice
## 1:         0.081        0.061      9      2     1  Dannon
## 2:         0.098        0.064      9      2     1 Yoplait
## 3:         0.098        0.061      9      2     1 Yoplait
## 4:         0.098        0.061      9      2     1 Yoplait
## 5:         0.098        0.049      9      2     1 Yoplait
## 6:         0.092        0.050      9      2     1 Yoplait

Now we need to tell R about the data, using function mlogit.data(). Within this function, we can specify

the data format using shape. If each row is an observation, with information of all the choice alternatives, use shape="wide"; if each choice occasion is specified in multiple rows, with each row for each choice alternative, use shape="long"
which variables are alternative-specific, meaning different values for each alternative, such as price and feature, using the option varying=1:6, meaning the first 6 columns
which variable indicates they are the same person, using id=PanID
which variable is the choice decision, using choice=Choice

# Create dataset in the "mlogit" format using mlogit.data() command
yl = mlogit.data(yogurtdata[,-c("Index" )], shape="wide", 
                 choice="Choice", id="PanID", varying=1:6)
head(yl)

##      Income HHSize PanID Choice        alt Feature Price chid
## 1319      9      2     1   TRUE     Dannon       0 0.061    1
## 1         9      2     1  FALSE Stonyfield       0 0.108    1
## 660       9      2     1  FALSE    Yoplait       0 0.081    1
## 1320      9      2     1  FALSE     Dannon       0 0.064    2
## 2         9      2     1  FALSE Stonyfield       0 0.108    2
## 661       9      2     1   TRUE    Yoplait       0 0.098    2

The data is ready, now need to

Get the formula
Estimate the model

When writing the formula to be estimated, the parameters for each variable can be alternative specific, or common to all choice options. The pattern in the formula is:

Choice Variable ~ Alternative-specifiic variables (feature, price) with a common coefficient | Individual-specific variables (income and hhsize) with an alternative-specific coefficient | Alternative specific variables (feature and price) with an alternative-specific coefficient

f <- mFormula(Choice ~ Feature+Price | Income + HHSize)


# Estimate the model
ml <- mlogit(f, yl, reflevel="Dannon")
summary(ml)

## 
## Call:
## mlogit(formula = Choice ~ Feature + Price | Income + HHSize, 
##     data = yl, reflevel = "Dannon", method = "nr")
## 
## Frequencies of alternatives:
##     Dannon Stonyfield    Yoplait 
##    0.33687    0.33080    0.33232 
## 
## nr method
## 4 iterations, 0h:0m:0s 
## g'(-H)^-1g = 8.68E-08 
## gradient close to zero 
## 
## Coefficients :
##                          Estimate Std. Error z-value  Pr(>|z|)    
## Stonyfield:(intercept)   1.572326   0.369253  4.2581 2.061e-05 ***
## Yoplait:(intercept)      2.848940   0.318431  8.9468 < 2.2e-16 ***
## Feature                  0.371186   0.206549  1.7971   0.07232 .  
## Price                  -23.480763   3.667916 -6.4017 1.537e-10 ***
## Stonyfield:Income       -0.125584   0.030431 -4.1268 3.678e-05 ***
## Yoplait:Income          -0.218509   0.030981 -7.0529 1.752e-12 ***
## Stonyfield:HHSize        0.265701   0.116981  2.2713   0.02313 *  
## Yoplait:HHSize          -0.096554   0.115666 -0.8348   0.40385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -632.78
## McFadden R^2:  0.12595 
## Likelihood ratio test : chisq = 182.36 (p.value = < 2.22e-16)

Estimation Results Interpretations

The values of the parameters are hard to interpret, let’s focus on the sign and some relative values

The intercepts for both Stonyfield and Yoplait are positive, indicating that everything else being equal, Dannon is the least preferred brand.
The Feature parameter for all brands are the same, and it is positive and statistically significant
The Price parameter for all brands are the same, and it is negative and statistically significant
The Income parameter for both brands are negative, meaning holding everything else the same, the families with higher income tend to prefer Dannon; with not slightly higher income tend to prefer Stonyfield.
The HHsize parameter for Stonyfield is positive, meaning holding everything else constant, the larger families tend to prefer Stonyfield over Dannon. The parameter for Yoplait is essentially zero, meaning they are indifferent between Dannon and Yoplait.

This concludes our first step in solving the case problem.

The second step, we change the price value for Stonyfield, and recalculate the purchase probabilities for each brand by each individual.

ydatanew = yogurtdata
ydatanew[, Price.Stonyfield :=Price.Stonyfield*.8]   
ylnew = mlogit.data(ydatanew[,-c("Index" )], shape="wide", 
                    choice="Choice", id="PanID", varying=1:6)
prob = predict(ml,yl)
probnew=predict(ml,ylnew)
colMeans(prob)

##     Dannon Stonyfield    Yoplait 
##  0.3368741  0.3308042  0.3323217

colMeans(probnew)

##     Dannon Stonyfield    Yoplait 
##  0.2800955  0.4360050  0.2838995

Multi-Nomial Logit model

Xiaojing Dong

February 12, 2020