Multi-Nomial Logit (MNL) model is an extenstion to the Binary Loigt (Logistic Regression) model we learned before.
The Binary Logit model allows a choice between two options, such as Yes or No, buy or not, MNL model allows a choice among multiple options. For example, there are four brands of coffee in the store, which brand are you going to purchase?
The MNL model has its origin in the Bay area. Professor of Economics, 2000 Nobel Laureate in Economics, Daniel McFadden, developed and applied the MNL model in 1972 to forecast the ride share of Bart. Back then, Bart was still in construction, not operating yet. The official forecast was 15%, his model forecasted 6.3%. The actual number turned out to be 6.2%. For reference, please see this speach notes by McFadden. As a result, he received the Nobel Laureate in 2000, a big bonus after many years.
The MNL model became popular in the area of transportatin demand forecasting, especially in forecasting the ride share for public transit. In 1983, a famous paper by Guadagni and Little published in Marketing Science applied the MNL model to study how customers make choices among multiple brands. This paper opens the door for the MNL model to enter the Marketing field, and have gained huge attentions among marketers.
When a decision maker faces a set of choices, and he/she can choose only one of these available alternatives. MNL model is the best approach that helps to understand how different factors influnce such multiple choice decisions in such a context.
To apply MNL model, the setup requires two conditions:
Here is an example list of possible applications of an MNL model, but this is far from a complete list
The MNL model analysis can help achieving the following goals,
In order to estimate an MNL model, we would need information/data regarding
For example, in scanner pannel data from a grocery store, they do not usually record what are the different brands of yogart available when each customer is making the choice decision, not to say the price information of the other unchosen brands. In the context of grocery scanner panel data, as so many people visit the store on the same day, and some other people might have purchased the other brands, where the information of those brands would be recorded that way.
In some other cases, it may be much harder to gather information of all the choice alternatives. For example, when you search a hotel room in Chicago, the list of hotels you saw was generated by certain algorithm from the search engine powering the website. Different customers are likely to see very different list of hotels. It is therefore the responsibility of the website, to make sure that list was recorded in the database. However, it is a lot easier to record which hotel you clicked on from that list, but not always straightforward to record which hotel options you saw. Without such information, it is impossible to collect the descriptions of all the alternatives a customer faces before making a choice decision.
As collecting/recording customer level data becomes common practice among companies, one important information such data could provide is how each customer make choice decision through trade offs. For example, some customers may think brand name is very important, and only buy a particular brand, without paying attention to minor price changes. Some other customers may be more interested in finding deals and are very sensitive to price changes, without paying attention to the brand they got.
As another example, an online travel website will create a list of hotels to a customer who’s looking to book a hotel. After seeing those options, the customer will choose a hotel room to book. Each hotel room is represented by a bundle of features, such as price, distance to downtown, hotel brand, whether offering free breakfast, etc. The customer is choosing a desirable bundle of these features, although not every feature is at its desirable level. For example, the customer may choose a hotel that is far away from the downtown area, but cheaper than the other one that is located in the downtown. MNL model can help us understand such trade offs. Once we understand how each customer makes decisions, we can provide the best list of options that we expect a higher chance for the customer to make a purchase. This is also called the recommendation system.
Finally, in the Binary Logit session, we presented a case on how the BL model helps to identify the right customers to target to, by calculating the propensity scores. MNL model can do that as well, in the context of multiple alternatives.
In the next sessions, we are going to analyze the following case:
A grocery store sells three brands of yogurt, and have collected individual level purchase data for each brand over a certain period of time. The manufacture of Stonyfield is interested in working with the store to identify the right customers to send a 20% discount coupons to.
To achieve that, we developed the following analysis:
Let’s get ready to run R and get started by loading the data.
#install.packages("mlogit") #it needs R version to be 3.5 or newer
#install.packages("data.table")
library(mlogit)
library(data.table)
yogurtdata = fread("yogurt_3brands.csv")
names(yogurtdata)
## [1] "Index" "Stonyfield" "Yoplait" "Dannon" "Feature_S"
## [6] "Feature_Y" "Feature_D" "Price_S" "Price_Y" "Price_D"
## [11] "Income" "HH Size" "Pan ID"
head(yogurtdata)
## Index Stonyfield Yoplait Dannon Feature_S Feature_Y Feature_D Price_S
## 1: 1 0 0 1 0 0 0 0.108
## 2: 2 0 1 0 0 0 0 0.108
## 3: 3 0 1 0 0 0 0 0.108
## 4: 4 0 1 0 0 0 0 0.108
## 5: 5 0 1 0 0 0 0 0.125
## 6: 6 0 1 0 0 0 0 0.108
## Price_Y Price_D Income HH Size Pan ID
## 1: 0.081 0.061 9 2 1
## 2: 0.098 0.064 9 2 1
## 3: 0.098 0.061 9 2 1
## 4: 0.098 0.061 9 2 1
## 5: 0.098 0.049 9 2 1
## 6: 0.092 0.050 9 2 1
In this dataset, each row is for one purchase incidence by an individual. The individual is identified by the column Pan ID. The same individual may appear in multiple rows, indicatng multiple purchase.
feature variable, for each of the three brands.price variable, for each of the three brands.Among these data, we have two sets of information: product related (feature and price) and customer information (income and HH size). Among the product related, each variable varies across the three brands, therefore the data has three columns for each variable. Among the customer related variables, each variable has only one column.
To setup a model, we need to introduce the concept called Latent Utility. It is “latent,” because analysts cannot observe it from the data. But it describes the utility that a customer derives from purchasing each product. The the probability model assumes a customer will choose the product that provides the highest latent utility to him/her.
The latent utility can be specified as a function of both the product information and the customer demographic variables. The latent utility for each customer (indexed by \(i\)) for each brand can be specified as
The reason that a particular brand is chosen is because that brand has the highest latent utility for that customer, in other words
In other words, we do not actually care about the exact value of these latent utilities \(U_{is},U_{iy},U_{id}\), but just their relative values.
To derive the above probabilities, we need to know the distributions of \(\epsilon_{is},\epsilon_{iy},\epsilon_{id}\). They are assumed to be independent, and all of them follow the same distribution. The differences between any of these two are also assumed to be independent, and all the differences follow the same distribution, called the Logistic Distribution.
The Logistic Distribution is very similar to the Normal distribution, only that the logistic distribution has slightly thicker tails than the normal distribution.
With that distribution assumptions, we can derive the probability functions, and get the following probability functions
\[P(Choice_i=s)=\frac{\exp(V_{is})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}\]
\[P(Choice_i=y)=\frac{\exp(V_{iy})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}\] \[P(Choice_i=d)=\frac{\exp(V_{id})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}\] Each \(V\) represents the observed part of the latent utility, which is the part with \(\beta\) and \(X\). That is
\[V_{is}=U_{is}-\epsilon_{is}=\beta_{0s}+\beta_1Price_s+\beta_2Feature_s+\beta_3Income_i+\beta_4HHsize_i\]
\[V_{iy}=U_{iy}-\epsilon_{iy}=\beta_{0y}+\beta_1Price_y+\beta_2Feature_y+\beta_3Income_i+\beta_4HHsize_i\] \[V_{id}=U_{id}-\epsilon_{id}=\beta_{0d}+\beta_1Price_d+\beta_2Feature_d+\beta_3Income_i+\beta_4HHsize_i\]
If we could get the \(V\) or the \(U\) values for all brands, we can get all the model parameters, as we learned in Lineare Regression model. However, can we get the \(V\) or the \(U\) values?
Note that, these three probability calculations share the same denominator, but they are different in the numerator. Now, let’s examine the probability function for choosing Stonyfield. Suppose we add an arbitrary constant \(\color{red}C\) to all the \(V\) values, the probability cacluation for choosing Stonyfield becomes \[P'(Choice_i=s)=\frac{\exp(V_{is}+\color{red}{C})}{\exp(V_{is}+\color{red}{C})+\exp(V_{iy}+\color{red}{C})+\exp(V_{id}+\color{red}{C})}\]
\[=\frac{\exp(V_{is})\times\color{red}{\exp(C)}}{\exp(V_{is})\color{red}{\exp(C)}+\exp(V_{iy})\color{red}{\exp(C)}+\exp(V_{id}) \color{red}{\exp(C)}}\]
In this equation, all the \(\exp(\color{red}C)\) can be canceled out, therefore the above calculation is exactly the same as the calculation we got without adding that constant \(\color{red}C\), that is \[P(Choice_i=s)=P'(Choice_i=s)\] This causes a problem, as it indicates even though we can get the probability values to fit the data, the \(V\) values are not well defined. This is called the \(\color{red}\text{Identification Problem}\), as the data would not be able to provide enough information for the analysts to obtain estimates for all the model parameters.
The Identification Problem arises due to too many parameters. In order to deal with that, we need to reduce get rid of some parameters. The values of \(V\)’s are not identified, but their differences are: the chosen alternative should have the highest probability with the highest \(V\) value. Therefore, we just need to fix one of the intercepts to be 0. For example, we can choose the intercept for Dannon to be 0, \(\beta_{0d}=0\).
When we have \(B\) alternatives, we can only estimate \(B-1\) alternative specific intercepts.
Similarly, as the variables describing the decisions makers are the same across the three alternatives for the same decision maker, we cannot include these variables to all three alternatives either.
Similarly, when we have \(B\) alternatives, we can only estimate at most \(B-1\) parameters for each variable that are the same across all \(B\) alternatives.
In a Binary Logit model, we get \[P(Y=1)=\frac{\exp(V)}{1+\exp(V)}\] If you notice that in the denominator, \(1=\exp(0)\), the above equation can be written as \[P(Y=1)=\frac{\exp(V)}{\exp(0)+\exp(V)}\] In other words, we are comparing two alternatives, only that the other alternative is set to have \(V=0\), so no additional parameters will need to estimate for the other alternative. This is also for the same reason - Identification, as we discussed in the MNL model.
As discussed in the Binary Logit model class, Likelihood function is defined as the probability that each data point takes the value it takes. To obtain the likelihood value for each data point in the context of MNL model, we need to calculate the probability of the chosen alternative.
For example, if the \(ith\) data point says the chosen alternative is Stonyfield, the likelihood function for the first data point is then the probability of Stonyfield being chosen, that is \(P(Y_i=Stonyfield)\)
The MNL model gives the functional form of calculating the probability for all three brands at each data point. To calculate the likelihood for data point \(i\) can be derived as \[L_i=\sum_{j=1}^3[P(Y_{ij}=1)\times Y_{ij}]\]
MNL model is very popular, due to two main reasons:
It does come with limitations as well. The most important limitation arises from its nice closed form in probability calculations. This can be seen from the following scenario.
In the above example, we can calculate the ratio between the purchase probability of Stonyfield and Yopait \[\frac{P(Choice_i=s)}{P(Choice_i=y)}=\frac{\frac{\exp(V_{is})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}}{\frac{\exp(V_{iy})}{\exp(V_{is})+\exp(V_{iy})+\exp(V_{id})}}=\frac{\exp(V_{is})}{\exp(V_{iy})}\] This is due to the fact that both probability calculations share the same denominator.
This implies that the ratio between the purchase probabilities of these two brands have nothing to do with the third brand Dannon. This maybe an issue if Dannon is much closer to one brand than the other one. Suppose if Dannon is much closer to Yoplait, a price reduction in Dannon would lead to more Yoplait customers to switch to Dannon than Stonyfield customers.
To solve such problems, more sophisticated models are developed, by incorporating a tree type structure that captures similarities among alternatives. That is not covered in this note.
Note that in estimating an MNL model, it is not as easy as a regression or logistic regression (Binary Logit) model to just find the Y variable and the X variables. In addition, we need to know
To do that, we first need to get the data into a format that the estimation software can parse out the above information from.
First, it requires a Choice variable specifying the chosen alternatives as factors.
# Create a Choice variable that lists the choice made
yogurtdata[Stonyfield==1, Choice := "Stonyfield"]
yogurtdata[Dannon==1, Choice := "Dannon"]
yogurtdata[Yoplait==1, Choice := "Yoplait"]
yogurtdata[, Choice := as.factor(Choice)]
yogurtdata[, c("Stonyfield","Dannon","Yoplait"):= NULL]#remove these three columns
setnames(yogurtdata, c("Feature_S", "Feature_D", "Feature_Y", "HH Size", "Pan ID"),
c("Feature.Stonyfield", "Feature.Dannon", "Feature.Yoplait",
"HHSize", "PanID"))
setnames(yogurtdata, c("Price_S", "Price_D", "Price_Y"),
c("Price.Stonyfield", "Price.Dannon", "Price.Yoplait"))
head(yogurtdata)
## Index Feature.Stonyfield Feature.Yoplait Feature.Dannon Price.Stonyfield
## 1: 1 0 0 0 0.108
## 2: 2 0 0 0 0.108
## 3: 3 0 0 0 0.108
## 4: 4 0 0 0 0.108
## 5: 5 0 0 0 0.125
## 6: 6 0 0 0 0.108
## Price.Yoplait Price.Dannon Income HHSize PanID Choice
## 1: 0.081 0.061 9 2 1 Dannon
## 2: 0.098 0.064 9 2 1 Yoplait
## 3: 0.098 0.061 9 2 1 Yoplait
## 4: 0.098 0.061 9 2 1 Yoplait
## 5: 0.098 0.049 9 2 1 Yoplait
## 6: 0.092 0.050 9 2 1 Yoplait
Now we need to tell R about the data, using function mlogit.data(). Within this function, we can specify
shape. If each row is an observation, with information of all the choice alternatives, use shape="wide"; if each choice occasion is specified in multiple rows, with each row for each choice alternative, use shape="long"varying=1:6, meaning the first 6 columnsid=PanIDchoice=Choice# Create dataset in the "mlogit" format using mlogit.data() command
yl = mlogit.data(yogurtdata[,-c("Index" )], shape="wide",
choice="Choice", id="PanID", varying=1:6)
head(yl)
## Income HHSize PanID Choice alt Feature Price chid
## 1319 9 2 1 TRUE Dannon 0 0.061 1
## 1 9 2 1 FALSE Stonyfield 0 0.108 1
## 660 9 2 1 FALSE Yoplait 0 0.081 1
## 1320 9 2 1 FALSE Dannon 0 0.064 2
## 2 9 2 1 FALSE Stonyfield 0 0.108 2
## 661 9 2 1 TRUE Yoplait 0 0.098 2
The data is ready, now need to
When writing the formula to be estimated, the parameters for each variable can be alternative specific, or common to all choice options. The pattern in the formula is:
Choice Variable ~ Alternative-specifiic variables (feature, price) with a common coefficient | Individual-specific variables (income and hhsize) with an alternative-specific coefficient | Alternative specific variables (feature and price) with an alternative-specific coefficient
f <- mFormula(Choice ~ Feature+Price | Income + HHSize)
# Estimate the model
ml <- mlogit(f, yl, reflevel="Dannon")
summary(ml)
##
## Call:
## mlogit(formula = Choice ~ Feature + Price | Income + HHSize,
## data = yl, reflevel = "Dannon", method = "nr")
##
## Frequencies of alternatives:
## Dannon Stonyfield Yoplait
## 0.33687 0.33080 0.33232
##
## nr method
## 4 iterations, 0h:0m:0s
## g'(-H)^-1g = 8.68E-08
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## Stonyfield:(intercept) 1.572326 0.369253 4.2581 2.061e-05 ***
## Yoplait:(intercept) 2.848940 0.318431 8.9468 < 2.2e-16 ***
## Feature 0.371186 0.206549 1.7971 0.07232 .
## Price -23.480763 3.667916 -6.4017 1.537e-10 ***
## Stonyfield:Income -0.125584 0.030431 -4.1268 3.678e-05 ***
## Yoplait:Income -0.218509 0.030981 -7.0529 1.752e-12 ***
## Stonyfield:HHSize 0.265701 0.116981 2.2713 0.02313 *
## Yoplait:HHSize -0.096554 0.115666 -0.8348 0.40385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -632.78
## McFadden R^2: 0.12595
## Likelihood ratio test : chisq = 182.36 (p.value = < 2.22e-16)
The values of the parameters are hard to interpret, let’s focus on the sign and some relative values
Feature parameter for all brands are the same, and it is positive and statistically significantPrice parameter for all brands are the same, and it is negative and statistically significantIncome parameter for both brands are negative, meaning holding everything else the same, the families with higher income tend to prefer Dannon; with not slightly higher income tend to prefer Stonyfield.HHsize parameter for Stonyfield is positive, meaning holding everything else constant, the larger families tend to prefer Stonyfield over Dannon. The parameter for Yoplait is essentially zero, meaning they are indifferent between Dannon and Yoplait.This concludes our first step in solving the case problem.
The second step, we change the price value for Stonyfield, and recalculate the purchase probabilities for each brand by each individual.
ydatanew = yogurtdata
ydatanew[, Price.Stonyfield :=Price.Stonyfield*.8]
ylnew = mlogit.data(ydatanew[,-c("Index" )], shape="wide",
choice="Choice", id="PanID", varying=1:6)
prob = predict(ml,yl)
probnew=predict(ml,ylnew)
colMeans(prob)
## Dannon Stonyfield Yoplait
## 0.3368741 0.3308042 0.3323217
colMeans(probnew)
## Dannon Stonyfield Yoplait
## 0.2800955 0.4360050 0.2838995