First, load the data

#install.packages("mlogit") #it needs R version to be 3.5 or newer
#install.packages("data.table")
library(mlogit)
library(data.table)
yogurtdata = fread("yogurt_3brands.csv")

#change the names, so that the software can recognize which column is for which choice alternative
#The names of each choice alternative needs to be consistent with those in the Choice variable

setnames(yogurtdata, c("Feature_S", "Feature_D", "Feature_Y", "HH Size", "Pan ID"), 
                c("Feature.Stonyfield", "Feature.Dannon", "Feature.Yoplait",
                  "HHSize", "PanID"))
setnames(yogurtdata, c("Price_S", "Price_D", "Price_Y"), 
                  c("Price.Stonyfield", "Price.Dannon", "Price.Yoplait"))

Second, set up the model

Please get ready the data for estimating the following MNL model, specified using the latent utility functions for each of the three brands

Stonyfield \(U_{is}= \beta_1Price_s+\beta_2Feature_s+\epsilon_{is}\)
Yoplait \(U_{iy}=\beta_{0y}+\beta_1Price_y+\beta_2Feature_y+\beta_3Income_i+\beta_4HHsize_i+\epsilon_{iy}\)
Dannon \(U_{id}=\beta_{0d}+\beta_1Price_d+\beta_2Feature_d+\beta_3Income_i+\beta_4HHsize_i+\epsilon_{id}\)

You need to create an additional column in yogurtdata, called “Choice”, indicating the choices made by each person, and set this column to be a factor.

# Create a Choice variable that lists the choice made
yogurtdata[Stonyfield==1, Choice := "Stonyfield"]
yogurtdata[Dannon==1, Choice := "Dannon"]
yogurtdata[Yoplait==1, Choice := "Yoplait"]
yogurtdata[, Choice := as.factor(Choice)]
yogurtdata[, c("Stonyfield","Dannon","Yoplait"):= NULL]#remove these three columns
head(yogurtdata)

##    Index Feature.Stonyfield Feature.Yoplait Feature.Dannon Price.Stonyfield
## 1:     1                  0               0              0            0.108
## 2:     2                  0               0              0            0.108
## 3:     3                  0               0              0            0.108
## 4:     4                  0               0              0            0.108
## 5:     5                  0               0              0            0.125
## 6:     6                  0               0              0            0.108
##    Price.Yoplait Price.Dannon Income HHSize PanID  Choice
## 1:         0.081        0.061      9      2     1  Dannon
## 2:         0.098        0.064      9      2     1 Yoplait
## 3:         0.098        0.061      9      2     1 Yoplait
## 4:         0.098        0.061      9      2     1 Yoplait
## 5:         0.098        0.049      9      2     1 Yoplait
## 6:         0.092        0.050      9      2     1 Yoplait

Then you need to setup the data format that is understandable by the package, using mlogit.data()

yl = mlogit.data(yogurtdata[,-c("Index" )], shape="wide", 
                 choice="Choice", id="PanID", varying=1:6)
head(yl)

##      Income HHSize PanID Choice        alt Feature Price chid
## 1319      9      2     1   TRUE     Dannon       0 0.061    1
## 1         9      2     1  FALSE Stonyfield       0 0.108    1
## 660       9      2     1  FALSE    Yoplait       0 0.081    1
## 1320      9      2     1  FALSE     Dannon       0 0.064    2
## 2         9      2     1  FALSE Stonyfield       0 0.108    2
## 661       9      2     1   TRUE    Yoplait       0 0.098    2

Third, now estimate the model

The format for using mFormula() is the following

Choice ~ X different, beta same |X same, beta same |X different, beta different

f <- mFormula(Choice ~ Feature+Price | Income + HHSize)
# Estimate the model
ml <- mlogit(f, yl, reflevel="Dannon")
summary(ml)

## 
## Call:
## mlogit(formula = Choice ~ Feature + Price | Income + HHSize, 
##     data = yl, reflevel = "Dannon", method = "nr")
## 
## Frequencies of alternatives:
##     Dannon Stonyfield    Yoplait 
##    0.33687    0.33080    0.33232 
## 
## nr method
## 4 iterations, 0h:0m:0s 
## g'(-H)^-1g = 8.68E-08 
## gradient close to zero 
## 
## Coefficients :
##                          Estimate Std. Error z-value  Pr(>|z|)    
## Stonyfield:(intercept)   1.572326   0.369253  4.2581 2.061e-05 ***
## Yoplait:(intercept)      2.848940   0.318431  8.9468 < 2.2e-16 ***
## Feature                  0.371186   0.206549  1.7971   0.07232 .  
## Price                  -23.480763   3.667916 -6.4017 1.537e-10 ***
## Stonyfield:Income       -0.125584   0.030431 -4.1268 3.678e-05 ***
## Yoplait:Income          -0.218509   0.030981 -7.0529 1.752e-12 ***
## Stonyfield:HHSize        0.265701   0.116981  2.2713   0.02313 *  
## Yoplait:HHSize          -0.096554   0.115666 -0.8348   0.40385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -632.78
## McFadden R^2:  0.12595 
## Likelihood ratio test : chisq = 182.36 (p.value = < 2.22e-16)

Q: Please interpret the model estimation results.

The intercepts for both Stonyfield and Yoplait are positive, with Dannon as the reference variable, it is easy to mark Dannon as the least preferred brand.
The Feature parameter for all brands are the same, and it is positive and statistically significant, and it is positive but less than 1. This means a marginal increase in the feature would increase the utility of every brand by 0.37.
The Price parameter for all brands are the same, and it is negative and statistically significant, Marginal value of the utility for price is quite high, this means that if any of the brand increases its price by 1 unit, it would have a high impact on the Competitiveness of the brand w.r.t. Dannon
The Income parameter for both brands are negative, meaning holding everything else the same, the families with higher income tend to prefer Dannon; with not slightly higher income tend to prefer Stonyfield.
The HHsize parameter for Stonyfield is positive, meaning holding everything else constant, the larger families tend to prefer Stonyfield over Dannon. The parameter for Yoplait is essentially zero, meaning they are indifferent between Dannon and Yoplait.

Change the model

In the above model, all brands are constrained to have the same price parameter. Re-estimate the above model, but instead allow the price parameter to be brand specific, that is different across brands.

ydatanew = yogurtdata
ylnew = mlogit.data(ydatanew[,-c("Index" )], shape="wide", 
                    choice="Choice", id="PanID", varying=1:6)
fnew <- mFormula(Choice ~ Feature | Income + HHSize | Price)
# Estimate the model
mlnew <- mlogit(fnew, yl, reflevel="Dannon")
summary(mlnew)

## 
## Call:
## mlogit(formula = Choice ~ Feature | Income + HHSize | Price, 
##     data = yl, reflevel = "Dannon", method = "nr")
## 
## Frequencies of alternatives:
##     Dannon Stonyfield    Yoplait 
##    0.33687    0.33080    0.33232 
## 
## nr method
## 4 iterations, 0h:0m:0s 
## g'(-H)^-1g = 2.79E-07 
## gradient close to zero 
## 
## Coefficients :
##                          Estimate Std. Error z-value  Pr(>|z|)    
## Stonyfield:(intercept)   0.359542   0.765243  0.4698 0.6384689    
## Yoplait:(intercept)      1.707557   1.018121  1.6772 0.0935103 .  
## Feature                  0.325126   0.211679  1.5359 0.1245541    
## Stonyfield:Income       -0.139728   0.031674 -4.4114 1.027e-05 ***
## Yoplait:Income          -0.229798   0.031981 -7.1854 6.701e-13 ***
## Stonyfield:HHSize        0.293239   0.118332  2.4781 0.0132082 *  
## Yoplait:HHSize          -0.074036   0.117278 -0.6313 0.5278521    
## Dannon:Price           -42.594788  11.100031 -3.8374 0.0001244 ***
## Stonyfield:Price       -21.209482   4.115792 -5.1532 2.561e-07 ***
## Yoplait:Price          -21.622927   9.535203 -2.2677 0.0233478 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -631.07
## McFadden R^2:  0.12831 
## Likelihood ratio test : chisq = 185.79 (p.value = < 2.22e-16)

prob = predict(ml,yl)
probnew=predict(mlnew,ylnew)
colMeans(prob)

##     Dannon Stonyfield    Yoplait 
##  0.3368741  0.3308042  0.3323217

colMeans(probnew)

##     Dannon Stonyfield    Yoplait 
##  0.3368741  0.3308042  0.3323217

Q: Compare the above two models

First model, constrained the price parameters to be the same across brands
Second model, allow the price parameters to be different across brands

in the following: - First, based on the price parameters, do you think it makes sense to constrain them to be the same across the three brands?

Keeping this Model outside, It makes more sense to not constrain price parameters to be the same across the three brands. This makes sense as Having marginal utility for every brand in a product segment/line is not possible.

Although the first model gave us a much better fit in terms of stastically significant factors, but it holds good in theory. In practice its much better to have different price parameters across the brands.

Second, compare the model fits, using the AIC values that we learned before \[AIC=-2 LogLikelihood + 2K\] K is the number of model parameters.

AIC(ml)

## [1] 1281.567

AIC(mlnew)

## [1] 1282.146

Looking at the AIC values, It is bit dificult to compare the model fit as they are so close to each other. Ideally a lower AIC value is considered a better fit. But the difference between both the AIC value is just 1. Second way to pick a better model out of both is to see which model has the highest number of statstically significant factors.

But There is no right and wrong answer here, as having same price coefficient for three different brands also does not make sense.

Homework Solutions: MNL model

Abir Chakraborty

Due on March 2, 2020

First, load the data

Second, set up the model

Third, now estimate the model

Change the model