This notebook introduces you to choice-based conjoint (CBC) analysis. This is much like regular ratings-based conjoint analysis, with the difference that the dependent variable is now a choice between different profiles, rather than a rating of each one of them. Consequently, the dpendent variable is categorical, which means we have to employ the appropriate regression techniques (e.g., logit).
The goal is to build a choice-based conjoint model that helps the firm:
#Installs required packages
if(!require(mlogit)) install.packages("mlogit",repos = "http://cran.us.r-project.org")
if(!require(MASS)) install.packages("MASS",repos = "http://cran.us.r-project.org")
if(!require(readxl)) install.packages("readxl",repos = "http://cran.us.r-project.org")
if(!require(tidyr)) install.packages("tidyr",repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr",repos = "http://cran.us.r-project.org")
library(readr)
library(mlogit)
library(MASS)
library(tidyr)
library(dplyr)
set.seed(1234) # setting seed for random draws
# Load the choice data
cbc.df <- read.csv("/Users/ayandacollins/Desktop/MISDI/Marketing Analytics/MG4F2_11488/CBC_NEO v5.csv")
# View data frame
#View(cbc.df)
# Getting an overview of the data
summary(cbc.df)
## choice id numberofscreens1 numberofscreens2
## Min. :1.000 Min. : 1 Length:2260 Length:2260
## 1st Qu.:1.000 1st Qu.: 29 Class :character Class :character
## Median :2.000 Median : 57 Mode :character Mode :character
## Mean :2.041 Mean : 57
## 3rd Qu.:3.000 3rd Qu.: 85
## Max. :3.000 Max. :113
## numberofscreens3 screensize1 screensize2 screensize3
## Length:2260 Min. :13.0 Min. :13.0 Min. :13.0
## Class :character 1st Qu.:13.0 1st Qu.:13.0 1st Qu.:13.0
## Mode :character Median :15.0 Median :13.0 Median :13.0
## Mean :14.1 Mean :13.9 Mean :13.9
## 3rd Qu.:15.0 3rd Qu.:15.0 3rd Qu.:15.0
## Max. :15.0 Max. :15.0 Max. :15.0
## weight1 weight2 weight3 brand1
## Length:2260 Length:2260 Length:2260 Length:2260
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## brand2 brand3 pricing1 pricing2
## Length:2260 Length:2260 Min. : 700 Min. : 700
## Class :character Class :character 1st Qu.: 925 1st Qu.: 700
## Mode :character Mode :character Median :1000 Median :1000
## Mean :1060 Mean : 970
## 3rd Qu.:1300 3rd Qu.:1075
## Max. :1300 Max. :1300
## pricing3 X X.1 X.2
## Min. : 700 Min. :1 Min. :1 Min. :1
## 1st Qu.: 700 1st Qu.:1 1st Qu.:1 1st Qu.:1
## Median :1000 Median :1 Median :1 Median :1
## Mean : 970 Mean :1 Mean :1 Mean :1
## 3rd Qu.:1300 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1
## Max. :1300 Max. :1 Max. :1 Max. :1
# Examining the structure of the data
str(cbc.df)
## 'data.frame': 2260 obs. of 20 variables:
## $ choice : int 2 3 3 1 1 2 3 3 1 2 ...
## $ id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ numberofscreens1: chr "Dual-screen" "Single-screen" "Single-screen" "Dual-screen" ...
## $ numberofscreens2: chr "Dual-screen" "Single-screen" "Single-screen" "Single-screen" ...
## $ numberofscreens3: chr "Single-screen" "Dual-screen" "Dual-screen" "Single-screen" ...
## $ screensize1 : int 13 13 13 15 15 13 15 15 15 15 ...
## $ screensize2 : int 15 13 15 13 15 13 15 13 13 15 ...
## $ screensize3 : int 15 15 13 15 13 13 13 15 13 13 ...
## $ weight1 : chr "1kg" "1.5kg" "1.5kg" "1kg" ...
## $ weight2 : chr "1.5kg" "1.5kg" "1kg" "1kg" ...
## $ weight3 : chr "1.5kg" "1.5kg" "1.5kg" "1.5kg" ...
## $ brand1 : chr "Microsoft" "Apple" "Microsoft" "Microsoft" ...
## $ brand2 : chr "Microsoft" "Microsoft" "Apple" "Apple" ...
## $ brand3 : chr "Microsoft" "Lenovo" "Apple" "Lenovo" ...
## $ pricing1 : int 1300 1000 1000 1000 700 1300 1300 1300 700 1000 ...
## $ pricing2 : int 700 1000 1300 700 1000 1000 1300 1300 1000 700 ...
## $ pricing3 : int 1300 1000 700 700 1000 1000 700 700 700 1300 ...
## $ X : int 1 1 1 1 1 1 1 1 1 1 ...
## $ X.1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ X.2 : int 1 1 1 1 1 1 1 1 1 1 ...
You can see that the variables are not quite in the appropriate format; cargo
is a character and the others are of type int
. We need to convert these to factors, which we do below.
# Getting data ready for `mlogit` package
cbc.df$chid <- 1:nrow(cbc.df)
cbc.mlogit <- dfidx(cbc.df, idx = list(c("chid", "id")), choice = "choice", varying = 3:17, sep = "")
# Transforming variables (to factor)
cbc.mlogit$numberofscreens = as.factor(cbc.mlogit$numberofscreens)
cbc.mlogit$screensize = as.factor(cbc.mlogit$screensize)
cbc.mlogit$weight = as.factor(cbc.mlogit$weight)
cbc.mlogit$brand = as.factor(cbc.mlogit$brand)
cbc.mlogit$price_fac = as.factor(cbc.mlogit$pricing)
# Examining the structure of the data
str(cbc.mlogit)
## Classes 'dfidx' and 'data.frame': 6780 obs. of 11 variables:
## $ choice : 'xseries' logi FALSE TRUE FALSE FALSE FALSE TRUE ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ X : 'xseries' int 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ X.1 : 'xseries' int 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ X.2 : 'xseries' int 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ numberofscreens: Factor w/ 2 levels "Dual-screen",..: 1 1 2 2 2 1 2 2 1 1 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ screensize : Factor w/ 2 levels "13","15": 1 2 2 1 1 2 1 2 1 2 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ weight : Factor w/ 2 levels "1.5kg","1kg": 2 1 1 1 1 1 1 2 1 2 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ brand : Factor w/ 3 levels "Apple","Lenovo",..: 3 3 3 1 3 2 3 1 1 3 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ pricing : 'xseries' int 1300 700 1300 1000 1000 1000 1000 1300 700 1000 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## $ idx :Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## ..$ chid: int 1 1 1 2 2 2 3 3 3 4 ...
## ..$ id : int 1 1 1 1 1 1 1 1 1 1 ...
## ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## ..- attr(*, "ids")= num [1:3] 1 1 2
## $ price_fac : Factor w/ 3 levels "700","1000","1300": 3 1 3 2 2 2 2 3 1 2 ...
## ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of 3 variables:
## .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ id : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..- attr(*, "ids")= num [1:3] 1 1 2
## - attr(*, "clseries")= chr "xseries"
## - attr(*, "choice")= chr "choice"
We can get a sense of the possible results we are likely to get by looking at the raw data. In particular, let us examine choice counts, which tells us how many times each level of each attribute was picked throughout the sample.
# Obtaining choice counts for price_fac
xtabs(choice ~ price_fac, data=cbc.mlogit)
## price_fac
## 700 1000 1300
## 1282 713 265
People picked (in order): 700,1000,1300. Not surprising.
# Obtaining choice counts for cargo space
xtabs(choice ~ brand, data=cbc.mlogit)
## brand
## Apple Lenovo Microsoft
## 1185 233 842
# Obtaining choice counts for number of seats
xtabs(choice ~ weight, data=cbc.mlogit)
## weight
## 1.5kg 1kg
## 1124 1136
# Obtaining choice counts for engine type
xtabs(choice ~ numberofscreens, data=cbc.mlogit)
## numberofscreens
## Dual-screen Single-screen
## 1199 1061
# Obtaining choice counts for engine type
xtabs(choice ~ screensize, data=cbc.mlogit)
## screensize
## 13 15
## 1132 1128
Time to analyse the data using a choice-based conjoint. As mentioned earlier, we’ll be using a logit model to analyse the data. Specifically, we’ll be using a conditional logit model (you may also see this being referred to as a multinomial logit, but the latter term is also used for a different model).
Logit model saying choices depend on seat, cargot, engine and price. Everything is a factor. Intercept is 0.
# specifying the regression
fml1 = as.formula(choice ~ numberofscreens + screensize + weight + brand + price_fac | 0)
Observe above that we have set the intercept to zero. This makes no difference to the interpretation of the results.
# running the logit and estimating the choice model
m1 <- mlogit(fml1, data = cbc.mlogit)
summary(m1)
##
## Call:
## mlogit(formula = choice ~ numberofscreens + screensize + weight +
## brand + price_fac | 0, data = cbc.mlogit, method = "nr")
##
## Frequencies of alternatives:choice
## 1 2 3
## 0.31018 0.33850 0.35133
##
## nr method
## 5 iterations, 0h:0m:0s
## g'(-H)^-1g = 4.18E-05
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## numberofscreensSingle-screen -0.760433 0.071325 -10.6615 < 2.2e-16 ***
## screensize15 0.354853 0.065656 5.4048 6.489e-08 ***
## weight1kg 0.246182 0.061042 4.0330 5.508e-05 ***
## brandLenovo -1.220215 0.089118 -13.6921 < 2.2e-16 ***
## brandMicrosoft -0.379922 0.078215 -4.8574 1.189e-06 ***
## price_fac1000 -1.001197 0.076237 -13.1327 < 2.2e-16 ***
## price_fac1300 -1.755387 0.081638 -21.5020 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1842.1
We get importance weights by calculating the variation within each attribute and dividing it by the total amount of variation across all attributes.
#calculating importance weights
iw.total <- abs(m1$coefficients["numberofscreensSingle-screen"]) + abs(m1$coefficients["screensize15"]) + abs(m1$coefficients["weight1kg"]) +abs(m1$coefficients["brandLenovo"]) +abs(m1$coefficients["price_fac1300"])
abs(m1$coefficients["numberofscreensSingle-screen"])/iw.total
## numberofscreensSingle-screen
## 0.1753333
abs(m1$coefficients["weight1kg"])/iw.total
## weight1kg
## 0.05676223
abs(m1$coefficients["screensize15"])/iw.total
## screensize15
## 0.08181869
abs(m1$coefficients["brandLenovo"])/iw.total
## brandLenovo
## 0.2813454
abs(m1$coefficients["price_fac1300"])/iw.total
## price_fac1300
## 0.4047403
There are two obvious approaches to calculating a monetary equivalent for the trade-offs we see in the estimates above. First, one could calculate the exchange rate using the estimates for price between any two levels. We have already seen how to do that in the lecture. The second approach involves estimating a model with price as a continuous variable (rather than as a factor with 3 levels, which was what we had so far), and use the coefficient of price as our relevant exchange rate. Let’s do the latter.
# Specifying the formula
fml2 = as.formula(choice ~ numberofscreens + weight + screensize + brand + pricing | 0)
# Actually running the logit
m2 <- mlogit(fml2, data = cbc.mlogit)
# Viewing results
summary(m2)
##
## Call:
## mlogit(formula = choice ~ numberofscreens + weight + screensize +
## brand + pricing | 0, data = cbc.mlogit, method = "nr")
##
## Frequencies of alternatives:choice
## 1 2 3
## 0.31018 0.33850 0.35133
##
## nr method
## 5 iterations, 0h:0m:0s
## g'(-H)^-1g = 4E-05
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## numberofscreensSingle-screen -0.75851848 0.07101969 -10.6804 < 2.2e-16 ***
## weight1kg 0.22530096 0.05995695 3.7577 0.0001715 ***
## screensize15 0.32106158 0.06307076 5.0905 3.571e-07 ***
## brandLenovo -1.20910172 0.08916348 -13.5605 < 2.2e-16 ***
## brandMicrosoft -0.39116494 0.07749024 -5.0479 4.466e-07 ***
## pricing -0.00296180 0.00013571 -21.8251 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1843.7
Now let us use the exchange rate to monetise some trade-offs. Let us calculate the willingness to pay for more cargo space (i.e., going from 2 ft to 3 ft of cargo space). Since the exchange rate was $5882, and going from 2ft to 3ft gives the consumer an extra 0.47 utils worth, we know that the extra foot of cargo space is worth 0.47*5882 = $2750.
Note. Since price is continuous, you can also get the willingness to pay for any attribute directly by dividing the coefficient of that attribute by the price coefficient. This is shown below.]
# Computing willingness to pay
coef(m2)["numberofscreensSingle-screen"]/ (-coef(m2)["pricing"]/1)
## numberofscreensSingle-screen
## -256.1009
coef(m2)["weight1kg"]/ (-coef(m2)["pricing"]/1)
## weight1kg
## 76.06905
coef(m2)["screensize15"]/ (-coef(m2)["pricing"]/1)
## screensize15
## 108.401
coef(m2)["brandLenovo"]/ (-coef(m2)["pricing"]/1)
## brandLenovo
## -408.2327
coef(m2)["brandMicrosoft"]/ (-coef(m2)["pricing"]/1)
## brandMicrosoft
## -132.0702
Before we proceed, it is good to do a test to check that the model with price as a continuous variable is not very different, statistically, from the one where price was a factor. The formal way of doing this is to conduct a “likelihood ratio test”, which is done below.
# Compare models with and without price as continuous
lrtest(m1, m2)
Note that the null hypothesis, that the two models are statistically the same, is not rejected (p-value = 0.34), with the continuous price model having a higher log likelihood. In what follows we’ll just use the model with continuous price.
# Creating a list of all the attributes and their levels
attrib <- list(numberofscreens = c("Single-screen", "Dual-screen"),
weight = c("1kg", "1.5kg"),
screensize = c("15", "13"),brand = c("Apple", "Lenovo","Microsoft"),
price_fac = c("700", "1000", "1300"))
## Create a full model matrix from the specs below (or whatever you want as market-space)
#### screens size weight brand price
#### 1 dual 13 1 Micro 1300
#### 2 single 13 1.5 Apple 1000
#### 3 single 13 1.5 Apple 1000
#### 4 Dual 15 1 Microsoft 1000
#### 5 Dual 15 1.5 Microsoft 700
#### 6 Dual 13 1.5 Lenovo 1300
Observe above that we have picked profile 8, 1, 3, 41, 49, and 26 as the competitive set. These are just random choices - the manager would create an appropriate set of profiles reflecting real-world conditions. The next few steps essentially create a matrix reflecting what these profiles look like, to feed them into the predict
function.
For above 6 products, we will see what market share would look like
# First, create an empty Market Space Data Frame with ALL the columnnames in order
msdf1 <- data.frame(matrix(ncol = 2+2+2+3+3, nrow = 0))
c1 <- c("numberofscreensSingle-screen", "numberofscreensDual-screen", "weight1kg", "weight1.5kg", "screensize15", "screensize13", "brandLenovo", "brandMicrosoft", "brandApple", "price_fac700", "price_fac1000","price_fac1300")
colnames(msdf1) <- c1
msdf1
## Second, append the data points appropriately (reproduces choice set from above)
### screens single double; weight 1 1.5; screensize 15, 13, brand L M A, $
msdf1[1, ] <- c( 0,1, 1,0, 1,0, 1,0,0, 0,1,0)
msdf1[2, ] <- c( 1, 0, 1,0, 0,1, 0,1,0, 0,1,0)
msdf1[3, ] <- c( 0, 1, 0,1, 1,0, 1,0,0, 0,1,0)
msdf1[4, ] <- c( 1, 0, 0,1, 0,1, 0,0,1, 1,0,0)
msdf1[5, ] <- c( 1, 0, 1,0, 1,0, 0,0,1, 1,0,0)
msdf1[6, ] <- c( 0, 1, 1,0, 0,1, 1,0,0, 0,0,1)
# Third, drop the columns that logit model uses as the base level for each choice variable
# here from m1 they are: seat6, cargo2ft, engelec, price_fac30
drops <- c("numberofscreensDual-screen", "screensize13", "weight1.5kg", "brandApple","price_fac700")
msdf2 <- msdf1[ , !(names(msdf1) %in% drops)]
#View(msdf2)
# Fourth, create a readable form of the matrix above
new_data <- msdf1
#new_data$numberofscreens <- ifelse(new_data$Single-screen==1,1, ifelse(new_data$numberofscreensDual-screen==1,1,0))
new_data$weight <- ifelse(new_data$weight1kg==1,1, ifelse(new_data$weight1.5kg==1,1.5,0))
new_data$screensize <- ifelse(new_data$screensize15==1,15, ifelse(new_data$screensize13==1,13,0))
new_data$price_fac <- ifelse(new_data$price_fac700==1,700, ifelse(new_data$price_fac1000==1,1000, ifelse(new_data$price_fac1300==1,1300,0)))
new_data$brand <- ifelse(new_data$brandLenovo==1,"Lenovo", ifelse(new_data$brandMicrosoft==1,"Microsoft", ifelse(new_data$brandApple==1,"Apple",0)))
new_data$nscreens <- ifelse(new_data[["numberofscreensSingle-screen"]]==1,1, ifelse(new_data[["numberofscreensDual-screen"]]==1,2,0))
keeps <- c("nscreens", "screensize", "weight","brand", "price_fac")
new_data <- new_data[ , (names(new_data) %in% keeps)]
#View(new_data)
We now have everything in place to run the prediction.
# Obtaining market share
predict.mnl_2(m1, new_data, msdf2)
What we have above is the basis of a very useful market simulator. We use this to conduct a real-world thought experiment. Suppose the manager wants to know how market shares for her product will vary, as she varies levels of each attribute. The competitive set is as defined before. Note that we will not be changing the attributes of competitor products. That too is something that can easily be done (e.g., one can use that to think through what your market share would be if you changed attributes and competitors responded with changes of their own).
We have answered most of the questions we had posed at the beginning of this notebook. We know consumer preferences across various levels of each attribute, we know the relative importance of various attributes, we know what the monetary trade-off between levels of attributes are, and we know how market shares vary as we change attribute levels. What follows is slightly more advanced material which looks at the last remaining question, i.e., do consumer preferences vary in the population (i.e., is there heterogeneity). This is the basis of coming up with segmentation schemes, with possibly differentiated product offerings to these segments.
Note that in our choice models so far, we have assumed that all individuals have the same preferences, i.e., our estimates above were averages for the entire population. It is reasonable to think that this is not the case – the very existence of segments, in fact, depends on this not being the case. One can do a couple of things to enrich the model with a view to accommodating heterogeneity in preferences. First, we can run the model separately for each individual, to get a set of individual-level parameter estimates. This is the easiest thing to do, but is not the most efficient; for one, there are only 15 observations per individual, which means the parameters may not be estimated with precision. Another is to model the coefficients as being drawn from a distribution, so that every individual would have a different set of coefficients. The advantage of this is that one only needs to specify the mean and variance of the distribution, which is two extra parameters, rather than a complete set for each individual. We show the extension for the model with price as a continuous variable. Nothing changes if you were to use price as a factor – you’ll just use a different base model.
# Specifying a normal distribution for all coefficients.
m2.rpar <- rep("n", length=length(m2$coef))
names(m2.rpar) <- names(m2$coef)
names(m2.rpar)
## [1] "numberofscreensSingle-screen" "weight1kg"
## [3] "screensize15" "brandLenovo"
## [5] "brandMicrosoft" "pricing"
The above just tells the model to use a normal distribution for the coefficients. In other words, we are assuming that preferences for each level varies as a normal distribution in the population.
# Run hierarchical model
m2.hier <- mlogit(fml2, data = cbc.mlogit, rpar=c("numberofscreensSingle-screen" = 'n', "weight1kg" = 'n', "screensize15" = 'n', "brandLenovo" = 'n', "brandMicrosoft" = 'n', "pricing" = 'n'), panel = TRUE, correlation=FALSE)
We are telling the program that we have multiple observations per respondent (panel=TRUE). In addition, we are specifying that preferences across attributes are not correlated with each other, i.e., an individual’s preference for 8 seats over 7 is independent of her preference for 7 seats over 6 (correlation = FALSE).
# View results of logit estimation
summary(m2.hier)
##
## Call:
## mlogit(formula = choice ~ numberofscreens + weight + screensize +
## brand + pricing | 0, data = cbc.mlogit, rpar = c(`numberofscreensSingle-screen` = "n",
## weight1kg = "n", screensize15 = "n", brandLenovo = "n", brandMicrosoft = "n",
## pricing = "n"), correlation = FALSE, panel = TRUE)
##
## Frequencies of alternatives:choice
## 1 2 3
## 0.31018 0.33850 0.35133
##
## bfgs method
## 28 iterations, 0h:0m:18s
## g'(-H)^-1g = 2.63E-07
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## numberofscreensSingle-screen -1.26129742 0.09929722 -12.7022 < 2.2e-16 ***
## weight1kg 0.36308434 0.07759978 4.6789 2.884e-06 ***
## screensize15 0.45314163 0.08041281 5.6352 1.749e-08 ***
## brandLenovo -1.83434052 0.13810318 -13.2824 < 2.2e-16 ***
## brandMicrosoft -0.24552977 0.09877481 -2.4858 0.01293 *
## pricing -0.00390228 0.00019877 -19.6325 < 2.2e-16 ***
## sd.numberofscreensSingle-screen -1.59964799 0.12655884 -12.6396 < 2.2e-16 ***
## sd.weight1kg 0.08822157 0.09695232 0.9099 0.36285
## sd.screensize15 -0.27229152 0.10751730 -2.5325 0.01132 *
## sd.brandLenovo -2.14786032 0.15064172 -14.2581 < 2.2e-16 ***
## sd.brandMicrosoft 1.87776190 0.15358871 12.2259 < 2.2e-16 ***
## sd.pricing -0.00410818 0.00027900 -14.7245 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1556.2
##
## random coefficients
## Min. 1st Qu. Median Mean
## numberofscreensSingle-screen -Inf -2.340243596 -1.261297425 -1.261297425
## weight1kg -Inf 0.303579795 0.363084338 0.363084338
## screensize15 -Inf 0.269483784 0.453141626 0.453141626
## brandLenovo -Inf -3.283050293 -1.834340524 -1.834340524
## brandMicrosoft -Inf -1.512060925 -0.245529767 -0.245529767
## pricing -Inf -0.006673209 -0.003902284 -0.003902284
## 3rd Qu. Max.
## numberofscreensSingle-screen -0.182351253 Inf
## weight1kg 0.422588881 Inf
## screensize15 0.636799468 Inf
## brandLenovo -0.385630755 Inf
## brandMicrosoft 1.021001391 Inf
## pricing -0.001131359 Inf
# update model to account for correlated preferences
m3.hier <- update(m2.hier, correlation=TRUE)
summary(m3.hier)
##
## Call:
## mlogit(formula = choice ~ numberofscreens + weight + screensize +
## brand + pricing | 0, data = cbc.mlogit, rpar = c(`numberofscreensSingle-screen` = "n",
## weight1kg = "n", screensize15 = "n", brandLenovo = "n", brandMicrosoft = "n",
## pricing = "n"), correlation = TRUE, panel = TRUE)
##
## Frequencies of alternatives:choice
## 1 2 3
## 0.31018 0.33850 0.35133
##
## bfgs method
## 3 iterations, 0h:0m:15s
## g'(-H)^-1g = 5.13E+03
## last step couldn't find higher value
##
## Coefficients :
## Estimate
## numberofscreensSingle-screen -1.29085293
## weight1kg 1.21433156
## screensize15 1.99100413
## brandLenovo -1.69615322
## brandMicrosoft 0.08219975
## pricing 0.00105614
## chol.numberofscreensSingle-screen:numberofscreensSingle-screen -0.54051253
## chol.numberofscreensSingle-screen:weight1kg -0.18577910
## chol.weight1kg:weight1kg 0.40647270
## chol.numberofscreensSingle-screen:screensize15 0.56960851
## chol.weight1kg:screensize15 0.48075027
## chol.screensize15:screensize15 -0.12128043
## chol.numberofscreensSingle-screen:brandLenovo 1.75236845
## chol.weight1kg:brandLenovo 1.36980956
## chol.screensize15:brandLenovo 1.84972819
## chol.brandLenovo:brandLenovo 0.82230419
## chol.numberofscreensSingle-screen:brandMicrosoft 1.05983288
## chol.weight1kg:brandMicrosoft 1.47480211
## chol.screensize15:brandMicrosoft 1.65952359
## chol.brandLenovo:brandMicrosoft 0.90302256
## chol.brandMicrosoft:brandMicrosoft 1.28598704
## chol.numberofscreensSingle-screen:pricing 0.09544325
## chol.weight1kg:pricing 0.09454956
## chol.screensize15:pricing 0.09573733
## chol.brandLenovo:pricing 0.09588848
## chol.brandMicrosoft:pricing 0.09314978
## chol.pricing:pricing 0.09662673
## Std. Error
## numberofscreensSingle-screen 0.06555700
## weight1kg 0.05941428
## screensize15 0.06376989
## brandLenovo 0.08937329
## brandMicrosoft 0.07648593
## pricing 0.00010059
## chol.numberofscreensSingle-screen:numberofscreensSingle-screen 0.06833123
## chol.numberofscreensSingle-screen:weight1kg 0.06787648
## chol.weight1kg:weight1kg 0.06652760
## chol.numberofscreensSingle-screen:screensize15 0.06852054
## chol.weight1kg:screensize15 0.06918265
## chol.screensize15:screensize15 0.07883235
## chol.numberofscreensSingle-screen:brandLenovo 0.11724981
## chol.weight1kg:brandLenovo 0.11172375
## chol.screensize15:brandLenovo 0.13008660
## chol.brandLenovo:brandLenovo 0.11368189
## chol.numberofscreensSingle-screen:brandMicrosoft 0.09584108
## chol.weight1kg:brandMicrosoft 0.09990137
## chol.screensize15:brandMicrosoft 0.09888237
## chol.brandLenovo:brandMicrosoft 0.10122799
## chol.brandMicrosoft:brandMicrosoft 0.10141139
## chol.numberofscreensSingle-screen:pricing 0.00018666
## chol.weight1kg:pricing 0.00018630
## chol.screensize15:pricing 0.00018404
## chol.brandLenovo:pricing 0.00018181
## chol.brandMicrosoft:pricing 0.00019490
## chol.pricing:pricing 0.00017887
## z-value
## numberofscreensSingle-screen -19.6905
## weight1kg 20.4384
## screensize15 31.2217
## brandLenovo -18.9783
## brandMicrosoft 1.0747
## pricing 10.4992
## chol.numberofscreensSingle-screen:numberofscreensSingle-screen -7.9102
## chol.numberofscreensSingle-screen:weight1kg -2.7370
## chol.weight1kg:weight1kg 6.1098
## chol.numberofscreensSingle-screen:screensize15 8.3130
## chol.weight1kg:screensize15 6.9490
## chol.screensize15:screensize15 -1.5385
## chol.numberofscreensSingle-screen:brandLenovo 14.9456
## chol.weight1kg:brandLenovo 12.2607
## chol.screensize15:brandLenovo 14.2192
## chol.brandLenovo:brandLenovo 7.2334
## chol.numberofscreensSingle-screen:brandMicrosoft 11.0582
## chol.weight1kg:brandMicrosoft 14.7626
## chol.screensize15:brandMicrosoft 16.7828
## chol.brandLenovo:brandMicrosoft 8.9207
## chol.brandMicrosoft:brandMicrosoft 12.6809
## chol.numberofscreensSingle-screen:pricing 511.3307
## chol.weight1kg:pricing 507.5216
## chol.screensize15:pricing 520.2022
## chol.brandLenovo:pricing 527.4227
## chol.brandMicrosoft:pricing 477.9419
## chol.pricing:pricing 540.1927
## Pr(>|z|)
## numberofscreensSingle-screen < 2.2e-16 ***
## weight1kg < 2.2e-16 ***
## screensize15 < 2.2e-16 ***
## brandLenovo < 2.2e-16 ***
## brandMicrosoft 0.2825
## pricing < 2.2e-16 ***
## chol.numberofscreensSingle-screen:numberofscreensSingle-screen 2.665e-15 ***
## chol.numberofscreensSingle-screen:weight1kg 0.0062 **
## chol.weight1kg:weight1kg 9.973e-10 ***
## chol.numberofscreensSingle-screen:screensize15 < 2.2e-16 ***
## chol.weight1kg:screensize15 3.679e-12 ***
## chol.screensize15:screensize15 0.1239
## chol.numberofscreensSingle-screen:brandLenovo < 2.2e-16 ***
## chol.weight1kg:brandLenovo < 2.2e-16 ***
## chol.screensize15:brandLenovo < 2.2e-16 ***
## chol.brandLenovo:brandLenovo 4.712e-13 ***
## chol.numberofscreensSingle-screen:brandMicrosoft < 2.2e-16 ***
## chol.weight1kg:brandMicrosoft < 2.2e-16 ***
## chol.screensize15:brandMicrosoft < 2.2e-16 ***
## chol.brandLenovo:brandMicrosoft < 2.2e-16 ***
## chol.brandMicrosoft:brandMicrosoft < 2.2e-16 ***
## chol.numberofscreensSingle-screen:pricing < 2.2e-16 ***
## chol.weight1kg:pricing < 2.2e-16 ***
## chol.screensize15:pricing < 2.2e-16 ***
## chol.brandLenovo:pricing < 2.2e-16 ***
## chol.brandMicrosoft:pricing < 2.2e-16 ***
## chol.pricing:pricing < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -3057.1
##
## random coefficients
## Min. 1st Qu. Median Mean
## numberofscreensSingle-screen -Inf -1.6554231 -1.290852927 -1.290852927
## weight1kg -Inf 0.9128913 1.214331565 1.214331565
## screensize15 -Inf 1.4816489 1.991004133 1.991004133
## brandLenovo -Inf -3.7246582 -1.696153220 -1.696153220
## brandMicrosoft -Inf -1.8867464 0.082199745 0.082199745
## pricing -Inf -0.1562936 0.001056143 0.001056143
## 3rd Qu. Max.
## numberofscreensSingle-screen -0.9262828 Inf
## weight1kg 1.5157718 Inf
## screensize15 2.5003594 Inf
## brandLenovo 0.3323518 Inf
## brandMicrosoft 2.0511459 Inf
## pricing 0.1584059 Inf
As you can see, we now have more estimates, reflecting the fact that we are accounting for correlation between levels.
Aside. In terms of predicting market shares, there is nothing conceptually different about predicting market shares in a model with heterogeneity versus one without (that we did earlier). Computationally, the added complication in a model with heterogeneity is that we now need to use each individual’s estimated preferences, compute her choices, then aggregate over the population to get share predictions. End Aside
We’ve seen above that there is significant heterogeneity in the population. How can we use this to come up with a possible segmentation scheme? Here we turn to another technique we’ve used before, i.e., cluster analysis.
What do we have to use as input data to give to the clustering algorithm. We certainly don’t have any customer descriptors, in the traditional sense (e.g., demographics). However, we do have something very powerful, namely estimates of their preferences for various attributes.
We start by getting the individual-level estimates in a useful form.
# clustering
coefs.resp <- data.frame()
coefs.resp <- data.frame()
for(i in 1:113) {
{
cbc.mlogit.resp <- cbc.mlogit[ which(cbc.mlogit$id==i),]
m.resp <- mlogit(fml2, data = cbc.mlogit.resp)
coefs.resp <- rbind(coefs.resp, data.frame(t(m.resp$coefficients)))
}
}
# View dataframe
#View(coefs.resp)
We now go through the standard k-means clustering approach we’ve used before (i.e., standardising the coefficients, doing an elbow plot, deciding on the number of clusters)
# read library tidyr
library(tidyr)
library(dplyr)
# standardizing the data
coefs.resp.sc <- scale(coefs.resp)
summary(coefs.resp.sc)
## numberofscreensSingle.screen weight1kg screensize15
## Min. :-5.2362 Min. :-3.3352 Min. :-3.58342
## 1st Qu.:-0.2449 1st Qu.:-0.2667 1st Qu.:-0.13370
## Median : 0.3843 Median :-0.1858 Median :-0.07031
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4360 3rd Qu.: 0.2108 3rd Qu.: 0.23237
## Max. : 2.2908 Max. : 2.8062 Max. : 3.15402
## brandLenovo brandMicrosoft pricing
## Min. :-2.8244 Min. :-4.5308 Min. :-3.7968
## 1st Qu.:-0.4498 1st Qu.:-0.1148 1st Qu.:-0.2117
## Median : 0.3212 Median : 0.3310 Median : 0.5283
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6412 3rd Qu.: 0.3743 3rd Qu.: 0.6403
## Max. : 2.0414 Max. : 2.5332 Max. : 1.7793
# set seed for kmeans clustering
set.seed(1234567)
# deciding the optimal number of clusters
SSE_curve <- c()
for (n in 1:10) {
kcluster <- kmeans(coefs.resp.sc, n)
sse <- sum(kcluster$withinss)
SSE_curve[n] <- sse
}
print("SSE curve for the ideal k value")
## [1] "SSE curve for the ideal k value"
plot(1:10, SSE_curve, type="b", xlab="Number of Clusters", ylab="SSE")
Hard to tell, but for now, let’s go with 3 clusters.
# doing k-means clustering.
kcluster <- kmeans(coefs.resp.sc,3)
kcluster$size
## [1] 6 19 88
clus <- kcluster$cluster
coefs.resp<-cbind(coefs.resp,clus)
summary(coefs.resp)
## numberofscreensSingle.screen weight1kg screensize15
## Min. :-280.81162 Min. :-51.5610 Min. :-74.8951
## 1st Qu.: -33.62683 1st Qu.: -0.5348 1st Qu.: -0.4843
## Median : -2.46276 Median : 0.8102 Median : 0.8830
## Mean : -21.49667 Mean : 3.8998 Mean : 2.3996
## 3rd Qu.: 0.09619 3rd Qu.: 7.4049 3rd Qu.: 7.4119
## Max. : 91.95192 Max. : 50.5641 Max. : 70.4319
## brandLenovo brandMicrosoft pricing clus
## Min. :-172.24172 Min. :-179.8921 Min. :-0.466063 Min. :1.000
## 1st Qu.: -54.22085 1st Qu.: -17.5478 1st Qu.:-0.092269 1st Qu.:3.000
## Median : -15.89499 Median : -1.1561 Median :-0.015107 Median :3.000
## Mean : -31.86205 Mean : -13.3260 Mean :-0.070192 Mean :2.726
## 3rd Qu.: 0.00744 3rd Qu.: 0.4335 3rd Qu.:-0.003433 3rd Qu.:3.000
## Max. : 69.60300 Max. : 79.8026 Max. : 0.115330 Max. :3.000
# comparing clusters, shows the mean values in groups:
aggregate(coefs.resp, list(coefs.resp$clus), function(x) mean(as.numeric(x)))