Choice-based Conjoint (CBC)

Introduction

This notebook introduces you to choice-based conjoint (CBC) analysis. This is much like regular ratings-based conjoint analysis, with the difference that the dependent variable is now a choice between different profiles, rather than a rating of each one of them. Consequently, the dpendent variable is categorical, which means we have to employ the appropriate regression techniques (e.g., logit).

The goal is to build a choice-based conjoint model that helps the firm:

decide how preference vary across levels of attributes
figure out which attributes are more important and which less so
figure out how much a consumer is willing to pay for more preferred levels of various attributes
figure out if these preferences vary across the population of consumers
conduct a simulation analysis to understand how the firm’s market share would vary as it varied its product offering.

Loading packages we’ll need

#Installs required packages
if(!require(mlogit)) install.packages("mlogit",repos = "http://cran.us.r-project.org")
if(!require(MASS)) install.packages("MASS",repos = "http://cran.us.r-project.org")
if(!require(readxl)) install.packages("readxl",repos = "http://cran.us.r-project.org")
if(!require(tidyr)) install.packages("tidyr",repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr",repos = "http://cran.us.r-project.org")

library(readr)
library(mlogit)
library(MASS)
library(tidyr)
library(dplyr)



set.seed(1234) # setting seed for random draws

Loading and reading data

# Load the choice data
cbc.df <- read.csv("/Users/ayandacollins/Desktop/MISDI/Marketing Analytics/MG4F2_11488/CBC_NEO v5.csv")

# View data frame
#View(cbc.df)

# Getting an overview of the data
summary(cbc.df)

##      choice            id      numberofscreens1   numberofscreens2  
##  Min.   :1.000   Min.   :  1   Length:2260        Length:2260       
##  1st Qu.:1.000   1st Qu.: 29   Class :character   Class :character  
##  Median :2.000   Median : 57   Mode  :character   Mode  :character  
##  Mean   :2.041   Mean   : 57                                        
##  3rd Qu.:3.000   3rd Qu.: 85                                        
##  Max.   :3.000   Max.   :113                                        
##  numberofscreens3    screensize1    screensize2    screensize3  
##  Length:2260        Min.   :13.0   Min.   :13.0   Min.   :13.0  
##  Class :character   1st Qu.:13.0   1st Qu.:13.0   1st Qu.:13.0  
##  Mode  :character   Median :15.0   Median :13.0   Median :13.0  
##                     Mean   :14.1   Mean   :13.9   Mean   :13.9  
##                     3rd Qu.:15.0   3rd Qu.:15.0   3rd Qu.:15.0  
##                     Max.   :15.0   Max.   :15.0   Max.   :15.0  
##    weight1            weight2            weight3             brand1         
##  Length:2260        Length:2260        Length:2260        Length:2260       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     brand2             brand3             pricing1       pricing2   
##  Length:2260        Length:2260        Min.   : 700   Min.   : 700  
##  Class :character   Class :character   1st Qu.: 925   1st Qu.: 700  
##  Mode  :character   Mode  :character   Median :1000   Median :1000  
##                                        Mean   :1060   Mean   : 970  
##                                        3rd Qu.:1300   3rd Qu.:1075  
##                                        Max.   :1300   Max.   :1300  
##     pricing3          X          X.1         X.2   
##  Min.   : 700   Min.   :1   Min.   :1   Min.   :1  
##  1st Qu.: 700   1st Qu.:1   1st Qu.:1   1st Qu.:1  
##  Median :1000   Median :1   Median :1   Median :1  
##  Mean   : 970   Mean   :1   Mean   :1   Mean   :1  
##  3rd Qu.:1300   3rd Qu.:1   3rd Qu.:1   3rd Qu.:1  
##  Max.   :1300   Max.   :1   Max.   :1   Max.   :1

# Examining the structure of the data
str(cbc.df)

## 'data.frame':    2260 obs. of  20 variables:
##  $ choice          : int  2 3 3 1 1 2 3 3 1 2 ...
##  $ id              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ numberofscreens1: chr  "Dual-screen" "Single-screen" "Single-screen" "Dual-screen" ...
##  $ numberofscreens2: chr  "Dual-screen" "Single-screen" "Single-screen" "Single-screen" ...
##  $ numberofscreens3: chr  "Single-screen" "Dual-screen" "Dual-screen" "Single-screen" ...
##  $ screensize1     : int  13 13 13 15 15 13 15 15 15 15 ...
##  $ screensize2     : int  15 13 15 13 15 13 15 13 13 15 ...
##  $ screensize3     : int  15 15 13 15 13 13 13 15 13 13 ...
##  $ weight1         : chr  "1kg" "1.5kg" "1.5kg" "1kg" ...
##  $ weight2         : chr  "1.5kg" "1.5kg" "1kg" "1kg" ...
##  $ weight3         : chr  "1.5kg" "1.5kg" "1.5kg" "1.5kg" ...
##  $ brand1          : chr  "Microsoft" "Apple" "Microsoft" "Microsoft" ...
##  $ brand2          : chr  "Microsoft" "Microsoft" "Apple" "Apple" ...
##  $ brand3          : chr  "Microsoft" "Lenovo" "Apple" "Lenovo" ...
##  $ pricing1        : int  1300 1000 1000 1000 700 1300 1300 1300 700 1000 ...
##  $ pricing2        : int  700 1000 1300 700 1000 1000 1300 1300 1000 700 ...
##  $ pricing3        : int  1300 1000 700 700 1000 1000 700 700 700 1300 ...
##  $ X               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ X.1             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ X.2             : int  1 1 1 1 1 1 1 1 1 1 ...

You can see that the variables are not quite in the appropriate format; cargo is a character and the others are of type int. We need to convert these to factors, which we do below.

# Getting data ready for `mlogit` package
cbc.df$chid <- 1:nrow(cbc.df)
cbc.mlogit <- dfidx(cbc.df, idx = list(c("chid", "id")), choice = "choice", varying = 3:17, sep = "")

# Transforming variables (to factor)
cbc.mlogit$numberofscreens = as.factor(cbc.mlogit$numberofscreens)
cbc.mlogit$screensize = as.factor(cbc.mlogit$screensize)
cbc.mlogit$weight = as.factor(cbc.mlogit$weight)
cbc.mlogit$brand = as.factor(cbc.mlogit$brand)
cbc.mlogit$price_fac = as.factor(cbc.mlogit$pricing)

# Examining the structure of the data
str(cbc.mlogit)

## Classes 'dfidx' and 'data.frame':    6780 obs. of  11 variables:
##  $ choice         : 'xseries' logi  FALSE TRUE FALSE FALSE FALSE TRUE ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ X              : 'xseries' int  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ X.1            : 'xseries' int  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ X.2            : 'xseries' int  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ numberofscreens: Factor w/ 2 levels "Dual-screen",..: 1 1 2 2 2 1 2 2 1 1 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ screensize     : Factor w/ 2 levels "13","15": 1 2 2 1 1 2 1 2 1 2 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ weight         : Factor w/ 2 levels "1.5kg","1kg": 2 1 1 1 1 1 1 2 1 2 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ brand          : Factor w/ 3 levels "Apple","Lenovo",..: 3 3 3 1 3 2 3 1 1 3 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ pricing        : 'xseries' int  1300 700 1300 1000 1000 1000 1000 1300 700 1000 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ idx            :Classes 'idx' and 'data.frame':   6780 obs. of  3 variables:
##   ..$ chid: int  1 1 1 2 2 2 3 3 3 4 ...
##   ..$ id  : int  1 1 1 1 1 1 1 1 1 1 ...
##   ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   ..- attr(*, "ids")= num [1:3] 1 1 2
##  $ price_fac      : Factor w/ 3 levels "700","1000","1300": 3 1 3 2 2 2 2 3 1 2 ...
##   ..- attr(*, "idx")=Classes 'idx' and 'data.frame': 6780 obs. of  3 variables:
##   .. ..$ chid: int [1:6780] 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ id  : int [1:6780] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ id2 : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..- attr(*, "ids")= num [1:3] 1 1 2
##  - attr(*, "clseries")= chr "xseries"
##  - attr(*, "choice")= chr "choice"

We can get a sense of the possible results we are likely to get by looking at the raw data. In particular, let us examine choice counts, which tells us how many times each level of each attribute was picked throughout the sample.

# Obtaining choice counts for price_fac
xtabs(choice ~ price_fac, data=cbc.mlogit)

## price_fac
##  700 1000 1300 
## 1282  713  265

People picked (in order): 700,1000,1300. Not surprising.

# Obtaining choice counts for cargo space
xtabs(choice ~ brand, data=cbc.mlogit)

## brand
##     Apple    Lenovo Microsoft 
##      1185       233       842

# Obtaining choice counts for number of seats
xtabs(choice ~ weight, data=cbc.mlogit)

## weight
## 1.5kg   1kg 
##  1124  1136

# Obtaining choice counts for engine type
xtabs(choice ~ numberofscreens, data=cbc.mlogit)

## numberofscreens
##   Dual-screen Single-screen 
##          1199          1061

# Obtaining choice counts for engine type
xtabs(choice ~ screensize, data=cbc.mlogit)

## screensize
##   13   15 
## 1132 1128

Running the conjoint

Time to analyse the data using a choice-based conjoint. As mentioned earlier, we’ll be using a logit model to analyse the data. Specifically, we’ll be using a conditional logit model (you may also see this being referred to as a multinomial logit, but the latter term is also used for a different model).

Logit model saying choices depend on seat, cargot, engine and price. Everything is a factor. Intercept is 0.

# specifying the regression
fml1 = as.formula(choice ~ numberofscreens + screensize + weight + brand + price_fac | 0)

Observe above that we have set the intercept to zero. This makes no difference to the interpretation of the results.

# running the logit and estimating the choice model
m1 <- mlogit(fml1, data = cbc.mlogit)
summary(m1)

## 
## Call:
## mlogit(formula = choice ~ numberofscreens + screensize + weight + 
##     brand + price_fac | 0, data = cbc.mlogit, method = "nr")
## 
## Frequencies of alternatives:choice
##       1       2       3 
## 0.31018 0.33850 0.35133 
## 
## nr method
## 5 iterations, 0h:0m:0s 
## g'(-H)^-1g = 4.18E-05 
## successive function values within tolerance limits 
## 
## Coefficients :
##                               Estimate Std. Error  z-value  Pr(>|z|)    
## numberofscreensSingle-screen -0.760433   0.071325 -10.6615 < 2.2e-16 ***
## screensize15                  0.354853   0.065656   5.4048 6.489e-08 ***
## weight1kg                     0.246182   0.061042   4.0330 5.508e-05 ***
## brandLenovo                  -1.220215   0.089118 -13.6921 < 2.2e-16 ***
## brandMicrosoft               -0.379922   0.078215  -4.8574 1.189e-06 ***
## price_fac1000                -1.001197   0.076237 -13.1327 < 2.2e-16 ***
## price_fac1300                -1.755387   0.081638 -21.5020 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -1842.1

Importance Weights

We get importance weights by calculating the variation within each attribute and dividing it by the total amount of variation across all attributes.

#calculating importance weights
iw.total <- abs(m1$coefficients["numberofscreensSingle-screen"]) + abs(m1$coefficients["screensize15"]) + abs(m1$coefficients["weight1kg"]) +abs(m1$coefficients["brandLenovo"]) +abs(m1$coefficients["price_fac1300"])


abs(m1$coefficients["numberofscreensSingle-screen"])/iw.total

## numberofscreensSingle-screen 
##                    0.1753333

abs(m1$coefficients["weight1kg"])/iw.total

##  weight1kg 
## 0.05676223

abs(m1$coefficients["screensize15"])/iw.total

## screensize15 
##   0.08181869

abs(m1$coefficients["brandLenovo"])/iw.total

## brandLenovo 
##   0.2813454

abs(m1$coefficients["price_fac1300"])/iw.total

## price_fac1300 
##     0.4047403

Predicting Market Shares

We start by creating a function to predict the market share of a profile, given a set of competitive alternatives.

# Predict function 
predict.mnl_2 <- function(model, newdata, inpmat) { 
  data.model <- as.matrix(inpmat)
  utility <- data.model%*%model$coef
  share <- exp(utility)/sum(exp(utility))
  cbind(share, newdata)
}

Observe that the share above uses the simple logit formula for the probability of a profile being picked, i.e., exp(utility)/sum(exp(utility)). When you think of it, the probability of a choice being picked is exactly akin to the market share of that choice; this direct link between choices and market shares is a very attractive feature of CBC models.

Now, we want to pick what the set of profiles whose market shares we wish to compute. Think of this as the competitive set of products in the market.

We can now move to answering another of the questions we had posed at the outset, i.e., how much is an improvement in attribute levels worth to the consumer. In our language, we wish to calculate the exchange rate between utility and money.

Calculating the ‘exchange rate’

There are two obvious approaches to calculating a monetary equivalent for the trade-offs we see in the estimates above. First, one could calculate the exchange rate using the estimates for price between any two levels. We have already seen how to do that in the lecture. The second approach involves estimating a model with price as a continuous variable (rather than as a factor with 3 levels, which was what we had so far), and use the coefficient of price as our relevant exchange rate. Let’s do the latter.

# Specifying the formula
fml2 = as.formula(choice ~ numberofscreens + weight + screensize + brand + pricing | 0)

# Actually running the logit
m2 <- mlogit(fml2, data = cbc.mlogit)

# Viewing results
summary(m2)

## 
## Call:
## mlogit(formula = choice ~ numberofscreens + weight + screensize + 
##     brand + pricing | 0, data = cbc.mlogit, method = "nr")
## 
## Frequencies of alternatives:choice
##       1       2       3 
## 0.31018 0.33850 0.35133 
## 
## nr method
## 5 iterations, 0h:0m:0s 
## g'(-H)^-1g = 4E-05 
## successive function values within tolerance limits 
## 
## Coefficients :
##                                 Estimate  Std. Error  z-value  Pr(>|z|)    
## numberofscreensSingle-screen -0.75851848  0.07101969 -10.6804 < 2.2e-16 ***
## weight1kg                     0.22530096  0.05995695   3.7577 0.0001715 ***
## screensize15                  0.32106158  0.06307076   5.0905 3.571e-07 ***
## brandLenovo                  -1.20910172  0.08916348 -13.5605 < 2.2e-16 ***
## brandMicrosoft               -0.39116494  0.07749024  -5.0479 4.466e-07 ***
## pricing                      -0.00296180  0.00013571 -21.8251 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -1843.7

Now let us use the exchange rate to monetise some trade-offs. Let us calculate the willingness to pay for more cargo space (i.e., going from 2 ft to 3 ft of cargo space). Since the exchange rate was $5882, and going from 2ft to 3ft gives the consumer an extra 0.47 utils worth, we know that the extra foot of cargo space is worth 0.47*5882 = $2750.

Note. Since price is continuous, you can also get the willingness to pay for any attribute directly by dividing the coefficient of that attribute by the price coefficient. This is shown below.]

# Computing willingness to pay
coef(m2)["numberofscreensSingle-screen"]/ (-coef(m2)["pricing"]/1)

## numberofscreensSingle-screen 
##                    -256.1009

coef(m2)["weight1kg"]/ (-coef(m2)["pricing"]/1)

## weight1kg 
##  76.06905

coef(m2)["screensize15"]/ (-coef(m2)["pricing"]/1)

## screensize15 
##      108.401

coef(m2)["brandLenovo"]/ (-coef(m2)["pricing"]/1)

## brandLenovo 
##   -408.2327

coef(m2)["brandMicrosoft"]/ (-coef(m2)["pricing"]/1)

## brandMicrosoft 
##      -132.0702

Before we proceed, it is good to do a test to check that the model with price as a continuous variable is not very different, statistically, from the one where price was a factor. The formal way of doing this is to conduct a “likelihood ratio test”, which is done below.

# Compare models with and without price as continuous
lrtest(m1, m2)

Note that the null hypothesis, that the two models are statistically the same, is not rejected (p-value = 0.34), with the continuous price model having a higher log likelihood. In what follows we’ll just use the model with continuous price.

# Creating a list of all the attributes and their levels
attrib <- list(numberofscreens = c("Single-screen", "Dual-screen"), 
               weight = c("1kg", "1.5kg"),
               screensize = c("15", "13"),brand = c("Apple", "Lenovo","Microsoft"),
               price_fac = c("700", "1000", "1300"))


## Create a full model matrix from the specs below (or whatever you want as market-space)
####        screens size weight brand price
####     1     dual 13 1 Micro 1300
####     2     single 13 1.5 Apple 1000
####     3     single 13 1.5 Apple 1000
####     4    Dual 15 1 Microsoft 1000
####     5    Dual 15 1.5 Microsoft 700
####     6    Dual 13 1.5 Lenovo 1300

Observe above that we have picked profile 8, 1, 3, 41, 49, and 26 as the competitive set. These are just random choices - the manager would create an appropriate set of profiles reflecting real-world conditions. The next few steps essentially create a matrix reflecting what these profiles look like, to feed them into the predict function.

For above 6 products, we will see what market share would look like

# First, create an empty Market Space Data Frame with ALL the columnnames in order
msdf1 <- data.frame(matrix(ncol = 2+2+2+3+3, nrow = 0))
c1 <- c("numberofscreensSingle-screen", "numberofscreensDual-screen", "weight1kg", "weight1.5kg", "screensize15", "screensize13", "brandLenovo", "brandMicrosoft", "brandApple", "price_fac700", "price_fac1000","price_fac1300")
colnames(msdf1) <- c1
msdf1

## Second, append the data points appropriately (reproduces choice set from above)
###         screens single double; weight 1 1.5;  screensize 15, 13, brand L M A, $
msdf1[1, ] <- c( 0,1,          1,0,          1,0,          1,0,0,          0,1,0)
msdf1[2, ] <- c( 1, 0,          1,0,         0,1,          0,1,0,         0,1,0)
msdf1[3, ] <- c( 0, 1,          0,1,         1,0,          1,0,0,         0,1,0)
msdf1[4, ] <- c( 1, 0,          0,1,         0,1,         0,0,1,         1,0,0)
msdf1[5, ] <- c( 1, 0,          1,0,         1,0,         0,0,1,        1,0,0)
msdf1[6, ] <- c( 0, 1,          1,0,         0,1,         1,0,0,         0,0,1)

# Third, drop the columns that logit model uses as the base level for each choice variable
# here from m1 they are: seat6, cargo2ft, engelec, price_fac30
drops <- c("numberofscreensDual-screen", "screensize13", "weight1.5kg", "brandApple","price_fac700")
msdf2 <- msdf1[ , !(names(msdf1) %in% drops)]
#View(msdf2)

# Fourth, create a readable form of the matrix above 
new_data <- msdf1
#new_data$numberofscreens <- ifelse(new_data$Single-screen==1,1, ifelse(new_data$numberofscreensDual-screen==1,1,0))
new_data$weight <- ifelse(new_data$weight1kg==1,1, ifelse(new_data$weight1.5kg==1,1.5,0))
new_data$screensize <- ifelse(new_data$screensize15==1,15, ifelse(new_data$screensize13==1,13,0))
new_data$price_fac <- ifelse(new_data$price_fac700==1,700, ifelse(new_data$price_fac1000==1,1000, ifelse(new_data$price_fac1300==1,1300,0)))
new_data$brand <- ifelse(new_data$brandLenovo==1,"Lenovo", ifelse(new_data$brandMicrosoft==1,"Microsoft", ifelse(new_data$brandApple==1,"Apple",0)))
new_data$nscreens <- ifelse(new_data[["numberofscreensSingle-screen"]]==1,1, ifelse(new_data[["numberofscreensDual-screen"]]==1,2,0))
keeps <- c("nscreens", "screensize", "weight","brand", "price_fac")
new_data <- new_data[ , (names(new_data) %in% keeps)]

#View(new_data)

We now have everything in place to run the prediction.

# Obtaining market share
predict.mnl_2(m1, new_data, msdf2)

What we have above is the basis of a very useful market simulator. We use this to conduct a real-world thought experiment. Suppose the manager wants to know how market shares for her product will vary, as she varies levels of each attribute. The competitive set is as defined before. Note that we will not be changing the attributes of competitor products. That too is something that can easily be done (e.g., one can use that to think through what your market share would be if you changed attributes and competitors responded with changes of their own).

Market Share Simulations

The particular design the manager has in mind is a 7 seater hybrid van with 2ft of cargo space for $30K (notice that this was profile 8 in our set above; this is for convenience - we could have picked anything, even profiles that were not part of the original design). We want to predict how market share would change as we change different levels of each attribute (keeping the competitive set fixed). The code below does precisely that. It specifies a function that loops through each level of each attribute and calculates the share for that combination, using the predict function created earlier.

We start by creating the appropriate matrix we’ll need to feed into the predict function.

#View(msdf2)

# get the base data (which is the 1st row of msdf2 above)
# Our base design is dual-screen, 15in, 1kg Lenovo for 1000.
base_data_2 <- msdf2[1,]
#View(base_data_2)

####      nscreen1 seat8 cargo3ft enggas enghyb price_fac35 price_fac40
####          1     0        0      0      1           0           0

# get competitor data (i.e., the remaining rows of msdf2 above)
# note that this is the same set of products as before
competitor_data_2 <- msdf2[2:6,]
#View(competitor_data_2)

# For the simulator, create the full matrix of possibilities (for the base case)
# simply copy the format of msdf1 above to get an empty df
fullvar_1 <- msdf1[0,]

# Next, append all variations
###               seat 6, 7, 8;  cargo 2ft, 3ft;  eng elec, gas, hyb; price_fac 30, 35, 40
# 2 variations of screen
fullvar_1[1, ] <- c(  1,0,          1,0,          1,0,          1,0,0,          0,1,0)
fullvar_1[2, ] <- c(  0,1,          1,0,          1,0,          1,0,0,          0,1,0) ## this one is the actual base

# 2 variations of weight
fullvar_1[3, ] <- c(   1,0,          1,0,          1,0,          1,0,0,          0,1,0)

fullvar_1[4, ] <- c(   1,0,          0,1,          1,0,          1,0,0,          0,1,0)


# 2 variations of screensize
fullvar_1[5, ] <- c(  1,0,          1,0,          1,0,          1,0,0,          0,1,0)

fullvar_1[6, ] <- c(   1,0,          1,0,          0,1,          1,0,0,          0,1,0)

# 3 variations of brand
fullvar_1[7, ] <- c(   1,0,          1,0,          1,0,          1,0,0,          0,1,0)
fullvar_1[8, ] <- c(   1,0,          1,0,          1,0,          0,1,0,          0,1,0)
fullvar_1[9, ] <- c(   1,0,          1,0,          1,0,          0,0,1,          0,1,0)

# 3 variations of price
fullvar_1[10,] <- c(   1,0,          1,0,          1,0,          1,0,0,          1,0,0)
fullvar_1[11,] <- c(   1,0,          1,0,          1,0,          1,0,0,          0,1,0)
fullvar_1[12,] <- c(   1,0,          1,0,          1,0,          1,0,0,          0,0,1)

# get the appropriate format for predict()
drops <- c("numberofscreensDual-screen", "weight1.5kg", "screensize13", "brandApple","price_fac700")
fullvar_2 <- fullvar_1[ , !(names(msdf1) %in% drops)] # change msdf1 to fullvar_1
#View(fullvar_2)

We now create the actual function which will do the ‘looping’ over levels of attributes we had referred to earlier.

sensitivity.mnl <- function(model, base.data, competitor.data) {
  # model: mlogit object returned by mlogit() function
  # attrib: list of vectors with attribute levels to be used in sensitivity
  # base.data: data frame containing baseline design of target product
  # competitor.data: data frame containing design of competitive set
  
  data <- rbind(base.data, competitor.data)
  base.share <- predict.mnl_2(model, data, data)[1,1]
  share <- NULL
  for(i in 1:nrow(fullvar_2)) {
      data[1,] <- fullvar_2[i,]
      share <- c(share, predict.mnl_2(model, data, data)[1,1])
  }
  data.frame(level=unlist(attrib), share=share, increase=share-base.share)
}

We now have everything in place to do the actual simulation.

# Find market shares as attributes change
tradeoff <- sensitivity.mnl(m1, base_data_2, competitor_data_2)

# Plot results
barplot(tradeoff$increase, horiz=FALSE, names.arg=tradeoff$level,
        ylab="Change in Share for Baseline Product")

The plot above contains a wealth of information. For instance, observe that moving from a 7-seater to a 6-seater increases share by roughly 7%; similarly, moving to a gas engine from a hybrid increases share by more than 10%.

Taking stock

We have answered most of the questions we had posed at the beginning of this notebook. We know consumer preferences across various levels of each attribute, we know the relative importance of various attributes, we know what the monetary trade-off between levels of attributes are, and we know how market shares vary as we change attribute levels. What follows is slightly more advanced material which looks at the last remaining question, i.e., do consumer preferences vary in the population (i.e., is there heterogeneity). This is the basis of coming up with segmentation schemes, with possibly differentiated product offerings to these segments.

Incorporating Heterogeneity

Note that in our choice models so far, we have assumed that all individuals have the same preferences, i.e., our estimates above were averages for the entire population. It is reasonable to think that this is not the case – the very existence of segments, in fact, depends on this not being the case. One can do a couple of things to enrich the model with a view to accommodating heterogeneity in preferences. First, we can run the model separately for each individual, to get a set of individual-level parameter estimates. This is the easiest thing to do, but is not the most efficient; for one, there are only 15 observations per individual, which means the parameters may not be estimated with precision. Another is to model the coefficients as being drawn from a distribution, so that every individual would have a different set of coefficients. The advantage of this is that one only needs to specify the mean and variance of the distribution, which is two extra parameters, rather than a complete set for each individual. We show the extension for the model with price as a continuous variable. Nothing changes if you were to use price as a factor – you’ll just use a different base model.

# Specifying a normal distribution for all coefficients. 
m2.rpar <- rep("n", length=length(m2$coef))
names(m2.rpar) <- names(m2$coef)
names(m2.rpar)

## [1] "numberofscreensSingle-screen" "weight1kg"                   
## [3] "screensize15"                 "brandLenovo"                 
## [5] "brandMicrosoft"               "pricing"

The above just tells the model to use a normal distribution for the coefficients. In other words, we are assuming that preferences for each level varies as a normal distribution in the population.

# Run hierarchical model
m2.hier <- mlogit(fml2, data = cbc.mlogit, rpar=c("numberofscreensSingle-screen" = 'n', "weight1kg" = 'n', "screensize15" = 'n', "brandLenovo" = 'n', "brandMicrosoft" = 'n', "pricing" = 'n'), panel = TRUE, correlation=FALSE)

We are telling the program that we have multiple observations per respondent (panel=TRUE). In addition, we are specifying that preferences across attributes are not correlated with each other, i.e., an individual’s preference for 8 seats over 7 is independent of her preference for 7 seats over 6 (correlation = FALSE).

# View results of logit estimation
summary(m2.hier)

## 
## Call:
## mlogit(formula = choice ~ numberofscreens + weight + screensize + 
##     brand + pricing | 0, data = cbc.mlogit, rpar = c(`numberofscreensSingle-screen` = "n", 
##     weight1kg = "n", screensize15 = "n", brandLenovo = "n", brandMicrosoft = "n", 
##     pricing = "n"), correlation = FALSE, panel = TRUE)
## 
## Frequencies of alternatives:choice
##       1       2       3 
## 0.31018 0.33850 0.35133 
## 
## bfgs method
## 28 iterations, 0h:0m:18s 
## g'(-H)^-1g = 2.63E-07 
## gradient close to zero 
## 
## Coefficients :
##                                    Estimate  Std. Error  z-value  Pr(>|z|)    
## numberofscreensSingle-screen    -1.26129742  0.09929722 -12.7022 < 2.2e-16 ***
## weight1kg                        0.36308434  0.07759978   4.6789 2.884e-06 ***
## screensize15                     0.45314163  0.08041281   5.6352 1.749e-08 ***
## brandLenovo                     -1.83434052  0.13810318 -13.2824 < 2.2e-16 ***
## brandMicrosoft                  -0.24552977  0.09877481  -2.4858   0.01293 *  
## pricing                         -0.00390228  0.00019877 -19.6325 < 2.2e-16 ***
## sd.numberofscreensSingle-screen -1.59964799  0.12655884 -12.6396 < 2.2e-16 ***
## sd.weight1kg                     0.08822157  0.09695232   0.9099   0.36285    
## sd.screensize15                 -0.27229152  0.10751730  -2.5325   0.01132 *  
## sd.brandLenovo                  -2.14786032  0.15064172 -14.2581 < 2.2e-16 ***
## sd.brandMicrosoft                1.87776190  0.15358871  12.2259 < 2.2e-16 ***
## sd.pricing                      -0.00410818  0.00027900 -14.7245 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -1556.2
## 
## random coefficients
##                              Min.      1st Qu.       Median         Mean
## numberofscreensSingle-screen -Inf -2.340243596 -1.261297425 -1.261297425
## weight1kg                    -Inf  0.303579795  0.363084338  0.363084338
## screensize15                 -Inf  0.269483784  0.453141626  0.453141626
## brandLenovo                  -Inf -3.283050293 -1.834340524 -1.834340524
## brandMicrosoft               -Inf -1.512060925 -0.245529767 -0.245529767
## pricing                      -Inf -0.006673209 -0.003902284 -0.003902284
##                                   3rd Qu. Max.
## numberofscreensSingle-screen -0.182351253  Inf
## weight1kg                     0.422588881  Inf
## screensize15                  0.636799468  Inf
## brandLenovo                  -0.385630755  Inf
## brandMicrosoft                1.021001391  Inf
## pricing                      -0.001131359  Inf

# update model to account for correlated preferences
m3.hier <- update(m2.hier, correlation=TRUE)
summary(m3.hier)

## 
## Call:
## mlogit(formula = choice ~ numberofscreens + weight + screensize + 
##     brand + pricing | 0, data = cbc.mlogit, rpar = c(`numberofscreensSingle-screen` = "n", 
##     weight1kg = "n", screensize15 = "n", brandLenovo = "n", brandMicrosoft = "n", 
##     pricing = "n"), correlation = TRUE, panel = TRUE)
## 
## Frequencies of alternatives:choice
##       1       2       3 
## 0.31018 0.33850 0.35133 
## 
## bfgs method
## 3 iterations, 0h:0m:15s 
## g'(-H)^-1g = 5.13E+03 
## last step couldn't find higher value 
## 
## Coefficients :
##                                                                   Estimate
## numberofscreensSingle-screen                                   -1.29085293
## weight1kg                                                       1.21433156
## screensize15                                                    1.99100413
## brandLenovo                                                    -1.69615322
## brandMicrosoft                                                  0.08219975
## pricing                                                         0.00105614
## chol.numberofscreensSingle-screen:numberofscreensSingle-screen -0.54051253
## chol.numberofscreensSingle-screen:weight1kg                    -0.18577910
## chol.weight1kg:weight1kg                                        0.40647270
## chol.numberofscreensSingle-screen:screensize15                  0.56960851
## chol.weight1kg:screensize15                                     0.48075027
## chol.screensize15:screensize15                                 -0.12128043
## chol.numberofscreensSingle-screen:brandLenovo                   1.75236845
## chol.weight1kg:brandLenovo                                      1.36980956
## chol.screensize15:brandLenovo                                   1.84972819
## chol.brandLenovo:brandLenovo                                    0.82230419
## chol.numberofscreensSingle-screen:brandMicrosoft                1.05983288
## chol.weight1kg:brandMicrosoft                                   1.47480211
## chol.screensize15:brandMicrosoft                                1.65952359
## chol.brandLenovo:brandMicrosoft                                 0.90302256
## chol.brandMicrosoft:brandMicrosoft                              1.28598704
## chol.numberofscreensSingle-screen:pricing                       0.09544325
## chol.weight1kg:pricing                                          0.09454956
## chol.screensize15:pricing                                       0.09573733
## chol.brandLenovo:pricing                                        0.09588848
## chol.brandMicrosoft:pricing                                     0.09314978
## chol.pricing:pricing                                            0.09662673
##                                                                 Std. Error
## numberofscreensSingle-screen                                    0.06555700
## weight1kg                                                       0.05941428
## screensize15                                                    0.06376989
## brandLenovo                                                     0.08937329
## brandMicrosoft                                                  0.07648593
## pricing                                                         0.00010059
## chol.numberofscreensSingle-screen:numberofscreensSingle-screen  0.06833123
## chol.numberofscreensSingle-screen:weight1kg                     0.06787648
## chol.weight1kg:weight1kg                                        0.06652760
## chol.numberofscreensSingle-screen:screensize15                  0.06852054
## chol.weight1kg:screensize15                                     0.06918265
## chol.screensize15:screensize15                                  0.07883235
## chol.numberofscreensSingle-screen:brandLenovo                   0.11724981
## chol.weight1kg:brandLenovo                                      0.11172375
## chol.screensize15:brandLenovo                                   0.13008660
## chol.brandLenovo:brandLenovo                                    0.11368189
## chol.numberofscreensSingle-screen:brandMicrosoft                0.09584108
## chol.weight1kg:brandMicrosoft                                   0.09990137
## chol.screensize15:brandMicrosoft                                0.09888237
## chol.brandLenovo:brandMicrosoft                                 0.10122799
## chol.brandMicrosoft:brandMicrosoft                              0.10141139
## chol.numberofscreensSingle-screen:pricing                       0.00018666
## chol.weight1kg:pricing                                          0.00018630
## chol.screensize15:pricing                                       0.00018404
## chol.brandLenovo:pricing                                        0.00018181
## chol.brandMicrosoft:pricing                                     0.00019490
## chol.pricing:pricing                                            0.00017887
##                                                                 z-value
## numberofscreensSingle-screen                                   -19.6905
## weight1kg                                                       20.4384
## screensize15                                                    31.2217
## brandLenovo                                                    -18.9783
## brandMicrosoft                                                   1.0747
## pricing                                                         10.4992
## chol.numberofscreensSingle-screen:numberofscreensSingle-screen  -7.9102
## chol.numberofscreensSingle-screen:weight1kg                     -2.7370
## chol.weight1kg:weight1kg                                         6.1098
## chol.numberofscreensSingle-screen:screensize15                   8.3130
## chol.weight1kg:screensize15                                      6.9490
## chol.screensize15:screensize15                                  -1.5385
## chol.numberofscreensSingle-screen:brandLenovo                   14.9456
## chol.weight1kg:brandLenovo                                      12.2607
## chol.screensize15:brandLenovo                                   14.2192
## chol.brandLenovo:brandLenovo                                     7.2334
## chol.numberofscreensSingle-screen:brandMicrosoft                11.0582
## chol.weight1kg:brandMicrosoft                                   14.7626
## chol.screensize15:brandMicrosoft                                16.7828
## chol.brandLenovo:brandMicrosoft                                  8.9207
## chol.brandMicrosoft:brandMicrosoft                              12.6809
## chol.numberofscreensSingle-screen:pricing                      511.3307
## chol.weight1kg:pricing                                         507.5216
## chol.screensize15:pricing                                      520.2022
## chol.brandLenovo:pricing                                       527.4227
## chol.brandMicrosoft:pricing                                    477.9419
## chol.pricing:pricing                                           540.1927
##                                                                 Pr(>|z|)    
## numberofscreensSingle-screen                                   < 2.2e-16 ***
## weight1kg                                                      < 2.2e-16 ***
## screensize15                                                   < 2.2e-16 ***
## brandLenovo                                                    < 2.2e-16 ***
## brandMicrosoft                                                    0.2825    
## pricing                                                        < 2.2e-16 ***
## chol.numberofscreensSingle-screen:numberofscreensSingle-screen 2.665e-15 ***
## chol.numberofscreensSingle-screen:weight1kg                       0.0062 ** 
## chol.weight1kg:weight1kg                                       9.973e-10 ***
## chol.numberofscreensSingle-screen:screensize15                 < 2.2e-16 ***
## chol.weight1kg:screensize15                                    3.679e-12 ***
## chol.screensize15:screensize15                                    0.1239    
## chol.numberofscreensSingle-screen:brandLenovo                  < 2.2e-16 ***
## chol.weight1kg:brandLenovo                                     < 2.2e-16 ***
## chol.screensize15:brandLenovo                                  < 2.2e-16 ***
## chol.brandLenovo:brandLenovo                                   4.712e-13 ***
## chol.numberofscreensSingle-screen:brandMicrosoft               < 2.2e-16 ***
## chol.weight1kg:brandMicrosoft                                  < 2.2e-16 ***
## chol.screensize15:brandMicrosoft                               < 2.2e-16 ***
## chol.brandLenovo:brandMicrosoft                                < 2.2e-16 ***
## chol.brandMicrosoft:brandMicrosoft                             < 2.2e-16 ***
## chol.numberofscreensSingle-screen:pricing                      < 2.2e-16 ***
## chol.weight1kg:pricing                                         < 2.2e-16 ***
## chol.screensize15:pricing                                      < 2.2e-16 ***
## chol.brandLenovo:pricing                                       < 2.2e-16 ***
## chol.brandMicrosoft:pricing                                    < 2.2e-16 ***
## chol.pricing:pricing                                           < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -3057.1
## 
## random coefficients
##                              Min.    1st Qu.       Median         Mean
## numberofscreensSingle-screen -Inf -1.6554231 -1.290852927 -1.290852927
## weight1kg                    -Inf  0.9128913  1.214331565  1.214331565
## screensize15                 -Inf  1.4816489  1.991004133  1.991004133
## brandLenovo                  -Inf -3.7246582 -1.696153220 -1.696153220
## brandMicrosoft               -Inf -1.8867464  0.082199745  0.082199745
## pricing                      -Inf -0.1562936  0.001056143  0.001056143
##                                 3rd Qu. Max.
## numberofscreensSingle-screen -0.9262828  Inf
## weight1kg                     1.5157718  Inf
## screensize15                  2.5003594  Inf
## brandLenovo                   0.3323518  Inf
## brandMicrosoft                2.0511459  Inf
## pricing                       0.1584059  Inf

As you can see, we now have more estimates, reflecting the fact that we are accounting for correlation between levels.

Aside. In terms of predicting market shares, there is nothing conceptually different about predicting market shares in a model with heterogeneity versus one without (that we did earlier). Computationally, the added complication in a model with heterogeneity is that we now need to use each individual’s estimated preferences, compute her choices, then aggregate over the population to get share predictions. End Aside

We’ve seen above that there is significant heterogeneity in the population. How can we use this to come up with a possible segmentation scheme? Here we turn to another technique we’ve used before, i.e., cluster analysis.

Cluster Analysis

What do we have to use as input data to give to the clustering algorithm. We certainly don’t have any customer descriptors, in the traditional sense (e.g., demographics). However, we do have something very powerful, namely estimates of their preferences for various attributes.

We start by getting the individual-level estimates in a useful form.

# clustering
coefs.resp <- data.frame()
coefs.resp <- data.frame()
for(i in 1:113) {
  {
    cbc.mlogit.resp <- cbc.mlogit[ which(cbc.mlogit$id==i),] 
    m.resp <- mlogit(fml2, data = cbc.mlogit.resp)
    coefs.resp <- rbind(coefs.resp, data.frame(t(m.resp$coefficients)))
  }
}

# View dataframe
#View(coefs.resp)

We now go through the standard k-means clustering approach we’ve used before (i.e., standardising the coefficients, doing an elbow plot, deciding on the number of clusters)

# read library tidyr
library(tidyr)
library(dplyr)

# standardizing the data
coefs.resp.sc <- scale(coefs.resp)
summary(coefs.resp.sc)

##  numberofscreensSingle.screen   weight1kg        screensize15     
##  Min.   :-5.2362              Min.   :-3.3352   Min.   :-3.58342  
##  1st Qu.:-0.2449              1st Qu.:-0.2667   1st Qu.:-0.13370  
##  Median : 0.3843              Median :-0.1858   Median :-0.07031  
##  Mean   : 0.0000              Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.4360              3rd Qu.: 0.2108   3rd Qu.: 0.23237  
##  Max.   : 2.2908              Max.   : 2.8062   Max.   : 3.15402  
##   brandLenovo      brandMicrosoft       pricing       
##  Min.   :-2.8244   Min.   :-4.5308   Min.   :-3.7968  
##  1st Qu.:-0.4498   1st Qu.:-0.1148   1st Qu.:-0.2117  
##  Median : 0.3212   Median : 0.3310   Median : 0.5283  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6412   3rd Qu.: 0.3743   3rd Qu.: 0.6403  
##  Max.   : 2.0414   Max.   : 2.5332   Max.   : 1.7793

# set seed for kmeans clustering
set.seed(1234567)

# deciding the optimal number of clusters
SSE_curve <- c()
for (n in 1:10) {
  kcluster <- kmeans(coefs.resp.sc, n)
  sse <- sum(kcluster$withinss)
  SSE_curve[n] <- sse
}

print("SSE curve for the ideal k value")

## [1] "SSE curve for the ideal k value"

plot(1:10, SSE_curve, type="b", xlab="Number of Clusters", ylab="SSE")

Hard to tell, but for now, let’s go with 3 clusters.

# doing k-means clustering.
kcluster <- kmeans(coefs.resp.sc,3)
kcluster$size

## [1]  6 19 88

clus <- kcluster$cluster
coefs.resp<-cbind(coefs.resp,clus)

summary(coefs.resp)

##  numberofscreensSingle.screen   weight1kg         screensize15     
##  Min.   :-280.81162           Min.   :-51.5610   Min.   :-74.8951  
##  1st Qu.: -33.62683           1st Qu.: -0.5348   1st Qu.: -0.4843  
##  Median :  -2.46276           Median :  0.8102   Median :  0.8830  
##  Mean   : -21.49667           Mean   :  3.8998   Mean   :  2.3996  
##  3rd Qu.:   0.09619           3rd Qu.:  7.4049   3rd Qu.:  7.4119  
##  Max.   :  91.95192           Max.   : 50.5641   Max.   : 70.4319  
##   brandLenovo         brandMicrosoft         pricing               clus      
##  Min.   :-172.24172   Min.   :-179.8921   Min.   :-0.466063   Min.   :1.000  
##  1st Qu.: -54.22085   1st Qu.: -17.5478   1st Qu.:-0.092269   1st Qu.:3.000  
##  Median : -15.89499   Median :  -1.1561   Median :-0.015107   Median :3.000  
##  Mean   : -31.86205   Mean   : -13.3260   Mean   :-0.070192   Mean   :2.726  
##  3rd Qu.:   0.00744   3rd Qu.:   0.4335   3rd Qu.:-0.003433   3rd Qu.:3.000  
##  Max.   :  69.60300   Max.   :  79.8026   Max.   : 0.115330   Max.   :3.000

# comparing clusters, shows the mean values in groups:
aggregate(coefs.resp, list(coefs.resp$clus), function(x) mean(as.numeric(x)))