Connect with me on Open Science Framework | Contact me via LinkedIn
R analysis script for analyzing Alpha (distribution of attribute level-wise means of the participants’ utilities) across successive iteration steps. Estimation uses the PlayStation 4 bundle example (see Chapter 4 in lecture). Estimation was done in Sawtooth Software Lighthouse Studio v.9.8
Beware: R is a context-sensitive language. Thus, ‘data’ will be interpreted not in the same way as ‘Data’ will.
In R most functionality is provided by additional packages.
Most of the packages are well documented, See: https://cran.r-project.org/
The code chunk below first evaluates if the package pacman is already installed to your machine. If yes, the corresponding package will be loaded. If not, R will install the package.
Alternatively, you can do this manually first by executing install.packages(“pacman”) and then library(pacman).
The second line then loads the package pacman
The third line uses the function p_load() from the pacman package to install (if necessary) and load all packages that we provide as arguments. We can see which arguments a function understands by pressing ‘F1’ while setting the cursor into the function’s name.
if(!"pacman" %in% rownames(installed.packages())) install.packages("pacman")
library(pacman)
pacman::p_load(tidyverse, haven, labelled, knitr, bayestestR, bayesplot)
In case you are not able to load a package, try to first install it via for example ‘install.packages(foreign)’ whereby foreign is the name of the package.
Here is the r session info:
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bayesplot_1.7.2 bayestestR_0.7.2 knitr_1.29 labelled_2.5.0
## [5] haven_2.3.1 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.1
## [9] purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.3
## [13] ggplot2_3.3.2 tidyverse_1.3.0 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.16 colorspace_1.4-1 vctrs_0.3.2
## [5] generics_0.0.2 htmltools_0.5.0 yaml_2.2.1 blob_1.2.1
## [9] rlang_0.4.7 pillar_1.4.6 glue_1.4.1 withr_2.2.0
## [13] DBI_1.1.0 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
## [17] plyr_1.8.6 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
## [21] cellranger_1.1.0 rvest_0.3.6 evaluate_0.14 fansi_0.4.1
## [25] broom_0.7.0.9001 Rcpp_1.0.5 scales_1.1.1 backports_1.1.8
## [29] jsonlite_1.7.1 fs_1.5.0 hms_0.5.3 digest_0.6.25
## [33] stringi_1.4.6 insight_0.9.0 grid_4.0.2 cli_2.0.2
## [37] tools_4.0.2 magrittr_1.5 crayon_1.3.4 pkgconfig_2.0.3
## [41] ellipsis_0.3.1 xml2_1.3.2 ggridges_0.5.2 reprex_0.3.0
## [45] lubridate_1.7.9 assertthat_0.2.1 rmarkdown_2.3 httr_1.4.2
## [49] rstudioapi_0.11 R6_2.4.1 compiler_4.0.2
Make sure that the file ‘Playstation4_hb_alpha.csv’ (see E-Learning platform) is located at your RStudio project’s directory.
Save the file into a new dataframe called d
R is a functional language, meaning we could define a function that takes some arguments as input and delivers a result as an output.
We can see which arguments a function understands by pressing ‘F1’ while setting the cursor into the function’s name (i.e., read.csv() below).
In the code chunk below, we hand-in the file name as well as the arguments header, sep, and dec to the read.csv() function.
In this way we tell the function that:
To execute the call, just set the cursor somewhere in the corresponding code line and press ‘Ctr’+‘Enter’.
If you encounter any problems with the read functions, please try to copy your csv-file to the desktop and then use:
‘d<-read.csv(file.choose(), header = TRUE, sep = ’,’, dec = ’.’)’.
This will open a dialogue that allows for a manual selection of the file.
d<-read.csv("PlayStation4_hb_alpha.csv", header = TRUE, sep = ',', dec = '.')
Now, the object d (a dataframe) became part of our working environment. We can access the dataframe.
For Example we can view the dataframe d via the function View(). We can also preview the first part of the data set by head()
The code below returns the first six cases (head()) of d, while only considering the first 5 variables (d[,1:5]).
head(d[,1:5])
| ï..Iteration | X500.gigabyte | X1.terabyte | black | white |
|---|---|---|---|---|
| 1 | -0.0654039 | 0.0654039 | -0.0017922 | 0.0017922 |
| 2 | -0.0100344 | 0.0100344 | -0.0024558 | 0.0024558 |
| 3 | -0.0346806 | 0.0346806 | 0.0019213 | -0.0019213 |
| 4 | 0.0197559 | -0.0197559 | -0.0212265 | 0.0212265 |
| 5 | -0.0092769 | 0.0092769 | 0.0257134 | -0.0257134 |
| 6 | -0.0850587 | 0.0850587 | -0.0267374 | 0.0267374 |
We see that each row corresponds to one step of iteration. Keep in mind: We assume the Hastings-Algorithm to have converged from the step 40001 on.
Next, allocate appropriate variable names. The following code chunk sets the names of d (the variable names) to a vector of names we provide as strings. c() is a generic R function which combines its arguments to form a vector.
names(d) <- c("Iteration", "500_gigabyte", "1_terabyte", "black", "white", "OneDualShockController", "TwoDualShockController", "charging_station_for_2_PS4_controller", "PS4_wireless_Stereo_Headset", "no_accessories", "Far_Cry_Primal", "GTA_V", "Life_is_strange", "Tom_Clancy_The_Division", "no_action_adventure_game", "The_Witcher_3", "Fallout_4", "Final Fantasy_X_X2", "Dark_Souls_3", "no_role_playing_game", "Just_Dance_2016", "Guitar_Hero_Live_incl_Guitar", "FIFA_16", "no_game_for_family_companionship", "price_202.3", "price_775.489", "NONE")
Afterwards we could see the new names of d by calling the names() function.
The call head(names(d)) presents the first few names in the dataframe
head(names(d))
## [1] "Iteration" "500_gigabyte" "1_terabyte"
## [4] "black" "white" "OneDualShockController"
In R we can assess the elements of a dataframe in multiple ways.
For example we can assess the variable “500_gigabyte” by calling d$500_gigabyte.
Alternatively, as we know that the variable is the second one in the dataframe, we could also use d[,2].
Analogous, we can use d[2,] to return all variable values of the second case (row) in the dataframe d.
In addition, we can use d[2,2] to return the value of the second variable for the second case.
Finally, we can use d[1:4, 2] to return the values of the second variable for the first four cases (rows) in d.
You might have assumed that Bayesian tests are going to be much harder to understand than classical null hypothesis tests. This has not to be true!
Statistical testing for ACBC HB models requires a shift in mindset.
We examine the distribution of posterior Alpha draws to see if a strong majority of these draws falls on either one side or the other of the null hypothesis (Kruschke, 2015; Orme & Chrzan, 2017).
One thing market researchers are very familiar with are 95% confidence intervals of estimated parameters in regression models. Bayesian analysis of the posterior distribution of Alphas offers a similar interval, Bayesians commonly refer to as Highest-Density-Intervals (HDI).
The HDI indicates which points of the distribution are most credible, and which cover most of the distribution. Thus, the HDI summarizes the distribution by specifying an interval that spans most of the distribution, say 95% of it, such that every point inside the interval has higher credibility than any point outside the interval. Therefore, also the term credibility interval is very common.
To develop a 95% HDI, we simply sort the 80,000 draws following the Burn-In (warm-up) stage from lowest to highest and then pick out the lowest and highest 2.5% from each tail. Put differently, we search for the 2.5% and the 97.5% quantiles. That’s all!
Let’s show how to do this in R for the part worth utility of 1 Dualshock Controller included in the PS4 bundle.
The following code chunk used the function subset() with the arguments ‘d’ and ‘Iteration>40000’.
the subset() function selects the subset of cases in d for which the variable ‘Iteration’ has values higher than 40,000.
The resulting subset of data is then replacing the old dataframe d.
d <- subset(d, Iteration>40000)
Below, we call the hdi() function of the package bayestestR. We feed the function with 2 arguments:
in the code below, we explicitly call bayestestR::hdi() instead of hdi(). This is because other packages in the R universe also provides a function hdi() producing a different output (e.g., bayestestR). By calling bayestestR::hdi(), we make sure to apply the correct hdi() function.
bayestestR::hdi(d$OneDualShockController, 0.95)
## # Highest Density Interval
##
## 95% HDI
## --------------
## [-1.23, -0.66]
Interpretation: The lower limit of the 95% HDI is -1.2300384. The upper limit is -0.6591412. As the 95% HDI does not include 0, we can conclude that, at a confidence level of 95%, having only 1 instead of 2 Dualshock Controller with the PS4 bundle significantly decreases the likelihood of being chosen1.
R is especially strong in visualizing data. We can generate a plot that shows us the entire a posteriori distribution of Alpha values (if you not believe that HB draws in the upper model from a normal distribution), together with the limits of the 95% HDI.
A description of the syntax is given below.
Density_Plot <- ggplot(d, aes(x = d$OneDualShockController)) +
geom_density(alpha = .2, fill = "#31B404" ) +
labs(title = "A posteriori density of Alpha estimates") +
labs(x = "Part worth utilities", y = "Density") +
geom_vline(aes(xintercept = bayestestR::hdi(d$OneDualShockController, 0.95)$CI_low),
color = "red", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = bayestestR::hdi(d$OneDualShockController, 0.95)$CI_high),
color = "red", linetype = "dashed", size = 1) +
annotate(
geom = "text",
x = (bayestestR::hdi(d$OneDualShockController, 0.95)$CI_low) + 0.1, y = max(density(d$OneDualShockController)$y), label = round(bayestestR::hdi(d$OneDualShockController, 0.95)$CI_low, digits = 3)) +
annotate(
geom = "text",
x = (bayestestR::hdi(d$OneDualShockController, 0.95)$CI_high) + 0.1, y = max(density(d$OneDualShockController)$y), label = round(bayestestR::hdi(d$OneDualShockController, 0.95)$CI_high, digits = 3))
plot(Density_Plot)
Explanation of the syntax creating the above plot: We use the package ggpolot2, to create a density plot of the a posteriori Alpha draws. The first line uses the function ggplot() to set up a new plot named Density_Plot. We feed the function with the dataframe d and tell it furthermore to use the variable ‘d$OneDualShockController’ at the x-axes.
Line 2 adds the density line to the plot. The argument alpha=.2 and fill=“#FF6666” define a filling for the area under the curve that is semi-transparent (fill uses hexadecimal code for the color).
Lines 3 and 4 add a plot title as well as labels describing both axes.
Line 5 and 6 add a vertical line at the lower limit of the 95% HDI, which is red, dashed and of size 1.
Line 7 and 8 add an analogous line at the upper limit.
In the rest of the code, we use the annotate function to add text labels with the rounded values of the limits to the plot. The position of these text labels is govern by x and y coordinates. We set x to the limits of the 95% HDI and add some shift to the right (+0.1), whereas we set y to the maximum of the density function in the diagram.
Whereas in the example above, we have used the ggplot() package to create a density plot of the posterior distribution from the scratch, we next use the bayesplot package to make things easier.
First, we use the function bayesplot::mcmc_areas() to recreate the above density plot. Here, the function only uses 3 arguments:
mcmc_areas(d, pars = vars(OneDualShockController), prob = 0.95)
Interpretation: The bold blue vertical line constitutes the median of the distribution. The limits of the light-blue area represent the 95% HDI of the posterior draws. Similar to the approaches above, the 95% HDI does not include 0. We can conclude that, at a confidence level of 95%, having only 1 instead of 2 Dualshock Controller with the PS4 bundle significantly decreases the likelihood of being chosen.
The bayesplot package also provides a function to create density plots for multiple attribute levels simultaneously, namely mcmc_areas_ridges(). In the code chunk below, we call mcmc_areas_ridges() with values for 3 of its arguments:
Line 2-5 of the code below just add a title and a subtitle to the plot.
mcmc_areas_ridges(d, pars = vars(-Iteration, -NONE, -price_202.3, -price_775.489), prob = 0.95) +
ggplot2::labs(
title = "Posterior distributions",
subtitle = "with 95% HDI intervals"
)
Interpretation: The plot very much mirrors information seen before, but for multiple attribute levels at once. We can see, for example, that on average the color black is significantly increasing the choice likelihood of a PlayStation 4 bundle as compared to the color white. Conversely, differences in hard disk capacity (500GB vs. 1TB) are on average not significant, since the HDI’s for the corresponding attribute levels are both crossing 0. Within the category of action-adventures on average (5 levels: Far Cry Primal, GTA V, Life is Strange, Tom Clancy’s The Devision, and no action-adventure), GTA V spends significantly more utility as compared to the average utility of the attribute’s levels. Furthermore, the plot is very useful to highlight differences in uncertainty of point estimates for different attribute levels. By comparing the breadth of the density plots of different attribute levels, we have to acknowledge that point estimates for levels with a thin density plot are of a higher precision as compared to those point estimates with an underlying density of greater width. Thus, we can trust more in the point estimates for the color white as compared to the point estimates for Just Dance 2016. Although both attribute levels are decreasing choice likelihood, the estimates for a white color of the console are affected by less uncertainty. Uncertainty in this context can also indicate that different consumer segments exist in the sample, which do hold different levels of negative utility for Just Dance 2016.
Finally, the bayesplot package also provides a function to solely illustrate HDI’s of multiple attribute levels, namely mcmc_intervals(). In the code chunk below, we call mcmc_intervals() using 4 arguments:
mcmc_intervals(d, pars = vars(-Iteration, -NONE, -price_202.3, -price_775.489), prob_outer = 0.95, prob = 0.89)
Interpretation: We can obtain the same information as before, without information on the form of distribution of attribute levels.
Next, we can evaluate how many percent of the Alpha draws for a part worth utility actually falls below the value of 0.
This time, we will use the part worth utility of the attribute level ‘Wireless Stereo Headset’ for the attribute ‘Accessories’ describing the PS4 bundles. This attribute also has two other levels: ‘Charging Station’ and ‘no accessories’. Thus, evaluating whether we can be very sure that the Alphas for ‘Wireless Stereo Headset’ fall below 0 means evaluating whether we can be sure that this accessory spends less utility to consumers as compared to the average utility any considered level of the attribute is doing (utilities within an attribute are zero-centered).
In the code chunk below use the function nrow() to calculate the fraction of posteriori draws that fall below 0. Here, nrow(d[d$PS4_wireless_Stereo_Headset<0, ]) counts the number of iterations a Alpha draw is less than 0. Dividing by nrow(d), which gives the total number of iterations in d, we obtain the fraction of draws falling below 0.
nrow(d[d$PS4_wireless_Stereo_Headset < 0, ]) / nrow(d) * 100
## [1] 98.1325
Interpretation: We see that 98.1325% of the draws are lower than 0. Thus, we can be very sure that adding a Wireless Stereo Headset to the PS4 bundle is actually decreasing total utility of the bundles as compared to the other options for the attribute (‘Charging Station’ and ‘no accessories’). The Bayesian p-value would correspond to 0.018675, which one might call a significant result. Keep in mind: if we test if a certain attribute level’s utility is < 0, we are testing whether this level’s utility is lower than the average utility across all attribute levels of the corresponding attribute (utilities are mean-centered within an attribute).
We can apply the same logic in order to evaluate whether, on average, the utility of one attribute level A is significantly lower as compared to another level B within the same attribute (Keep in mind: We ca not compare magnitude of raw Alpha draws across different attributes).
To do this, we just count the fraction of iterations after the Warm-up stage for which the Alpha draw of attribute level A is lower than those for attribute level B.
In this illustration we focus on the attribute ‘Action-Adventure’. We assess if the game ‘Tom Clancy’s The Division’ spends significantly less utility as compared to the game ‘Far Cry Primal’.
nrow(d[d$Tom_Clancy_The_Division < d$Far_Cry_Primal, ]) / nrow(d) * 100
## [1] 59.08625
Interpretation: Only in slightly more than one half of iterations the Alpha draws for ‘Tom Clancy’s The Division’ fall below those obtained for ‘Far Cry Primal’. The Bayesian p-value corresponds to 0.409. Thus, the former game can not be assumed to be significantly less preferred as compared to the latter.
Think back to our lecture. Just as we were able to use the individual Beta draws for each consumer to calculate the total utility of a product bundle for that consumers, we can now use the Alpha draws to calculate the total utility for an average consumer.
Again, simply counting the number of iterations after Burn-In for which the total utility of Bundle A is higher as the total utility of bundle B allows for an assessment of significant differences from a Bayesian perspective.
Imagine, we want to evaluate if A, on average is significantly more liked than B
The first thing we do is to calculate the total utilities across each attribute except for the price in each iteration step after the Warm-up stage. Therefore: we simply add the Alphas of the corresponding attribute levels, as shown below:
d$Util_A_without_Price <- d$`500_gigabyte` + d$black + d$OneDualShockController + d$charging_station_for_2_PS4_controller + d$no_action_adventure_game + d$no_role_playing_game + d$no_game_for_family_companionship
d$Util_B_without_Price <- d$`1_terabyte` + d$white + d$OneDualShockController + d$no_accessories + d$Far_Cry_Primal + d$`Final Fantasy_X_X2` + d$no_game_for_family_companionship
We can now preview the first few values of the new variable:
head(d$Util_A_without_Price)
## [1] -0.103480282 -0.237708920 -0.189473538 -0.613720912 0.003270605
## [6] -0.435947492
head(d$Util_B_without_Price)
## [1] -1.611101 -1.166206 -1.702064 -2.133918 -2.246294 -1.847212
For the utility of the price of the bundle, we have to apply a linear interpolation (see slides of Chapter 2).
We use the approx() function to interpolate linearly. We pass-in 3 arguments to the function:
We first establish two new variables ‘Util_A_price’ and ‘Util_B_price’ that, later on, store the interpolated utility values. The problem in our case is, that we have to apply the interpolation for each row of the dataframe d.
Therefore, we use a for() loop.
‘for(i in 1:nrow(d))’ instructs R to apply a certain command for each i in the interval [1|number of lines in d]. The subsequent command is then applied for each line i in the dataframe d. By using as.numeric() as a funtion on the result of our linear interpolation, we make sure, that the new variables Util_A_price and Util_B_price in d are filled with numeric values (approx() otherwise would give a list of values as result.)
d$Util_A_price <- 0
d$Util_B_price <- 0
for (i in 1:nrow(d)) {
d$Util_A_price[i] <- as.numeric(approx(x = c(202.3, 775.489), y = c(d$price_202.3[i], d$price_775.489[i]), xout = 299.99)[2])
d$Util_B_price[i] <- as.numeric(approx(x = c(202.3, 775.489), y = c(d$price_202.3[i], d$price_775.489[i]), xout = 349.99)[2])
}
Next, we build the total utility of the PS4 bundles A and B, by adding up the utilities without price and the utility of the price of A and B.
d$Total_Utility_A <- d$Util_A_without_Price + d$Util_A_price
d$Total_Utility_B <- d$Util_B_without_Price + d$Util_B_price
Finally, we evaluate, if A, on average, is significantly more liked than B, by counting the iterations for which the total utility of bundle A is higher as the utility of bundle B.
nrow(d[d$Total_Utility_A > d$Total_Utility_B, ]) / nrow(d) * 100
## [1] 100
Interpretation: Across all of the iterations the Alpha-based estimates for the mean total utility of the PS4 bundle A are higher then those obtained for bundle B. The Bayesian p-value corresponds to 0, a significant difference.
A description of the syntax is given below. We see, that the posteriori distributions are nearly non-overlapping highlighting the significant differences in mean total utilities for A and B.
Density_Plot <- ggplot(d) +
geom_density(aes(x = d$Total_Utility_A), alpha = .2, fill = "#31B404") +
geom_density(aes(x = d$Total_Utility_B), alpha = .2, fill = "#FFBF00") +
labs(title = "A posteriori density of Total utility estimates for A and B") +
labs(x = "Part worth utilities", y = "Density") +
geom_vline(aes(xintercept = bayestestR::hdi(d$Total_Utility_A, 0.95)$CI_low),
color = "black", linetype = "dashed", size = 0.7
) +
geom_vline(aes(xintercept = bayestestR::hdi(d$Total_Utility_A, 0.95)$CI_high),
color = "black", linetype = "dashed", size = 0.7
) +
geom_vline(aes(xintercept = bayestestR::hdi(d$Total_Utility_B, 0.95)$CI_low),
color = "red", linetype = "longdash", size = 0.7
) +
geom_vline(aes(xintercept = bayestestR::hdi(d$Total_Utility_B, 0.95)$CI_high),
color = "red", linetype = "longdash", size = 0.7
) +
annotate(
geom = "text",
x = (bayestestR::hdi(d$Total_Utility_A, 0.95)$CI_low) + 0.6, y = max(density(d$Total_Utility_A)$y), label = round(bayestestR::hdi(d$Total_Utility_A, 0.95)$CI_low, digits = 2)
) +
annotate(
geom = "text",
x = (bayestestR::hdi(d$Total_Utility_A, 0.95)$CI_high) + 0.6, y = max(density(d$Total_Utility_A)$y), label = round(bayestestR::hdi(d$Total_Utility_A, 0.95)$CI_high, digits = 2)
) +
annotate(
geom = "text",
x = (bayestestR::hdi(d$Total_Utility_B, 0.95)$CI_low) + 0.6, y = max(density(d$Total_Utility_B)$y), label = round(bayestestR::hdi(d$Total_Utility_B, 0.95)$CI_low, digits = 2)
) +
annotate(
geom = "text",
x = (bayestestR::hdi(d$Total_Utility_B, 0.95)$CI_high) + 0.6, y = max(density(d$Total_Utility_B)$y), label = round(bayestestR::hdi(d$Total_Utility_B, 0.95)$CI_high, digits = 2)
) +
annotate(
geom = "text",
x = median(d$Total_Utility_A), y = max(density(d$Total_Utility_A)$y) / 7, label = "A"
) +
annotate(
geom = "text",
x = median(d$Total_Utility_B), y = max(density(d$Total_Utility_B)$y) / 7, label = "B"
)
plot(Density_Plot)
Explanation of the syntax creating the above plot: The syntax is pretty much the same as for the figure in Example 1. However, we have to add some elements twice, as we now are plotting the values of two variables simultaneously. At the end of the chunk we added a new text label at the mean of each bundle’s distribution.
Kruschke, J. K. (2015). Doing bayesian data analysis: A tutorial with r, jags, and stan (2. ed.). Amsterdam: Academic Press.
Orme, B. K., & Chrzan, K. (2017). Becoming an expert in conjoint analysis: Choice modeling for pros. Orem, UT: Sawtooth Software.
(Bayesian experts recently propose to use of 89% intervals, which are deemed to be more stable than, 95% intervals (Kruschke, 2015). Moreover, 89% indicates the arbitrariness of interval limits.)↩︎