Survey Experiments: Design and Analysis

Author

Jordi Muñoz

1 Introduction

In this lab session we will work on the analysis of experimental data. In experiments, the most important thing is the design, not the analysis — which is relatively straightforward for standard designs. Today we will cover the following topics:

Statistical power (Section 2): How to calculate the minimum sample size needed to reliably detect a true effect, and avoid running underpowered experiments.
Randomization (Section 3): How to implement and verify randomization procedures in R.
Experimental analysis (Section 4): How to analyze data from real experiments with different designs and complications: a simple 2-group randomized experiment and a 4-arm experiment.
Analysis of conjoint experiments (Section 5): How to analyze a conjoint experiment using R, including subgroup analysis.

# Remove scientific notation for cleaner output
options(scipen = 999)

2 Statistical Power

Key concepts

A well-powered experiment prevents two types of errors: a false positive (Type I error, α) — concluding there is an effect when there isn’t — and a false negative (Type II error, β) — missing a real effect. Statistical power is 1 − β, the probability of correctly detecting a true effect.

We have already discussed the logic of statistical power in class. To run power analyses in R, we use the pwr package.

# install.packages("pwr")
library(pwr)

Power analyses require you to specify three of the following four quantities; R will solve for the fourth:

Effect size: how large a difference do you expect?
Sample size (n): how many observations per group?
Significance level (α): the false positive rate, conventionally 0.05
Power (1 − β): the probability of detecting the effect, conventionally 0.80

2.1 Sample Size for a Difference in Proportions

We use pwr.p.test() when we have two groups and a binary (yes/no) outcome. The function requires an effect size in Cohen’s h, which we calculate with ES.h() from the two proportions.

Scenario A: Suppose we expect 50% success in the control group and 75% in the treatment group — a fairly large effect.

# ES.h() converts two proportions into Cohen's h (the appropriate effect size metric
# for proportions). pwr.p.test() then finds the n needed per group.
sample1 <- pwr.p.test(
  h          = ES.h(p1 = 0.75, p2 = 0.50),  # expected proportions in treatment vs. control
  sig.level  = 0.05,   # alpha (Type I error rate)
  power      = 0.80,   # desired power (1 - Type II error rate)
  alternative = "two.sided"  # we don't assume a direction in advance
)
sample1


     proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.5235988
              n = 28.62923
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

plot(sample1)

Interpretation: The n shown is the required sample size per group. Double it for the total study size. The power curve shows how power rises as sample size grows; the red dot marks your design.

Scenario B: Most interventions produce smaller effects. Let’s try 35% vs. 40% — a much more realistic expectation.

sample2 <- pwr.p.test(
  h          = ES.h(p1 = 0.40, p2 = 0.35),
  sig.level  = 0.05,
  power      = 0.80,
  alternative = "two.sided"
)
plot(sample2)

Notice how dramatically the required sample size increases for a smaller effect. This is a common surprise for new researchers — realistic effect sizes demand large samples.

2.2 Computing Power for a Fixed Sample Size

Sometimes budget or logistics fix your sample size, so instead of asking “how many do I need?”, you ask “what is my power given what I have?” Simply replace power = with n =:

# Here we fix n = 80 and ask: what is our power?
sample3 <- pwr.p.test(
  h          = ES.h(p1 = 0.75, p2 = 0.50),
  sig.level  = 0.05,
  n          = 80,
  alternative = "greater"  # one-sided: we expect treatment > control
)
sample3


     proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.5235988
              n = 80
      sig.level = 0.05
          power = 0.9988106
    alternative = greater

plot(sample3)

Interpretation: A power below 0.80 means a high probability of a false negative — we might miss a real effect. If power is very low (e.g., < 0.50), running the experiment may not be worthwhile.

2.3 Sample Size for a Difference in Means

When the outcome is continuous (e.g., a 0–10 attitude scale), we think in terms of differences in means. We need two inputs:

delta (Δ): the expected difference in group means
sd (σ): the expected standard deviation of the outcome

These two together determine the standardized effect size d = Δ/σ.

# delta and sd are on the same scale as your outcome variable
sample4 <- power.t.test(
  delta     = 0.75,   # expected difference in means (e.g., 0.75 points on a scale)
  sd        = 2.25,   # expected SD of the outcome
  sig.level = 0.05,
  power     = 0.80,
  n         = NULL    # solve for n
)
sample4


     Two-sample t test power calculation 

              n = 142.2466
          delta = 0.75
             sd = 2.25
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

plot(sample4)

Again, we can flip this around to assess power for a given sample:

sample5 <- power.t.test(
  delta     = 0.75,
  sd        = 2.25,
  sig.level = 0.05,
  n         = 80      # solve for power
)
sample5


     Two-sample t test power calculation 

              n = 80
          delta = 0.75
             sd = 2.25
      sig.level = 0.05
          power = 0.5538507
    alternative = two.sided

NOTE: n is number in *each* group

If you find it hard to “guess” delta and sd separately, you can work directly with Cohen’s d. A common rule of thumb: d = 0.2 (small effect size), 0.5 (medium), 0.8 (large):

# d = delta / sd. Using d = 0.2 (a small but realistic effect)
sample6 <- power.t.test(
  d         = 0.2,
  power     = 0.80,
  sig.level = 0.05
)
sample6


     Two-sample t test power calculation 

              n = 393.4067
          delta = 0.2
             sd = 1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

plot(sample6)

Finally, if the two groups have unequal sizes (e.g., because one condition is rarer or more expensive), use pwr.t2n.test():

# n1 and n2 are the sizes of the two groups; d is Cohen's d
sample7 <- pwr.t2n.test(n1 = 28, n2 = 35, d = 0.5)
sample7


     t test power calculation 

             n1 = 28
             n2 = 35
              d = 0.5
      sig.level = 0.05
          power = 0.4924588
    alternative = two.sided

Rule of thumb for power analyses

If you have no prior data to estimate effect size, use a small-to-medium effect (Cohen’s d ≈ 0.3, or a 5–10 percentage point difference in proportions) as a conservative planning assumption. This ensures you are powered for realistic but interesting effects rather than only for large, optimistic ones.

an alternative way of thinking is to power the experiment for the minimum substantively relevant effect you consider is worth reporting. Powering the experiments to detect super tiny effects may be a waste of resources.

3 Randomization

In survey experiments, randomization is usually handled by the survey software. In field or fieldwork experiments, however, we often need to randomize manually — for example, assigning villages or polling stations to treatment. This section shows how to do it in R and, crucially, how to verify that it worked.

3.1 Basic Randomization Procedure

The core idea: generate a vector of random numbers, rank them, and assign the top m units to treatment.

set.seed(1234567)  # Set a seed: guarantees the same random draw every time you run this.
                   # Use any integer; record it for your pre-registration/replication file.

n  <- 400          # Total number of units
m  <- 200          # Number to assign to treatment (here, 50%)

data1 <- data.frame(ID = 1:n)          # Create data frame with unit IDs
sel   <- rnorm(n)                      # Draw n random numbers from a standard normal
sel   <- order(sel, decreasing = TRUE) # Rank them (highest first)

data1$Tr <- 0                          # Initialize all units as control (0)
data1$Tr[sel[1:m]] <- 1               # Assign the top m units to treatment (1)

head(data1)

table(data1$Tr)  # Check: should be 200 treated, 200 control


  0   1 
200 200

3.2 Randomization Check (Balance Test)

One of the most important steps after randomization is checking balance — that the treatment and control groups are similar on pre-treatment characteristics. With true randomization, any differences are due to chance, but we should verify this formally.

We create a larger dataset and add a non-random variable (gender assigned by ID order) to test against:

N2   <- 2000
m2   <- N2 / 2  # 50% treated

data2 <- data.frame(id = 1:N2)

# Assign the first 1000 observations as Male, the rest as Female.
# This is deliberately non-random to give the balance test something to detect IF
# our randomization fails.
data2$genere <- ifelse(data2$id <= N2 / 2, "Male", "Female")

table(data2$genere)  # 1000 male, 1000 female


Female   Male 
  1000   1000

Now randomize:

sel2         <- rnorm(N2)
sel2         <- order(sel2, decreasing = TRUE)
data2$Tr     <- 0
data2$Tr[sel2[1:m2]] <- 1

And check balance. A t-test should find no statistically significant difference in gender or ID across treatment arms — if randomization worked, treatment assignment is independent of these variables:

# Proportions table: should be roughly 50/50 male-female within each arm
check.gender <- table(data2$Tr, data2$genere)
prop.table(check.gender, margin = 1)  # row proportions (by treatment arm)

   
    Female  Male
  0  0.493 0.507
  1  0.507 0.493

# Formal tests: p > 0.05 means no evidence of imbalance (which is what we want)
t.test(data2$Tr ~ data2$genere)  # Is treatment related to gender? (Should be: NO)


    Welch Two Sample t-test

data:  data2$Tr by data2$genere
t = 0.62585, df = 1998, p-value = 0.5315
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.02987034  0.05787034
sample estimates:
mean in group Female   mean in group Male 
               0.507                0.493

t.test(data2$id ~ data2$Tr)      # Is treatment related to ID order? (Should be: NO)


    Welch Two Sample t-test

data:  data2$id by data2$Tr
t = -0.13742, df = 1998, p-value = 0.8907
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -54.21182  47.11182
sample estimates:
mean in group 0 mean in group 1 
        998.725        1002.275

Interpretation: Non-significant p-values here are good news — they suggest that the random assignment is truly independent of pre-existing characteristics. If a test were significant, it would indicate a flaw in the randomization procedure.

4 Analysis of Experiments

Now we turn to real experimental data. We cover three examples that progressively introduce common complications:

A simple two-arm experiment (election observers)
A multi-arm experiment (housing and immigration)
An experiment with non-compliance (GOTV)

library(readr)
library(tidyverse)
library(haven)
library(stargazer)
library(rempsyc)
library(sjmisc)
library(AER)
library(here)
library(tableone)
library(janitor)

What does here() do?

here() constructs file paths relative to the project root (where your .Rproj file lives), not your current working directory. This makes scripts portable — they run correctly regardless of which subfolder you open them from.

4.1 Experiment 1: Electoral Fraud in Russia

Paper: Enikolopov, R., et al. (2013). “Field experiment estimate of electoral fraud in Russian parliamentary elections.” PNAS, 110(2), 448–452.

Design: During the December 2011 Russian parliamentary election, polling stations were randomly assigned to receive independent election observers (observers = 1) or not (observers = 0). If observers deter fraud, we expect higher turnout and lower incumbent vote share in observed stations.

Key identifying assumption: Because assignment was random, any differences in outcomes between observed and unobserved stations are caused by the observers (not by pre-existing differences between stations).

# --- File path ---
# Adjust these paths to match your project structure
exprussia <- here("data_exprussia.csv")
exprussia <- read_delim(exprussia,delim = ";", 
                        escape_double = FALSE,
                        locale = locale(decimal_mark = ","))

names(exprussia)

[1] "observers" "turnout"   "ershare"

summary(exprussia)

   observers         turnout          ershare      
 Min.   :0.0000   Min.   :0.3348   Min.   :0.1273  
 1st Qu.:0.0000   1st Qu.:0.5223   1st Qu.:0.2891  
 Median :0.0000   Median :0.5989   Median :0.4558  
 Mean   :0.0493   Mean   :0.6105   Mean   :0.4438  
 3rd Qu.:0.0000   3rd Qu.:0.6852   3rd Qu.:0.5563  
 Max.   :1.0000   Max.   :1.0000   Max.   :0.9400

table(exprussia$observers)  # How many stations in each condition?


   0    1 
3008  156

4.1.1 Descriptive Comparison

The simplest analysis: compare group means.

Russia_summary <- exprussia %>%
  group_by(observers) %>%
  summarise(
    mean_turnout = mean(turnout),
    mean_ershare = mean(ershare)
  )

knitr::kable(Russia_summary, format = "html", digits = 2, align = "lcc",
             caption = "Mean turnout and United Russia vote share by observer presence")

Mean turnout and United Russia vote share by observer presence
observers	mean_turnout	mean_ershare
0	0.61	0.45
1	0.55	0.34

4.1.2 Significance Testing

We already see important differences in turnout and vot for United Russia. To assess whether these differences are larger than we would expect from chance alone, we run t-tests:

# nice_t_test() from rempsyc runs t-tests and formats results in APA style
t.test.table <- nice_t_test(
  data     = exprussia,
  response = c("turnout", "ershare"),
  group    = "observers",
  warning  = FALSE
) |>
  nice_table()
t.test.table

Dependent Variable	t	df	p	d	95% CI
turnout	8.31	178.95	< .001***	0.58	[0.42, 0.74]
ershare	9.25	177.91	< .001***	0.66	[0.50, 0.82]

4.1.3 Visualization: Bar Plot with Confidence Intervals

Bar plots summarizing group means are common in experimental papers. The error bars represent 95% confidence intervals.

# Compute descriptive stats including SE and 95% CI for each group
Russia_summary <- exprussia |>
  group_by(observers) |>
  summarize(
    mean_turnout = mean(turnout),
    mean_ershare = mean(ershare),
    n            = n(),
    SE_turnout   = sd(turnout) / sqrt(n()),
    SE_ershare   = sd(ershare) / sqrt(n()),
    CI_turnout   = 1.96 * SE_turnout,   # half-width of 95% CI
    CI_ershare   = 1.96 * SE_ershare
  )

Russia_summary$observers <- as.factor(Russia_summary$observers)

ggplot(Russia_summary, aes(observers, mean_turnout, fill = observers)) +
  geom_col(width = 0.7) +
  geom_errorbar(
    aes(ymin = mean_turnout - CI_turnout, ymax = mean_turnout + CI_turnout),
    width = 0.2
  ) +
  labs(
    y       = "Turnout ± 95% CI",
    x       = "Observers present",
    title   = "Effect of election observers on voter turnout",
    caption = "Enikolopov et al. (2013). Error bars = 95% CI."
  ) +
  theme_classic() +
  guides(fill = "none")

4.1.4 Visualization: Violin + Box Plot (Showing Full Distributions)

Bar plots only show the mean and CI, which can hide important features of the data (e.g., skewness, bimodality). A violin plot shows the full distribution:

# Recode 0/1 to meaningful labels before plotting
exprussia <- exprussia |>
  mutate(observers = factor(observers,
                            levels = c(0, 1),
                            labels = c("No observers", "Observers")))

ggplot(exprussia, aes(x = observers, y = turnout, fill = observers)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, alpha = 0.5) +  # Shows median and IQR inside violin
  stat_summary(                              # Adds mean ± SD as a point + range
    fun.data = mean_sdl, mult = 1,
    geom = "pointrange", color = "black"
  ) +
  labs(
    y       = "Turnout",
    x       = NULL,
    title   = "Distribution of turnout by observer assignment",
    caption = "Point = mean; range = ±1 SD. Enikolopov et al. (2013)."
  ) +
  theme_classic() +
  guides(fill = "none")

Why prefer the violin plot? It reveals whether distributions are symmetric, whether there are outliers, and whether the two groups differ in spread — none of which is visible from a bar chart.

4.2 Experiment 2: Public Housing and Immigration Attitudes

This experiment uses a factorial vignette design — more complex than a simple treatment/control setup. Respondents were shown a scenario about a municipal housing allocation and asked which family should receive the apartment. The first family in the waiting list, or another family with a delicate situation ( a sick daughter) that may deserve priority.

The immigrant status of the first and second family in the scenario was randomly varied, creating four conditions:

Condition	1st family	2nd family
1 (I–I)	Immigrant	Immigrant
2 (I–NI)	Immigrant	Non-immigrant
3 (NI–I)	Non-immigrant	Immigrant
4 (NI–NI)	Non-immigrant	Non-immigrant

By comparing these conditions, we can estimate whether respondents systematically favor one group over another and whether the order matters.

exp_pis <- here("exp_pis.dta")
exp_pis <- read_dta(exp_pis)
names(exp_pis)

[1] "gender"             "age"                "education"         
[4] "housing1"           "housing1_dv"        "leftright"         
[7] "immigrants_economy" "immigrants_culture" "housing_dic"

4.2.1 Randomization Check (Balance)

Before analyzing outcomes, we verify that the four groups are similar on pre-treatment characteristics. If randomization worked, we should see no significant differences across groups.

Pis_summary <- exp_pis %>%
  group_by(housing1) %>%
  summarise(
    mean_age           = mean(age),
    mean_education     = mean(education),
    mean_leftright     = mean(leftright),
    mean_imm_economy   = mean(immigrants_economy),
    mean_imm_culture   = mean(immigrants_culture),
    mean_gender        = mean(gender),
    n_obs              = n()
  )

knitr::kable(Pis_summary, format = "html", digits = 2, align = "lccccccc",
             caption = "Pre-treatment characteristics by experimental condition")

Pre-treatment characteristics by experimental condition
housing1	mean_age	mean_education	mean_leftright	mean_imm_economy	mean_imm_culture	mean_gender	n_obs
1	50.03	3.86	5.26	4.55	4.16	1.48	148
2	47.64	4.00	5.40	4.43	4.15	1.51	145
3	47.41	3.98	5.37	4.75	4.17	1.50	167
4	49.95	3.64	5.51	4.67	4.26	1.46	152

We use the tableone package, which is purpose-built for balance tables, handles both continuous and binary variables appropriately, and includes significance tests across multiple groups in one clean output.

# Label the treatment variable for cleaner column headers
exp_pis <- exp_pis |>
  mutate(condition = factor(housing1,
                            labels = c("I–I", "I–NI", "NI–I", "NI–NI")))

# Variables to check for balance
covariates <- c("age", "education", "leftright",
                "immigrants_economy", "immigrants_culture", "gender")

# CreateTableOne runs the omnibus F-test (continuous) or chi-square (binary)
# across all four groups simultaneously
tab1 <- CreateTableOne(
  vars     = covariates,
  strata   = "condition",   # grouping variable
  data     = exp_pis,
  factorVars = "gender"     # treat gender as categorical, not continuous
)

# Print with SMDs and the omnibus p-value
# smd = TRUE adds standardized mean differences between each group and the first
print(tab1,
      smd     = TRUE,
      showAllLevels = FALSE,
      printToggle   = FALSE) |>
  knitr::kable(caption = "Table 1. Covariate balance across experimental conditions",
               notes   = "p: omnibus F-test (continuous) or χ² (categorical). 
                          SMD: standardized mean difference vs. I–I condition.")

Table 1. Covariate balance across experimental conditions
	I–I	I–NI	NI–I	NI–NI	p	SMD
n	148	145	167	152
age (mean (SD))	50.03 (15.73)	47.64 (16.41)	47.41 (16.13)	49.95 (15.65)	0.301	0.106
education (mean (SD))	3.86 (1.52)	4.00 (1.59)	3.98 (1.66)	3.64 (1.54)	0.186	0.126
leftright (mean (SD))	5.26 (1.74)	5.40 (1.80)	5.37 (1.77)	5.51 (1.72)	0.654	0.077
immigrants_economy (mean (SD))	4.55 (2.87)	4.43 (2.58)	4.75 (2.73)	4.67 (2.45)	0.743	0.067
immigrants_culture (mean (SD))	4.16 (2.87)	4.15 (2.84)	4.17 (2.74)	4.26 (2.67)	0.987	0.020
gender = 2 (%)	71 (48.0)	74 (51.0)	84 (50.3)	70 (46.1)	0.816	0.058

How to read the table: Each row is a pre-treatment variable. Columns show the mean (SD) for each condition, the omnibus p-value testing equality across all four groups simultaneously, and the SMD for each group relative to the first. All p-values should be > 0.05 and SMDs should be close to 0.

For a more formal pairwise test across all four groups, we can add Bonferroni-corrected pairwise t-tests. Because we run many comparisons simultaneously, we need to adjust the p-values to control the family-wise error rate:

# Cross-tabulate gender by treatment to check proportions
balance <- table(exp_pis$gender, exp_pis$housing1)
prop.table(balance, margin = 2)  # Column proportions

   
            1         2         3         4
  1 0.5202703 0.4896552 0.4970060 0.5394737
  2 0.4797297 0.5103448 0.5029940 0.4605263

# Pairwise t-tests with Bonferroni correction (adjusts for multiple comparisons).
# Goal: ALL p-values should be > 0.05.
pairwise.t.test(exp_pis$leftright, exp_pis$housing1, p.adjust.method = "bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  exp_pis$leftright and exp_pis$housing1 

  1 2 3
2 1 - -
3 1 1 -
4 1 1 1

P value adjustment method: bonferroni

pairwise.t.test(exp_pis$age, exp_pis$housing1, p.adjust.method = "bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  exp_pis$age and exp_pis$housing1 

  1    2    3   
2 1.00 -    -   
3 0.89 1.00 -   
4 1.00 1.00 0.95

P value adjustment method: bonferroni

pairwise.t.test(exp_pis$gender, exp_pis$housing1, p.adjust.method = "bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  exp_pis$gender and exp_pis$housing1 

  1 2 3
2 1 - -
3 1 1 -
4 1 1 1

P value adjustment method: bonferroni

pairwise.t.test(exp_pis$education, exp_pis$housing1, p.adjust.method = "bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  exp_pis$education and exp_pis$housing1 

  1    2    3   
2 1.00 -    -   
3 1.00 1.00 -   
4 1.00 0.32 0.37

P value adjustment method: bonferroni

Why Bonferroni correction? When we run many tests simultaneously, some will be significant by chance (the multiple comparisons problem). The Bonferroni correction controls the family-wise error rate by making each individual test more stringent.

4.2.2 Main Analysis: Pairwise Comparisons

We first produce a crosstabulation to observe the percentages across conditions:

library(gtsummary)

# Label the variables for clean output
exp_pis_labelled <- exp_pis |>
  mutate(
    housing1_dv = factor(housing1_dv,
                         levels = c(1, 2),
                         labels = c("1st family", "2nd family")),
    condition   = factor(housing1, levels = c(1, 2, 3, 4),
                         labels = c("I-I", "I-NI", "NI-I","NI-NI"))
  )
# tbl_cross() produces a clean crosstab with row/column totals and
# an omnibus chi-square test across all cells
tbl_cross(
  data       = exp_pis_labelled,
  row        = housing1_dv,       # outcome in rows
  col        = condition,         # treatment in columns
  percent    = "column",          # show column percentages (proportion within each arm)
  statistic  = "{n} ({p}%)",      # format: count (percentage)
  label      = list(
    housing1_dv ~ "Family chosen",
    condition   ~ "Experimental condition"
  )
) |>
  add_p(source_note = TRUE) |>   # adds chi-square test p-value as a footnote
  bold_labels()

	Experimental condition				Total
	I-I	I-NI	NI-I	NI-NI	Total
Family chosen
1st family	40 (27%)	48 (33%)	94 (56%)	48 (32%)	230 (38%)
2nd family	108 (73%)	97 (67%)	73 (44%)	104 (68%)	382 (62%)
Total	148 (100%)	145 (100%)	167 (100%)	152 (100%)	612 (100%)
Pearson’s Chi-squared test, p<0.001

We can also use the pairwise t-test to check for significant differences.

# Are there significant differences across conditions in the outcome?
pairwise.t.test(exp_pis$housing1_dv, exp_pis$housing1, p.adjust.method = "bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  exp_pis$housing1_dv and exp_pis$housing1 

  1          2      3         
2 1.0000     -      -         
3 0.00000034 0.0001 -         
4 1.0000     1.0000 0.00002189

P value adjustment method: bonferroni

4.2.3 Regression Analysis

Regression is often preferable to pairwise t-tests because: (1) it gives you a single reference category to interpret against, (2) you can easily add covariates, and (3) outputs are easy to present in publication tables.

# The DV currently codes choice as 1 = first family, 2 = second family.
# Recode to 0/1 so coefficients represent the *proportion choosing the 2nd family*.
exp_pis$housing1_dv <- exp_pis$housing1_dv - 1

# Convert treatment to a factor with meaningful labels
exp_pis$housing1 <- as.factor(exp_pis$housing1)
exp_pis$housing2 <- fct_recode(exp_pis$housing1,
                                "I-I"   = "1",
                                "I-NI"  = "2",
                                "NI-I"  = "3",
                                "NI-NI" = "4")

Model 1: Treatment dummies with “I–I” as the reference category:

fit1 <- lm(housing1_dv ~ housing2, data = exp_pis)
stargazer(fit1, type = "html",
          title = "Effect of family immigrant status on housing allocation",
          dep.var.labels = "Chose 2nd family (0/1)",
          notes = "Reference category: I-I (both families immigrant).")

**Effect of family immigrant status on housing allocation**

	Dependent variable:

	Chose 2nd family (0/1)

housing2I-NI	-0.061
	(0.055)

housing2NI-I	-0.293^***
	(0.053)

housing2NI-NI	-0.046
	(0.054)

Constant	0.730^***
	(0.039)


Observations	612
R²	0.058
Adjusted R²	0.053
Residual Std. Error	0.472 (df = 608)
F Statistic	12.485^*** (df = 3; 608)

Note:	p<0.1; p<0.05; p<0.01
	Reference category: I-I (both families immigrant).

Reading the table: Each coefficient shows how much more (or less) likely respondents were to choose the 2nd family relative to the I–I condition. The intercept is the predicted probability of choosing the 2nd family when both are immigrant.

We observe how in the case of a 1st non-migrant family and a 2nd immigrant origin family, the likelihood of our respondents to advocate for a switch in the order of the list is significantly lower. Changing the reference category can make substantive comparisons clearer. Let’s use “NI–NI” as the baseline:

exp_pis$housing2 <- relevel(exp_pis$housing2, ref = "NI-NI")

fit2 <- lm(housing1_dv ~ housing2, data = exp_pis)
stargazer(fit2, type = "html",
          title = "Effect of family immigrant status on housing allocation",
          dep.var.labels = "Chose 2nd family (0/1)",
          notes = "Reference category: NI-NI (both families non-immigrant).")

**Effect of family immigrant status on housing allocation**

	Dependent variable:

	Chose 2nd family (0/1)

housing2I-I	0.046
	(0.054)

housing2I-NI	-0.015
	(0.055)

housing2NI-I	-0.247^***
	(0.053)

Constant	0.684^***
	(0.038)


Observations	612
R²	0.058
Adjusted R²	0.053
Residual Std. Error	0.472 (df = 608)
F Statistic	12.485^*** (df = 3; 608)

Note:	p<0.1; p<0.05; p<0.01
	Reference category: NI-NI (both families non-immigrant).

4.2.4 Adding Covariates

In a randomized experiment, adding pre-treatment covariates to the regression should not change the treatment effect estimates (because of randomization), but it can improve precision (reduce standard errors). If randomization worked, treatment coefficients should not change substantially. This is a useful check:

fit3 <- lm(
  housing1_dv ~ housing2 + gender + age + education + leftright +
    immigrants_economy + immigrants_culture,
  data = exp_pis
)

stargazer(fit1, fit3, type = "html",
          title = "Treatment effects with and without covariates",
          dep.var.labels = "Chose 2nd family (0/1)",
          column.labels = c("Without covariates", "With covariates"))

**Treatment effects with and without covariates**

	Dependent variable:

	Chose 2nd family (0/1)
	Without covariates	With covariates
	(1)	(2)

housing2I-I		0.047
		(0.054)

housing2I-NI	-0.061	-0.003
	(0.055)	(0.055)

housing2NI-I	-0.293^***	-0.239^***
	(0.053)	(0.053)

housing2NI-NI	-0.046
	(0.054)

gender		-0.025
		(0.038)

age		0.001
		(0.001)

education		-0.028^**
		(0.013)

leftright		-0.019^*
		(0.011)

immigrants_economy		0.007
		(0.010)

immigrants_culture		-0.007
		(0.009)

Constant	0.730^***	0.878^***
	(0.039)	(0.132)


Observations	612	612
R²	0.058	0.073
Adjusted R²	0.053	0.059
Residual Std. Error	0.472 (df = 608)	0.470 (df = 602)
F Statistic	12.485^*** (df = 3; 608)	5.270^*** (df = 9; 602)

Note:	p<0.1; p<0.05; p<0.01

4.2.5 Heterogeneous Treatment Effects

Does the treatment affect everyone equally? Probably not — people with strong anti-immigration attitudes may react differently. We test this by interacting the treatment with prior immigration attitudes.

# Inspect the immigration culture variable (0 = enriches, 10 = undermines)
frq(exp_pis$immigrants_culture)

Y, usando esta escala, ¿dirías que la vida cultural española se enriquece o se e (x) <numeric> 
# total N=612 valid N=612 mean=4.19 sd=2.77

Value |        Label |   N | Raw % | Valid % | Cum. %
-----------------------------------------------------
    0 |  Enriquece 0 |  74 | 12.09 |   12.09 |  12.09
    1 |            1 |  43 |  7.03 |    7.03 |  19.12
    2 |            2 |  66 | 10.78 |   10.78 |  29.90
    3 |            3 |  80 | 13.07 |   13.07 |  42.97
    4 |            4 |  51 |  8.33 |    8.33 |  51.31
    5 |            5 | 134 | 21.90 |   21.90 |  73.20
    6 |            6 |  36 |  5.88 |    5.88 |  79.08
    7 |            7 |  51 |  8.33 |    8.33 |  87.42
    8 |            8 |  28 |  4.58 |    4.58 |  91.99
    9 |            9 |  14 |  2.29 |    2.29 |  94.28
   10 | Empobrece 10 |  35 |  5.72 |    5.72 | 100.00
 <NA> |         <NA> |   0 |  0.00 |    <NA> |   <NA>

# Simplify to 3 ordered categories for interpretability
exp_pis <- exp_pis |>
  mutate(immigrants_cult3 = case_when(
    immigrants_culture >= 0 & immigrants_culture <= 4 ~ "Positive",
    immigrants_culture == 5                           ~ "Neutral",
    immigrants_culture >= 6 & immigrants_culture <= 10 ~ "Negative"
  ))

# Set "Negative" as the reference (most restrictive attitudes)
exp_pis$immigrants_cult3 <- factor(exp_pis$immigrants_cult3,
                                    levels = c("Negative", "Neutral", "Positive"))

# The * operator includes main effects AND their interaction
fit4 <- lm(housing1_dv ~ housing2 * immigrants_cult3, data = exp_pis)

stargazer(fit4, type = "html",
          title = "Heterogeneous treatment effects by immigration attitudes",
          dep.var.labels = "Chose 2nd family (0/1)",
          notes = "Interaction terms show how treatment effects differ by attitude group.")

**Heterogeneous treatment effects by immigration attitudes**

	Dependent variable:

	Chose 2nd family (0/1)

housing2I-I	0.032
	(0.109)

housing2I-NI	0.063
	(0.105)

housing2NI-I	-0.405^***
	(0.101)

immigrants_cult3Neutral	-0.051
	(0.109)

immigrants_cult3Positive	-0.043
	(0.092)

housing2I-I:immigrants_cult3Neutral	-0.008
	(0.152)

housing2I-NI:immigrants_cult3Neutral	0.040
	(0.160)

housing2NI-I:immigrants_cult3Neutral	0.239
	(0.154)

housing2I-I:immigrants_cult3Positive	0.035
	(0.134)

housing2I-NI:immigrants_cult3Positive	-0.161
	(0.129)

housing2NI-I:immigrants_cult3Positive	0.213^*
	(0.125)

Constant	0.718^***
	(0.075)


Observations	612
R²	0.076
Adjusted R²	0.059
Residual Std. Error	0.470 (df = 600)
F Statistic	4.512^*** (df = 11; 600)

Note:	p<0.1; p<0.05; p<0.01
	Interaction terms show how treatment effects differ by attitude group.

Interaction coefficients in a table are hard to read. It is much clearer to plot predicted values for each treatment × attitude combination:

# Create a grid of all treatment × attitude combinations
new_data <- expand.grid(
  housing2          = levels(exp_pis$housing2),
  immigrants_cult3  = levels(exp_pis$immigrants_cult3)
)

# Get predicted values and 95% CI from the model
predictions    <- predict(fit4, newdata = new_data, interval = "confidence")
predictions_df <- as.data.frame(predictions)
colnames(predictions_df) <- c("fit", "lwr", "upr")

new_data <- cbind(new_data, predictions_df)

knitr::kable(new_data, digits = 2, format = "html", align = "lcccc",
             caption = "Predicted probability of choosing 2nd family by treatment and attitudes")

Predicted probability of choosing 2nd family by treatment and attitudes
housing2	immigrants_cult3	fit	lwr	upr
NI-NI	Negative	0.72	0.57	0.87
I-I	Negative	0.75	0.60	0.90
I-NI	Negative	0.78	0.64	0.92
NI-I	Negative	0.31	0.18	0.45
NI-NI	Neutral	0.67	0.51	0.82
I-I	Neutral	0.69	0.55	0.83
I-NI	Neutral	0.77	0.59	0.95
NI-I	Neutral	0.50	0.33	0.67
NI-NI	Positive	0.68	0.57	0.78
I-I	Positive	0.74	0.63	0.85
I-NI	Positive	0.58	0.47	0.68
NI-I	Positive	0.48	0.39	0.58

ggplot(new_data, aes(x = housing2, y = fit, fill = immigrants_cult3)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_errorbar(
    aes(ymin = lwr, ymax = upr),
    position = position_dodge(width = 0.9),
    width = 0.25
  ) +
  labs(
    title    = "Predicted probability of choosing 2nd family",
    subtitle = "By experimental condition and attitudes towards immigration",
    x        = "Experimental condition",
    y        = "Predicted probability",
    fill     = "Immigration attitudes"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Interpretation guide: If the treatment effects (differences across bars within each attitude group) vary in size or direction across attitude groups, that is evidence of heterogeneous treatment effects — the experiment affected different people differently.

4.3 Experiment 3: ITT and CACE — Get Out the Vote

The third experiment illustrates a common real-world complication: non-compliance. In the Gerber & Green (2005) GOTV study, voters were randomly assigned to receive a personal visit encouraging them to vote. However, some assigned to the treatment group were not at home and thus never received the visit — these are “never-takers.”

This creates a gap between:

Intent-to-Treat (ITT): the effect of being assigned to treatment (regardless of whether treatment was received)
Complier Average Causal Effect (CACE): the effect of actually receiving treatment, estimated only among those who complied

The CACE is what we really care about, but we can only estimate it indirectly, using the assignment as an instrumental variable for actual treatment receipt.

data1 <- read_csv("Gerber_Green_APSRsubset_2005.csv")

# Keep only single-person households in the door-to-door treatment arm
# (filtering out mail and phone treatments for clean comparison)
data2 <- data1 %>%
  filter(onetreat == 1, mailings == 0, phongotv == 0, persons == 1)

4.3.1 Intent-to-Treat (ITT) Estimate

The ITT is the simplest estimator: regress the outcome on the assignment indicator. This is always valid because assignment is random, even if compliance is imperfect.

# persngrp = 1 if assigned to personal visit condition
# v98_1 = voted in 1998 election (0/1)
itt_fit <- lm(v98_1 ~ persngrp, data = data2)

stargazer(itt_fit, type = "html",
          title = "Intent-to-Treat estimate (effect of assignment)",
          dep.var.labels = "Voted (0/1)",
          covariate.labels = c("Assigned to personal visit"),
          notes = "ITT: effect of being assigned to treatment, regardless of compliance.")

**Intent-to-Treat estimate (effect of assignment)**

	Dependent variable:

	Voted (0/1)

Assigned to personal visit	-0.121
	(0.657)

Constant	5.742^***
	(0.296)


Observations	7,090
R²	0.00000
Adjusted R²	-0.0001
Residual Std. Error	22.269 (df = 7088)
F Statistic	0.034 (df = 1; 7088)

Note:	p<0.1; p<0.05; p<0.01
	ITT: effect of being assigned to treatment, regardless of compliance.

Interpretation: The ITT estimates the effect of offering the treatment to the full assigned group, diluted by non-compliers who did not actually receive it.

4.3.2 First Stage: Effect of Assignment on Compliance (ITT_D)

We next estimate how many of those assigned to treatment actually received it (were visited). This is called the “first stage” in IV terminology, or ITT_D:

# cntany = 1 if the voter was actually contacted
itt_d_fit <- lm(cntany ~ persngrp, data = data2)

stargazer(itt_d_fit, type = "html",
          title = "First stage: Effect of assignment on treatment receipt (ITT_D)",
          dep.var.labels = "Actually contacted (0/1)",
          covariate.labels = c("Assigned to personal visit"),
          notes = "ITT_D: proportion of assigned units who complied.")

**First stage: Effect of assignment on treatment receipt (ITT**

	Dependent variable:

	Actually contacted (0/1)

Assigned to personal visit	0.273^***
	(0.006)

Constant	0.000
	(0.003)


Observations	7,090
R²	0.230
Adjusted R²	0.230
Residual Std. Error	0.201 (df = 7088)
F Statistic	2,122.996^*** (df = 1; 7088)

Note:	p<0.1; p<0.05; p<0.01
	ITTproportion of assigned units who complied.

Key quantity: The coefficient on persngrp is the compliance rate — the share of assigned units who actually received the treatment.

4.3.3 Complier Average Causal Effect (CACE)

The CACE answers the question: “For those who would comply with assignment, what is the effect of actually receiving the treatment?”

By the exclusion restriction (assignment only affects voting through actual contact), we can recover CACE as:

\[\text{CACE} = \frac{\text{ITT}}{\text{ITT}_D}\]

# Manual calculation: ratio of the two regression coefficients
cace_manual <- coef(itt_fit)[2] / coef(itt_d_fit)[2]
cat("CACE (manual):", round(cace_manual, 4))

CACE (manual): -0.4431

This is algebraically equivalent to Two-Stage Least Squares (2SLS), where we use assignment (persngrp) as an instrument for actual contact (cntany):

# ivreg() syntax: outcome ~ endogenous_variable | instrument
# The | separates the "structural equation" from the "instrument"
iv_model <- ivreg(v98_1 ~ cntany | persngrp, data = data2)
summary(iv_model)


Call:
ivreg(formula = v98_1 ~ cntany | persngrp, data = data2)

Residuals:
   Min     1Q Median     3Q    Max 
-5.742 -5.742 -5.742 -4.742 93.701 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   5.7419     0.2964  19.375 <0.0000000000000002 ***
cntany       -0.4431     2.4015  -0.185               0.854    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.27 on 7088 degrees of freedom
Multiple R-Squared: 0.0002198,  Adjusted R-squared: 7.873e-05 
Wald test: 0.03405 on 1 and 7088 DF,  p-value: 0.8536

Interpretation: The CACE estimate is larger than the ITT estimate. This makes intuitive sense: the ITT averages across compliers and never-takers (who were unaffected), so it underestimates the effect for those who actually received the visit.

Summary of estimators

Estimator	What it estimates	When to use
ITT	Effect of assignment	Always valid; conservative
ITT_D	Compliance rate	Needed to compute CACE
CACE (IV)	Effect of receipt among compliers	When you care about the treatment, not just the offer

The CACE is only identified if: (1) assignment affects the outcome only through treatment receipt (exclusion restriction), and (2) assignment increases (not decreases) treatment receipt for all units (monotonicity).

5 Analysis of Conjoint Experiments

This part walks through a conjoint analysis of a public policy experiment conducted during the COVID-19 pandemic. Respondents were shown pairs of hypothetical exit-from-lockdown policies and asked which they preferred. Each policy profile varied randomly across five attributes.

Our goal is to estimate and visualize two complementary quantities:

AMCEs (Average Marginal Component Effects): How much does each attribute level change the probability of a profile being chosen, relative to a baseline?
Marginal Means (MMs): What is the average predicted probability that a profile is chosen when a given level is present?

We will also conduct a subgroup analysis by partisanship to see whether government supporters and non-supporters weight these attributes differently.

Why both AMCEs and Marginal Means?

AMCEs answer “compared to the baseline level, how much does this level matter?” They are intuitive for testing hypotheses but depend on what the baseline is. Marginal Means don’t require a baseline and are easier to interpret substantively: a MM above 0.5 means the level makes a profile more likely to be chosen than average. For a good discussion, see Leeper, Hobolt & Tilley (2020).

5.1 Setup

We use four packages:

tidyverse — data manipulation and plotting
haven — reading SPSS .sav files
cregg — conjoint analysis (AMCEs and Marginal Means)
here — robust relative file paths

#Remove scientific notation
options(scipen=999)

# Install any missing packages automatically
required_packages <- c("tidyverse", "haven", "here")
to_install <- required_packages[!sapply(required_packages, requireNamespace, quietly = TRUE)]
if (length(to_install) > 0) install.packages(to_install)

# Cregg is not on CRAN, we install it from github page 
if (!require("remotes")) {
    install.packages("remotes")
}
remotes::install_github("leeper/cregg")


library(tidyverse)
library(haven)
library(cregg)
library(here)

5.2 Data Loading and Merging

The experiment produces two separate files:

Conjoint results (resultados Cj.sav): one row per profile shown (each respondent saw multiple pairs)
Survey data (UBCNES_196285_20200625.sav): one row per respondent, containing background variables like partisanship

We merge them on the respondent ID (numericalId) to attach partisanship to the choice data.

# --- File paths ---
# Adjust these paths to match your project structure
cj_path     <- here("UBCNES_196285_20200625_resultados Cj.sav")
survey_path <- here("UBCNES_196285_20200625_short.sav")

conjoint_raw <- haven::read_spss(cj_path)
survey_raw   <- haven::read_spss(survey_path)

# From the survey, extract partisanship (P5) and recode into a binary variable.
# P5 == 1 or P5 == 4 are coded as government supporters in the original Stata script.
survey_partisanship <- survey_raw |>
  select(numericalId, P5) |>
  mutate(
    govt_supporter = factor(
      if_else(P5 %in% c(1, 4), "Govt Supporter", "Non-supporter"),
      levels = c("Govt Supporter", "Non-supporter")
    )
  )

5.3 Data Preparation

This is the most important step. We need to:

Merge the conjoint data with the partisanship variable
Derive the choice variable: which profile did the respondent pick?
Label the attribute levels so they are human-readable in plots

5.3.1 Merge and Create the Choice Variable

In the raw data, selection contains the position (1 or 2) of the profile the respondent chose. Within each task (set), the chosen profile is the one where pos == selection.

merged_data <- left_join(conjoint_raw, survey_partisanship, by = "numericalId")

# `choice` is TRUE for the profile the respondent selected, FALSE for the other.
# We check this *within* each (respondent, task) pair.
merged_data <- merged_data |>
  group_by(numericalId, set) |>
  mutate(choice = (max(selection) == position)) |>
  ungroup()

Understanding max(selection) == position

In a forced-choice conjoint, each task shows exactly two profiles. selection records which position (1 or 2) was chosen. position records the position of this row’s profile (1 or 2). So for each group, max(selection) pulls out the chosen value, then compares it to each row’s position. The result is TRUE for the selected option and FALSE for all others.

5.3.2 Label Attributes and Rename ID

labeled_data <- merged_data |>
  mutate(
    `End of Confinement` = factor(
      atr1, levels = 1:3,
      labels = c("July", "September", "November")
    ),
    `Economic Impact` = factor(
      atr2, levels = 1:3,
      labels = c("Same unemployment as now", "1M more unemployed", "2M more unemployed")
    ),
    `Risk of Second Wave` = factor(
      atr3, levels = 1:4,
      labels = c("5% prob 2nd wave", "20% prob 2nd wave", "50% prob 2nd wave", "80% prob 2nd wave")
    ),
    `Privacy` = factor(
      atr4, levels = 1:2,
      labels = c("No cellphone control", "Cellphone control")
    ),
    `Executive Power` = factor(
      atr5, levels = 1:2,
      labels = c("No special powers", "Special powers")
    )
  ) |>
  rename(id = numericalId)  # we rename for clarity

5.4 Data Structure

Before modelling, let’s understand what the data looks like. Each row is one profile shown to one respondent in one task.

cat("Number of unique respondents:", n_distinct(labeled_data$id), "\n")

Number of unique respondents: 1236

cat("Total choice tasks shown:", nrow(distinct(labeled_data, id, set)), "\n")

Total choice tasks shown: 12360

cat("Rows per task (profiles per task):", nrow(labeled_data) / nrow(distinct(labeled_data, id, set)), "\n")

Rows per task (profiles per task): 2

cat("\nSample: first task for the first respondent\n")


Sample: first task for the first respondent

labeled_data |>
  filter(id == min(id), set == 1) |>
  select(id, set, position, choice, `End of Confinement`, `Economic Impact`,
         `Risk of Second Wave`, `Privacy`, `Executive Power`) |>
  print()

# A tibble: 2 × 9
         id   set position choice `End of Confinement` `Economic Impact` 
      <dbl> <dbl>    <dbl> <lgl>  <fct>                <fct>             
1 378656406     1        1 TRUE   July                 1M more unemployed
2 378656406     1        2 FALSE  November             2M more unemployed
# ℹ 3 more variables: `Risk of Second Wave` <fct>, Privacy <fct>,
#   `Executive Power` <fct>

What does the data structure tell us?

Each respondent sees 10 tasks. Each task has exactly 2 rows — one per profile. In each task, exactly one row has choice = TRUE. The attributes vary randomly across profiles and tasks. This randomization is what allows us to estimate causal effects of each attribute.

5.5 Conjoint Analysis: Full Sample

We define the formula once and reuse it throughout. This ensures consistency and makes it easy to add or remove attributes.

# Define the conjoint formula — reused for all models below
cj_formula <- choice ~ `End of Confinement` + `Economic Impact` +
                       `Risk of Second Wave` + `Privacy` + `Executive Power`

# Average Marginal Component Effects (AMCEs)
amce_results <- cj(labeled_data, cj_formula, id = ~id, estimate = "amce")

# Marginal Means
mm_results <- cj(labeled_data, cj_formula, id = ~id, estimate = "mm")

What is cregg estimating?

cj() fits a linear probability model (OLS) for AMCEs, or simply computes weighted means for MMs, with standard errors clustered by respondent. The id = ~id argument tells it which column identifies respondents for clustering.

The reference (baseline) level for AMCEs is automatically set to the first factor level in each attribute — in our case, “July”, “Same unemployment as now”, “5% prob 2nd wave”, “No cellphone control”, and “No special powers”. These appear as zero-effect points in the AMCE plot.

5.6 Visualization: Full Sample

We define a shared theme so all plots look consistent.

plot_theme <- theme_minimal(base_size = 14) +
  theme(
    plot.title    = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    legend.position  = "none",               # suppress redundant legend (levels already on y-axis)
    strip.text.y     = element_text(angle = 0, face = "bold"),
    panel.border     = element_rect(color = "grey90", fill = NA),
    strip.background = element_rect(color = "grey90", fill = "grey95")
  )

5.6.1 AMCE Plot

The AMCE shows how much each level changes the probability of a profile being chosen, relative to the baseline level of that attribute.

ggplot(amce_results, aes(x = estimate, y = level)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey40") +
  geom_pointrange(aes(xmin = lower, xmax = upper), size = 0.5) +
  facet_grid(feature ~ ., scales = "free_y", space = "free_y") +
  labs(
    title    = "Which Policy Features Drive Public Choice?",
    subtitle = "AMCEs with 95% Confidence Intervals (reference levels shown at 0)",
    x = "Change in Probability of Choosing Profile",
    y = NULL
  ) +
  plot_theme

Average Marginal Component Effects (AMCEs) — Full Sample

Reading the AMCE plot

Points at exactly 0 are baseline levels (they don’t show up as lines because the estimate is zero by construction).
A point to the right of 0 means this level makes the profile more likely to be chosen than the baseline.
A point to the left of 0 means less likely.
Bars show 95% confidence intervals — if a bar crosses zero, the effect is not statistically distinguishable from zero at the 5% level.

5.6.2 Marginal Means Plot

Marginal Means show the predicted probability that a profile is chosen when a given level is present, averaged across all combinations of other attributes. The reference point is 0.5 (random chance in a 2-profile task).

ggplot(mm_results, aes(x = estimate, y = level)) +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "grey40") +
  geom_pointrange(aes(xmin = lower, xmax = upper), size = 0.5) +
  facet_grid(feature ~ ., scales = "free_y", space = "free_y") +
  scale_x_continuous(limits = c(0, 0.7)) +
  labs(
    title    = "Predicted Public Support for Policy Features",
    subtitle = "Marginal Means with 95% CIs (dashed line = random selection at 0.5)",
    x = "Predicted Probability of Choosing Profile",
    y = NULL
  ) +
  plot_theme

AMCE vs Marginal Means: a quick comparison

Both plots convey similar information, but:

	AMCE	Marginal Mean
Baseline needed?	Yes	No
Intuitive scale	Change in probability	Absolute probability
Best for	Hypothesis testing	Substantive interpretation

Notice that the rank order of levels within an attribute is the same in both plots. The difference is just the reference point (0 vs 0.5).

5.7 Subgroup Analysis: Partisanship

Does partisanship shape how citizens evaluate these trade-offs? We now estimate Marginal Means separately for government supporters and non-supporters using the by argument in cj().

mm_subgroup <- cj(
  labeled_data,
  cj_formula,
  id      = ~id,
  estimate = "mm",
  by      = ~govt_supporter
)

ggplot(mm_subgroup, aes(x = estimate, y = level, color = BY, shape = BY)) +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "grey40") +
  geom_pointrange(
    aes(xmin = lower, xmax = upper),
    size = 0.5,
    position = position_dodge(width = 0.6)
  ) +
  scale_color_manual(
    values = c("Govt Supporter" = "#2166ac", "Non-supporter" = "#d6604d")
  ) +
  scale_shape_manual(
    values = c("Govt Supporter" = 16, "Non-supporter" = 17)
  ) +
  scale_x_continuous(limits = c(0.2, 0.8)) +
  facet_grid(feature ~ ., scales = "free_y", space = "free_y") +
  labs(
    title    = "Do Government Supporters and Non-supporters Have Different Priorities?",
    subtitle = "Marginal Means estimated separately for each partisan group",
    x        = "Predicted Probability of Choosing Profile",
    y        = NULL,
    color    = "Partisanship",
    shape    = "Partisanship"
  ) +
  plot_theme +
  theme(legend.position = "bottom")  # restore legend for subgroup plot

5.7.1 Interpreting the Subgroup Results

The plot above reveals where partisan groups converge and where they diverge:

Economic Impact & Second Wave Risk: Both groups respond similarly to economic and health risks — higher unemployment and higher probability of a second wave both reduce support, roughly equally across groups. This suggests these considerations are less politically filtered.
Privacy (Cellphone Control): Non-supporters show markedly lower support for policies involving cellphone tracking compared to government supporters. The difference in predicted probabilities is substantively meaningful and statistically clear (non-overlapping CIs).
Executive Power (Special Powers): The starkest divergence. Non-supporters strongly penalize the granting of special powers to the government, while supporters are considerably more permissive. This is consistent with well-established findings that in-group members are more tolerant of power concentration.

Subgroup comparisons: a methodological note

The subgroup MMs tell us that the two groups differ in their predicted probabilities, but to formally test whether these differences are statistically significant, you should compute interaction AMCEs using cj() with estimate = "amce" and by = ~govt_supporter, then use cj_anova() to test for differences. Visualizing the subgroup plots is a useful first step, but it is not a formal test.

5.7.2 Formal test of subgroup comparison

We now test formally the interaction using the AMCES

# Step 1: Estimate interaction AMCEs separately by partisanship
amce_subgroup <- cj(
  labeled_data,
  cj_formula,
  id       = ~id,
  estimate = "amce",
  by       = ~govt_supporter
)

# Step 2: Formal test of whether AMCEs differ across partisan groups
cj_anova(labeled_data, cj_formula, id = ~id, by = ~govt_supporter)

Analysis of Deviance Table

Model 1: choice ~ `End of Confinement` + `Economic Impact` + `Risk of Second Wave` + 
    Privacy + `Executive Power`
Model 2: choice ~ `End of Confinement` + `Economic Impact` + `Risk of Second Wave` + 
    Privacy + `Executive Power` + govt_supporter + `End of Confinement`:govt_supporter + 
    `Economic Impact`:govt_supporter + `Risk of Second Wave`:govt_supporter + 
    Privacy:govt_supporter + `Executive Power`:govt_supporter
  Resid. Df Resid. Dev Df Deviance      F                Pr(>F)    
1     24710     5291.4                                             
2     24700     5251.9 10   39.509 18.582 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.8 Conclusion

This analysis demonstrates a complete conjoint workflow in R:

Data preparation: merging conjoint and survey data, deriving the choice variable, and labeling attributes
Estimation: using cregg::cj() to compute AMCEs and Marginal Means
Visualization: readable ggplot2 plots with appropriate reference lines
Subgroup analysis: comparing partisan groups to uncover heterogeneity in preferences

The central substantive finding is that while economic and health considerations are shared across partisan lines, civil liberties trade-offs (cellphone surveillance, executive power) are strongly moderated by partisanship. Government supporters are considerably more willing to accept restrictions on civil liberties in exchange for security during a health crisis.