PS 5708 Week 11: Survey experiments and handling sensitive questions

List Experiment

A list experiment is a survey method designed to measure attitudes on sensitive topics by asking respondents to report the total number of items they agree with on a list, without specifying which ones. Respondents are randomly assigned to either a control group (with a list of non-sensitive items) or a treatment group (with the same non-sensitive items plus a sensitive one). By comparing the average count from both groups, researchers can estimate the proportion of respondents who hold the sensitive opinion.

Here we run a simulated (fake) experiment specifying four non-sensitive behaviors and one \(\textbf{sensitive}\) behavior:

Donated to a charity

Voted in a national election

Attended a religious service

Watched the evening news

\(\textbf{Used illegal drugs in the past years}\)!!!

To test for the valence of this behavior, we randomly assigned simulated 1,000 responses to two groups: the control group contains four non-sensitive items, the treatment group contains four non-sensitive items plus this highly sensitive item. After viewing the listed items, respondents were asked to choose how many such behaviors they had done in the past years without indicating the exact behavior (item). We also randomized the order of the four non-sensitive items to mitigate the ordering effect emanated from the sensitive item, as cautioned by Blair and Imai (2012).

The result of this simulated study is presented at below.

### Load required packages (uncomment if you haven't downloaded these packages)
# install.packages("list")     # list package by Blair & Imai
# install.packages("dplyr")    # For data wrangling
# install.packages("ggplot2")  # For visualization

library(list)
library(dplyr)
library(ggplot2)

### Import and prepare Qualtrics data
# treat = 0 for control list, 1 for treatment list
# plus covariates (e.g., age, gender, partisanship, income level)
list_df <- read.csv(url("https://www.dropbox.com/scl/fi/99ci5nzc07muhwq7g5yzy/list_experiment1.csv?rlkey=w3s4qscixyo12mwz30mpge85z&st=g5ue1bb3&dl=1"))

# Examine the data
dim(list_df)

## [1] 2000    7

names(list_df)

## [1] "gender"                                                          
## [2] "age"                                                             
## [3] "liberal"                                                         
## [4] "income_level"                                                    
## [5] "How.many.of.these.activities.have.you.done...Choose.from.0.to.4."
## [6] "How.many.of.these.activities.have.you.done...Choose.from.0.to.5."
## [7] "treat"

str(list_df)

## 'data.frame':    2000 obs. of  7 variables:
##  $ gender                                                          : int  1 1 2 2 1 2 1 2 1 2 ...
##  $ age                                                             : int  4 3 1 1 4 1 2 5 1 4 ...
##  $ liberal                                                         : int  8 6 3 6 6 9 7 5 3 8 ...
##  $ income_level                                                    : int  4 5 4 5 3 4 4 2 3 1 ...
##  $ How.many.of.these.activities.have.you.done...Choose.from.0.to.4.: int  NA 5 3 NA NA 1 NA 3 5 NA ...
##  $ How.many.of.these.activities.have.you.done...Choose.from.0.to.5.: int  2 NA NA 2 3 NA 2 NA NA 3 ...
##  $ treat                                                           : int  1 0 0 1 1 0 1 0 0 1 ...

# Collapse treatment & control group columns
list_df$outcome <- list_df$How.many.of.these.activities.have.you.done...Choose.from.0.to.4.
# Fill NA values in 'outcome' with corresponding values from treatment group
list_df$outcome[is.na(list_df$outcome)] <- list_df$How.many.of.these.activities.have.you.done...Choose.from.0.to.5.[is.na(list_df$outcome)]

# Minus 1 to let outcome value within proper range (0-4 for control group; 0-5 for treatment group)
list_df$outcome <- list_df$outcome - 1

# Rename for convenience
list_df <- list_df %>%
  rename(
    y = outcome,        # Count response
    treat = treat,     # Treatment indicator
    age = age,       # Example covariate
    gender = gender,
    liberal = liberal,
    income = income_level
  )

# Remove unused columns
list_df <- list_df[, -c(5, 6)]


# Summary of treatment status and outcome (y)
table(list_df$treat)

## 
##    0    1 
## 1000 1000

summary(list_df$y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.228   4.000   5.000

### Visualize distribution of counts by treatment
ggplot(list_df, aes(x = y, fill = factor(treat))) +
  geom_bar(position = "dodge") +
  labs(fill = "Treatment",
       title = "Distribution of Counts by Treatment Group",
       x = "Number of Items Reported",
       y = "Count of Respondents") +
  theme(plot.title = element_text(hjust = 0.5))

### Simple difference-in-means estimator
# This gives the estimated "prevalence" of the sensitive item
mean_treat <- mean(list_df$y[list_df$treat == 1], na.rm = TRUE)
mean_control <- mean(list_df$y[list_df$treat == 0], na.rm = TRUE)
sensitive_estimate <- mean_treat - mean_control   # Quite high, but since this result is estimated from a simulated dataset, rest assured!

cat("Estimated prevalence of the sensitive item:", round(sensitive_estimate, 3), "\n")

## Estimated prevalence of the sensitive item: 0.5

### Blair and Imai's (2012) design effect test
# Check whether including the sensitive item changes responses to non-sensitive items
# J = number of non-sensitive items (e.g., 4 in our example)
ict.test(y = list_df$y, treat = list_df$treat, J = 4)

## 
## Test for List Experiment Design Effects
## 
## Estimated population proportions 
##                          est.   s.e.
## pi(Y_i(0) = 0, Z_i = 1) 0.048 0.0173
## pi(Y_i(0) = 1, Z_i = 1) 0.059 0.0215
## pi(Y_i(0) = 2, Z_i = 1) 0.109 0.0221
## pi(Y_i(0) = 3, Z_i = 1) 0.132 0.0194
## pi(Y_i(0) = 4, Z_i = 1) 0.152 0.0114
## pi(Y_i(0) = 0, Z_i = 0) 0.160 0.0116
## pi(Y_i(0) = 1, Z_i = 0) 0.131 0.0197
## pi(Y_i(0) = 2, Z_i = 0) 0.102 0.0221
## pi(Y_i(0) = 3, Z_i = 0) 0.066 0.0214
## pi(Y_i(0) = 4, Z_i = 0) 0.041 0.0169
## 
##  Y_i(0) is the (latent) count of 'yes' responses to the control items. Z_i is the (latent) binary response to the sensitive item.
## 
## Bonferroni-corrected p-value
## If this value is below alpha, you reject the null hypothesis of no design effect. If it is above alpha, you fail to reject the null.
## 
## Sensitive Item 1 
##                1

# If the p-value is > 0.05, no evidence of a design effect (good).
# If < 0.05, it suggests the sensitive item altered responses to other items.
# All insignificant, nothing to worry about even though it's a fake dataset

### Regression-adjusted analysis using ictreg() function
# Allows estimation of covariate effects on the probability of endorsing the sensitive item
# Method = "ml" (maximum likelihood) or "nls" (nonlinear least squares)

list.fit <- ictreg(
  formula = y ~ age + gender + liberal + income,
  data = list_df,
  treat = "treat",
  J = 4,
  method = "ml"  # Maximum likelihood
)

summary(list.fit)

## 
## Item Count Technique Regression 
## 
## Call: ictreg(formula = y ~ age + gender + liberal + income, data = list_df, 
##     treat = "treat", J = 4, method = "ml")
## 
## Sensitive item 
##                 Est.    S.E.
## (Intercept)  0.28360 0.47797
## age          0.01947 0.05719
## gender      -0.14777 0.19790
## liberal     -0.02985 0.03695
## income       0.00543 0.07029
## 
## Control items 
##                 Est.    S.E.
## (Intercept) -0.22729 0.11333
## age          0.02909 0.01415
## gender       0.01778 0.04866
## liberal      0.00095 0.00902
## income       0.02435 0.01707
## 
## Log-likelihood: -3977.355
## 
## Number of control items J set to 4. Treatment groups were indicated by '1' and the control group by '0'.

### Interpreting the output
# Intercept: baseline probability of endorsing sensitive item (at covariate = 0)
# Covariate coefficients: marginal effects on probability of endorsement


### Visualize in coef. plot

# Extract coefficients and SEs
treat_df <- data.frame(
     term = names(list.fit$par.treat),
 estimate = as.numeric(list.fit$par.treat),
       se = as.numeric(list.fit$se.treat),
    group = "Treatment"
)

control_df <- data.frame(
      term = names(list.fit$par.control),
  estimate = as.numeric(list.fit$par.control),
        se = as.numeric(list.fit$se.control),
     group = "Control"
)

# Combine into one dataframe
coef_df <- bind_rows(treat_df, control_df)
 
# Optional: remove intercepts (if we don't need to assess baseline effect)
coef_df <- coef_df %>% filter(term != "(Intercept)")
 
# Convert terms to factors for plotting
coef_df$term <- factor(coef_df$term, levels = unique(coef_df$term))


# Plotting

ggplot(coef_df, aes(x = term, y = estimate, color = group)) +     
  geom_point(position = position_dodge(width = 0.5), size = 3) +
  geom_errorbar(aes(ymin = estimate - 1.96*se,
                    ymax = estimate + 1.96*se),
                   width = 0.2,
                   position = position_dodge(width = 0.5)) +
         coord_flip() +
         theme_minimal(base_size = 13) +
         labs(
             title = "Coefficient Estimates from List Experiment",
             x = "",
             y = "Estimate +/- 1.96 SE",
             color = "Group"
       )

# Most effects are insignificant, but, hey, they are from a simulated dataset.

### Predict estimated probabilities using the example of "age"
# Get predicted expected counts
pred_mean <- predict(list.fit, type = "expected")

pred_df <- data.frame(
  covariate = list_df$age,
  y = pred_mean$fit
)

# Plot predicted counts as a function of age group

ggplot(pred_df, aes(x = covariate, y = y)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess") +
  labs(
    title = "Predicted Expected Number of Items Endorsed \n as a function of Age",
    x = "Age",
    y = "Expected Count"
  ) +
  scale_x_continuous(
    breaks = 1:6,
    labels = c("18–29", "30–39", "40–49", "50–59", "60–69", "≥70")
  ) + 
  theme(plot.title = element_text(hjust = 0.5))

## `geom_smooth()` using formula = 'y ~ x'

Conjoint Analysis

Conjoint analysis is a statistical (not “experimental” really) technique often used in market research to determine how respondents value different product or service features by presenting them with trade-offs. It breaks down products into attributes (like price, brand, or features), shows respondents different combinations, and analyzes their choices to understand the relative importance of each feature. For instance, what “feature(s)” do you prioritize when purchasing a smartphone?

This helps companies with product design, pricing strategies, market segmentation, and forecasting demand. Political scientists and policy makers also apply this approach to study candidate attributes and policy trade-offs in order to improve their campaign/nomination strategies (e.g., Bansak et al. 2023) and structure a better trade deal (e.g., Hahm et al. 2019).

There are many commercial packages for implementing conjoint analysis. The license of our university’s Qualtrics conjoint module, sadly, has expired. For pedagogical purpose, I turned to Anton Strezhnev’s Python-based conjointSDT (SDT stands for Survey Development Tool) to script our Qualtrics survey (kudos to our grad student Yuhang Si) and then used relevant functions from \(\textsf{R}\)’s \(\textsf{cjoint}\) and \(\textsf{cregg}\) packages to estimate the so-called Average Marginal Component Effect (AMCE) and Marginal Mean (MM) as social scientists care more about the \(\textit{causal effect}\) of an attribute or the \(\textit{predicted probability}\) of a profile with a given level than merely an attribute’s part worth.

In this simulated conjoint analysis, we ask respondents to choose between five pairs of candidates that differ in one of the following five attriutes:

Gender: {Male, Female}
: Age: {< 30, 30-50, > 50}
: Education: {High school, College, Grad school}
: Party ID (PID): {TPP, Independent, KMT, DPP}
: Policy: {Independence, Unification}

Each choice represents a “level” of an attribute. For convenience, we keep keep the order of these five attributes constant for all five choice tasks. We also intentionally restrict the combination of {DPP, Unification} as this would become an empty set in today’s Taiwanese politics :D

### Load required packages (uncomment if you haven't downloaded these packages)
# install.packages("cjoint")   # cjoint package by Strezhnev et al.
# install.packages("dplyr")    # For data wrangling
# install.packages("ggplot2")  # For visualization

library(cjoint)
library(dplyr)
library(ggplot2)


# Load design file
design <- cjoint::makeDesign(type="file", filename="C://Users/hktse/Dropbox/Syllabus/Public opinion analysis/conjoint/design.dat")

# Load individual choice data
choice <- read.csv(url("https://www.dropbox.com/scl/fi/dzm01nh8nhrxz2f5ldqk3/choice.csv?rlkey=v1igg6yd0qjbyw797qc8p93wu&st=txc28wn0&dl=1"))


# To make our dataset cjoint readable, we first need to take a look at the structure of cjoint's example immigration data and then reverse-engineer to make our dataset looks like cjoint's

data("immigrationconjoint")
head(immigrationconjoint)

##   CaseID contest_no       Education Gender Country of Origin
## 1      4          1     high school   male              Iraq
## 2      4          1       no formal female            France
## 3      4          2 graduate degree female             Sudan
## 4      4          2       4th grade female           Germany
## 5      4          3     high school female       Philippines
## 6      4          3  college degree   male             Sudan
##   Reason for Application                 Job Job Experience
## 1        seek better job               nurse       5+ years
## 2        seek better job child care provider      3-5 years
## 3     escape persecution            gardener      3-5 years
## 4    reunite with family construction worker       5+ years
## 5        seek better job               nurse       5+ years
## 6        seek better job child care provider           none
##                  Job Plans            Prior Entry          Language Skills
## 1   contract with employer        once as tourist tried English but unable
## 2 interviews with employer once w/o authorization         used interpreter
## 3   contract with employer        once as tourist           fluent English
## 4 interviews with employer                  never           fluent English
## 5 interviews with employer once w/o authorization           broken English
## 6   contract with employer                  never           fluent English
##   Chosen_Immigrant ethnocentrism profile LangPos PriorPos
## 1                1            50       1       5        4
## 2                0            50       2       5        4
## 3                0            50       1       5        4
## 4                1            50       2       5        4
## 5                1            50       1       5        4
## 6                0            50       2       5        4

Okay, so we need to reshape our data from wide to long format and then append “CaseID” (for each respondent), “contest_no” (two per task), “profile” (a unique combination of attributes at given levels), and the dichotomous outcome variable for a task (the profile that’s being chosen).

# Let's write a function to collapse collected responses into a long-format dataframe 
# choice_df: data.frame or matrix where
# each ROW = one respondent
# each COLUMN = one conjoint task (J tasks, here we have 5 tasks)
# cell values are 1 or 2 indicating which profile was chosen

make_conjoint_long <- function(choice_df) {

  # Bust be a numeric matrix
  mat <- as.matrix(choice_df)
  mode(mat) <- "numeric"
  
  J <- ncol(mat)      # number of tasks per respondent
  N <- nrow(mat)      # number of respondents
  
  # CaseID: each respondent has J*2 rows, respondents stacked
  CaseID <- rep(1:N, each = J * 2)
  
  # contest_no: for each respondent, produce (1,1,2,2,...,J,J)
  contest_no <- rep(rep(1:J, each = 2), times = N)
  
  # profile: for each task produce (1,2) and repeat for all tasks and respondents
  profile <- rep(rep(c(1,2), times = J), times = N)   # Equivalently we can do rep(c(1,2), times = J * N))
  
  # Choice: for each respondent, for each task produce two rows:
  # if respondent chose profile 1 (value == 1) -> c(1,0)
  # if respondent chose profile 2 (value == 2) -> c(0,1)
  # Concatenate respondents in the same order as CaseID
  Choice <- unlist(lapply(1:N, function(i) {
    resp_choices <- mat[i, ]            # length J, entries = 1 or 2
    # Map each task to a length-2 vector
    unlist(lapply(resp_choices, function(ch) {
      if (is.na(ch)) {
        # if missing, code both as NA (or you can choose 0,0)
        c(NA_integer_, NA_integer_)
      } else if (ch == 1) {
        c(1L, 0L)
      } else if (ch == 2) {
        c(0L, 1L)
      } else {
        stop("Choice values must be 1 or 2 (or NA). Found: ", ch)
      }
    }))
  }))
  
  # Build the dataframe
  out <- data.frame(
    CaseID = CaseID,
    contest_no = contest_no,
    profile = profile,
    Choice = Choice
  )
  
  # Ensure correct types
  out$CaseID <- as.integer(out$CaseID)
  out$contest_no <- as.integer(out$contest_no)
  out$profile <- as.integer(out$profile)
  out$Choice <- as.integer(out$Choice)
  
  return(out)
}

# Check
conjoint_long <- make_conjoint_long(choice)
head(conjoint_long, 12)

##    CaseID contest_no profile Choice
## 1       1          1       1      1
## 2       1          1       2      0
## 3       1          2       1      0
## 4       1          2       2      1
## 5       1          3       1      1
## 6       1          3       2      0
## 7       1          4       1      0
## 8       1          4       2      1
## 9       1          5       1      1
## 10      1          5       2      0
## 11      2          1       1      1
## 12      2          1       2      0

### Convert profile to long format

# Load profile data
profile <- read.csv(url("https://www.dropbox.com/scl/fi/w1g3lch0jjvhdmbi3jgdm/profile.csv?rlkey=uofp5sqejtcepgc56aobke8cy&st=umidcfs5&dl=1"))

# Function
make_profile_long <- function(profile_df, choice_df) {
  # profile_df: 1-row data frame, 75 columns
  # choice_df: respondent-level choices (used only for replication)
  
  J <- 5                # Number of tasks
  K <- 5                # Number of attributes per profile
  profiles_per_task <- 2
  
  # Total respondents (need to get this from previous file)
  N <- nrow(choice_df)
  
  # Each task has 3 blocks of 5 columns:
  # 1st = attribute names (whose order is the same across all J tasks)
  # 2nd = profile 1 values
  # 3rd = profile 2 values
  #
  # => pattern for columns:
  # [Task1: attr1-5 | p1_1-5 | p2_1-5]
  # [Task2: attr1-5 | p1_1-5 | p2_1-5]
  # ...
  
  # Initialize list to store results
  profile_list <- list()
  
  for (t in 1:J) {
    # Column index boundaries for this task
    start_col <- (t - 1) * 15 + 1
    attr_names <- as.character(profile_df[1, start_col:(start_col + 4)])
    
    # profile 1 and profile 2 values
    prof1 <- as.character(profile_df[1, (start_col + 5):(start_col + 9)])
    prof2 <- as.character(profile_df[1, (start_col + 10):(start_col + 14)])
    
    # Combine into a 2×5 dataframe with proper column names
    df_task <- data.frame(
      matrix(c(prof1, prof2), nrow = 2, byrow = TRUE),
      stringsAsFactors = FALSE
    )
    colnames(df_task) <- attr_names
    
    profile_list[[t]] <- df_task
  }
  
  # Combine all 5 tasks into 10×5 dataframe
  profile_long <- do.call(rbind, profile_list)
  rownames(profile_long) <- NULL
  
  # Replicate this entire 10×5 block for all respondents
  profile_long_full <- profile_long[rep(1:nrow(profile_long), times = N), ]
  
  return(profile_long_full)
  
}


# Check
profile_long <- make_profile_long(profile, choice)
# Convert to factor level for fully factoral design
profile_long[] <- lapply(profile_long, factor)
head(profile_long, 12)

##     Gender   Age         Edu         PID       Policy
## 1     Male 30-50 High School Independent  Unification
## 2   Female 30-50 Grad School Independent  Unification
## 3   Female 30-50 Grad School         DPP Independence
## 4     Male 30-50     College         KMT  Unification
## 5   Female 30-50 High School         DPP Independence
## 6     Male   >50     College Independent  Unification
## 7   Female   >50 High School Independent Independence
## 8   Female   >50 High School Independent Independence
## 9     Male   >50 Grad School         KMT Independence
## 10  Female 30-50 High School         TPP  Unification
## 1.1   Male 30-50 High School Independent  Unification
## 2.1 Female 30-50 Grad School Independent  Unification

nrow(profile_long)

## [1] 9610

### Replicate each respondent's attributes by the number of tasks (5) x profile choices (2)

# Load Respondent attributes
respondent_attributes <- read.csv(url("https://www.dropbox.com/scl/fi/9x1u2ivybpgckczuaykfz/personal_attributes.csv?rlkey=z4qn8vijrsvlddffb9nru3cbk&st=9wvzfljv&dl=1"))

# Number of repetitions (number of respondents)
J = 5
C = 2
n_choice <- J * C

# Replicate each row of personal_attributes by nrow(choice) times
respondent_attributes_long <- respondent_attributes[rep(1:nrow(respondent_attributes), each = n_choice), ]

# Recode gender
respondent_attributes_long$gender <- ifelse(respondent_attributes_long$gender == 2, 0, 1)


# Combine data
candidateconjoint <- cbind(conjoint_long, profile_long, respondent_attributes_long)

# Take a look at the data
dim(candidateconjoint)

## [1] 9610   13

names(candidateconjoint)

##  [1] "CaseID"     "contest_no" "profile"    "Choice"     "Gender"    
##  [6] "Age"        "Edu"        "PID"        "Policy"     "gender"    
## [11] "age"        "edu"        "partisan"

str(candidateconjoint)

## 'data.frame':    9610 obs. of  13 variables:
##  $ CaseID    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ contest_no: int  1 1 2 2 3 3 4 4 5 5 ...
##  $ profile   : int  1 2 1 2 1 2 1 2 1 2 ...
##  $ Choice    : int  1 0 0 1 1 0 0 1 1 0 ...
##  $ Gender    : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 1 1 2 1 ...
##  $ Age       : Factor w/ 2 levels ">50","30-50": 2 2 2 2 2 1 1 1 1 2 ...
##  $ Edu       : Factor w/ 3 levels "College","Grad School",..: 3 2 2 1 3 1 3 3 2 3 ...
##  $ PID       : Factor w/ 4 levels "DPP","Independent",..: 2 2 1 3 1 2 2 2 3 4 ...
##  $ Policy    : Factor w/ 2 levels "Independence",..: 2 2 1 2 1 2 1 1 1 2 ...
##  $ gender    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ age       : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ edu       : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ partisan  : int  1 1 1 1 1 1 1 1 1 1 ...

Average Marginal Component Effect (AMCE)

We then run AMCE estimator using all attributes in the design. The \(\textsf{amce()}\) function fits a linear probability model (i.e., OLS) or logit model (depending on your options), treats each attribute as a factor. One level of each factor is the reference category (often the first or alphabetically first level, akin to fixed effects. Estimated coefficients are interpreted as the differences in the probability of being chosen relative to that baseline.

cj_amce.fit <- cjoint::amce(Choice ~ `Gender` + `Age` + `Edu` + `PID` + `Policy`, data = candidateconjoint, 
                cluster = TRUE, respondent.id = "CaseID", 
                design = design)

summary(cj_amce.fit)

## ------------------------------------------
## Average Marginal Component Effects (AMCE):
## ------------------------------------------
##  Attribute       Level   Estimate Std. Err  z value Pr(>|z|)  
##        Age       30-50  0.0098855 0.042438  0.23294 0.815809  
##        Edu Grad School  0.0036420 0.028223  0.12905 0.897322  
##        Edu High School -0.0327784 0.043287 -0.75723 0.448910  
##     Gender        Male  0.0062435 0.038971  0.16021 0.872717  
##        PID Independent  0.0176899 0.051139  0.34592 0.729406  
##        PID         KMT -0.0286160 0.016118 -1.77544 0.075825  
##        PID         TPP  0.0426639 0.082614  0.51643 0.605557  
##     Policy Unification -0.0312175 0.067974 -0.45926 0.646049  
## ---
## Number of Obs. = 9610
## ---
## Number of Respondents = 961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
## 
## --------------------
## AMCE Baseline Levels:
## --------------------
##  Attribute        Level
##        Age          >50
##        Edu      College
##     Gender       Female
##        PID          DPP
##     Policy Independence

# So if Gender has two levels {Male, Female} (and baseline = Female), then the AMCE coefficient for GenderMale = 0.006 means:
# Holding other attributes constant, a profile being Male (vs Female) increases its choice probability by 0.6 percentage points on average.

# Visalize AMCE
plot(cj_amce.fit, vline = 0.5)

# None of these effects is statistically significant, but this is expected as we use simulated data.

Marginal Mean (MM)

Another often-used effect metric is “marginal mean” (mm), which measures the average predicted probability that “a profile with a given level is chosen, regardless of baseline.” Let’s use the \(\textsf{cj()}\) function from the \(\textsf{cregg}\) package to obtain MM estimator.

# install.packages("cregg")    # Removed from CRAN, but you can load into your Rstudio manually as a ZIP file, get it here: https://cran.r-project.org/src/contrib/Archive/cregg/

library(cregg)

cj_mm.fit <- cregg::cj(Choice ~ Gender + Age + Edu + PID + Policy, data = candidateconjoint,
                id = ~ CaseID,
                estimate = "mm",  # Estimate marginal mean
                h0 = 0.5)  # A numeric value specifying a null hypothesis value to use when generating z-statistics and p-values, so correspondingly the alpha level is, by default, 0.95

## Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.
## Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.
## Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.
## Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.
## Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.

print(cj_mm.fit)

##    outcome statistic feature        level  estimate   std.error          z
## 1   Choice        mm  Gender       Female 0.5065904 0.005377737  1.2254890
## 2   Choice        mm  Gender         Male 0.4901145 0.008066605 -1.2254890
## 3   Choice        mm     Age          >50 0.5010406 0.005557983  0.1872231
## 4   Choice        mm     Age        30-50 0.4993063 0.003705322 -0.1872231
## 5   Choice        mm     Edu      College 0.4895942 0.011256475 -0.9244304
## 6   Choice        mm     Edu  Grad School 0.5133541 0.009342494  1.4293982
## 7   Choice        mm     Edu  High School 0.4961498 0.005477122 -0.7029524
## 8   Choice        mm     PID          DPP 0.5104058 0.011256475  0.9244304
## 9   Choice        mm     PID  Independent 0.5015609 0.003227093  0.4836781
## 10  Choice        mm     PID          KMT 0.4838710 0.011523137 -1.3997084
## 11  Choice        mm     PID          TPP 0.5036420 0.016137002  0.2256949
## 12  Choice        mm  Policy Independence 0.5034339 0.005594886  0.6137610
## 13  Choice        mm  Policy  Unification 0.4965661 0.005594886 -0.6137610
##            p     lower     upper
## 1  0.2203910 0.4960502 0.5171305
## 2  0.2203910 0.4743042 0.5059247
## 3  0.8514857 0.4901471 0.5119340
## 4  0.8514857 0.4920440 0.5065686
## 5  0.3552623 0.4675319 0.5116565
## 6  0.1528898 0.4950432 0.5316651
## 7  0.4820854 0.4854149 0.5068848
## 8  0.3552623 0.4883435 0.5324681
## 9  0.6286144 0.4952359 0.5078859
## 10 0.1616007 0.4612860 0.5064559
## 11 0.8214387 0.4720141 0.5352700
## 12 0.5393733 0.4924681 0.5143997
## 13 0.5393733 0.4856003 0.5075319

# Ths function computes the "predicted probability" of being chosen for each attribute level, averaging over all respondents AND all combinations of other attributes. Yet there’s no reference category - each level stands on its own scale (between 0 and 1). Hence, while AMCEs are the differences relative to a base level (zero-centered), MMs are absolute probabilities (bounded between 0 and 1), representing the mean outcome across all appearances of a given conjoint feature level, holding all other features to their means (recall your knowledge of logit and probit models)


# Visualize MM using coef. plot

plot(cj_mm.fit, vline = 0.5)

# Plot the difference in AMCE (of candidate attributes) by Gender
# cregg supports literally all cjoint applications, including the latter's AMCE estimator, all you need to do is to tweak the estimate argument inside cj()
# First convert contest_no to factor class
candidateconjoint$gender <- factor(ifelse(candidateconjoint$gender == 1, "Male", "Female"))

amce_g <- cregg::cj(Choice ~ Gender + Age + Edu + PID + Policy, data = candidateconjoint, 
                id = ~ CaseID,
                estimate = "amce",  # Estimate AMCE
                by = ~gender # Notice here
                )

## Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.
## Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.

amce_diff_g <- cregg::cj(Choice ~ Gender + Age + Edu + PID + Policy, data = candidateconjoint, 
                id = ~ CaseID,
                estimate = "amce_diff",  # Estimate AMCE by gender
                by = ~gender # Notice here
                )

## Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.

p <- plot(rbind(amce_g, amce_diff_g)) + ggplot2::facet_wrap(~BY, ncol = 3L)
p + scale_color_manual(values = rep("black", 9))

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

# Again, since we used simulated data, the result reported here should not be taken seriously.

PS 5708 Week 11: Survey experiments and handling sensitive questions

HK Tseng

2025-11-07

List Experiment

Conjoint Analysis

Average Marginal Component Effect (AMCE)

Marginal Mean (MM)