Data cleaning, preparation & first analysis

Christoph Schulze

Introduction and outline

Today we will have a detailed look in our data analysis. This includes:

Good habits I

  • structure your work such that others understand it as well
  • in other words, make it reproducible
  • good practice: one parent folder and separate folders for
  1. raw data,
  2. scripts,
  3. processed data, and
  4. output

Good habits II

  • use R Projects as “parent” files
  • set working directory to Project directory

flowchart LR
  A(Session) --> B(Set Working Directory)
  B --> C[To Project Directory]

  • everything you do, include it in the scripts (for the sake of reproducibility and your organisation)
setwd("~/ich/Teaching/Osnabrück/slides/Block 2/Quarto_B2D1")

Good habits III

  • provide a good script header
## Script_title: manipulation checks of treatments
##
## Notes: comparison of beliefs and self assessed knowledge
##
## Date created:
## Author(s): Christoph Schulze
  • comment your code
#loading another script in advance
source("01_scripts/00_header.R")

#clear working space
rm(list=ls())

#loading data into an object called 'data'
data <- read.csv("02_processed/02_pooled_data/all_milk_ppp.csv")

Good habits IV

  • provide sections in your code by adding ---- or ####
  • that way you can easily navigate in your code
# data cleaning ----
# data cleaning ####

Mock data

Study on agri-environmental contracts

For this session we will use data from a DCE on agri-environmental contracts choice (see Schulze et al. (2026)).

  • labelled DCE: private or public contract
  • 4 attributes
  • \((2 \times 2 \times 2 \times 8) \times (2 \times 2 \times 2 \times 8) = \\(2 \times 2 \times 2 \times 8)^2=4096\)
  • 12 per respondent (3 Blocks)
  • Bayesian efficient design

Study on agri-environmental contracts

Table 1: Attributes used to characterise AECM contract
Attribute Description Level Type
Payment mechanism Payment criteria of the agri-environmental contract Practice based; Result based Binary (0-Practice based, 1-Result based)
Contract partners Option to involve partners in the contract and implement measure cooperatively none; cooperative Binary (0-none, 1-cooperatives)
Advisory services Option to have advisory services included in the contract none; inclusive Binary (0-none, 1-inclusive)
Compensation Compensation payments per hectare 350–830 Euros/ha/year 8 equally-separated levels

Data preparation

Merging datasets

  • depending on how you collect your data, it may come in different formats
  • be aware what format is required from the program you working with
  • oftentimes we have one dataset (wide format) that includes participant characteristics (one observation per respondent)
  • then, we also have responses to choice tasks (long format) that include peoples choices

Merging datasets

The R package we are working with here (Apollo) will require the long format.

  • each row is treated as a respondents choice

\(\rightarrow\) transform data accordingly

Merging datasets

In our mock dataset, we have two distinct files: choice data and respondent characteristics.

 # include the R script which includes all the packages
 source("01_scripts/00_header.R")

 #load the database including choice data
1 database_choice <- read_excel("00_data/agora_live_1_v1.xlsx")

 #load the database including the covariates
 database_covar <- read_excel("00_data/agora_live_1_covariates.xlsx")
 
 #merge
2 database <- database_choice %>%
  left_join(database_covar, by="RID")
1
loading both datasets
2
merging data by variable RID

Clean dataset

In our raw datafile, we can hardly see what variables mean. For that reason, we rename and recode them accordingly.

# Recoding variables ----
database <- database %>% 
  mutate(pay_mode_1 = ifelse(a1_x1==2,0,1),
         partner_1 = ifelse(a1_x2==1,0,1),
         consult_1 = ifelse(a1_x3==1,0,1),
         compensation_1 = (case_when((a1_x4==1) ~ 350,
                                     (a1_x4==2) ~ 410,
                                     (a1_x4==3) ~ 470,
                                     (a1_x4==4) ~ 530,
                                     (a1_x4==5) ~ 590,
                                     (a1_x4==6) ~ 650,
                                     (a1_x4==7) ~ 710,
                                     (a1_x4==8) ~ 770,
                                     (a1_x4==9) ~ 830)),
         pay_mode_2 = ifelse(a2_x1==2,0,1),
         partner_2 = ifelse(a2_x2==1,0,1),
         consult_2 = ifelse(a2_x3==1,0,1),
         compensation_2 = (case_when((a2_x4==1) ~ 350,
                                     (a2_x4==2) ~ 410,
                                     (a2_x4==3) ~ 470,
                                     (a2_x4==4) ~ 530,
                                     (a2_x4==5) ~ 590,
                                     (a2_x4==6) ~ 650,
                                     (a2_x4==7) ~ 710,
                                     (a2_x4==8) ~ 770,
                                     (a2_x4==9) ~ 830)))
# Generating opt out alternatives and relocating them to the correct place ----

database <- database %>% 
  mutate(pay_mode_3 = 0,
         partner_3 = 0,
         consult_3 = 0,
         compensation_3 = 0) %>%
  relocate(pay_mode_3, .after=compensation_2) %>%
  relocate(partner_3, .after=pay_mode_3) %>%
  relocate(consult_3, .after=partner_3) %>%
  relocate(compensation_3, .after=consult_3) %>%
  as.data.frame()

Filter out incomplete responses

We only want to include individuals that completed the survey.

  • check STATUS variable for levels
  • 7 means complete
  • 10 means incomplete
database <- database %>% 
  filter(STATUS==7)

Protest responses

There might be individuals in our dataset that always chose the opt out alternative.

  • need to find out: why?
  • are they protesting or is the offered amount not sufficient?
Variable N N = 3471
ID 347 174.00 (100.31)
Someone else should carry out nature conservation (yes) 83 14 / 83 (17%)
The compensation amount was not sufficient (yes) 83 21 / 83 (25%)
I have already done a lot for nature conservation in the past (yes) 83 28 / 83 (34%)
There was not enough information available (yes) 83 17 / 83 (20%)
The measure does not fit into my farm operations (yes) 83 51 / 83 (61%)
1 Mean (SD); n / N (%)

Protest responses

Creating the dataset which excludes all serial protest responses.

database_np <- database %>%
  group_by(ID) %>%
1  summarise(always_opt_out = all(pref1 == 3 & q30_2==1)) %>%
  filter(!always_opt_out) %>%
  select(ID) %>%
  inner_join(database, by = "ID")

cset <- 12
nsets <- nrow(database_np)
respondents <- nsets/cset %>% 
  as.numeric()

database_np <- database_np %>% 
  mutate(ID = rep(1:respondents, each=cset))

2write.csv(database_np, "02_processed/agora_data.csv")
1
identify respondents who always choose option 3 and would never participate
2
save database excluding protesters

Descriptive statistics

Sample description

We may want to know a few things prior to getting into the choice modelling analysis about the representativeness of our sample.

  • age
  • gender
  • income
  • education
  • if you are a farmers: farm size, years of experience, share of owned vs. leased land etc.

Choice modelling

Initiating Apollo

source("01_scripts/00_header.R")

1apollo_initialise()

nalt=12

2database <- read.csv("02_processed/agora_data.csv")

3apollo_control = list(
  modelName  ="Agora baseline",
  modelDescr ="Simple MNL model",
  indivID    ="ID",
  outputDirectory = "output",
  panelData = TRUE
)
1
we start apollo
2
we load the database
3
we name a model, identify the respondent specific ID variable, and create an output directory for all estimation files

Logit models

Maximum Likelihood Estimation

Which values of \(\beta\) make the choices we observe (\(d_{nj}=1\)) most probable?

  • define starting values for search
# set apollo prior values ----
apollo_beta=c(asc_pub          = 0,
              b_pay_mode_pub   = 0,
              b_partner_pub    = 0,
              b_consult_pub    = 0,
              asc_priv         = 0,
              b_pay_mode_priv  = 0,
              b_partner_priv   = 0,
              b_consult_priv   = 0,
              b_comp           = 0)

Logit models

Building utlity functions \[\begin{align} U_i = \underbrace{V_i}_{\text{deterministic}} + \underbrace{\epsilon_i}_{\text{random}} \label{eq:rum} \end{align}\] with \[\begin{align} V_i = ASC + \beta_1X_{1_i} + \beta_2X_{2_i} + \cdots + \beta_{COST_i}X_{COST_i} \label{eq:det_part} \end{align}\]

  V[['first']]   = asc_pub + b_pay_mode_pub * pay_mode_1 + b_partner_pub * partner_1 + b_consult_pub * consult_1 + b_comp * compensation_1
  V[['second']]  = asc_priv + b_pay_mode_priv * pay_mode_2 + b_partner_priv * partner_2 + b_consult_priv * consult_2 + b_comp *  compensation_2 
  V[['third']]   = 0 

Logit models

Finally, we estimate the model…

AgoraBaselineMNL = apollo_estimate(apollo_beta, apollo_fixed, apollo_probabilities, apollo_inputs,
                                                  estimate_settings=list(maxIterations=400, estimationRoutine="bfgs",
                                                                         hessianRoutine="analytic"))
Preparing user-defined functions.

Testing likelihood function...

Overview of choices for MNL model component :
                                   first  second   third
Times available                  3420.00 3420.00 3420.00
Times chosen                     1374.00  845.00 1201.00
Percentage chosen overall          40.18   24.71   35.12
Percentage chosen when available   40.18   24.71   35.12


Pre-processing likelihood function...

Testing influence of parameters
Starting main estimation
Initial function value: -3757.254 
Initial gradient value:
        asc_pub  b_pay_mode_pub   b_partner_pub   b_consult_pub        asc_priv 
       234.0000        103.0000         94.0000        143.0000       -295.0000 
b_pay_mode_priv  b_partner_priv  b_consult_priv          b_comp 
      -172.3333       -168.0000       -104.6667      83310.0000 
initial  value 3757.254027 
iter   2 value 3745.828230
iter   3 value 3684.427878
iter   4 value 3600.239896
iter   5 value 3595.030966
iter   6 value 3587.807902
iter   7 value 3579.101758
iter   8 value 3536.370271
iter   9 value 3536.231559
iter  10 value 3531.239777
iter  11 value 3456.412293
iter  12 value 3454.913940
iter  13 value 3450.303799
iter  14 value 3450.295732
iter  15 value 3450.295635
iter  15 value 3450.295634
iter  15 value 3450.295634
final  value 3450.295634 
converged

Estimated parameters with approximate standard errors from BHHH matrix:
                   Estimate     BHHH se BHH t-ratio (0)
asc_pub           -2.191634     0.12037        -18.2076
b_pay_mode_pub    -0.131633     0.07954         -1.6548
b_partner_pub     -0.039843     0.07582         -0.5255
b_consult_pub      0.225681     0.09753          2.3141
asc_priv          -2.696137     0.12172        -22.1495
b_pay_mode_priv   -0.227101     0.08420         -2.6971
b_partner_priv    -0.068259     0.09859         -0.6924
b_consult_priv     0.278566     0.07897          3.5277
b_comp             0.003876  1.4192e-04         27.3091

Final LL: -3450.2956

Calculating log-likelihood at equal shares (LL(0)) for applicable
  models...
Calculating log-likelihood at observed shares from estimation data
  (LL(c)) for applicable models...
Calculating LL of each model component...
Calculating other model fit measures
Computing covariance matrix using numerical jacobian of analytical
  gradient.
 0%....25%....50%....75%...100%
Negative definite Hessian with maximum eigenvalue: -30.659283

Please acknowledge the use of Apollo by citing Hess & Palma (2019) -
  doi.org/10.1016/j.jocm.2019.100170

Logit models

… and save the output in the output directory.

apollo_modelOutput(AgoraBaselineMNL)
Model run by cschulze using Apollo 0.3.5 on R 4.5.0 for Windows.
Please acknowledge the use of Apollo by citing Hess & Palma (2019)
  DOI 10.1016/j.jocm.2019.100170
  www.ApolloChoiceModelling.com

Model name                                  : Agora baseline
Model description                           : Simple MNL model
Model run at                                : 2026-03-05 16:52:35.367829
Estimation method                           : bfgs
Model diagnosis                             : successful convergence
Optimisation diagnosis                      : Maximum found
     hessian properties                     : Negative definite
     maximum eigenvalue                     : -30.659283
     reciprocal of condition number         : 1.02635e-07
Number of individuals                       : 285
Number of rows in database                  : 3420
Number of modelled outcomes                 : 3420

Number of cores used                        :  1 
Model without mixing

LL(start)                                   : -3757.25
LL at equal shares, LL(0)                   : -3757.25
LL at observed shares, LL(C)                : -3691.16
LL(final)                                   : -3450.3
Rho-squared vs equal shares                  :  0.0817 
Adj.Rho-squared vs equal shares              :  0.0793 
Rho-squared vs observed shares               :  0.0653 
Adj.Rho-squared vs observed shares           :  0.0634 
AIC                                         :  6918.59 
BIC                                         :  6973.83 

Estimated parameters                        : 9
Time taken (hh:mm:ss)                       :  00:00:1.34 
     pre-estimation                         :  00:00:0.6 
     estimation                             :  00:00:0.37 
     post-estimation                        :  00:00:0.37 
Iterations                                  :  17  

Unconstrained optimisation.

Estimates:
                   Estimate        s.e.   t.rat.(0)    Rob.s.e. Rob.t.rat.(0)
asc_pub           -2.191634     0.13367    -16.3955     0.17500      -12.5236
b_pay_mode_pub    -0.131633     0.07303     -1.8024     0.07134       -1.8452
b_partner_pub     -0.039843     0.07304     -0.5455     0.07307       -0.5453
b_consult_pub      0.225681     0.07304      3.0898     0.05607        4.0250
asc_priv          -2.696137     0.13914    -19.3772     0.17740      -15.1984
b_pay_mode_priv   -0.227101     0.08287     -2.7405     0.09130       -2.4874
b_partner_priv    -0.068259     0.08322     -0.8203     0.07461       -0.9149
b_consult_priv     0.278566     0.08312      3.3512     0.09250        3.0114
b_comp             0.003876  1.8683e-04     20.7443  2.6245e-04       14.7671
apollo_saveOutput(AgoraBaselineMNL)

Old result file "output/Agora baseline_output.txt" 
 renamed to: "output/Agora baseline_OLD49_output.txt"
Old result file "output/Agora baseline_estimates.csv" 
 renamed to: "output/Agora baseline_OLD49_estimates.csv"
Old result file "output/Agora baseline_model.rds" 
 renamed to: "output/Agora baseline_OLD49_model.rds"
Model output saved to output/Agora baseline_output.txt 
Estimates saved to output/Agora baseline_estimates.csv 
Model object saved to output/Agora baseline.rds 

Welfare analysis

Remember, willingness to accept is the marginal rate of substitution, hence the quantification of how respondents trade-off different attributes.

How do farmers trade off different attributes, specifically compensation payments to other contract attributes, while supposedly maintaining the same level of utility?

Welfare analysis

Whenever we need standard errors from a nonlinear function of estimated coefficients, we use the delta method. The delta method turns uncertainty in your estimated \(\beta\)’s into uncertainty for a derived measure (like WTP or CV) by linearizing the formula around \(\hat\beta\) (Daly, Hess, and De Jong 2012).

  1. Approximate \(g(\beta)\) by a straight linear line near \(\hat \beta\) (a first-order Taylor expansion).
  2. Use that linear approximation to propagate the variance from \(\hat \beta\) to \(g(\hat \beta)\)

Welfare analysis

If \(g(\beta)\) is in our case WTP or CV, then \[\begin{align} \operatorname{Var}\!\big(g(\hat{\boldsymbol{\beta}})\big) &\approx \nabla g(\hat{\boldsymbol{\beta}})^{\top}\, \widehat{\boldsymbol{\Sigma}}\, \nabla g(\hat{\boldsymbol{\beta}}) \\ \operatorname{SE}\!\big(g(\hat{\boldsymbol{\beta}})\big) &\approx \sqrt{\nabla g(\hat{\boldsymbol{\beta}})^{\top}\, \widehat{\boldsymbol{\Sigma}}\, \nabla g(\hat{\boldsymbol{\beta}})} \end{align}\]

  • \(g(\hat{\boldsymbol{\beta}})\) is the vector of partial derivatives
  • \(\widehat{\boldsymbol{\Sigma}}\) is the estimated covariance matrix of \(\hat \beta\)

Welfare analysis

Constructing WTA measures from parameters estimates.

deltaMethod_settings=list(expression=c(
wta_asc_pub ="asc_pub/b_comp",
wta_paymode_pub = "b_pay_mode_pub/b_comp",
wta_consult_pub = "b_consult_pub/b_comp",
wta_asc_priv ="asc_priv/b_comp",
wta_paymode_priv = "b_pay_mode_priv/b_comp",
wta_consult_priv = "b_consult_priv/b_comp"
))
  
apollo_deltaMethod(AgoraBaselineMNL, deltaMethod_settings)  
Running Delta method computation for user-defined function using robust standard errors

       Expression     Value    s.e. t-ratio (0)
      wta_asc_pub -565.4905 28.9924      -19.50
  wta_paymode_pub  -33.9642 18.3939       -1.85
  wta_consult_pub   58.2309 15.0827        3.86
     wta_asc_priv -695.6637 35.9319      -19.36
 wta_paymode_priv  -58.5972 23.7508       -2.47
 wta_consult_priv   71.8763 24.3206        2.96
INFORMATION: The results of the Delta method calculations are returned invisibly as
  an output from this function. Calling the function via
  result=apollo_deltaMethod(...) will save this output in an object
  called result (or otherwise named object). 

Welfare measures

Constructing compensating variation from parameters estimates.

deltaMethod_settings <- list(expression = c(
  # Scenario utilities (partner not significant)
  V1_pub  = "asc_pub  + 1*b_pay_mode_pub  + 1*b_consult_pub  + 0*b_partner_pub",
  V1_priv = "asc_priv + 1*b_pay_mode_priv + 1*b_consult_priv + 0*b_partner_priv",

  # Compensating variation (simple): CV = -(V1 - V0) / b_comp
  cv_pub  = "-((asc_pub  + 1*b_pay_mode_pub  + 1*b_consult_pub  + 0*b_partner_pub) - 0)/b_comp",
  cv_priv = "-((asc_priv + 1*b_pay_mode_priv + 1*b_consult_priv + 0*b_partner_priv) - 0)/b_comp"
))

apollo_deltaMethod(AgoraBaselineMNL, deltaMethod_settings)  
Running Delta method computation for user-defined function using robust standard errors

 Expression    Value    s.e. t-ratio (0)
     V1_pub  -2.0976  0.1853      -11.32
    V1_priv  -2.6447  0.1769      -14.95
     cv_pub 541.2238 31.0348       17.44
    cv_priv 682.3846 34.5646       19.74
INFORMATION: The results of the Delta method calculations are returned invisibly as
  an output from this function. Calling the function via
  result=apollo_deltaMethod(...) will save this output in an object
  called result (or otherwise named object). 

Subgroup analysis

What about impacts of:

  • age
  • education
  • gender
  • share of income on overall income (part time farmers)
  • soil quality
  • past AECM participation
  • claiming nature conservation advice
  • focus of farm

Subgroup analysis

We could either:

  • interact certain variables with particular attributes of interest
  • split the sample based on certain criteria of interest

However, be aware of sample sizes of subsamples!

Subgroup analysis

Split sample

We could split the sample, for example, based on certified organic and non-organic farms (variable c9 in our data).

# create a new object called `database_organic` that only includes farmers that
# stated that they are certified organic
database_organic <- database %>% 
  filter(c9==1)

And then run the analysis for the split sample.

Subgroup analysis

Interactions

  • Alternatively, we could also use interactions to investigate sensitivity of particular attributes.
  • This might make sense when main effects were insignificant.
  • For example, the interaction of past use of nature conservation services (c11) and the partner attribute.
  • Our utility function would then change accordingly.
# renaming and recoding the variable for easier interpretation
database <- database %>% 
  mutate(advice_past=ifelse(c11==1,1,0))

Subgroup analysis

Interactions

# set apollo prior values ----
apollo_beta=c(asc_pub          = 0,
              b_pay_mode_pub   = 0,
              b_partner_pub    = 0,
              b_consult_pub    = 0,
              asc_priv         = 0,
              b_pay_mode_priv  = 0,
              b_partner_priv   = 0,
              b_consult_priv   = 0,
              b_comp           = 0,
1              int_partner_advice_pub=0,
              int_partner_advice_priv=0)
1
Adding new interaction terms to our list of prior values.

Subgroup analysis

Interactions

Including new dummy coded variable in utility

apollo_probabilities=function(apollo_beta, apollo_inputs, functionality="estimate"){
  
  apollo_attach(apollo_beta, apollo_inputs)
  on.exit(apollo_detach(apollo_beta, apollo_inputs))
  
  P = list()
  V = list()
  
  V[['first']]   = asc_pub + b_pay_mode_pub * pay_mode_1 + b_partner_pub * partner_1 + int_partner_advice_pub * advice_past * partner_1 + b_consult_pub * consult_1 + b_comp * compensation_1
  V[['second']]  = asc_priv + b_pay_mode_priv * pay_mode_2 + b_partner_priv * partner_2 + int_partner_advice_priv * advice_past * partner_1 + b_consult_priv * consult_2 + b_comp *  compensation_2 
  V[['third']]   = 0 
  
  mnl_settings = list(
    alternatives  = c(first=1, second=2, third=3), 
    avail         = list(first=1, second=1, third=1), 
    choiceVar     = pref1,
    V             = V
  )
  
  P[['model']] = apollo_mnl(mnl_settings, functionality)
  P = apollo_panelProd(P, apollo_inputs, functionality)
  P = apollo_prepareProb(P, apollo_inputs, functionality)
  return(P)
}

Subgroup analysis

Interactions

AgoraBaselineMNL_int = apollo_estimate(apollo_beta, apollo_fixed, apollo_probabilities, apollo_inputs,
                                                  estimate_settings=list(maxIterations=400, estimationRoutine="bfgs",
                                                                         hessianRoutine="analytic"))
Preparing user-defined functions.

Testing likelihood function...

Overview of choices for MNL model component :
                                   first  second   third
Times available                  3420.00 3420.00 3420.00
Times chosen                     1374.00  845.00 1201.00
Percentage chosen overall          40.18   24.71   35.12
Percentage chosen when available   40.18   24.71   35.12


Pre-processing likelihood function...

Testing influence of parameters
Starting main estimation
Initial function value: -3757.254 
Initial gradient value:
                asc_pub          b_pay_mode_pub           b_partner_pub 
               234.0000                103.0000                 94.0000 
          b_consult_pub                asc_priv         b_pay_mode_priv 
               143.0000               -295.0000               -172.3333 
         b_partner_priv          b_consult_priv                  b_comp 
              -168.0000               -104.6667              83310.0000 
 int_partner_advice_pub int_partner_advice_priv 
                56.0000                -34.0000 
initial  value 3757.254027 
iter   2 value 3745.828213
iter   3 value 3683.862865
iter   4 value 3595.432140
iter   5 value 3519.048374
iter   6 value 3502.433816
iter   7 value 3498.853609
iter   8 value 3496.198891
iter   9 value 3473.747313
iter  10 value 3456.913127
iter  11 value 3455.016138
iter  12 value 3454.533078
iter  13 value 3453.972938
iter  14 value 3444.988131
iter  15 value 3442.834990
iter  16 value 3442.830708
iter  17 value 3442.830571
iter  17 value 3442.830569
iter  17 value 3442.830569
final  value 3442.830569 
converged

Estimated parameters with approximate standard errors from BHHH matrix:
                           Estimate     BHHH se BHH t-ratio (0)
asc_pub                   -2.203688     0.12115        -18.1893
b_pay_mode_pub            -0.141271     0.07993         -1.7673
b_partner_pub             -0.155886     0.07700         -2.0246
b_consult_pub              0.231883     0.09769          2.3736
asc_priv                  -2.722525     0.12170        -22.3716
b_pay_mode_priv           -0.227577     0.08469         -2.6872
b_partner_priv            -0.067603     0.09863         -0.6854
b_consult_priv             0.267883     0.08000          3.3485
b_comp                     0.003881  1.4208e-04         27.3145
int_partner_advice_pub     0.506046     0.07193          7.0352
int_partner_advice_priv    0.219875     0.09235          2.3809

Final LL: -3442.8306

Calculating log-likelihood at equal shares (LL(0)) for applicable
  models...
Calculating log-likelihood at observed shares from estimation data
  (LL(c)) for applicable models...
Calculating LL of each model component...
Calculating other model fit measures
Computing covariance matrix using numerical jacobian of analytical
  gradient.
 0%....25%....50%....75%..100%
Negative definite Hessian with maximum eigenvalue: -29.886489

Please acknowledge the use of Apollo by citing Hess & Palma (2019) -
  doi.org/10.1016/j.jocm.2019.100170

Subgroup analysis

Interactions

apollo_modelOutput(AgoraBaselineMNL_int)
Model run by cschulze using Apollo 0.3.5 on R 4.5.0 for Windows.
Please acknowledge the use of Apollo by citing Hess & Palma (2019)
  DOI 10.1016/j.jocm.2019.100170
  www.ApolloChoiceModelling.com

Model name                                  : Agora baseline interaction
Model description                           : Simple MNL model with interaction
Model run at                                : 2026-03-05 16:52:38.332485
Estimation method                           : bfgs
Model diagnosis                             : successful convergence
Optimisation diagnosis                      : Maximum found
     hessian properties                     : Negative definite
     maximum eigenvalue                     : -29.886489
     reciprocal of condition number         : 1.00342e-07
Number of individuals                       : 285
Number of rows in database                  : 3420
Number of modelled outcomes                 : 3420

Number of cores used                        :  1 
Model without mixing

LL(start)                                   : -3757.25
LL at equal shares, LL(0)                   : -3757.25
LL at observed shares, LL(C)                : -3691.16
LL(final)                                   : -3442.83
Rho-squared vs equal shares                  :  0.0837 
Adj.Rho-squared vs equal shares              :  0.0808 
Rho-squared vs observed shares               :  0.0673 
Adj.Rho-squared vs observed shares           :  0.0648 
AIC                                         :  6907.66 
BIC                                         :  6975.17 

Estimated parameters                        : 11
Time taken (hh:mm:ss)                       :  00:00:1.15 
     pre-estimation                         :  00:00:0.3 
     estimation                             :  00:00:0.39 
     post-estimation                        :  00:00:0.45 
Iterations                                  :  20  

Unconstrained optimisation.

Estimates:
                           Estimate        s.e.   t.rat.(0)    Rob.s.e.
asc_pub                   -2.203688     0.13403    -16.4414     0.17528
b_pay_mode_pub            -0.141271     0.07320     -1.9299     0.07172
b_partner_pub             -0.155886     0.08035     -1.9401     0.09723
b_consult_pub              0.231883     0.07319      3.1681     0.05619
asc_priv                  -2.722525     0.13981    -19.4729     0.17984
b_pay_mode_priv           -0.227577     0.08292     -2.7445     0.09128
b_partner_priv            -0.067603     0.08326     -0.8119     0.07468
b_consult_priv             0.267883     0.08339      3.2123     0.09222
b_comp                     0.003881  1.8729e-04     20.7212  2.6322e-04
int_partner_advice_pub     0.506046     0.13218      3.8285     0.24693
int_partner_advice_priv    0.219875     0.13756      1.5984     0.21803
                        Rob.t.rat.(0)
asc_pub                      -12.5725
b_pay_mode_pub                -1.9696
b_partner_pub                 -1.6032
b_consult_pub                  4.1265
asc_priv                     -15.1387
b_pay_mode_priv               -2.4931
b_partner_priv                -0.9053
b_consult_priv                 2.9049
b_comp                        14.7443
int_partner_advice_pub         2.0493
int_partner_advice_priv        1.0085
apollo_saveOutput(AgoraBaselineMNL_int)

Old result file "output/Agora baseline interaction_output.txt" 
 renamed to: "output/Agora baseline interaction_OLD47_output.txt"
Old result file "output/Agora baseline interaction_estimates.csv" 
 renamed to: "output/Agora baseline interaction_OLD47_estimates.csv"
Old result file "output/Agora baseline interaction_model.rds" 
 renamed to: "output/Agora baseline interaction_OLD47_model.rds"
Model output saved to output/Agora baseline interaction_output.txt 
Estimates saved to output/Agora baseline interaction_estimates.csv 
Model object saved to output/Agora baseline interaction.rds 

Subgroup analysis

Interactions

deltaMethod_settings=list(expression=c(
wta_asc_pub ="asc_pub/b_comp",
wta_paymode_pub = "b_pay_mode_pub/b_comp",
wta_partner_advice_pub = "(b_partner_pub + int_partner_advice_pub)/b_comp",
wta_partner_pub = "b_partner_pub/b_comp",
wta_consult_pub = "b_consult_pub/b_comp",
wta_asc_priv ="asc_priv/b_comp",
wta_paymode_priv = "b_pay_mode_priv/b_comp",
wta_partner_advice_priv = "(b_partner_pub + int_partner_advice_priv)/b_comp",
wta_partner_priv = "b_partner_priv/b_comp",
wta_consult_priv = "b_consult_priv/b_comp"
))
  
apollo_deltaMethod(AgoraBaselineMNL_int, deltaMethod_settings)  
Running Delta method computation for user-defined function using robust standard errors

              Expression     Value    s.e. t-ratio (0)
             wta_asc_pub -567.8231 28.9293      -19.63
         wta_paymode_pub  -36.4013 18.4762       -1.97
  wta_partner_advice_pub   90.2255 52.3928        1.72
         wta_partner_pub  -40.1671 25.0477       -1.60
         wta_consult_pub   59.7491 15.1144        3.95
            wta_asc_priv -701.5114 36.9420      -18.99
        wta_paymode_priv  -58.6395 23.6923       -2.48
 wta_partner_advice_priv   16.4878 55.2773        0.30
        wta_partner_priv  -17.4192 19.0249       -0.92
        wta_consult_priv   69.0253 24.2324        2.85
INFORMATION: The results of the Delta method calculations are returned invisibly as
  an output from this function. Calling the function via
  result=apollo_deltaMethod(...) will save this output in an object
  called result (or otherwise named object). 

Addtional analysis

From parameters to policy

If we force both contracts to be result-based + include advisory services, how does uptake change as compensation increases?

pred_in <- apollo_prediction(
  model                = AgoraBaselineMNL,
  apollo_probabilities  = apollo_probabilities,
  apollo_inputs         = apollo_inputs,
  prediction_settings   = list(summary = FALSE, modelComponent = "model")
)
Running predictions from model using parameter estimates...

Predicted probabilities

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.08363 0.27175 0.37849 0.39096 0.49881 0.71242 
    first    second     third 
0.4017534 0.2470772 0.3511694 

Prediction by varying compensation

Prediction by varying compensation

Prediction by varying compensation

Simulation of different sample sizes

Role of sample size

Simulating data

We simulate a Discrete Choice Experiment dataset that follows the Random Utility Model (RUM): \[\begin{align} U_{njt}=V_{njt}+\epsilon_{njt} \end{align}\]

  • n: respondent
  • t: choice task
  • j: alternative

Role of sample size

Simulating data

We create data where we know the truth, so we can later estimate an MNL and check whether we recover it.

We will:

  1. Create hypothetical choice tasks (attribute levels for each alternative)
  2. Specify “true” preference parameters \(\beta\) (the data-generating process)
  3. Compute deterministic utilities \(V\) from attributes and \(\beta\)
  4. Add random shocks \(\epsilon\) (Gumbel) to get total utilities \(U\)
  5. Generate a choice by picking the alternative with the highest \(U\)
  6. Retrieve \(\beta\) parameters from artificial sample

Role of sample size

Simulating data

Building a dataset with n rows (No choices yet!).


SIM_N <- 300  # number of respondents
SIM_T <- 12     # choice tasks per respondent
SIM_SEED <- 123

sim_agora <- function(N = SIM_N, T = SIM_T, seed = SIM_SEED){

  comp_levels <- c(350,410,470,530,590,650,710,770,830)

  n <- N*T
  df <- data.frame(
    ID = rep(1:N, each = T),

    pay_mode_1 = rbinom(n, 1, 0.5),
    partner_1  = rbinom(n, 1, 0.5),
    consult_1  = rbinom(n, 1, 0.5),
    compensation_1 = sample(comp_levels, n, replace=TRUE),

    pay_mode_2 = rbinom(n, 1, 0.5),
    partner_2  = rbinom(n, 1, 0.5),
    consult_2  = rbinom(n, 1, 0.5),
    compensation_2 = sample(comp_levels, n, replace=TRUE),

    pay_mode_3 = 0,
    partner_3  = 0,
    consult_3  = 0,
    compensation_3 = 0
  )

Role of sample size

Simulating data

Defining “true” preference parameters (the data-generating process)


  # Choose "true" parameters (pick values that yield sensible choices)
  beta_true <- c(
    asc_pub        = 0.3,
    asc_priv       = 0.05,
    b_pay_mode_pub = 0.4,
    b_partner_pub  = 0.15,
    b_consult_pub  = 0.5,
    b_pay_mode_priv= 0.2,
    b_partner_priv = 0.1,
    b_consult_priv = 0.35,
    b_comp         = -0.003
  )

Role of sample size

Simulating data

Building deterministic utilities \(V\)


  V1 <- with(df, beta_true["asc_pub"] +
                beta_true["b_pay_mode_pub"]*pay_mode_1 +
                beta_true["b_partner_pub"]*partner_1 +
                beta_true["b_consult_pub"]*consult_1 +
                beta_true["b_comp"]*compensation_1)

  V2 <- with(df, beta_true["asc_priv"] +
                beta_true["b_pay_mode_priv"]*pay_mode_2 +
                beta_true["b_partner_priv"]*partner_2 +
                beta_true["b_consult_priv"]*consult_2 +
                beta_true["b_comp"]*compensation_2)

  V3 <- 0

Role of sample size

Simulating data

Adding randomness: \(\epsilon\) shocks

  # i.i.d. Gumbel shocks (MNL kernel)
  gumbel <- function(n) -log(-log(runif(n)))
  U1 <- V1 + gumbel(n)
  U2 <- V2 + gumbel(n)
  U3 <- V3 + gumbel(n)
  • in the RUM, we never observe everything that drives choices.
  • The Gumbel errors are the “unobserved factors” \(\epsilon\)

This gives us \[\begin{align} U_{njt}=V_{njt}+\epsilon_{njt} \end{align}\]

Role of sample size

Simulating data

Turning utilities into choices

  df$pref1 <- max.col(cbind(U1,U2,U3))  

So the rule is

the chosen alternative is the one with the highest utility

That is the behavioral assumption of random utility maximization.

Role of sample size

Simulating data

apollo_control <- list(
  modelName  ="Simulated Agora baseline",
  modelDescr ="MNL on simulated data",
  indivID    ="ID",
  outputDirectory = "output_sim",
  panelData = TRUE
)

apollo_beta <- c(
  asc_pub          = 0,
  b_pay_mode_pub   = 0,
  b_partner_pub    = 0,
  b_consult_pub    = 0,
  asc_priv         = 0,
  b_pay_mode_priv  = 0,
  b_partner_priv   = 0,
  b_consult_priv   = 0,
  b_comp           = 0
)

apollo_fixed <- c()
apollo_inputs <- apollo_validateInputs()
SimMNL <- apollo_estimate(apollo_beta, apollo_fixed, apollo_probabilities, apollo_inputs)

Role of sample size

Simulating data

apollo_modelOutput(SimMNL)
Model run by cschulze using Apollo 0.3.5 on R 4.5.0 for Windows.
Please acknowledge the use of Apollo by citing Hess & Palma (2019)
  DOI 10.1016/j.jocm.2019.100170
  www.ApolloChoiceModelling.com

Model name                                  : Simulated Agora baseline
Model description                           : MNL on simulated data
Model run at                                : 2026-03-05 16:52:44.170779
Estimation method                           : bgw
Model diagnosis                             : Relative function convergence
Optimisation diagnosis                      : Maximum found
     hessian properties                     : Negative definite
     maximum eigenvalue                     : -28.46654
     reciprocal of condition number         : 1.03905e-07
Number of individuals                       : 300
Number of rows in database                  : 3600
Number of modelled outcomes                 : 3600

Number of cores used                        :  1 
Model without mixing

LL(start)                                   : -3955
LL at equal shares, LL(0)                   : -3955
LL at observed shares, LL(C)                : -3418.98
LL(final)                                   : -3239.6
Rho-squared vs equal shares                  :  0.1809 
Adj.Rho-squared vs equal shares              :  0.1786 
Rho-squared vs observed shares               :  0.0525 
Adj.Rho-squared vs observed shares           :  0.0504 
AIC                                         :  6497.2 
BIC                                         :  6552.9 

Estimated parameters                        : 9
Time taken (hh:mm:ss)                       :  00:00:0.67 
     pre-estimation                         :  00:00:0.16 
     estimation                             :  00:00:0.17 
     post-estimation                        :  00:00:0.34 
Iterations                                  :  7  

Unconstrained optimisation.

Estimates:
                   Estimate        s.e.   t.rat.(0)    Rob.s.e. Rob.t.rat.(0)
asc_pub            0.368745     0.13848     2.66289     0.13420       2.74780
b_pay_mode_pub     0.527084     0.08098     6.50865     0.08276       6.36893
b_partner_pub      0.031601     0.08032     0.39342     0.08052       0.39244
b_consult_pub      0.500961     0.08083     6.19785     0.08285       6.04628
asc_priv           0.251496     0.14828     1.69605     0.15708       1.60102
b_pay_mode_priv    0.106741     0.09248     1.15425     0.09353       1.14127
b_partner_priv     0.003690     0.09220     0.04002     0.08982       0.04108
b_consult_priv     0.362350     0.09298     3.89720     0.09466       3.82801
b_comp            -0.003185  2.0330e-04   -15.66826  2.0696e-04     -15.39111

Role of sample size

Simulating data

est <- get_apollo_estimates(SimMNL)

comp <- data.frame(
  param = names(beta_true),
  true  = as.numeric(beta_true),
  est   = as.numeric(est[names(beta_true)])
)

# add diagnostics
comp$diff      <- comp$est - comp$true
comp$abs_diff  <- abs(comp$diff)
comp$rel_diff  <- comp$diff / comp$true

comp
            param   true          est          diff     abs_diff     rel_diff
1         asc_pub  0.300  0.368745049  0.0687450494 0.0687450494  0.229150165
2        asc_priv  0.050  0.251496272  0.2014962722 0.2014962722  4.029925445
3  b_pay_mode_pub  0.400  0.527083718  0.1270837183 0.1270837183  0.317709296
4   b_partner_pub  0.150  0.031601110 -0.1183988897 0.1183988897 -0.789325931
5   b_consult_pub  0.500  0.500961199  0.0009611994 0.0009611994  0.001922399
6 b_pay_mode_priv  0.200  0.106741015 -0.0932589849 0.0932589849 -0.466294925
7  b_partner_priv  0.100  0.003690012 -0.0963099884 0.0963099884 -0.963099884
8  b_consult_priv  0.350  0.362350353  0.0123503530 0.0123503530  0.035286723
9          b_comp -0.003 -0.003185373 -0.0001853733 0.0001853733  0.061791103

Role of sample size

Comparison: true vs. simulation

Role of sample size

Simulating data

What happens if we change size of N?

Re-run code with N=100, N=500 and N=2000!

Role of sample size

Comparison: true vs. simulation

Role of sample size

Comparison: true vs. simulation

References

Daly, Andrew, Stephane Hess, and Gerard De Jong. 2012. “Calculating Errors for Measures Derived from Choice Modelling Estimates.” Transportation Research Part B: Methodological 46 (2): 333–41.
Hess, Stephane, and David Palma. 2019. “Apollo: A Flexible, Powerful and Customisable Freeware Package for Choice Model Estimation and Application.” Journal of Choice Modelling 32: 100170.
Schulze, Christoph, Klaus Glenk, Julian Sagebiel, and Bettina Matzdorf. 2026. “Private or Public? Farmer Preferences and Identities in Agri-Environmental Contract Implementation.” Journal of Agricultural Economics 77 (1): 107–29.