Sandbox for exploring models for validating digital log data with ESM

Background

Starting point/premise:

Over the course of the semester, we use ESM to ping students every time (or nearly every time) they view the calendar or view the syllabus or view assignment instruction. What we find is that 95% of the time whenever students “view the calendar” they report via ESM that they are “planning”, whereas 90% of the time whenever they “view the syllabus” they report via ESM that they are “planning”, and only 50% of the time when they “view assignment instructions” they are “planning”. It turns out that the “view assignment instructions” ESM data systematically varies: if a student does this prior to the assignment due date, they report “planning” but if they do this on the day the assignment is due, they report “doing assignment”. To me, the ESM data are our “ground truth” – I think in most cases we would “believe” what a student reports via ESM unless their report is uninterpretable (e.g., “I was chillin’”) or really implausible (e.g., they viewed the syllabus but reported they were “submitting a quiz response”). I would interpret this to mean that “view the calendar” and “view the syllabus” can be inferred as evidence of “planning” regardless of when they enact that behavior whereas the “view assignment instructions” digital trace can be interpreted with high validity when we take into account timing (i.e., either “planning” or “doing assignment”). ## Loading, setting up

library(tidyverse)
library(palmerpenguins)
library(margins)
library(prediction)
library(corrr)
library(sjPlot)
library(lme4)
library(brms)

Let’s make this a relevant-appearing situation. While we’re using data on penguins, let’s imagine that we’re modeling the presence of a planning reported via ESM as a function of a certain behavior measured through log data. For now, let us hold aside whether ESM is really ground truth (and/or how to account for measurement error in those responses. To simplify further, let’s consider any independent variables to be standardized with M = 0, SD = 1. For now, we’ll consider planning to be a binary variable (1 = planning, 0 = not planning)

penguins <- penguins %>%
    mutate(large_flipper = if_else(flipper_length_mm < mean(flipper_length_mm, na.rm = TRUE), 0, 1))

penguins <- penguins %>% 
    mutate(body_mass_g = as.numeric(scale(body_mass_g))) %>% 
    rename(planning_esm = large_flipper,
           planning_behavior = body_mass_g)

Examining bivariate Pearson correlations

penguins %>% 
    select(planning_esm, planning_behavior) %>% 
    correlate() %>% 
    knitr::kable()

term	planning_esm	planning_behavior
planning_esm	NA	0.7781774
planning_behavior	0.7781774	NA

Modeling

Let’s predict planning_esm on the basis of planning_behavior.

model_logit <- glm(planning_esm ~ planning_behavior,
                   data = penguins, 
                   family = binomial(link = "logit"))

it’s a simple model:

equatiomatic::extract_eq(model_logit)

\[ \log\left[ \frac { P( \operatorname{planning\_esm} = \operatorname{0} ) }{ 1 - P( \operatorname{planning\_esm} = \operatorname{0} ) } \right] = \alpha + \beta_{1}(\operatorname{planning\_behavior}) \]

these estimates are in log-odds units; they can be exponentiated to obtain an odds ratio,

summary(model_logit)

## 
## Call:
## glm(formula = planning_esm ~ planning_behavior, family = binomial(link = "logit"), 
##     data = penguins)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1978  -0.3788  -0.1192   0.2624   2.8251  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.3022     0.1909  -1.583    0.113    
## planning_behavior   3.6707     0.3974   9.237   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 467.91  on 341  degrees of freedom
## Residual deviance: 180.96  on 340  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 184.96
## 
## Number of Fisher Scoring iterations: 6

https://cran.r-project.org/web/packages/margins/vignettes/TechnicalDetails.pdf

For a logistic regression model, for example, we may want to interpret marginal effects (sometimes “partial effects”) on the scale of the observed outcome, so that we can understand the marginal effects as changes in the predicted probability of the outcome. By default, margins sets type = “response”. This can, however, be modified. For GLMs, this could be set to type = “link” in order to calculate true marginal effects on the scale of the linear predictor.

model_logit_margins <- margins(model_logit)
summary(model_logit_margins)

##             factor    AME     SE       z      p  lower  upper
##  planning_behavior 0.2946 0.0062 47.2972 0.0000 0.2824 0.3068

Obtaining predictions on the response/probability scale

Using the estimates, let’s obtain predictions for all of the cases. The predicted values are on a probabiity scale (0-1). Let’s just look at 30 predictions.

Generally, to the extent that this is a predictive model, when fitted is higher, planning_esm should be higher, too,

model_logit_predictions <- prediction::prediction(model_logit, penguins)

as_tibble(model_logit_predictions) %>%
    select(planning_esm, fitted, se.fitted, planning_behavior, ) %>% 
    arrange(fitted) %>% 
    sample_n(30) %>% 
    knitr::kable()

planning_esm	fitted	se.fitted	planning_behavior
0	0.0402744	0.0139911	-0.7815336
0	0.0769623	0.0214096	-0.5944905
0	0.0074848	0.0039299	-1.2491411
0	0.2695759	0.0401826	-0.1892307
1	0.9195391	0.0262747	0.7459845
0	0.0131873	0.0061394	-1.0932719
0	0.1051689	0.0257511	-0.5009690
0	0.1893112	0.0347451	-0.3139260
0	0.0323021	0.0119798	-0.8438812
1	0.9415518	0.0212964	0.8395060
0	0.0206838	0.0086479	-0.9685766
1	0.8206367	0.0407325	0.4965938
1	0.9475400	0.0197741	0.8706798
1	0.9999088	0.0000967	2.6164147
1	0.9475400	0.0197741	0.8706798
1	0.9929911	0.0041968	1.4318089
1	0.9349270	0.0228908	0.8083321
0	0.6468527	0.0493587	0.2472031
0	0.1051689	0.0257511	-0.5009690
1	0.9977573	0.0016147	1.7435472
0	0.0231333	0.0093982	-0.9374027
1	0.8785088	0.0335638	0.6212891
1	0.9912045	0.0050527	1.3694612
1	0.8206367	0.0407325	0.4965938
1	0.8785088	0.0335638	0.6212891
0	0.0692167	0.0200405	-0.6256644
1	0.8518929	0.0372412	0.5589414
1	0.9944168	0.0034787	1.4941565
0	0.0692167	0.0200405	-0.6256644
0	0.8785088	0.0335638	0.6212891

So what?

The predictions on a probability scale are valuable. For instance, this code shows us that students are reporting planning around 43% (+- 4%) of the time.

my_subset_predictions <- model_logit_predictions %>% 
    select(planning_esm, planning_behavior, sex, fitted, se.fitted) %>% 
    as_tibble()

# overall predicted probability and SE (in probability units)
my_subset_predictions %>% 
    summarize(mean_fitted = mean(fitted, na.rm = TRUE),
              mean_se.fitted = mean(se.fitted, na.rm = TRUE))

## # A tibble: 1 x 2
##   mean_fitted mean_se.fitted
##         <dbl>          <dbl>
## 1       0.433         0.0207

We can also examine differences in (in this case) sex:

# sex predicted probability and SE (in probability units)
my_subset_predictions %>% 
    filter(!is.na(sex)) %>% 
    group_by(sex) %>% 
    summarize(mean_fitted = mean(fitted, na.rm = TRUE),
              mean_se.fitted = mean(se.fitted, na.rm = TRUE))

## # A tibble: 2 x 3
##   sex    mean_fitted mean_se.fitted
##   <fct>        <dbl>          <dbl>
## 1 female       0.310         0.0186
## 2 male         0.555         0.0227

Females were planning around 31% of the time (+- ~4%) and males around 56% of the time (+- 4.5%).

Extending the models to account for the multi-level structure

Let’s predict planning_esm on the basis of planning_behavior. We can add time_point (year in the original data) and student (species in the original data), estimating a model that considers responses to be nested within both time points and classes. We see that much of the variation is within students - around 74%. 8% is within time point.

penguins <- penguins %>% 
    rename(time_point = year,
           student = species)

library(lme4)

model_logit_nested <- glmer(planning_esm ~ planning_behavior + (1|time_point) + (1|student),
                            data = penguins, 
                            family = binomial(link = "logit"))

summary(model_logit_nested)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: planning_esm ~ planning_behavior + (1 | time_point) + (1 | student)
##    Data: penguins
## 
##      AIC      BIC   logLik deviance df.resid 
##    115.4    130.7    -53.7    107.4      338 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4182 -0.1630 -0.0311  0.0149 12.0711 
## 
## Random effects:
##  Groups     Name        Variance Std.Dev.
##  time_point (Intercept)  1.486   1.219   
##  student    (Intercept) 13.881   3.726   
## Number of obs: 342, groups:  time_point, 3; student, 3
## 
## Fixed effects:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.256103   0.002258   556.4   <2e-16 ***
## planning_behavior 3.297524   0.002258  1460.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## plnnng_bhvr 0.000 
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0313688 (tol = 0.002, component 1)

performance::icc(model_logit_nested, by_group = TRUE)

## # ICC by Group
## 
## Group      |   ICC
## ------------------
## time_point | 0.080
## student    | 0.744

equatiomatic::extract_eq(model_logit_nested)

\[ \begin{aligned} \operatorname{planning\_esm}_{i} &\sim \operatorname{Binomial}(n = 1, \operatorname{prob}_{\operatorname{planning\_esm} = 1} = \widehat{P}) \\ \log\left[\frac{\hat{P}}{1 - \hat{P}} \right] &=\alpha_{j[i],k[i]} + \beta_{1}(\operatorname{planning\_behavior}) \\ \alpha_{j} &\sim N \left(\mu_{\alpha_{j}}, \sigma^2_{\alpha_{j}} \right) \text{, for time_point j = 1,} \dots \text{,J} \\ \alpha_{k} &\sim N \left(\mu_{\alpha_{k}}, \sigma^2_{\alpha_{k}} \right) \text{, for student k = 1,} \dots \text{,K} \end{aligned} \]

So what (take 2)?

We can again examine differences in (in this case) sex:

model_logit_nested_predictions <- prediction::prediction(model_logit_nested, penguins)

my_subset_nested_predictions <- model_logit_nested_predictions %>% 
    select(planning_esm, planning_behavior, sex, fitted, se.fitted) %>% 
    as_tibble()

# sex predicted probability and SE (in probability units)
my_subset_nested_predictions %>% 
    filter(!is.na(sex)) %>% 
    group_by(sex) %>% 
    summarize(mean_fitted = mean(fitted, na.rm = TRUE),
              mean_se.fitted = mean(se.fitted, na.rm = TRUE))

## # A tibble: 2 x 3
##   sex    mean_fitted mean_se.fitted
##   <fct>        <dbl>          <dbl>
## 1 female       0.381            NaN
## 2 male         0.483            NaN

It looks like we may have to fit a model with MCMC to propogate uncertainty with our multi-level model. There seems to be much more uncertainty; let’s explore that further next.

model_logit_nested_brms <- brm(planning_esm ~ planning_behavior + (1|time_point) + (1|student),
                               data = penguins, 
                               family = bernoulli(),
                               cores = 4,
                               chains = 4,
                               iter = 1000)

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1

model_logit_nested_brms

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: planning_esm ~ planning_behavior + (1 | time_point) + (1 | student) 
##    Data: penguins (Number of observations: 342) 
## Samples: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup samples = 2000
## 
## Group-Level Effects: 
## ~student (Number of levels: 3) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     4.25      2.04     1.73     9.96 1.00     1150     1006
## 
## ~time_point (Number of levels: 3) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.85      1.18     0.47     5.07 1.00      666      582
## 
## Population-Level Effects: 
##                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept             0.62      1.87    -3.05     4.51 1.00     1101      890
## planning_behavior     3.39      0.70     2.05     4.79 1.00     1932     1345
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

model_logit_nested_predictions_brms <- predict(model_logit_nested_brms, penguins)

model_logit_nested_predictions_brms <- bind_cols(as_tibble(model_logit_nested_predictions_brms), penguins)

my_subset_nested_predictions_brms <- model_logit_nested_predictions_brms %>% 
    select(planning_esm, planning_behavior, sex, Estimate, Est.Error) %>% 
    as_tibble()

# sex predicted probability and SE (in probability units) 
my_subset_nested_predictions_brms %>% 
    filter(!is.na(sex)) %>% 
    group_by(sex) %>% 
    summarize(mean_fitted = mean(Estimate, na.rm = TRUE),
              mean_se.fitted = mean(Est.Error, na.rm = TRUE))

## # A tibble: 2 x 3
##   sex    mean_fitted mean_se.fitted
##   <fct>        <dbl>          <dbl>
## 1 female       0.381          0.111
## 2 male         0.483          0.175

Why so much more uncertainty? Maybe it’s the multi-level structure, somehow? Let’s fit a simpler model. It’s even more uncertain (which makes sense since we removed degrees of freedom when we added the multi-level structure). Not sure why.

model_logit_nested_brms_1 <- brm(planning_esm ~ planning_behavior,
                               data = penguins, 
                               family = bernoulli(),
                               cores = 4,
                               chains = 4,
                               iter = 1000)

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1

model_logit_nested_brms_1

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: planning_esm ~ planning_behavior 
##    Data: penguins (Number of observations: 342) 
## Samples: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup samples = 2000
## 
## Population-Level Effects: 
##                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept            -0.30      0.18    -0.68     0.06 1.00     1656     1369
## planning_behavior     3.75      0.39     3.05     4.56 1.00     1287     1179
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

model_logit_nested_predictions_brms_1 <- predict(model_logit_nested_brms_1, penguins)

model_logit_nested_predictions_brms_1 <- bind_cols(as_tibble(model_logit_nested_predictions_brms_1), penguins)

my_subset_nested_predictions_brms_1 <- model_logit_nested_predictions_brms_1 %>% 
    select(planning_esm, planning_behavior, sex, Estimate, Est.Error) %>% 
    as_tibble()

# sex predicted probability and SE (in probability units) 
my_subset_nested_predictions_brms_1 %>% 
    filter(!is.na(sex)) %>% 
    group_by(sex) %>% 
    summarize(mean_fitted = mean(Estimate, na.rm = TRUE),
              mean_se.fitted = mean(Est.Error, na.rm = TRUE))

## # A tibble: 2 x 3
##   sex    mean_fitted mean_se.fitted
##   <fct>        <dbl>          <dbl>
## 1 female       0.310          0.220
## 2 male         0.554          0.258

Adding a prior

Without a prior, the log-odds coefficient for the relationship between planning_behavior and planning_esm was 3.74 (SE = 0.40). Let’s say we think a priori that it’s very high. I’m getting a bit lost in the marginal effects on the response scale and calculating the probability from the log-odds and odds, so this will just be a very very rough start.

exp(3.74) # odds

## [1] 42.09799

exp(3.74)/(exp(3.74)+1) # probability - this seems too high, especially given the marginal effects

## [1] 0.9767971

Let’s say our prior is a bit lower:

x_to_p <- function(x) {
    exp(x)/(exp(x)+1)
}

seq(0, 10, by = .10) %>% 
    map_dbl(x_to_p) # oh, it's 0

##   [1] 0.5000000 0.5249792 0.5498340 0.5744425 0.5986877 0.6224593 0.6456563
##   [8] 0.6681878 0.6899745 0.7109495 0.7310586 0.7502601 0.7685248 0.7858350
##  [15] 0.8021839 0.8175745 0.8320184 0.8455347 0.8581489 0.8698915 0.8807971
##  [22] 0.8909032 0.9002495 0.9088770 0.9168273 0.9241418 0.9308616 0.9370266
##  [29] 0.9426758 0.9478464 0.9525741 0.9568927 0.9608343 0.9644288 0.9677045
##  [36] 0.9706878 0.9734030 0.9758730 0.9781187 0.9801597 0.9820138 0.9836975
##  [43] 0.9852260 0.9866131 0.9878716 0.9890131 0.9900482 0.9909867 0.9918374
##  [50] 0.9926085 0.9933071 0.9939402 0.9945137 0.9950332 0.9955037 0.9959299
##  [57] 0.9963158 0.9966652 0.9969816 0.9972680 0.9975274 0.9977622 0.9979747
##  [64] 0.9981671 0.9983412 0.9984988 0.9986415 0.9987706 0.9988875 0.9989932
##  [71] 0.9990889 0.9991756 0.9992540 0.9993249 0.9993891 0.9994472 0.9994998
##  [78] 0.9995474 0.9995904 0.9996294 0.9996646 0.9996966 0.9997254 0.9997515
##  [85] 0.9997752 0.9997966 0.9998159 0.9998334 0.9998493 0.9998636 0.9998766
##  [92] 0.9998883 0.9998990 0.9999086 0.9999173 0.9999252 0.9999323 0.9999387
##  [99] 0.9999446 0.9999498 0.9999546

We can specify a prior of 0 with a SD of, let’s say, .50. This should lead to a smaller estimated relationship between planning_behavior and planning_esm.

model_logit_nested_brms_2 <- brm(planning_esm ~ planning_behavior,
                               prior = c(prior(normal(0, .50), 
                                               coef = planning_behavior)),
                               data = penguins, 
                               family = bernoulli(),
                               cores = 4,
                               chains = 4,
                               iter = 1000)

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1

model_logit_nested_brms_2

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: planning_esm ~ planning_behavior 
##    Data: penguins (Number of observations: 342) 
## Samples: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup samples = 2000
## 
## Population-Level Effects: 
##                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept            -0.31      0.16    -0.62     0.00 1.00     1875     1446
## planning_behavior     2.62      0.23     2.18     3.11 1.00     1767     1237
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

model_logit_nested_predictions_brms_2 <- predict(model_logit_nested_brms_2, penguins)

model_logit_nested_predictions_brms_2 <- bind_cols(as_tibble(model_logit_nested_predictions_brms_2), penguins)

my_subset_nested_predictions_brms_2 <- model_logit_nested_predictions_brms_2 %>% 
    select(planning_esm, planning_behavior, sex, Estimate, Est.Error) %>% 
    as_tibble()

# sex predicted probability and SE (in probability units) 
my_subset_nested_predictions_brms_2 %>% 
    filter(!is.na(sex)) %>% 
    group_by(sex) %>% 
    summarize(mean_fitted = mean(Estimate, na.rm = TRUE),
              mean_se.fitted = mean(Est.Error, na.rm = TRUE))

## # A tibble: 2 x 3
##   sex    mean_fitted mean_se.fitted
##   <fct>        <dbl>          <dbl>
## 1 female       0.310          0.296
## 2 male         0.557          0.318

Key questions

Is a logistic/linear regression fundamentally the right way to think about this? And should the ESM outcomes be the dependent variables?
IF ESM is ground truth, how do we account for measurement error / uncertainty in those reports? Is there a way to model this as latent?
Why do confidence intervals/why does uncertainty seem to be so much higher when we use MCMC/a Bayesian estimation approach? Without any priors specified, the estimates should be comparable/nearly identical. I wonder if I am “processing” the predictions from the Bayesian model incorrectly as the function (prediction from the prediction package) I normally use doesn’t appear to work with the Bayesian estimation package (brms/Stan).
What do we specify priors for – the relationships? The relationships + the intercept?
How do we specify a prior for a binomial outcome?