A total of 2694 completed as far as the goal progress questions. The next plot shows the pattern of missing data. Some of these missing data are completely expected (such as the missing data among the GP variables ending in b). Others are missing because respondents did not get that far in the survey, or because they did not want to tell us their age, for instance. Some of the results (describing the goals endorsed by respondents and progress towards them) will include only the GP data (2694 observations), with missing patterns shown in the next figure.
However, the statistical analysis will be conducted with those that also have the situational optimism data (2336 observations), shown below.
Finally, you can see the absence of missing data (apart from where it is expected) following missing imputation, below.
The goal progress statements were based on the Value-Belief Norm theory, which suggests that pro-environmental behavioural intentions are motivated by egoistic, altruistic, and biospheric values. In the figure below, the left panel shows the number of people that consider each goal to be important, grouped into egoistic, altruistic, and biospheric catagories.
The right panel shows how satisfied or dissatisfied respondents are with progress being made to each goal that has been endorsed.
We can also explore the correlations between continuous variables.
And also looking at the VIF, using the mean GP as the response. All values are below 3, which is somewhat surprising given the high correlation between age and years in conservation.
## SO_total LOTR_total age_year_scaled years_cons_scaled
## 1.068449 1.047971 2.932077 2.920725
## WH_scaled
## 1.011587
Next, we explore how the prioritisation of goals varies between groups and places. Three models correspond to the three goal types. For each goal type, the observations for each goal are appended to create a single longer data frame. To account for the duplication of between observations from the same individual, the respondents ID is used as a random effect. To account for differences in the propensity for each goal to be endorsed, the goal name is included as an explanatory variable. In other words, we create a variable called “goal endorsement”, which includes duplicate observations for each of the three goals, plus an identifier for the respondent (each of whom is represented by three observations), plus an identifier for each of three goals. We will run this model as a Bayesian logistic regression. The main explanatory variables are:
First, we want to be able to use plausible values derived from the SEM describing respondents dispositional optimism. It currently does not appear possible to extract plausible values for latent variables estimated from ordinal data in lavaan. As such, we want to compare the factor scores when were treating the LOTR items as ordinal and vs treating them as continuous, as shown below.
# Create a dummy dataset
test.df <- mice.imp.GP[[1]]
# The structure of the DO model
model_OP_method <- '
###### Dispositional optimism
# Dispositional optimism
OP =~ LOTR_1 + LOTR_2 + LOTR_3 + LOTR_4 + LOTR_5 + LOTR_6
# The method effect
method =~ LOTR_1 + LOTR_3 + LOTR_6
# OP and the method effect are orthognal
OP ~~0*method
'
# Run the SEM, treating these data as ordinal.
fit_OP_meth <- lavaan::sem(model=model_OP_method, estimator = "WLSMVS", data= test.df , ordered = c("LOTR_1" , "LOTR_2" , "LOTR_3" , "LOTR_4", "LOTR_5", "LOTR_6") )
factors_1 <- data.frame(lavPredict(fit_OP_meth))
factors_1$factor <- c(rep("ord", length(nrow(factors_1))))
# Now lets convert the ordinal variable to numeric, and run the same analysis,
test.df[,LOTR_row_name] <- apply(test.df[,LOTR_row_name], 2, as.numeric)
fit_OP_meth_num <- lavaan::sem(model=model_OP_method, estimator = "WLSMVS", data=test.df )
factors_2 <- data.frame(lavPredict(fit_OP_meth_num))
factors_2$factor <- c(rep("num", length(nrow(factors_2))))
# Combine the results
factors <- rbind(factors_1, factors_2)
# And look at the distribution of factor scores
ggdensity(factors, x = "OP",
add = "mean", rug = TRUE,
color = "factor", palette = c("#00AFBB", "#E7B800"))
When comparing these results, we can see there is remarkably good overlap when treating the data as ordinal or numeric. It seems reasonably to treat these data as numeric, and attempt to extract plausible values. The below figure shows the distribution of the n sets of plausible values, which, unsurprisingly overlap a lot.
# Extract the PV
pv_DO <- plausibleValues(fit_OP_meth_num, nDraws = 20, seed = 12345)
# Create DF
pv_DO_DF <- list()
for (i in seq_along(1:length(pv_DO))){
pv_DO_DF[[i]] <- data.frame(pv_DO[[i]])
pv_DO_DF[[i]]$ID <- as.factor(i)
}
# Creat a single DF
pv_DO_com <- do.call(rbind, pv_DO_DF)
# And look at the distribution of plausible values
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
ggdensity(pv_DO_com, x = "OP", color = "ID", palette = col_vector[1:20])
And now we can extract these plausible values across the ten imputed datasets.
# Convert to numeric, across all imputed DF
for (i in seq_along(1:length(mice.imp.GP))){
mice.imp.GP[[i]][,LOTR_row_name] <- apply(mice.imp.GP[[i]][,LOTR_row_name], 2, as.numeric)
}
# Run the SEM on all imputed DF
model_GP_1_MI <- runMI(model_OP_method, data = mice.imp.GP, fun="sem", estimator = "WLSMVS", FUN = fitMeasures)
# Extract the plausible values from each imputed DF - we're doing just one draw from each of the ten imputed DF given the computational time
pv_DO <- plausibleValues(model_GP_1_MI, nDraws = 1, seed = 12345)
# Match each PV to each DF - technically, it should be ten PV for each imputed datasets (totalling 100), but this is likely massive overkill given we have complete cases for the LOTR questions, and means the resulting computation is too much.
mice.imp.GP.sim <- mice.imp.GP
for (i in seq_along(1:length(mice.imp.GP))){
mice.imp.GP.sim[[i]]$DO <- pv_DO[[i]]$OP
}
# Save this DF
save(mice.imp.GP.sim, file ="mice.imp.GP.sim.RData")
Now, lets see how long it takes to run a simple Bayesian logistic regression with the MI dataset (using the default prior and other arguments).
# Endorsement of goal 1, depending on years working in conservation
b.gp1.years<- brm(GP_1_a ~ years_cons_scaled, family='bernoulli', mice.imp.GP.sim)
## Compiling Stan program...
## Start sampling
summary(b.gp1.years)
## Family: bernoulli
## Links: mu = logit
## Formula: GP_1_a ~ years_cons_scaled
## Data: mice.imp.GP.sim (Number of observations: 2336)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 2.97 0.10 2.78 3.17 1.00 2245 2461
## years_cons_scaled -0.54 0.07 -0.68 -0.41 1.00 1984 2349
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
OK, now we’re going to group complete cases for each goal by type (into personal, altruistic, and biospheric goals).
# A function that returns complete cases for each goal (they should and are all complete cases) and the name of that goal
complete.goals.a <- function(DF, goal){
DF_goal <- DF[complete.cases(DF[,goal]),]
DF_goal$goal_end <- DF_goal[,goal]
DF_goal$goal_num <- as.factor(goal)
return(DF_goal)
}
# Lists of the stacked DF (stacked by complete case of each goal) for each MI dataset
mice.imp_personal.a <- list_along(1:length(mice.imp.GP.sim))
mice.imp_people.a <- list_along(1:length(mice.imp.GP.sim))
mice.imp_nature.b <- list_along(1:length(mice.imp.GP.sim))
for (i in seq_along(1:10)){
# Personal
mice.imp_personal.a[[i]] <- rbind(complete.goals.a(mice.imp.GP.sim[[i]], "GP_2_a" ), complete.goals.a(mice.imp.GP.sim[[i]], "GP_3_a" ), complete.goals.a(mice.imp.GP.sim[[i]], "GP_4_a" ))
# People
mice.imp_people.a[[i]] <- rbind(complete.goals.a(mice.imp.GP.sim[[i]], "GP_6_a" ), complete.goals.a(mice.imp.GP.sim[[i]], "GP_7_a" ), complete.goals.a(mice.imp.GP.sim[[i]], "GP_8_a" ))
# Nature
mice.imp_nature.b[[i]] <- rbind(complete.goals.a(mice.imp.GP.sim[[i]], "GP_5_a" ), complete.goals.a(mice.imp.GP.sim[[i]], "GP_9_a" ), complete.goals.a(mice.imp.GP.sim[[i]], "GP_10_a" ))
}
Now we can specify a simple mixed-effects model formula, where the goal endorsement is the response, the goal itself is one explanatory variable, years in conservation is another, and we also account for the dependencies within each observation by including ID as a random effect.
# The model formula
brm_form.1 <- formula("goal_end ~ goal_num + years_cons_scaled + (1|ID)")
# Get priors
get_prior(brm_form.1, data = mice.imp_personal.a[[1]], family ='bernoulli' )
# Set priors
prior_all <- c(
set_prior(prior = "normal(0,100)", class = "Intercept"),
set_prior(prior = "normal(0,100)", class = "b"),
set_prior(prior = "student_t(3, 0, 2.5)", class = "sd"))
Lets run a simple logit model, with 15000 iterations and 7500 burn ins, and 4 chains (all of these are lower than what we’ll actually use).
burn_in <- 7500
iterations <- 15000
# Personal
start_time_personal.a <- Sys.time()
personal_model.a <- brm(brm_form.1, family = 'bernoulli', data = mice.imp_personal.a[[1]], warmup = burn_in, iter = iterations, chains = 4, seed = 123, prior = prior_all)
end_time_personal.a <- Sys.time()
end_time_personal.a - start_time_personal.a
# Summary
summary(personal_model.a)
We can see this approach works, but it also takes a long time to run. For this reason, we might otherwise prefer to do the model construction in a frequentist framework. However, GLMM are notoriously prone to convergence issues, and even some of the relatively simple models we’re running encounter convergence issues. For this reason, we’re going to use a Bayesian approach, whose diagnostics can be more informative when trying to understand convergence issues. However, given my local memory limits, I’ve done this through an AWS E2 instance, saving the models (which get over 12 GB) directly to an S3 bucket.
The following checks were made when implementing the Bayesian logistic regressions using the first imputed data set. This follows the When to worry and how to Avoid the Misuse of Bayesian Statistics (WAMBS)-Checklist. As uninformative priors were used, the steps involving comparing informative to uninformative priors were skipped.
The priors were the same for all models, so these are only presented once. However, steps 2-6 are presented for each model (corresponding to goal type).
These steps are conducted for each model:
Finally, we can actually look at the results of the logistic regression.
# Extracting coefficients, converting them to odds-ratios, etc.
coef_list_a <- list(
# Personal
data.frame(Response = "Personal",Variable=as.factor(rownames(personal_model_full.a.summary$fixed)), exp(personal_model_full.a.summary$fixed[,c(1,3,4)])),
# People
data.frame(Response = "People", Variable= as.factor(rownames(people_model_full.a.summary$fixed)), exp(people_model_full.a.summary$fixed[,c(1,3,4)])),
# Nature
data.frame(Response = "Nature", Variable= as.factor(rownames(nature_model_full.a.summary$fixed)), exp(nature_model_full.a.summary$fixed[,c(1,3,4)])))
# Create single DF
coef_df_a <- do.call(rbind, coef_list_a)
# Subset to variables of interest
vars_in_a <- c("DO", "education_simpleUniversity", "genderMale", "position_simplePractice", "SO_RegionEasternandSouthMEasternAsia", "SO_RegionEuropeandNorthernAmerica" , "SO_RegionLatinAmericaandtheCaribbean","SO_RegionNorthernAfricaandWesternAsia","SO_RegionOceania","SO_RegionSubMSaharanAfrica", "WH_scaled","years_cons_scaled" )
coef_df_a.s <- coef_df_a[ which(coef_df_a$Variable %in% vars_in_a), ]
# Plot
ggplot(coef_df_a.s, aes(x= Variable, y= Estimate)) + ylab("Goal endorsement (odds-ratio)") +
geom_pointrange(aes(color = Response, shape = Response, ymax = u.95..CI, ymin= l.95..CI),
position=position_dodge(width=c(0.6,0.6))) +
geom_hline(yintercept = 1, linetype="dashed", colour="grey73") + coord_flip() + theme(axis.title.y=element_blank(), legend.position = "top", legend.direction = "vertical")
Next, we explore how perceived progress towards goals varies between groups and places. As above, three models correspond to the three goal types. For each goal type, the observations for each goal are appended to create a single longer data frame, using respondent ID as a random effect, and goal name as an explanatory variable. Respondents are only asked to evaluate progress towards goals they endorse, and so there are different numbers of observations for each goal. We will run this model as a Bayesian ordinal regression. As above, the main explanatory variables are:
As above, we use the same set of plausible values estimated for DO for each individual.
# A function that returns complete cases for each goal, progress towards each goal, the name of that goal, anc converting back from numeric to ordered factor
complete.goals <- function(DF, goal){
DF_goal <- DF[complete.cases(DF[,goal]),]
DF_goal$goal_num <- as.factor(goal)
DF_goal$goal_prog <- DF_goal[,goal]
DF_goal$goal_prog <- as.ordered(DF_goal$goal_prog)
return(DF_goal)
}
# Lists of the stacked DF (stacked by complete case of each goal) for each MI dataset
mice.imp_personal.b <- list_along(1:length(mice.imp.GP.sim))
mice.imp_people.b <- list_along(1:length(mice.imp.GP.sim))
mice.imp_nature.b <- list_along(1:length(mice.imp.GP.sim))
for (i in seq_along(1:10)){
# Personal
mice.imp_personal.b[[i]] <- rbind(complete.goals(mice.imp.GP.sim[[i]], "GP_2_b" ), complete.goals(mice.imp.GP.sim[[i]], "GP_3_b" ), complete.goals(mice.imp.GP.sim[[i]], "GP_4_b" ))
mice.imp_personal.b[[i]]$goal_prog <- as.ordered(mice.imp_personal.b[[i]]$goal_prog)
mice.imp_personal.b[[i]]$goal_prog <- plyr::mapvalues(mice.imp_personal.b[[i]]$goal_prog, from = c("1", "2","3", "4", "5"), to = c("Very dissatisfied" , "Dissatisfied" , "Neutral" , "Satisfied" , "Very satisfied"))
# People
mice.imp_people.b[[i]] <- rbind(complete.goals(mice.imp.GP.sim[[i]], "GP_6_b" ), complete.goals(mice.imp.GP.sim[[i]], "GP_7_b" ), complete.goals(mice.imp.GP.sim[[i]], "GP_8_b" ))
mice.imp_people.b[[i]]$goal_prog <- as.ordered(mice.imp_people.b[[i]]$goal_prog)
mice.imp_people.b[[i]]$goal_prog <- plyr::mapvalues(mice.imp_people.b[[i]]$goal_prog, from = c("1", "2","3", "4", "5"), to = c("Very dissatisfied" , "Dissatisfied" , "Neutral" , "Satisfied" , "Very satisfied"))
# Nature
mice.imp_nature.b[[i]] <- rbind(complete.goals(mice.imp.GP.sim[[i]], "GP_5_b" ), complete.goals(mice.imp.GP.sim[[i]], "GP_9_b" ), complete.goals(mice.imp.GP.sim[[i]], "GP_10_b" ))
mice.imp_nature.b[[i]]$goal_prog <- as.ordered(mice.imp_nature.b[[i]]$goal_prog)
mice.imp_nature.b[[i]]$goal_prog <- plyr::mapvalues(mice.imp_nature.b[[i]]$goal_prog, from = c("1", "2","3", "4", "5"), to = c("Very dissatisfied" , "Dissatisfied" , "Neutral" , "Satisfied" , "Very satisfied"))
}
The degree to which the proportional odds assumptions was likely to hold was visually inspected, for each goal type model. Although not ideal (at low levels of goal progress), this appears adequate.
As above, the WAMBS-checklist was implemented, the results of which can be found here:
Once again, we can look at the results of the ordinal regression.
# Extracting coefficients, converting them to odds-ratios, etc.
coef_list_b <- list(
# Personal
data.frame(Response = "Personal",Variable=as.factor(rownames(personal_model_full.b.summary$fixed)), exp(personal_model_full.b.summary$fixed[,c(1,3,4)])),
# People
data.frame(Response = "People", Variable= as.factor(rownames(people_model_full.b.summary$fixed)), exp(people_model_full.b.summary$fixed[,c(1,3,4)])),
# Nature
data.frame(Response = "Nature", Variable= as.factor(rownames(nature_model_full.b.summary$fixed)), exp(nature_model_full.b.summary$fixed[,c(1,3,4)])))
# Create single DF
coef_df_b <- do.call(rbind, coef_list_b)
# Subset to variables of interest
# Subset to variables of interest
vars_in_b <- c("DO", "education_simpleUniversity", "genderMale", "position_simplePractice", "SO_RegionEasternandSouthMEasternAsia", "SO_RegionEuropeandNorthernAmerica" , "SO_RegionLatinAmericaandtheCaribbean","SO_RegionNorthernAfricaandWesternAsia","SO_RegionOceania","SO_RegionSubMSaharanAfrica", "WH_scaled","years_cons_scaled" )
coef_df_b.s <- coef_df_b[ which(coef_df_b$Variable %in% vars_in_b), ]
# Plot
ggplot(coef_df_b.s, aes(x= Variable, y= Estimate)) + ylab("Goal progress (logits)") +
geom_pointrange(aes(color = Response, shape = Response, ymax = u.95..CI, ymin= l.95..CI),
position=position_dodge(width=c(0.6,0.6))) +
geom_hline(yintercept = 1, linetype="dashed", colour="grey73") + coord_flip() + theme(axis.title.y=element_blank(), legend.position = "top", legend.direction = "vertical")