2020-06-18
The document summarises the online survey data collected as part of the Life in Conservation project. This is the first of three walkthroughs. The second explores associations between variables, found here. The third implements the full structural equation model, using multiple-imputation for missing data, found here.
This document also accompanies some theory walkthroughs (e.g. exploratory factor analysis, item-response theory, and SEM). These walkthroughs provide the methodological justification for some of the modelling decisions. There are some additional side analysis, linked to during the course of the walkthrough.
You can see a detailed description of each variable, including associated questions, here. The dataset contains 86 variables. 2906 people started the survey, of which 224 decided to actively “leave the survey” for unknown reasons. A further 401 did not complete the psychological distress questions (or in a few cases, did the survey twice), so will be excluded. We will use all other responses (2281). These remaining responses include 187 who said they do not work or do research in conservation (but are likely still considered conservationists). We’ll repeat the analysis to see if this affects the results.
The table below shows the pattern of missing data. Out of the dataset (2281), 55 are from French speakers, 59 are from Spanish speakers, 23 are from Portuguese speakers, 3 is from a Kiswahili speaker, and the rest English.
The vast majority of these missing data are from variables we expected some people not to want to answer, like the age questions. We’ll talk about the consequences of this missing data as we proceed.
We can also have a quick look at the number of observations (within 30 day smoothed bins). The dashed line is an arbitrary date (1st January 2020). Those to the left are categorised as “pre-COVID” and those to the right as “COVID”. Interestingly, dates and durations are missing for some observations (which are otherwise complete).
Let’s look at some summary statistics for each variable. We’ll start with the K10 questions for measuring psychological distress, since that will be used in later parts of the analysis. We’ll then proceed through the questions in the order they are asked.
Lets start looking at the K10, which indicates psychological distress risk. Lets look at the frequency of responses per item. Questions 3, 6, and 8 are skip-pair questions, meaning that they’re only asked if a respondent gives any other response but “none of the time” in the previous question.
To score the K10, each item is scored from 1, for ‘none of the time’, to five ‘all of the time’, and scores across the ten items for each responded are added together (ranging from 10 to 50). There is no universally agreed interpretation of the K10 scores. The Australian Bureau of Statistics uses the following scoring:
The 2001 Victorian Population Health Survey uses the following scoring:
Now, lets sum the scores for each person, and look at the distribution of total scores, and the thresholds between psychological distress severity (based on the Australian Bureau of Statistics scoring). Within our dataset, 25% have low distress, 34% have moderate distress, 29% have high distress, and 13% experience very high distress.
round((table(DF2$K10_total>=22 &DF2$K10_total<=29)["TRUE"]/nrow(DF2) + table(DF2$K10_total>=30 &DF2$K10_total<=50)["TRUE"]/nrow(DF2))*100,2)
## TRUE
## 41.25
A walkthrough of the theory underpinning this exploratory factor analysis can be found here.
A few points: - The data is ordinal, so we’re going to examine the polychoric rather than Pearson correlation. - Furthermore, we’re then going to impliment the confirmatory factor analysis in SEM, using WLSMV. - Additionally, Items 2 and 3, 5 and 6, and 7 and 8 are answered using a skip-pattern, meaning they’re interdependent. It will quickly become apparent this this affects our model fit. To correct for this, I’ve specified correlated error terms for each of the three pairs. - We’ll conduct the exploratory factor analysis on 70% of the data, and confirmatory on the other 30% (randomly split). This isn’t really necessary for the K10, because it has already been validated multiple times. We’ll do it anyway though.
Lets split the data.
# Choosing the sample size (the seed has been set above, so is reproducable)
smp_size <- floor(0.7*nrow(DF2))
## Training observations
train_ind <- sample(seq_len(nrow(DF2)), size = smp_size)
# Subet to test and training
train_DF2 <- DF2[train_ind, ]
test_DF2 <- DF2[-train_ind, ]
First, lets look at the polychoric correlation between items. We see high correlations between items, as we expect.
Next, lets use perform parallel analysis, to explore how many factors to extract. The vast majority of variation is explained by the first factor, which we expect to be psychological distress.
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
Lets confirm this by looking at the model fit (reflected by the RMSEA) of EFA models with different numbers of extracted factors (using WLS, and oblimin rotation). As described in the EFA theory walkthrough, RMSEA values closer to 0 indicate better model fit:
Factors | RMSEA | lower | upper |
---|---|---|---|
1 | 0.204 | 0.197 | 0.211 |
2 | 0.171 | 0.163 | 0.179 |
3 | 0.165 | 0.155 | 0.175 |
4 | 0.122 | 0.110 | 0.135 |
5 | 0.122 | 0.110 | 0.135 |
The fit is very poor. Let’s have a look at the results and fit if we include the pair-wise correlated error terms (with coefficients presented as standardized parameter estimates). Here we’re moving into confirmatory factor analysis terratory, but I’m not sure how you’d do exploritory factor analysis with pair-wise correlated error terms. We’ll still do this with the training dataset though.
# K10 CFA
model_K10 <- '
# PD
PD =~ K10_1 + K10_2 + K10_3 + K10_4 + K10_5 + K10_6 + K10_7 + K10_8 + K10_9 + K10_10
# Correlated error terms
K10_2 ~~ K10_3
K10_5 ~~ K10_6
K10_7 ~~ K10_8
'
# Model
fit_K10 <- lavaan::cfa(model=model_K10, estimator = "WLSMVS", data=train_DF2[,K10_row_name], ordered = c("K10_1" , "K10_2" , "K10_3" , "K10_4", "K10_5", "K10_6", "K10_7", "K10_8", "K10_9", "K10_10"))
The RMSEA of this model is 0.051 (95% CI 0.04 - 0.06), which indicates a close model fit. Clearly, the correlated error terms are needed. For piece of mind, let’s also use the same model with the test dataset. When doing this, we get RMSEA of 0.053 (95% CI 0.04 - 0.07), which is fine.
We can further explore the relationship between each item and psychologicla distress using Item Response Theory. However, since this is tangential I’ve religated it to here.
The mean (blue dashed line) and median age of respondents are 37 and 34 respectively, the standard deviation (SD) is 11.2. As we’ve seen, there are 169 (7.4%) missing observations. This is a bit high, and its is unlikely that this data is missing at random. We’ll see how this affects the results, through sensitivity analysis, later.
Also, this variable will be scaled and centered in future analysis.
The goal progress statements were based on the Value-Belief Norm theory, which suggests that pro-environmental behavioural intentions are motivated by egoistic, altruistic, and biospheric values. In the figure below, the left panel shows the number of people that consider each goal to be important, grouped into egoistic, altruistic, and biospheric catagories.
The biospheric goals were the most endorsed, with an average of 70.5% of the goals being endorsed, follwed by egoistic goals (56.9%) and altruistic goals (54.3%). However, if we exclude “… making a meaningful contribution to conservation” then the average endorsement for the egoistic goals falls to 44.3%.
The right panel shows how satisfied or dissatisfied respondents are with progress being made to each goal that has been endorsed. Broadly, respondents tend towards being satisfied with progress towards egoistic goals, relatively neutral about progress towards altruistic goals, but dissatisfied with progress towards biospheric goals (which were the most important goals to conservationists). The goal of “… making a meaningful contribution to conservation” was simultaneously the most important goal, and the one which respondents were most satisfied with. The goals of “… stopping human-driven species loss” and “… stopping damage to the natural world” were the second and third most important goals, respectively, but also the ones which respondents were least satisfied with.
This highlights an interesting contrast - respondents tend to be satisfied with their contribution to conservation, yet are dissatisfied with the state of conservation as a whole. This might be because the first is something within an individual’s control, whereas the state of the natural world is strongly influenced by external forces.
Our goal response variable is a composite variable, derived from the goal progress variables. One of the strengths of SEM is the ability to construct composite variables, neatly summarised here: “This composite is still an unmeasured quantity – like a latent variable – but with no error variance, and with “indicators” actually driving the variable, rather than having the unmeasured variable causing the expression of its indicators."
This would be a great method for exploring the association between goal progress and psychological distress. The only problem is that we have “missing data”. We know the mechanism that generated this missing data - people did not endorse the variable - so we can’t use methods normally reserved from missing at random data. I am still looking for an elegant way of constructing the composite variable (see here). However, for the time being we’ll take the mean of endorsed goals (i.e. the average accross those goals that were endorsed). This makes a number of assumptions:
We’re also going to scale and center the goal progress variable, to make it easier to interpret the results.
Prior to the goal progress questions, we ask respondents to describe the area or context they are thinking about. We’ll think about a more sophisticated analysis of this, but for now, here is a word cloud of the top 300 words.
The Effort-Reward Imbalance (ERI) model is used to understand occupational risk factors for mental illness, specifically if rewards are commensurate with effort. A value great than one suggests that efforts outweigh rewards. The below figure shows the agreement and disagreement for each item (NB the colour inversion between the two plots, and the reverse coding in the second plot).
This first plot simply slows the levels of agreement or disagreement with each ERI statement. However, the ERI is supposed to be a single value calculated by comparing the balance of efforts and rewards. The red line indicates a balance of “efforts” and “rewards”. Observation to the left are when efforts are less than rewards. Observation to the right are where rewards are greater than efforts. I.e. a value greater than one suggests that efforts outweigh rewards. The ERI value for the original set of statements included in the ERI are shown in pink. However, we added some additional questions that capture some additional challenges and rewards that conservationists might experience. The value of our updated ERI is shown in blue.
We can also look at the correlation between efforts and rewards, both of which I’ve scaled and centred (SD units). This is somewhat surprising. We would have expected rewards to increase as effort increases. However, this suggests the opposite.
Lets start by looking at which countries people are thinking about when asked “Which country’s conservation context are you most familiar with?”
It appears we have a reasonable spread - although results are clustered in the USA, UK, and India.
Let’s look at the distribution of responses in each situational optimism item.
As above, we might ask what latent variables drive these responses. We used the Aichi Biolodiversity Targets as an indicator of the general breadth of what people consider to be desirable future conservation outcomes. We creates two statements for each of the five targets (and one extra target, discussed a bit later). We expect those two statements to be associated with each other, and the target, but less associated with statements linked to other targets. As a result, we expect to see five latent variables, corresponding to optimism about each of the five Aichi targets. However, we expect that each of those five latent constructs are also driven by a secound order latent variable - conservation optimism.
The potential contribution of the final item (SO_11) is less clear - all other questions are about general conservation outcomes in a selected country. However, the final item is specific to an individual conservation context, and so might be much more sensitive to local contextual conditions. There is a strong argument for including this as a separate variable in the analysis. For the time being, we’ll exclude it from the analysis.
There seems to be reasonable correlation between items (within the training dataset), particularly among the adjacent pairs of items (that correspond to the pairs for each Strategic goal), among the first half of the set of items.
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
The parallel analysis suggests between two and three factors, so lets have a look at the associated RMSEA.
Factors | RMSEA | lower | upper |
---|---|---|---|
1 | 0.169 | 0.162 | 0.176 |
2 | 0.123 | 0.115 | 0.131 |
3 | 0.106 | 0.096 | 0.116 |
4 | 0.128 | 0.116 | 0.141 |
5 | 0.128 | 0.116 | 0.141 |
None of these models provide very good fit. Let’s look at the results of a five factor model, since alongside the four factor model this has the lowest (but still high, RMSEA).
Factor 1 | Factor 2 | Factor 3 | Factor 4 | Factor 5 | |
---|---|---|---|---|---|
Public support for conservation will grow over the next ten years (SO_1) | 0.66 | ||||
Government spending on conservation will grow over the next ten years (SO_2) | 0.77 | ||||
The harmful impact of people on nature will be less in ten years’ time than it is now (SO_3) | 0.89 | ||||
Human society will be more environmentally sustainable in ten years’ time than it is now (SO_4) | 0.69 | ||||
There will be more wildlife in ten years’ time than there is today (SO_5) | 0.85 | ||||
There will be more natural areas and habitats in ten years’ time than there are today (SO_6) | 0.83 | ||||
People will spend more recreational time in nature in ten years’ time than they do now (SO_7) | 0.72 | ||||
Nature will be able to provide the same benefits to people in ten years’ time as now (SO_8) | 0.41 | ||||
There will be more local participation in conservation in ten years’ time than now (SO_9) | 0.62 | ||||
Conservationists will have better tools and knowledge in ten years’ time than now (SO_10) | 0.9 |
Item 10 is only loaded on one factor, and 8 is weakly loaded in general. Furthermore, we know 1 & 2, 2 & 3, and 5, 6 are relate to the same Aichi targets, and 7 and 9 appear to relate to peoples engagement with nature. Lets look at a model where we include correlated error terms between those grouped items, and exclude 8 and 10.
# The model
model_SO_simple <- '
SO =~ SO_1 + SO_2 + SO_3 +SO_4 + SO_5+ SO_6 + SO_7 + SO_9
# Correlated error terms
SO_1 ~~ SO_2
SO_3 ~~ SO_4
SO_5 ~~ SO_6
SO_7 ~~ SO_9
'
# Now lets run the SEM with the same
fit_SO_simple <- lavaan::cfa(model = model_SO_simple, estimator = "WLSMVS", data=train_DF2[,SO_row_name], ordered = c("SO_1" , "SO_2" , "SO_3" , "SO_4", "SO_5", "SO_6", "SO_7", "SO_9") )
The RMSEA of this model is 0.066 (95% CI 0.06 - 0.08), which is adequate.
# Now lets run the SEM with the same
fit_SO_simple_test <- lavaan::cfa(model=model_SO_simple, estimator = "WLSMVS", test_DF2[,SO_row_name], ordered = c("SO_1" , "SO_2" , "SO_3" , "SO_4", "SO_5", "SO_6", "SO_7", "SO_9", "SO_10") )
Let’s see how it performs on the test dataset. The RMSEA of this model is 0.076 (95% CI 0.06 - 0.09), which again, appears adequate.
Next to dispositional optimism, measured using the Life-Orientation Test Revisited (LOT-R). The LOT-R contained four fillers, which we discarded because we wanted to minimise the survey length. Items are scored from 0 to 4, and scores for each item are added together.
Let’s look at the distribution of total scores, ranging from 0 to 24, with people above 19 considered to have high optimism.
Since we’re also including this as a latent variable, we should run through some of the steps to examine if this is valid. First, examinig the correlations, which show good correlation.
Now lets examine the number of factors to consider extracting. There is ongoing discussion about if the LOT-R is a one factor model, describing optimism, or a two factor model, describing optimism and pessimism as distinct constructs. Alternatively, others have argued that it is a two factor model, with one factor describing optimism, and the other factor simply being a method affect associated with the positive wording of questions.
## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
In fact, it’s clear that a two factor model has much better fit - I was unable to fit a three factor model.
Factors | RMSEA | lower | upper |
---|---|---|---|
1 | 0.234 | 0.220 | 0.248 |
2 | 0.065 | 0.044 | 0.087 |
OK, let’s examine the loadings for a two factor model. These correspond to the loadings on each item found in other studies (e.g. here). NB because we discarded the fillers, the coding we used (ranging from LOTR_1 to LOTR_6) does not correspond to the codes uses in others studies (LOTR_1 to LOTR_10).
Factor 1 | Factor 2 | |
---|---|---|
In uncertain times, I usually expect the best (LOTR_1) | 0.72 | |
If something can go wrong for me, it will (reverse coding) (LOTR_2) | 0.61 | |
I’m always optimistic about my future (LOTR_3) | 0.80 | |
I hardly ever expect things to go my way (reverse coding) (LOTR_4) | 0.90 | |
I rarely count on good things happening to me (reverse coding) (LOTR_5) | 0.67 | |
In uncertain times, I usually expect the best (LOTR_6) | 0.56 |
I’ve done a bit of IRT analysis on these data, which can be found here. Also, we’re ‘undoing’ the reverse coding for the pessimism items. Originally all response levels had the same coding. We then reverse coded those for pessimism, so one response level was coded as 0 optimism questions, but 4 for pessimism questions (for instance.) We’re now undoing that, so a given response level will have the same code for both sets of questions. This makes it clearer that those items are negatively correlated with each other.
So far we’re assuming that optimism and pessimism are correlated. Let’s test that this is true, by including the correlation between the two factors.
model_OP_PES <- '
###### Dispositional optimism and pessamism
# Dispositional optimism
OP =~ LOTR_1 + LOTR_3 + LOTR_6
# Dispositional pessimism
PES =~ LOTR_2 + LOTR_4 + LOTR_5
######
OP~~PES
'
# Now lets run the SEM with the same
fit_OP_PES <- lavaan::sem(model=model_OP_PES, estimator = "WLSMVS", data= train_DF2 , ordered = c("LOTR_1" , "LOTR_2" , "LOTR_3" , "LOTR_4", "LOTR_5", "LOTR_6") )
There does appear to be a strong correlation between the two, but see below.
The RMSEA of this model is 0.099 (95% CI 0.08 - 0.11), which is poor.
Others have argued that actually the model is best fit as a single factor describing dispositional optimism, and a secound latent variable accounting for the method effect of positively worded questions (see here). These factors should be orthogonal to each other (and in fact, need to be if the model is to be identified), as described in the paper.
model_OP_method <- '
###### Dispositional optimism
# Dispositional optimism
OP =~ LOTR_1 + LOTR_2 + LOTR_3 + LOTR_4 + LOTR_5 + LOTR_6
# The method effect
method =~ LOTR_1 + LOTR_3 + LOTR_6
# OP and the method effect are orthognal
OP ~~0*method
'
# Now lets run the SEM with the same
fit_OP_meth <- lavaan::sem(model=model_OP_method, estimator = "WLSMVS", data=train_DF2 , ordered = c("LOTR_1" , "LOTR_2" , "LOTR_3" , "LOTR_4", "LOTR_5", "LOTR_6") )
The RMSEA of this model is 0.062 (95% CI 0.05 - 0.08), which is better.
# Now lets run the SEM with the same
fit_OP_meth_test <- lavaan::sem(model=model_OP_method, estimator = "WLSMVS", data= test_DF2 , ordered = c("LOTR_1" , "LOTR_2" , "LOTR_3" , "LOTR_4", "LOTR_5", "LOTR_6") )
Let’s see how well this structure works with the test dataset. The RMSEA of this model is 0.045 (95% CI 0.01 - 0.08), which is adequate.
Health | Percentage |
---|---|
Very bad | 0.7 |
Bad | 3.7 |
Fair | 21.6 |
Good | 51.2 |
Very good | 22.1 |
Presented in percentage terms.
Strongly disagree | Disagree | Neither | Agree | Strongly agree | ||
---|---|---|---|---|---|---|
It is dangerous to go outside at night alone | PS_1 | 31.8 | 32.4 | 9.8 | 19.2 | 6 |
My work puts me in dangerous situations | PS_2 | 26.6 | 33.3 | 15.1 | 19.7 | 4.5 |
I do not feel safe, even where I live | PS_3 | 49.4 | 34.2 | 8.6 | 5.5 | 1.6 |
For the large analysis, I’m going to code into desk-based, non-desk-based, and other/unknown.
Position | Percentage |
---|---|
Administration | 2.4 |
Bachelors student | 1.6 |
Consultant/self-employed | 5.7 |
Fieldworker | 6.0 |
Graduate student | 10.4 |
Intern | 0.9 |
Manager | 14.1 |
Other | 16.9 |
Policymaker | 1.4 |
Ranger | 1.1 |
Researcher | 35.3 |
Unknown | 4.2 |
Gender | Percentage |
---|---|
Female | 52.0 |
Male | 42.1 |
Other | 0.2 |
Prefer not to say | 1.4 |
Education | Percentage |
---|---|
College | 4.2 |
Unknown | 4.4 |
Primary | 0.5 |
Secondary | 1.5 |
University | 89.4 |
Within our sample, 22.1% are working in countries that they are not nationals of.
The next steps will be to explore bivariate relationships between variables, found here.
Social support
Presented in percentage terms.