This document constitutes a preliminary report on the analyses carried out on the last available version (as of Thu Nov 05 10:13:58 2020) of the survey dataset, collated from the questionnaires collected over the study period and curated by Simone Tomaz, Anna Whittaker, Gemma Ryde. This is a work in progress and will be constantly updated to reflect the current state of the research and to incorporate feedback from the team. Ultimately, this document should provide the backbone to write up a fully-fledged and rigorous analysis report.
Following the desiderata in the shared tasks document, the following (slightly adapted) list of questions has been identified as of interest and shall be addressed here.
With respect to missing data, is there any indication as to which mechanisms underlie the process? Also, look into differences in missingness by SurveyCompleted
(encoding which survey was completed).
Produce data completeness heat maps for IPAQ, UCLA, and BPSSQ scores.
Can the sample be deemed representative of its parent population, with respect to the basic set of variables identified in the design phase?
Do the \(70+/<70\) and \(60+/<60\) groups differ (and if they do, how) in terms of i) socio-demographics, ii) self-reported health status, iii) physical activity, and iv) socialisation?
Consider the analyses in 4, stratified by sex.
Consider the analyses in 4, stratified by likelytoshield. Also look into differences in UCLA, BPSSQ.
Consider the analyses in 4, stratified by quantiles of deprivation.
Is there any suggestion of an association between urban/rural index (3-fold) and all outcomes?
Is there any suggestion of differences between carer and non-carer, stratified by age group, in all outcomes?
Is there evidence of association between socio-demographic variables and health variables?
Does current physical activity relate to current wellbeing (EQ5D)?
How are change variables related to each other?
Do changes in physical activity associate to the level of physical activity undertaken?
Does Change in Social contact/loneliness relate to level of BPSSQ/UCLA now?
Do COVID variables relate to any other outcomes and/or differ across the age groups?
Explore the sleep variables, also with respect to health variables
The study dataset encompasses 1429 unique observations (subjects) on 157 variables. The observations were collected online, using a questionnaire developed for the occasion, inclusive also of a number of pre-existing, separately validated tools.
Data collection: the questionnaire was advertised on social media (Facebook, Twitter) and through other channels [specify which ones] administered over the period [fill in details]. [expand]
Data preprocessing: [expand]
SurveyCompleted
(encoding which survey was completed).To be able to properly take action with respect to gaps in information (missing data), it is paramount to achieve some level of understanding of the underlying process that led to incomplete observations. A useful framework to adopt when discussing missingness was proposed by Rubin (1976), who identified three possible mechanisms responsible for it:
Without going into too much detail, Rubin’s framework provides a simple way to assess the validity of many commonly used missing data handling methods, under the three mechanisms above. Broadly speaking, we would like to be dealing with MCAR (hardly ever occurring in practice, but the only situation where listwise and pairwise deletion - as well as mean imputation - do not affect the validity of inference). In practice, however, MAR and MNAR are far more likely situations, both calling for increased care in how missing observations are treated. As no formal statistical testing exists to choose from the three possibilities, inspection of patterns of missingness in the data (also with the aid of visual instruments) paired with expert knowledge is typically the most robust strategy.
The dataset contains 8753 missing observations, which is ~3.9% of the total (1429 rows \(\times\) 157 columns \(=\) 224353). Approximately 52.87% of the variables contain at least one missing value, as do ~97.69% of the individual records. If of interest, it is possible to inspect the frequency tables of each of the variables and individuals.
The relatively low missingness rate and large number of involved variables and individuals, would suggest that the observations are missing in a relatively sparse way. There is, however, no guarantee that local clusters of missing patterns are not occurring. A simple way to get an idea of occurrence of patterns, is to produce a visualisation of the whole dataset after recoding it to a missingness indicator matrix (where each values maps to either 0 - not missing - or 1 - missing), possibly sorting the variables from highest to lowest number of missing observations. The next figure contains such visualisation.
vis_miss(x,sort_miss=TRUE)
Each of the vertical grey ticks indicates one column (variable), whereas each row is a subject. While pretty crowded (we have 157 variables in the dataset), it helps provide an overall look at missingness. What emerges, is that the overall missingness is low (as observed earlier, ~3.9%), and it appears clear that most of those missing values are specific to a small number of variables. It is certainly of interest to focus on these, to see if any noticeable pattern arises.
In order to do so, a very useful visual tool is the upset plot. Originally developed to depict intersections between sets (think of it as visualising a contingency table with some additional information on top), it can be used to compactly highlight where the missing observations occur in a dataset, and what combinations of missingness across a set of variables are more frequent. What follows is a missingness upset plot for the 10 variables with the largest amount of missing observations over up to 30 most frequent missingness patterns across them.
The 10 variables considered account for ~78.82% of the total missingness. With 46.54% missing values, medication
is by large the variable with the highest level of missingness. Inspecting the upset plot reveals that pintsbeer
, shotsspirits
and income
closely follow (26.38%, 17.42% and 16.66%, respectively), often presenting joint missingness. While the most frequently observed are medication
and income
missing alone, combinations of missing values on these four variables account for a large part of the remaining the observed patterns. It might be worth looking into why this is happening: is the questionnaire designed in such a way that non-drinkers (for example) would tend not to report on alcohol consumption, rather than just indicating ‘zero’ as a response? How is the variable medication linked to alcohol consumption? Are respondents reluctant to report using alcohol in conjuction with medications? The next set of variables that present some level of missingness include deprivation and rurality measures. Quintiles, vigintiles and ranks of SIMD (and similarly rurality measures) are computed based on the same piece of information (possibly the postcode?), hence missingness is expected to be shared across them. Once this will be confirmed, it might be of interest to produce another upset plot, removing the redundant information. This will be important in order to better understand where exactly the missing data occurs, and whether there is a suspect that it can be linked to specific characteristics of the individuals (which would then warrant an additional piece of analysis).
With respect to the variable SurveyCompleted
, we were interested in checking whether the particular type survey is found to be associated with missingness specifically of the postcode
variable. The reason for this is that the entry box for the postcode changed over the three version of the survey (1, 2, 3)
table(x$Surveycompleted,x$postcode=='')
##
## FALSE TRUE
## 1 1002 100
## 2 168 9
## 3 137 13
There appears to be no convincing evidence that postcode
was missing (TRUE
in the above table) in a consistently different manner after adjusting the postcode box in the survey (1
vs the rest). The relative proportions of missingness were as follows: 9.98%, 5.36%, 9.49%).
A quick inspection of IPAQvalid
, UCLAtotal
and BPSSQtotal
suggests that missingness on these variables is negligible (0%, 5.6% and 2.45%, respectively). Therefore, there seems to be no point in running a georeferenced analysis of completeness for these measures. [we need to discuss about whether these were the variables to look at]
Until then, here is a rough map of the locations of the respondents, which can be useful to get an idea of the geographic spread. Note that not all locations possessed valid postcodes, hence cannot be represented on the map (89.36% of the respondents either did not report a postcode or provided an invalid one).
## Warning: Ignoring unknown parameters: legend
## Warning: Removed 1277 rows containing non-finite values (stat_density2d).
## Warning: Removed 1277 rows containing missing values (geom_point).
The map indicates that the highest participation density comes (unsurprisingly)from the most densely populated areas. If of interest, it is possible to adjust for that and see if any difference still exists.
An unequivocal definition of what a representative sample is, with respect to a population, does not exist. In this situation, we will try and assess the balance of the sample with respect to a set of pre-defined population variables, that we deemed important to be adequately represented.
These variables are sex, age, education level, deprivation (as measured by SIMD quintile), and rurality (as measured by the 6-fold Urban/Rurality index).
Age distribution, also by sex
Unfortunately, we do not possess information at the population level for those at risk, but below 60 years of age. This makes it impossible to compare the age distribution for those directly. We observe a steep increment in responses from people 60+ as opposed to below 60, a reason for which was identified in the way the survey message was formulated, thus justifyin the imbalance. We have, however, no way to know whether such imbalance exists in the target population as well (probably the increment is not so steep the age of 60). For individuals above the age of 60+, the definition of target population coincides with that of general population. It should be enough to look at the pyramid of age for the general Scottish population to get an idea of how close the sample is to them. The following image comes from the 2018 NRS official report on Scotland’s population, available here. I have overimposed the red line at 60 years of age to facilitate interpretation, focus is on the part of the plot above the line.
A visual comparison with the frequency by age of 60+ in the sample appears to suggest that the distribution is preserved. While not a fully-fledged comparison of conditional distributions (which we would have no data to carry out properly), the shape of the distributions seem overall comparable.
## rurality 60+ in population 60+ in sample
## 1 Large Urban Areas 29.3 27.5
## 2 Other Urban Areas 36.6 32.7
## 3 Accessible Small Towns 9.4 12.3
## 4 Remote Small Towns 4.3 2.9
## 5 Accessible Rural 12.7 18.2
## 6 Remote Rural 7.7 6.3
## deprivation 60+ in population 60+ in sample
## 1 1 16.6 8.1
## 2 2 19.1 13.0
## 3 3 21.3 20.9
## 4 4 21.7 25.8
## 5 5 21.2 32.2
Relationship
Deprivation (quintiles)
Rurality
Self-reported health status
EQ5D
Health rating
Physical activity
Weekly minutes of light physical activity
Weekly minutes of moderate physical activity
Weekly minutes of vigorous physical activity
Weekly minutes of walking
Weekly minutes of physical activity
Level of physical activity
Phyical activity guidelines
Socialisation
UCLA
UCLA loneliness total score. The observed range in the sample is [6,24].
BPSSQ
Social network size and social hours per week
Relationship
Variable likelytoshield
. In the plots: No
- not likely to shield, Yes
- likely to shield.
Deprivation (quintiles)
Rurality
Self-reported health status
EQ5D
Health rating
Physical activity
Weekly minutes of light physical activity
Weekly minutes of moderate physical activity
Weekly minutes of vigorous physical activity
Weekly minutes of walking
Weekly minutes of physical activity
Level of physical activity
Phyical activity guidelines
Socialisation
UCLA
BPSSQ
Social network size and social hours per week
[DROPPING THE AGE GROUPING STRATIFICATION FROM NOW ON, as per mail today 27/10/2020]
Sex (binary)
Relationship
Rurality
Self-reported health status
EQ5D
Health rating
Physical activity
Weekly minutes of light physical activity
Weekly minutes of moderate physical activity
Weekly minutes of vigorous physical activity
Weekly minutes of walking
Weekly minutes of physical activity
Level of physical activity
Phyical activity guidelines
Socialisation
UCLA
BPSSQ
Social network size and social hours per week
Urban/rurality 3-fold index 1 - rest of Scotland
, 2 - accessible rural
, 3 - remote rural
Sex (binary)
Relationship
Deprivation
Self-reported health status
EQ5D
Health rating
Physical activity
Weekly minutes of light physical activity
Weekly minutes of moderate physical activity
Weekly minutes of vigorous physical activity
Weekly minutes of walking
Weekly minutes of physical activity
Level of physical activity
Phyical activity guidelines
Socialisation
UCLA
BPSSQ
Social network size and social hours per week
Relationship
Deprivation (quintiles)
Rurality
Self-reported health status
EQ5D
Health rating
Physical activity
Weekly minutes of light physical activity
Weekly minutes of moderate physical activity
Weekly minutes of vigorous physical activity
Weekly minutes of walking
Weekly minutes of physical activity
Level of physical activity
Phyical activity guidelines
Socialisation
UCLA
UCLA loneliness total score. The observed range in the sample is [6,24].
BPSSQ
Social network size and social hours per week
Unless there is a specific interest in a given three-way association, most of the two-way can be deduced by the previous analyses. Anything more specific than that, just ask.
Light PA
Moderate PA
Vigorous PA
Walking
Total PA
All together
The average EQ5D score steeply decreases with the amount of physical activities of all kind up to ~500 reported minutes, then slowly decays. The y axis range is truncated for readability, the observed range of the EQ5D score is 5
to 13
, as evident from the previous plots.
Light physical activity
Moderate physical activity
Vigorous physical activity
Walking
Sitting
Strength training
Screen time
x$COVIDyou, x$COVIDclose, x$Covid_othersyesorno
[The small numbers in some of the categories make running all comparisons not reasonable. As said for question 12, can we reduce it a bit and focus more on certain aspects?]
healthcond
, comorbid_bi
)Sleep guideline category
##
## 1 2 3 4 5
## 1 129 309 494 33 11
## 2 27 85 110 7 2
## 3 11 25 43 2 0
## Warning in chisq.test(table(x$SIMDurbrural3fold, x$sleepguideline)): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(x$SIMDurbrural3fold, x$sleepguideline)
## X-squared = 3.66, df = 8, p-value = 0.8864
##
## 1 2 3 4 5
## 1 34 41 41 4 1
## 2 24 59 82 8 3
## 3 37 81 140 7 2
## 4 39 111 159 10 3
## 5 33 127 225 13 4
## Warning in chisq.test(table(x$SIMDquintile, x$sleepguideline)): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(x$SIMDquintile, x$sleepguideline)
## X-squared = 41.966, df = 16, p-value = 0.0003992
##
## 1 2 3 4
## 210 761 397 61