Introduction

This document constitutes a preliminary report on the analyses carried out on the last available version (as of Thu Nov 05 10:13:58 2020) of the survey dataset, collated from the questionnaires collected over the study period and curated by Simone Tomaz, Anna Whittaker, Gemma Ryde. This is a work in progress and will be constantly updated to reflect the current state of the research and to incorporate feedback from the team. Ultimately, this document should provide the backbone to write up a fully-fledged and rigorous analysis report.

Following the desiderata in the shared tasks document, the following (slightly adapted) list of questions has been identified as of interest and shall be addressed here.

  1. With respect to missing data, is there any indication as to which mechanisms underlie the process? Also, look into differences in missingness by SurveyCompleted (encoding which survey was completed).

  2. Produce data completeness heat maps for IPAQ, UCLA, and BPSSQ scores.

  3. Can the sample be deemed representative of its parent population, with respect to the basic set of variables identified in the design phase?

  4. Do the \(70+/<70\) and \(60+/<60\) groups differ (and if they do, how) in terms of i) socio-demographics, ii) self-reported health status, iii) physical activity, and iv) socialisation?

  5. Consider the analyses in 4, stratified by sex.

  6. Consider the analyses in 4, stratified by likelytoshield. Also look into differences in UCLA, BPSSQ.

  7. Consider the analyses in 4, stratified by quantiles of deprivation.

  8. Is there any suggestion of an association between urban/rural index (3-fold) and all outcomes?

  9. Is there any suggestion of differences between carer and non-carer, stratified by age group, in all outcomes?

  10. Is there evidence of association between socio-demographic variables and health variables?

  11. Does current physical activity relate to current wellbeing (EQ5D)?

  12. How are change variables related to each other?

  13. Do changes in physical activity associate to the level of physical activity undertaken?

  14. Does Change in Social contact/loneliness relate to level of BPSSQ/UCLA now?

  15. Do COVID variables relate to any other outcomes and/or differ across the age groups?

  16. Explore the sleep variables, also with respect to health variables

Analysis

The study dataset encompasses 1429 unique observations (subjects) on 157 variables. The observations were collected online, using a questionnaire developed for the occasion, inclusive also of a number of pre-existing, separately validated tools.

Data collection: the questionnaire was advertised on social media (Facebook, Twitter) and through other channels [specify which ones] administered over the period [fill in details]. [expand]

Data preprocessing: [expand]

1. With respect to missing data, is there any indication as to which mechanisms underlie the process? Also, look into differences in missingness by SurveyCompleted (encoding which survey was completed).

To be able to properly take action with respect to gaps in information (missing data), it is paramount to achieve some level of understanding of the underlying process that led to incomplete observations. A useful framework to adopt when discussing missingness was proposed by Rubin (1976), who identified three possible mechanisms responsible for it:

  • Missing Completely at Random (MCAR, the probability to be missing is unrelated to observed and unobserved information)
  • Missing at Random (MAR, the probability to be missing may depend on observed information)
  • Missing Not at Random (MNAR, the probability to be missing may also depend on unobserved information).

Without going into too much detail, Rubin’s framework provides a simple way to assess the validity of many commonly used missing data handling methods, under the three mechanisms above. Broadly speaking, we would like to be dealing with MCAR (hardly ever occurring in practice, but the only situation where listwise and pairwise deletion - as well as mean imputation - do not affect the validity of inference). In practice, however, MAR and MNAR are far more likely situations, both calling for increased care in how missing observations are treated. As no formal statistical testing exists to choose from the three possibilities, inspection of patterns of missingness in the data (also with the aid of visual instruments) paired with expert knowledge is typically the most robust strategy.

The dataset contains 8753 missing observations, which is ~3.9% of the total (1429 rows \(\times\) 157 columns \(=\) 224353). Approximately 52.87% of the variables contain at least one missing value, as do ~97.69% of the individual records. If of interest, it is possible to inspect the frequency tables of each of the variables and individuals.

The relatively low missingness rate and large number of involved variables and individuals, would suggest that the observations are missing in a relatively sparse way. There is, however, no guarantee that local clusters of missing patterns are not occurring. A simple way to get an idea of occurrence of patterns, is to produce a visualisation of the whole dataset after recoding it to a missingness indicator matrix (where each values maps to either 0 - not missing - or 1 - missing), possibly sorting the variables from highest to lowest number of missing observations. The next figure contains such visualisation.

vis_miss(x,sort_miss=TRUE)

Each of the vertical grey ticks indicates one column (variable), whereas each row is a subject. While pretty crowded (we have 157 variables in the dataset), it helps provide an overall look at missingness. What emerges, is that the overall missingness is low (as observed earlier, ~3.9%), and it appears clear that most of those missing values are specific to a small number of variables. It is certainly of interest to focus on these, to see if any noticeable pattern arises.

In order to do so, a very useful visual tool is the upset plot. Originally developed to depict intersections between sets (think of it as visualising a contingency table with some additional information on top), it can be used to compactly highlight where the missing observations occur in a dataset, and what combinations of missingness across a set of variables are more frequent. What follows is a missingness upset plot for the 10 variables with the largest amount of missing observations over up to 30 most frequent missingness patterns across them.

The 10 variables considered account for ~78.82% of the total missingness. With 46.54% missing values, medication is by large the variable with the highest level of missingness. Inspecting the upset plot reveals that pintsbeer, shotsspirits and income closely follow (26.38%, 17.42% and 16.66%, respectively), often presenting joint missingness. While the most frequently observed are medication and income missing alone, combinations of missing values on these four variables account for a large part of the remaining the observed patterns. It might be worth looking into why this is happening: is the questionnaire designed in such a way that non-drinkers (for example) would tend not to report on alcohol consumption, rather than just indicating ‘zero’ as a response? How is the variable medication linked to alcohol consumption? Are respondents reluctant to report using alcohol in conjuction with medications? The next set of variables that present some level of missingness include deprivation and rurality measures. Quintiles, vigintiles and ranks of SIMD (and similarly rurality measures) are computed based on the same piece of information (possibly the postcode?), hence missingness is expected to be shared across them. Once this will be confirmed, it might be of interest to produce another upset plot, removing the redundant information. This will be important in order to better understand where exactly the missing data occurs, and whether there is a suspect that it can be linked to specific characteristics of the individuals (which would then warrant an additional piece of analysis).

With respect to the variable SurveyCompleted, we were interested in checking whether the particular type survey is found to be associated with missingness specifically of the postcode variable. The reason for this is that the entry box for the postcode changed over the three version of the survey (1, 2, 3)

table(x$Surveycompleted,x$postcode=='')
##    
##     FALSE TRUE
##   1  1002  100
##   2   168    9
##   3   137   13

There appears to be no convincing evidence that postcode was missing (TRUE in the above table) in a consistently different manner after adjusting the postcode box in the survey (1 vs the rest). The relative proportions of missingness were as follows: 9.98%, 5.36%, 9.49%).

2. Produce data completeness heat maps for IPAQ, UCLA, and BPSSQ scores.

A quick inspection of IPAQvalid, UCLAtotal and BPSSQtotal suggests that missingness on these variables is negligible (0%, 5.6% and 2.45%, respectively). Therefore, there seems to be no point in running a georeferenced analysis of completeness for these measures. [we need to discuss about whether these were the variables to look at]

Until then, here is a rough map of the locations of the respondents, which can be useful to get an idea of the geographic spread. Note that not all locations possessed valid postcodes, hence cannot be represented on the map (89.36% of the respondents either did not report a postcode or provided an invalid one).

## Warning: Ignoring unknown parameters: legend
## Warning: Removed 1277 rows containing non-finite values (stat_density2d).
## Warning: Removed 1277 rows containing missing values (geom_point).

The map indicates that the highest participation density comes (unsurprisingly)from the most densely populated areas. If of interest, it is possible to adjust for that and see if any difference still exists.

3. Can the sample be deemed representative of its parent population, with respect to the basic set of variables identified in the design phase?

An unequivocal definition of what a representative sample is, with respect to a population, does not exist. In this situation, we will try and assess the balance of the sample with respect to a set of pre-defined population variables, that we deemed important to be adequately represented.

These variables are sex, age, education level, deprivation (as measured by SIMD quintile), and rurality (as measured by the 6-fold Urban/Rurality index).

Age distribution, also by sex

Unfortunately, we do not possess information at the population level for those at risk, but below 60 years of age. This makes it impossible to compare the age distribution for those directly. We observe a steep increment in responses from people 60+ as opposed to below 60, a reason for which was identified in the way the survey message was formulated, thus justifyin the imbalance. We have, however, no way to know whether such imbalance exists in the target population as well (probably the increment is not so steep the age of 60). For individuals above the age of 60+, the definition of target population coincides with that of general population. It should be enough to look at the pyramid of age for the general Scottish population to get an idea of how close the sample is to them. The following image comes from the 2018 NRS official report on Scotland’s population, available here. I have overimposed the red line at 60 years of age to facilitate interpretation, focus is on the part of the plot above the line.

A visual comparison with the frequency by age of 60+ in the sample appears to suggest that the distribution is preserved. While not a fully-fledged comparison of conditional distributions (which we would have no data to carry out properly), the shape of the distributions seem overall comparable.

##                 rurality 60+ in population 60+ in sample
## 1      Large Urban Areas              29.3          27.5
## 2      Other Urban Areas              36.6          32.7
## 3 Accessible Small Towns               9.4          12.3
## 4     Remote Small Towns               4.3           2.9
## 5       Accessible Rural              12.7          18.2
## 6           Remote Rural               7.7           6.3
##   deprivation 60+ in population 60+ in sample
## 1           1              16.6           8.1
## 2           2              19.1          13.0
## 3           3              21.3          20.9
## 4           4              21.7          25.8
## 5           5              21.2          32.2

4. Do the \(70+/<70\) and \(60+/<60\) groups differ (and if they do, how) in terms of i) socio-demographics, ii) self-reported health status, iii) physical activity, and iv) socialisation?

Socio-demographics by age group

Sex (binary)

While no striking differences can be observed when stratifying by <60/60+, there seems to be a slight imbalance in the sex distribution when considering the <70/70+ categorisation. Specifically, there appear to be proportionately more males (sex==1) than females in the latter group.

Relationship

Where 1 is single never married, 2 is divorced or widowed, 3 is relationship/married living apart, and 4 is married/relationship cohabiting. The frequency distribution of relationship type in the sample seem to differ depending on the age categorisation adopted. It is, however, important to note that this is likely due to the fact that the ‘youngs’ in the two groups (white bars in the above plot) have very different numbers, specifically they constitute ~16.11% of the group <60/60+, and are the ~74.37% in the <70/70+. So what we observe is some sort of ‘switching’, where most of the people in the ‘olds’ category in the first group, become classified as ‘youngs’ in the second. Looking back at the overall distribution of age in the sample

here depicted via kernel density estimate (top panel) and boxplot (bottom panel), and with the threshold ages of 60 and 70 highlighted (dashed lines), it appears immediately evident that approximately half of the sample lies between 60 and 70 years of age, hence the switch. All this to say, we should not be tempted to compare across the two type of groupings, rather within the groupings separately.

In a sense, the ‘youngs’ and ‘olds’ groups swap roles in the two stratifications. A better approach might then be to compare the conditional distributions by </+ within each grouping as follows.

At a glance, there seem to exist differences in the distribution of relationship status, conditional on being ‘young’ or ‘old’, within each of the two groupings.

Deprivation (quintiles)

In analogy with what done with the relationship status, we can inspect the conditional distributions of deprivation quintiles (where 1 indicates least deprived, and 5 most deprived) for the sample.

There does not appear to be any striking difference between ‘youngs’ and ‘olds’ under either grouping. If anything, considering the <60/60+ dichotomy, more deprived individuals seem to be relatively more represented in the 60+ group than in its <60 counterpart.

Rurality

Rurality is assessed using the 3-fold Urban/Rural Classification, where 3 indicates remote rural, 2 indicates accessible rural, and 1 indicates rest of Scotland; for the official definitions, see here.

Visual inspection does not suggest remarkable differences in distributions. As expected from the high response rate across the central belt (highlighted previously by the response density map), most of the sample is located in postcodes associated to 1 (settlements of 3,000 or more people).

Self-reported health status

EQ5D

[add something on EQ5D here]

No overall remarkable differences. Distributions are a bit more similar in the <70/70+ grouping than they are in the <60/60+. In every case, we observe a marked positive skew, with most of the sample reporting values on the lowest categories.

Health rating

[add something on health rating]

Very similar distributions throughout. In the <60/60+ group, it would appear than younger people were more likely to report worse health the older group; no such difference is observable in the <70/70+ group.

Physical activity

The survey collected measurements on self-reported weekly time spent doing physical activity (PA). He, we look at differences between ‘youngs’ and ‘olds’, by age groups, for each category of PA (light, moderate, vigorous, total, walking) and by classification of activity (PAcat) or meeting PA guidelines (strengthPAguidelines)

Weekly minutes of light physical activity

Weekly minutes of moderate physical activity

Weekly minutes of vigorous physical activity

Weekly minutes of walking

Weekly minutes of physical activity

The total weekly physical activity is obtained by summing the reported activity for moderate PA (weeklyMPAmin), vigorous PA (weeklyVPAmin) and walking (weeklyWALKmin).

Level of physical activity

Within the comparison <60/60+, the ‘youngs’ appear to be more uniform than the ‘olds’ (more likely to be classifed as moderately or highly active). This difference does not seem to be present when looking at the <70/70+ dichotomisation.

Phyical activity guidelines

There appear to be no relevant differences in the distribution of adherence to guidelines - which is very low in the sample - conditional to age grouping, within the two comparisons.

Socialisation

UCLA

UCLA loneliness total score. The observed range in the sample is [6,24].

No overall striking differences can be observed in the <70/70+ grouping. Similarly to what happened with self reported health, however, the ‘youngs’ in the <60/60+ comparison were slightly more likely to report higher loneliness than the ‘olds’.

BPSSQ

BPSSQ perceived social support total score. The observed range in the sample is [6,30].

No overall striking differences can be observed in the <70/70+ grouping. Similarly to what happened with self reported health and loneliness, however, the ‘youngs’ in the <60/60+ comparison were slightly more likely to report lower perceived social support than the ‘olds’.

Social network size and social hours per week

[not sure whether self-reported or computed based on other information?]. The observed range in the sample for social network size is [0,100], while for social hours per week is [0,84]. The next plot shows both the joint and the marginal distributions (trimmed of the highest \(1\%\) of observations for better visualisation).

There appears to be little evidence of association between the two variables, that is to say, a larger social network size seems to be only loosely associated with more social hours reported (this is an unadjusted analysis, differences may come up when considering relationship status).

Visual inspection of the conditional distributions by age within the two groupings (carried out, again, on the trimmed distributions), shows absence of substantial departures: ‘youngs’ and ‘olds’ have reported overall similar social network sizes and social hours per week, with a slightly (but not remarkably so) increased social network size and activity being reported by the ‘olds’.

5. Consider the analyses in 4, stratified by sex.

Relationship

Deprivation (quintiles)

Rurality

Self-reported health status

EQ5D

Health rating

Physical activity

Weekly minutes of light physical activity

Weekly minutes of moderate physical activity

Weekly minutes of vigorous physical activity

Weekly minutes of walking

Weekly minutes of physical activity

Level of physical activity

Phyical activity guidelines

Socialisation

UCLA

UCLA loneliness total score. The observed range in the sample is [6,24].

BPSSQ

Social network size and social hours per week

6. Consider the analyses in 4, stratified by likelytoshield.

Relationship

Variable likelytoshield. In the plots: No - not likely to shield, Yes - likely to shield.

Deprivation (quintiles)

Rurality

Self-reported health status

EQ5D

Health rating

Physical activity

Weekly minutes of light physical activity

Weekly minutes of moderate physical activity

Weekly minutes of vigorous physical activity

Weekly minutes of walking

Weekly minutes of physical activity

Level of physical activity

Phyical activity guidelines

Socialisation

UCLA

BPSSQ

Social network size and social hours per week

7. Consider the analyses in 4, stratified by quantiles of deprivation.

[DROPPING THE AGE GROUPING STRATIFICATION FROM NOW ON, as per mail today 27/10/2020]

Sex (binary)

Relationship

Rurality

Self-reported health status

EQ5D

Health rating

Physical activity

Weekly minutes of light physical activity

Weekly minutes of moderate physical activity

Weekly minutes of vigorous physical activity

Weekly minutes of walking

Weekly minutes of physical activity

Level of physical activity

Phyical activity guidelines

Socialisation

UCLA

BPSSQ

Social network size and social hours per week

8. Is there any suggestion of an association between urban/rural index (3-fold) and all outcomes?

Urban/rurality 3-fold index 1 - rest of Scotland, 2 - accessible rural, 3 - remote rural

Sex (binary)

Relationship

Deprivation

Self-reported health status

EQ5D

Health rating

Physical activity

Weekly minutes of light physical activity

Weekly minutes of moderate physical activity

Weekly minutes of vigorous physical activity

Weekly minutes of walking

Weekly minutes of physical activity

Level of physical activity

Phyical activity guidelines

Socialisation

UCLA

BPSSQ

Social network size and social hours per week

9. Is there any suggestion of differences between carer and non-carer, stratified by age group, in all outcomes?

Relationship

Deprivation (quintiles)

Rurality

Self-reported health status

EQ5D

Health rating

Physical activity

Weekly minutes of light physical activity

Weekly minutes of moderate physical activity

Weekly minutes of vigorous physical activity

Weekly minutes of walking

Weekly minutes of physical activity

Level of physical activity

Phyical activity guidelines

Socialisation

UCLA

UCLA loneliness total score. The observed range in the sample is [6,24].

BPSSQ

Social network size and social hours per week

10. Is there evidence of association between socio-demographic variables and health variables?

Unless there is a specific interest in a given three-way association, most of the two-way can be deduced by the previous analyses. Anything more specific than that, just ask.

11. Does current physical activity relate to current wellbeing (EQ5D)?

Light PA

Moderate PA

Vigorous PA

Walking

Total PA

All together

The average EQ5D score steeply decreases with the amount of physical activities of all kind up to ~500 reported minutes, then slowly decays. The y axis range is truncated for readability, the observed range of the EQ5D score is 5 to 13, as evident from the previous plots.

13. Do changes in physical activity associate to the level of physical activity undertaken?

Light physical activity

Moderate physical activity

Vigorous physical activity

Walking

Sitting

Strength training

Screen time

14. Does change in social contact/loneliness relate to level of BPSSQ/UCLA now?

Change in social support

Another simple visualisation is possible (kernel density estimates of the conditional distributions, as used previously). Just let me know which one you prefer. I’ll stick with boxplots in this section for a change, but they can be readily swapped around.

Social contact number of people change

Social contact frequency change

Social contact time change

Social support quality satisfaction change

Giving social support change

Social contact amount change

Loneliness change since social distancing

15. Do COVID variables relate to any other outcomes and/or differ across the age groups?

x$COVIDyou, x$COVIDclose, x$Covid_othersyesorno

[The small numbers in some of the categories make running all comparisons not reasonable. As said for question 12, can we reduce it a bit and focus more on certain aspects?]

16. Explore the sleep variables, also with respect to health variables (healthcond, comorbid_bi)

Sleep guideline category

##    
##       1   2   3   4   5
##   1 129 309 494  33  11
##   2  27  85 110   7   2
##   3  11  25  43   2   0
## Warning in chisq.test(table(x$SIMDurbrural3fold, x$sleepguideline)): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(x$SIMDurbrural3fold, x$sleepguideline)
## X-squared = 3.66, df = 8, p-value = 0.8864

##    
##       1   2   3   4   5
##   1  34  41  41   4   1
##   2  24  59  82   8   3
##   3  37  81 140   7   2
##   4  39 111 159  10   3
##   5  33 127 225  13   4
## Warning in chisq.test(table(x$SIMDquintile, x$sleepguideline)): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(x$SIMDquintile, x$sleepguideline)
## X-squared = 41.966, df = 16, p-value = 0.0003992

## 
##   1   2   3   4 
## 210 761 397  61