HTML Link: https://rpubs.com/junekam17/verificationreport
Paper: Smith L., Mottershaw A., Egan M., Waller J., Marteau T., et al. (2020) The impact of believing you have had COVID-19 on self-reported behaviour: Cross-sectional survey. PLOS ONE 15(11): e0240399. https://doi.org/10.1371/journal.pone.0240399 pmid:33147219
Directory
COVID-19 was, and remains, a great cause for concern in public health, and the primary method we have used to combat its effects is the presence of lockdown laws and government mandates. Adherence to these measures is essential in decreasing transmission rates, saving countless lives. However, those who believe they have already had COVID-19 may not adhere to these measures as they believe they have immunity against the virus. This may be problematic due to potential false test results and inaccurate self-diagnosis, but primarily because our knowledge regarding immunity is limited; it remains unknown whether we can be infectious more than once (Waller et al (2020)). For this reason, it is important to know the factors that may affect adherence to lockdown laws could inform us on targeted interventions to reduce them; however, at time of publication, there was no literature about whether belief of already having COVID-19 affects adherence to lockdown measures or attitudes towards it.
Smith et al (2020) investigated whether this belief affected self-reported behaviour, in terms adherence to lockdown measures, and beliefs early in the pandemic by employing an online cross-sectional survey of 6149 participants. Participants were asked how worried they were about COVID-19, whether they believed they had already had it, whether they believed they had immunity to it, and the extent to which they thought COVID-19 posed a risk to themselves and to others. They were also asked to indicate the number of outings they had for various reasons, both government-mandated as well as not, as well as total number of outings. They also recorded participants’ social and demographic characteristics.
They found that those who believed they already had COVID-19 were more likely to believe they had immunity, and also less likely to adhere to social distancing in that they had a higher total number of outings. They were also less likely to be able to identify symptoms. Surprisingly, there was no correlation between belief of having it and perceived risk. There were several limitations, such as complete reliance on self-report measures and the cross-sectional design prevented researchers from determining the direction of associations. They also noted that the study may have been overpowered, in that the large sample size meant even small differences were statistically significant. However, it may have important implications in guiding educational communications targeted at those who believe they have had COVID-19 regarding immunity (or lack thereof) and the importance of continuing to adhere to lockdown measures.
I can see that this relates to real world public health and safety implications, particularly relevant in this case as public attitudes guide social norms regarding compliance, which in turn affect the rate of transmission. It could also have important implications in public health policy, which could impact the lives of the population in terms of health as well as economic and social wellbeing. As Smith et al discuss, it may lead to a vicious cycle in which belief in having had coronavirus leads to lower adherence, leading to lowered social expectations, further decreased adherence and an even larger proportion of the population believing they have caught the disease due to increased perceived exposure. It reminds me of a paper by Waller et al (2020), which investigated the impact of ‘immunity passports’ and found that participants who underwent an ‘immunity’ test instead of an ‘antibody’ test were more likely to perceive no risk of catching coronavirus. This forces consideration of the connotations and public perceptions of the concept of ‘immunity’. Following from this would be consideration of the Health Belief Model (Acosta & Wolfe, 2008), typically used to guide disease prevention programs. The model predicts individual health-related behaviours based on the drivers of cues to action and perceived risk, severity, benefits, self-efficacy, and barriers to action. This study could definitely be used in conjunction with the Model.
I was surprised that there was little evidence for the association between belief of having had COVID-19 before and perceived risk of it, particularly as it contradicts existing evidence that increased risk perception is associated with greater protective action (Brewer et al, 2004) – in this case, adhering to lockdown measures. However, their reckless behaviour instead indicated a lack of perceived risk. It seems logical that this belief would be positively correlated with both worry and perceived risk – what makes this factor stand out? Although the authors noted that this seemed unusual, they did not offer any alternatives. Though perhaps it may have been an issue with the power of the study – but it was overpowered, meaning it would have been in the opposite direction. Another potential factor may be the difference in scales (worry was measured on a scale of 1-5 whereas perceived risk on a scale of 1-4), or maybe an error in the self-report and perception of the ‘perceived risk’ questions themselves.
It seems that the next step in this area of research would be to investigate the sources of misinformation and miscommunication that led to such a high proportion of those believing they had COVID-19 to believe they were immune. It would be worth developing ways to minimise or mitigate these sources as well as investigating ways to effectively communicate with this subset of the population surrounding the importance of adhering with lockdown measures. Since the publication of this paper, Gonzalez-Castro et al (2021) have also studied the impact of perceived vulnerability predicting adherence to measures and found that high perceived vulnerability was linked to higher adherence. Although using the opposing measure of vulnerability rather than immunity, this lines up with Smith et al’s results. This could also be further explored by investigating which subsets of the population (eg. Age groups, education level, those with children) are most affected by the phenomenon, allowing targeted interventions. Finally, the factor of ‘perceived risk’ should be further investigated – why was it not correlated with the other measures? Is it an internal or external issue?
We obtained the open data from the OSF repository. Our goal was to reproduce the following 4 figures:
Figure 1: Table of Participant Characteristics and Self-Reported Beliefs on COVID-19
Figure 2: Plot of Results
Figure 3: Table of Results
Figure 4: Table of Adherence to Lockdown Measures
We attempted to reproduce Figure 1 first as published in the study.
We first loaded the packages to be used throughout our entire reproducibility journey.
library(tidyverse) #for most of the important functions
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr) # more important functions
library(gt) #making tables
library(gtExtras) #more tables
library(janitor) #wrangling
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(patchwork) #plots
library(data.table) #renaming
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(gtExtras) #for merging data
library(readr)
library(glue)
library(data.table) #table formatting
library(scales) #more table formatting
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(tidyr) #to tidy messy data
library(knitr) #to knit to word/pdf later
The data in the OSF repo was in the .sav format from SPSS. Though it
was possible to work in this format, we found it easier to use the .csv
as it told us what each of the variables meant - therefore, we used
read_csv() to achieve this. This is from the readr package,
contained within the tidyverse package. We named the data “data1”.
setwd('/Users/junekam/downloads')
data <- read_csv("Dataset - impact of believing you've had covid-19.csv")
## Rows: 6149 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (19): Ever_covid, q7beentested, q8haveimmunity, q9worry, q10arisk, q10br...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
After reading the data, we had to clean it up so it was easier for us to work with. This included (all from the tidyverse package):
rename() to rename the variablesselect() to select the relevant variablesmutate() to change the response codes from
numbers to their meaningsFinally, we determined which code corresponded to which response by
using the tabyl() function from the janitor package, which
counts the frequency of respondents, and matching accordingly.
data1 <- data %>%
rename(think_covid = Ever_covid,
Age = age_categories,
Gender = gender) %>%
select(think_covid, Age, Gender, Age, degree, Has_child, Working, Key_worker, region) %>%
mutate(Gender = case_when(Gender == 1 ~ "Male",
Gender == 2 ~ "Female")) %>%
mutate(think_covid = case_when(think_covid == 1 ~ "Have COVID-19",
think_covid == 0 ~ "Have Not Had COVID-19")) %>%
mutate(degree = case_when(degree == 0 ~ "GSCE/vocational/A-level/No formal qualifications",
degree == 1 ~ "Degree or higher (Bachelors, Masters, PhD)")) %>%
mutate(Age = case_when(Age == 1 ~ "18 to 24 years",
Age == 2 ~ "25 to 34 years",
Age == 3 ~ "35 to 44 years",
Age == 4 ~ "45 to 54 years",
Age == 5 ~ "55+")) %>%
mutate(Has_child = case_when(Has_child == 0 ~ "No",
Has_child == 1 ~ "Yes")) %>%
mutate(Working = case_when(Working == 0 ~ "Not Working",
Working == 1 ~ "Working")) %>%
mutate(Key_worker = case_when(Key_worker == 0 ~ "No",
Key_worker == 1 ~ "Yes")) %>%
mutate(region = case_when(region == 1 ~ "Midlands",
region == 2 ~ "South & East",
region == 3 ~ "North",
region == 4 ~ "London",
region == 5 ~ "Wales, Scotland and Northern Ireland"))
We were unsure what the ‘condition_antibodies’ variable was referring to, considering that it sounded like it was related to participant belief in their own immunity - however, there was already another immunity variable. Although the authors made the small note of ‘assigned to experimental condition’, this didn’t give us much information, and it’s worth noting that it definitely could have been described in more detail. As a result of this, we didn’t include it in our reproducibility journey.
The next (painstaking) step was to create a data frame for each demographic characteristic against the ‘think_covid’ variable that was the primary focus of the study and hence the main DV, referring to whether or not the participant believed they had caught COVID-19 before. Before we knew how to deal with R efficiently, we made this easy task seem much harder than it was - here’s my first attempt at accomplishing this.
data2 <- data1 %>%
tabyl(Age, think_covid)
tab_2 <- data2 %>% gt()
tab_2
| Age | Have COVID-19 | Have Not Had COVID-19 |
|---|---|---|
| 18 to 24 years | 419 | 1003 |
| 25 to 34 years | 400 | 823 |
| 35 to 44 years | 294 | 751 |
| 45 to 54 years | 164 | 554 |
| 55+ | 216 | 1525 |
That seemed fine… except for the fact that at the time, I didn’t know
how to combine all these small gt tables I had. But the use of
tabyl() definitely set my teammates and I on the right
track!
An example of this with the “gender” variable:
datagender2 <- data1 %>%
select(Gender, think_covid) #creating the data frame and selecting the relevant variables
tgender <- datagender2 %>%
tabyl(Gender, think_covid) #used to count no. of respondents in these variables, to be included later in the final table
gtgender <- gt(tgender) #making it into a nice-looking table
datagender <- gtgender$'_data' %>% #allows us to merge gt tables
arrange(factor(Gender, levels = c('Male','Female'))) #reordering the categories to recreate the figure
gtgender
| Gender | Have COVID-19 | Have Not Had COVID-19 |
|---|---|---|
| Female | 796 | 2459 |
| Male | 697 | 2197 |
As you may have noticed, there was a line in that chunk that is labeled with ‘allows us to merge gt tables’. This was a struggle for us for a long time, but thankfully with a lot of Googling and trying to apply our situation to that of the Stackoverflow questions, we found the solution.
Let’s compare this to the original figure:
Great, all the numbers looked right - it was just the formatting that
was different, but we could fix that later. We repeated this process for
the age, child, employment, key sector, education, and region variables.
As we only included in our data set those that were eligible and
provided responses, we excluded those that provided an “NA” response -
for this, we used filter != "NA" to filter out these
participants. We did this for the child, employment, and education
variables. We also used arrange(factor()) to arrange the
levels of factors that appeared in the wrong order, changing them to
appear as they did in the figure.
#participants by age
data_Age <- data1 %>%
select(Age, think_covid)
tage <- data_Age %>%
tabyl(Age, think_covid)
gtage <- gt(tage)
dataage<- gtage$'_data'
#by child
datachild <- data1 %>%
select(Has_child, think_covid) %>%
filter(Has_child != "NA")
tchild <- datachild %>%
tabyl(Has_child, think_covid)
gtchild <- gt(tchild)
datachild <- gtchild$'_data'
#employment status
dataworking <- data1 %>%
select(Working, think_covid) %>%
filter(Working != "NA")
tworking <- dataworking %>%
tabyl(Working, think_covid)
gtworking <- gt(tworking)
dataworking <- gtworking$'_data'
#working key sector
tkey <- data1 %>%
tabyl(Key_worker, think_covid)
gtkey <- gt(tkey)
datakey <- gtkey$'_data'
#highest educational or professional qualification
datadegree <- data1 %>%
select(degree, think_covid) %>%
filter(degree != "NA")
tdegree <- datadegree %>%
tabyl(degree, think_covid)
gtdegree <- gt(tdegree)
datadegree <- gtdegree$'_data' %>%
arrange(factor(degree, levels = c('GSCE/vocational/A-level/No formal qualifications','Degree or higher (Bachelors, Masters, PhD)')))
#region
tregion <- data1 %>%
tabyl(region, think_covid)
gtregion <- gt(tregion)
dataregion <- gtregion$'_data' %>%
arrange(factor(region, levels = c('Midlands', 'South & East', 'North', 'London', 'Wales, Scotland and Northern Ireland')))
It definitely wasn’t easy for us to get to this point - very common was the error “Column - Gender (or whichever variable was deciding to act up that day) doesn’t exist”. Though this was extremely frustrating, since we knew that it definitely did exist, it was one of the first lessons we learnt about troubleshooting: it’s usually something small, and could be as small as a typo, or a lowercase letter where a capital should be, or that we started a new data set or rscript and forgot to rename the variable. We learnt to be very careful with reading through our code because there were a million small errors that could easily sneak past.
Next, since the table also displayed percentages, we had to calculate these for each variable as well. These are in-built functions in R.
perGender <- with(data1, table(Gender, think_covid)) #recreates the data frame in a table that allows us to evaluate expressions - in this case percentage
pGender <- prop.table(perGender, 1)*100 #turning the table of counts into a table of percentages
prop.table() created proportion counts, so by
calculating these and multiplying them by 100, we were able to create
percentage values that could be used later in the final table.
After we had all the data we needed - i.e., the counts and
percentages for each variable as an expression of ‘think_covid’, it was
finally time to combine everything. This was tricky considering it was
the first figure we attempted after learning R, and hence were unaware
of the various formats it could take - but luckily we figured it out
relatively quickly. First we combined the data frames with the counts
using list(), then used rbindlist() to combine
them into one big data table - from the data.table package.
listfull <- list(datagender, dataage, datachild, dataworking, datakey, datadegree, dataregion) #combined all the data frames into a list comprised of the 7 different data tables
fulltibble <- rbindlist(listfull, use.name=FALSE) #converted the list into 1 big combined data table as rows
fulltibble <- as_tibble(fulltibble) #converted it into a tibble
fulltibble1 <- fulltibble %>% #create column names
rename(ycovid = "Have COVID-19",
ncovid = "Have Not Had COVID-19")
We also renamed the think_covid columns to have appropriate headings
again using rename().
But it wasn’t always so easy. The first approach we tried was
creating data frames for each variable, like we did in the end, but only
combining 2 at a time and then using inner_join() to
combine them all. Here’s an excerpt of our previous code:
#merge tables
data_Gender <- data1 %>%
select(Gender, think_covid)
data_Age <- data1 %>%
select(Age, think_covid)
#set tables as data.frames
t1 = data.frame(data_Gender)
t2 = data.frame(data_Age)
dfgender <- data.frame(data_Gender)
dfage <- data.frame(data_Age)
#merge based on row names
gender_age = merge(t1, t2, by=0)
gt_gender_age <- gt(gender_age)
combined <- inner_join(data_Age, data_Gender, by = "think_covid")
Not only would it have been a painstakingly long process, this is what the output for our supposed ‘merged’ table would have looked like:
While we thought inner_join() was the solution to our
problems, it definitely wasn’t! Luckily we managed to get through the
process. Once we understood it, it was easy to repeat this process for
the data frames with the percentages. In this case, we used
rbind() from the dpylr package rather than
rbindlist().
ptibble <- rbind(pGender,pAge,pChild,pEmp,pKey,pEdu,pRegion) #as before, but with percent tables
ptibble <- as_tibble(ptibble)
ptibble <- data.frame(lapply(ptibble, function(x) if(is.numeric(x)) round(x,2) else x)) #to ensure they all remained at 2 decimal places
ptibble1 <- ptibble %>% #match column names
rename(ypercent = "Have.COVID.19",
npercent = "Have.Not.Had.COVID.19")
Next was the trickiest part: combining the full tables for counts and
percentages in the correct format. We ran into many problems with this,
most prominently that attempting the unite() and
paste() would work in combining them, but with many, many
decimal points that even our previous mitigating code wouldn’t be able
to counter. We also tried using gt_merge_stack(), which we
were so convinced was the answer! Until…
An example of our table using gt_merge_stack() to merge the counts and percentages
While it did technically show the information, it was definitely not
what we were looking for. There were too many decimal places, and it
‘stacked’ instead of putting the numbers side by side. Plus for some
reason, the percentages refused to remain at 2 decimal places, despite
our previous code ensuring this! After playing around with more merging
functions, and lots of Googling, we stumbled across paste0
allowed us to proceed. Note that the following output, in the code,
appears as one large table, but was unable to fit width-wise into this
format.
pfull <- cbind(fulltibble1, ptibble1) #combine tables as columns
pfull$havenocovid <- paste0(pfull$ncovid, " (", pfull$npercent, ")") #format percentages to be next to counts in brackets
pfull$havecovid <- paste0(pfull$ycovid, " (", pfull$ypercent, ")") #for both think_covid variables
pfull
## Gender ycovid ncovid ypercent
## 1 Male 697 2197 24.45
## 2 Female 796 2459 24.08
## 3 18 to 24 years 419 1003 29.47
## 4 25 to 34 years 400 823 32.71
## 5 35 to 44 years 294 751 28.13
## 6 45 to 54 years 164 554 22.84
## 7 55+ 216 1525 12.41
## 8 No 621 2005 23.65
## 9 Yes 776 2386 24.54
## 10 Not Working 357 1714 17.24
## 11 Working 1124 2871 28.14
## 12 No 753 3105 19.52
## 13 Yes 740 1551 32.30
## 14 GSCE/vocational/A-level/No formal qualifications 1060 3382 25.70
## 15 Degree or higher (Bachelors, Masters, PhD) 415 1200 23.86
## 16 Midlands 251 781 29.90
## 17 South & East 416 1369 24.32
## 18 North 335 1120 23.02
## 19 London 299 701 23.31
## 20 Wales, Scotland and Northern Ireland 192 685 21.89
## npercent havenocovid havecovid
## 1 75.55 2197 (75.55) 697 (24.45)
## 2 75.92 2459 (75.92) 796 (24.08)
## 3 70.53 1003 (70.53) 419 (29.47)
## 4 67.29 823 (67.29) 400 (32.71)
## 5 71.87 751 (71.87) 294 (28.13)
## 6 77.16 554 (77.16) 164 (22.84)
## 7 87.59 1525 (87.59) 216 (12.41)
## 8 76.35 2005 (76.35) 621 (23.65)
## 9 75.46 2386 (75.46) 776 (24.54)
## 10 82.76 1714 (82.76) 357 (17.24)
## 11 71.86 2871 (71.86) 1124 (28.14)
## 12 80.48 3105 (80.48) 753 (19.52)
## 13 67.70 1551 (67.7) 740 (32.3)
## 14 74.30 3382 (74.3) 1060 (25.7)
## 15 76.14 1200 (76.14) 415 (23.86)
## 16 70.10 781 (70.1) 251 (29.9)
## 17 75.68 1369 (75.68) 416 (24.32)
## 18 76.98 1120 (76.98) 335 (23.02)
## 19 76.69 701 (76.69) 299 (23.31)
## 20 78.11 685 (78.11) 192 (21.89)
Then the final step before making it into a full table - selecting
and renaming the relevant variables in their respective categories using
subset(), from base R.
pfull <- subset(pfull, select = c(Gender,havenocovid, havecovid)) %>%
rename(Level = Gender)
However, for a long time we were unable to create a stub and could only create row group headers. We almost considered just giving up and just leaving it like this…
An example piece of our table where we were unable to create the stub and instead had the participant characteristics as row group headers
While it would work and admittedly did serve the purpose of clearly
displaying the data, we were determined to make it work and achieve our
goal exactly how we set out to. We eventually realised that it could be
best to create a whole new vector for the participant characteristics,
separate from the features of the actual table, and just format them
based on how we saw appropriate. This allowed us to group all the
relevant categories together. We used c() to do this;
however, as this function works by naming each row, we had to account
for the spaces as well by creating blank rows, like so:
pchar <- c("Gender","","Age","","","","","Have a child","","Employment status","",
"Working in key sector","","Highest educational or professional qualification","",
"Region","","","","")
From there, it was easy to combine this column with the rest of the
table using cbind().
pfull1 <- cbind(pchar, pfull) #combining the "participant characteristics" column with the rest of the table
Now the final step - making it into a nice-looking table, so all the important information could be easily displayed! This was done mainly by using the gt package - after some experimentation, we were able to reproduce the figure exactly as was shown in the paper!
gtfull2 <- gt(pfull1) %>% #creating a gt table!
tab_spanner(
label = "Had COVID-19",
columns = c(3:4) #creating a header for the think_covid columns
) %>%
cols_label(
pchar = "Participant characteristics" #labeling the columns as in the original figure
) %>%
cols_label(
"havenocovid" = "Think have not had COVID-19 n = 4656 (%)"
) %>%
cols_label(
"havecovid" = "Think have had COVID-19 n = 1493 (%)"
) %>%
tab_style(cell_text(weight = "bold"), #bolding relevant text
locations = cells_column_labels(columns = everything())) %>%
tab_style(cell_text(weight = "bold"),
locations = cells_column_spanners())
gtfull2
| Participant characteristics | Level | Had COVID-19 | |
|---|---|---|---|
| Think have not had COVID-19 n = 4656 (%) | Think have had COVID-19 n = 1493 (%) | ||
| Gender | Male | 2197 (75.55) | 697 (24.45) |
| Female | 2459 (75.92) | 796 (24.08) | |
| Age | 18 to 24 years | 1003 (70.53) | 419 (29.47) |
| 25 to 34 years | 823 (67.29) | 400 (32.71) | |
| 35 to 44 years | 751 (71.87) | 294 (28.13) | |
| 45 to 54 years | 554 (77.16) | 164 (22.84) | |
| 55+ | 1525 (87.59) | 216 (12.41) | |
| Have a child | No | 2005 (76.35) | 621 (23.65) |
| Yes | 2386 (75.46) | 776 (24.54) | |
| Employment status | Not Working | 1714 (82.76) | 357 (17.24) |
| Working | 2871 (71.86) | 1124 (28.14) | |
| Working in key sector | No | 3105 (80.48) | 753 (19.52) |
| Yes | 1551 (67.7) | 740 (32.3) | |
| Highest educational or professional qualification | GSCE/vocational/A-level/No formal qualifications | 3382 (74.3) | 1060 (25.7) |
| Degree or higher (Bachelors, Masters, PhD) | 1200 (76.14) | 415 (23.86) | |
| Region | Midlands | 781 (70.1) | 251 (29.9) |
| South & East | 1369 (75.68) | 416 (24.32) | |
| North | 1120 (76.98) | 335 (23.02) | |
| London | 701 (76.69) | 299 (23.31) | |
| Wales, Scotland and Northern Ireland | 685 (78.11) | 192 (21.89) | |
To compare it to the original, here it is again:
As you can see, the tables are nearly identical. However, through this comparison, we did notice a discrepancy in the percentage values of male and female - however, as the numbers of the count matched, we believe this was the authors’ mistake. Overall, although it took a significant amount of experimentation and lots of trial and error, we had our first triumph!
Secondly, we attempted to reproduce Figure 2, the plot of results.
Much like in Figure 1, our first task was to clean the data using:
rename() to rename the variables in a way that was
easier to work with - such as
rename(immunity = q8haveimmunity)select() to select the relevant variablesmutate() to label the coded responsesdata2 <- data %>%
rename(immunity = q8haveimmunity,
gshop = Adhere_shop_groceries,
oshop = Adhere_shop_other,
friends = Adhere_meet_friends,
outings = Going_out_total,
noworry = q9worry,
rself = q10arisk,
rothers = q10brisk,
covidknow = Sx_covid_nomissing,
think_covid = Ever_covid) %>%
select(immunity,
gshop,
oshop,
friends,
outings,
noworry,
rself,
rothers,
covidknow,
think_covid) %>%
mutate(think_covid = case_when(think_covid == 0 ~ "Think have not had COVID-19",
think_covid == 1 ~ "Think have had COVID-19"))
Like Figure 1, Figure 2 was also made up of various different data sets - therefore we used the same approach in again creating individual data sets. However, there were 2 parts to this:
As the whole plot was measuring these variables across the person’s
belief that they have had COVID, we had to create an overall count for
this factor that could be used across all the other data sets. We did
this using group_by() to ensure it was counting the correct
variable, then using the count() function to (as the name
suggests) count. Since we were grouping all the following variables by
‘think_covid’, the main variable in the study, we didn’t bother adding
ungroup() as it would just be redundant.
covid <- data2 %>%
select(immunity, think_covid) %>% #selecting the relevant variables - immunity as a placeholder. The real variable of interest is think_covid
group_by(think_covid) %>% #so it knows to sort it by this
count() %>% #count no. of people in each category
rename(covidtotal = n) #to avoid confusion later
This was the same process as the step above, however consisted of
filtering specific instances of each measure. For example, the
‘immunity’ column in the graph was labeled “strongly agree that they are
immune”, suggesting that they only included participants that scored 5
out of 5 on the scale in the plotted proportion. Therefore, we had to
use filter() to filter only responses that were 5, ensuring
that all other responses were excluded. Then we used mutate
to rename according to the labels in the graph. Again, we used
count() to count the number of participants in this
category, and group_by() to select the relevant variables,
but since each variable would only be counted once, we had to use
ungroup() so that this filter wouldn’t apply to the
following data sets of other variables. Finally, we used
rename() to rename the ‘n’ count to a variable-relevant
name so we wouldn’t get confused, since all of the other variable-counts
would also be called ‘n’! Here is our data for the immunity variable -
we followed the same process for each variable in this section.
dimmunity <- data2 %>%
select(immunity, think_covid) %>%
filter(immunity == "5") %>% #filter so only those who responded with 5 are included
mutate(immunity = case_when(immunity == 5 ~ "Strongly agree that they are immune")) %>% #rename
group_by(think_covid, immunity) %>%
count() %>%
ungroup() %>% #so this filter doesn't apply to following sets
rename("stronglyagree" = "n")
As percentages were the y-axis component on this plot, for each variable, we had to:
merge()mutate() and calculate this
value by manually inputting the formula. This formula would be the ‘n’
value for each variable divided by the total number of included
participants, creating a proportion, then multiplying by 100 to create a
percentage value that could be used in the plotIn the case of immunity:
dimmunity <- merge(x = dimmunity, y = covid, by = "think_covid") %>% #where 'covid' is the think_covid count
mutate(strongpercent = stronglyagree*100/covidtotal) #creating the percentage column
We repeated this for all other variables as well.
After we had the percentages for each relevant variable, we had to combine them all into one tibble. We used the same strategy as in Figure 1, i.e.:
list() function to combine them into a listrbindlist() to convert it into one big data frameLike so:
listdata2 <- list(dcovidknow,
drothers,
drself,
dnoworry,
doutings,
dfriends,
doshop,
dgshop,
dimmunity)
tdata2 <- rbindlist(listdata2, use.name=FALSE)
tdata2<- subset(tdata2, select = -c(idcovidsym, covidtotal)) %>%
rename("groups"=covidknow) %>%
rename(Percentage = idcovidsympercent) %>%
group_by(think_covid)
Finally the exciting part of converting the data into a visually
appealing plot! We had all the information - our first step in plotting
was to manually reorder the variables to appear as they do in the study.
We did this by using mutate() to create a new column, then
within this, using fct_relevel()to reorder them:
plotdata2 <- tdata2 %>% mutate(groups = fct_relevel(groups,
"Strongly agree that they are immune",
"Shopping for groceries/pharmacy two or more times",
"Shopping for items other than groceries/pharmacy",
"Met up with friends/family they do not live with",
"Out-of-home activity, 8 or more outings",
"Not worried at all",
"Perceive no risk at all to themselves",
"Perceive no risk at all to others",
"Did not identify common symptoms"))
It was here we realised that there was actually a mistake in this figure. The columns for “shopping for groceries/pharmacy two or more times” showed roughly 35% and 45% correspondingly - but cross-referenced with Figure 4, those percentage values actually matched up with the data for “shopping for groceries/pharmacy on one or fewer days”. This suggested that either Figure 2 or Figure 4 was incorrect! This was likely due to a typo in the code as in the original data, ‘0’ was coded as “one or fewer days” whereas ‘1’ was coded as “two or more times”. But moving on..
Then, like in Figure 1, we created a vector for the 2 levels of categories (think_covid) and named them accordingly:
plotdata2_new <- plotdata2
plotdata2_new$think_covid <- factor(plotdata2_new$think_covid,
levels = c("Think have not had COVID-19", "Think have had COVID-19" ))
Now we could finally see the plot! For this, the ggplot2 package was
essential. We used ggplot() for this - a super handy tool
and very easy to use once you’ve got all your data in order. We
formatted the plot until it replicated the original Figure 2.
plot2 <- ggplot(plotdata2_new, aes(fill=think_covid,
y= Percentage,
x= groups)) + #plotting the data
geom_bar(position ='dodge', stat='identity', width=0.5) + #making it a bar chart, with bars next to each other for the think_covid factor
scale_fill_grey()+ # scale_color_manual(labels = c("Think have not had COVID-19", "Think have had COVID-19"), #changing colours of the bars
#values = c("black", "grey"))
theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust = 1), #changing the text settings
panel.background = element_rect(fill = "white", colour="white"), #general aesthetics
panel.grid.major.y = element_line(size = 0.5, colour="grey"),
panel.grid.minor.y = element_line(size = 0.5, colour="grey"),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
legend.position="bottom", #designing the legend
legend.title=element_blank(),
axis.title.x = element_blank(), #axis titles
axis.title.y = element_text(margin = margin(t = 0, r = 50, b = 0, l = 0))) +
scale_y_continuous(limits = c(0,60), breaks = c(0,10,20,30,40,50,60)) #scale lengths and intervals
Leading to an output looking like this:
plot2
Again, let’s compare to the original figure:
Yes! Figure 2 complete - halfway there. Figure 2 seemed easier than Figure 1 - firstly, because we were a bit more used to using R and definitely knew a lot more. But also because we had the guidance of Danielle’s video modules, which didn’t touch on tables - but covered plots!
Next we moved on to Figure 3. At this point, we had gotten a handle of using R and had split up into pairs to work on the last 2 figures - but Figure 3 was different because we had to learn how to include descriptives in our table. However, at this point, we had learnt so much about R and troubleshooting, and our ability to Google using the right search terms (we actually knew what the people on the forums were talking about!) and going to the right websites (such as beginner-friendly blog posts rather than official websites) had advanced so much, we were able to move on swiftly.
Again, using rename(), select(), and
mutate() for ease of working:
data3 <- data %>%
rename(immunity = q8haveimmunity, #rename variables appropriately
outings = Going_out_total,
noworry = q9worry,
rself = q10arisk,
rothers = q10brisk,
think_covid = Ever_covid) %>%
select(immunity, #select relevant variables
outings,
noworry,
rself,
rothers,
think_covid) %>%
mutate(think_covid = case_when(think_covid == 1 ~ "Have COVID-19", #rename coded responses
think_covid == 0 ~ "Have Not Had COVID-19"))
Since this table includes the means and SDs for each
measure/characteristic, we had to calculate them for each data set -
meaning we had to create a data set for each variable, as in the
previous 2 figures. However, this time, we used summarise()
to calculate these values.
d3immune <- data3 %>%
select(immunity, think_covid) %>% #select relevant variable
group_by(think_covid) %>% #the main variable of interest
summarise(mean(immunity), #to calculate mean
sd(immunity)) %>% #to calculate SD
rename(mean = 'mean(immunity)', #rename for ease and for categorising later
sd = 'sd(immunity)')
I mentioned earlier that we tried using unite() to
combine the mean and SD cells - here’s an example of what this looked
like.
This persisted even though we had options(digits=3) to
keep values at 3 significant figures. No one - to this day - knows why
this wasn’t working, but we did determine that it was unite that was
causing it. Which meant that we had to try a different approach.
Hence:
d3immune <- data.frame(lapply(d3immune, function(x) if(is.numeric(x)) round(x,2) else x)) #to ensure they would remain at 2 decimal places
d3immune$desc <- paste("M = ",d3immune$mean, sep="") %>% #paste mean and SD together in the same cell
paste(", SD = ",d3immune$sd, sep="")
Then subset() the columns we didn’t need - and there we
had our first data frame, for the immunity variable.
d3immune <- subset(d3immune, select = -c(mean,sd))
d3immune
## think_covid desc
## 1 Have COVID-19 M = 3.33, SD = 1
## 2 Have Not Had COVID-19 M = 2.38, SD = 1.01
Let’s quickly double-check that everything is correct by comparing to the original figure:
Yep, perfect.
And again, we repeated that for each variable. But we had a problem - we needed the think_covid variable to be column headings, but our output looked like this:
d3immune
## think_covid desc
## 1 Have COVID-19 M = 3.33, SD = 1
## 2 Have Not Had COVID-19 M = 2.38, SD = 1.01
So we needed to switch them. We ran into lots of trouble with this, and ended up having to create a whole new data frame for each, using the following function:
switchI <- as.data.frame(t(as.matrix(d3immune)))
But this ended up creating a matrix with the headings ‘V1’ and ‘V2’, with far too many decimal places, like so:
This took us such a long time to get rid of! We couldn’t even rename them, since they were not ‘officially’ part of the table, and we already had the desired headings. Worst of all, the ‘Have had COVID-19’ and ‘Have not had COVID-19’ headings still existed for each table. They were just pushed down, which was fine - if you just cropped the bottom half of the table, that is.
It was clearly a disaster, but we managed to mitigate this by using
subset(). Phew! Now our tibbles were perfect:
switchI
## 1 2
## think_covid Have COVID-19 Have Not Had COVID-19
## desc M = 3.33, SD = 1 M = 2.38, SD = 1.01
Like before, now we just had to combine all the data into a list
using list(), and use this list to create a table with
rbindlist().
fulllist3 <- list(switchI, switchO, switchW, switchS, switchR)
fulltable3 <- rbindlist(fulllist3, use.names = FALSE)
fulltable3 <- fulltable3[-c(1, 3, 5, 7, 9)] %>%
select(2, 1) #to clean
fulltable3
## 2 1
## 1: M = 2.38, SD = 1.01 M = 3.33, SD = 1
## 2: M = 6.69, SD = 5.63 M = 9.35, SD = 7.69
## 3: M = 3.59, SD = 1.01 M = 3.38, SD = 1.12
## 4: M = 2.81, SD = 0.76 M = 2.81, SD = 0.76
## 5: M = 3.39, SD = 0.67 M = 3.3, SD = 0.7
Next - the figure’s first two columns were completely comprised of
text. As can be seen in the last screenshot of the old code, at that
point, we were still only able to include the participant
characteristics as row group headers. But once we figured out what to do
in Figure 1, it was easy to apply it here. So to create this, we created
vectors using the c() function and manually typed the
relevant text, like so:
level3 <- c("1 = strongly disagree to 5 = strongly agree",
"Range = 0 to 42",
"1 = not at all worried to 5 = extremely worried",
"1 = no risk at all to 4 = major risk",
"1 = no risk at all to 4 = major risk")
pchar3 <- c("I think I have some immunity to COVID-19",
"Total out-of-home activity in the last seven days",
"Worry about COVID-19",
"Perceived risk of COVID-19 to oneself",
"Perceived risk of COVID-19 to people in the UK")
Lastly, using cbind() to create the table itself:
fulltable4 <- cbind(pchar3, level3, fulltable3)
Finally! Using the gt() function, we are able to
visualise the whole table.
gtfulltable4 <- gt(fulltable4) %>%
tab_spanner( #column headers spanning columns 3 and 4
label = "Had COVID-19",
columns = c(3:4)
) %>%
cols_label( #column labels
pchar3 = "Participant characteristics"
) %>%
cols_label(
"level3" = "Level"
) %>%
cols_label(
"2" = "Think have not had COVID-19 n = 4656"
) %>%
cols_label(
"1" = "Think have had COVID-19 n = 1493"
) %>%
tab_style(cell_text(weight = "bold"), #formatting
locations = cells_column_labels(columns = everything())) %>%
tab_style(cell_text(weight = "bold"),
locations = cells_column_spanners())
This leads to a table output looking like this: exactly as the paper demonstrated.
gtfulltable4
| Participant characteristics | Level | Had COVID-19 | |
|---|---|---|---|
| Think have not had COVID-19 n = 4656 | Think have had COVID-19 n = 1493 | ||
| I think I have some immunity to COVID-19 | 1 = strongly disagree to 5 = strongly agree | M = 2.38, SD = 1.01 | M = 3.33, SD = 1 |
| Total out-of-home activity in the last seven days | Range = 0 to 42 | M = 6.69, SD = 5.63 | M = 9.35, SD = 7.69 |
| Worry about COVID-19 | 1 = not at all worried to 5 = extremely worried | M = 3.59, SD = 1.01 | M = 3.38, SD = 1.12 |
| Perceived risk of COVID-19 to oneself | 1 = no risk at all to 4 = major risk | M = 2.81, SD = 0.76 | M = 2.81, SD = 0.76 |
| Perceived risk of COVID-19 to people in the UK | 1 = no risk at all to 4 = major risk | M = 3.39, SD = 0.67 | M = 3.3, SD = 0.7 |
Again, let’s compare to the actual figure:
Yay! We did it!
We were finally here at the last figure. This one was an interesting case - as it was the last one we attempted, we had significantly more R knowledge than we did when we started, and could hence apply that to various features, as will be described below. While we didn’t produce the exact figure the paper did, we believe we found a way that presented the data even more clearly. But let’s get into exactly how we did that.
We cleaned the data almost exactly as we did in Figure 1, using the
rename() and mutate() functions.
data4 <- data %>%
rename(Gender = gender,
Age = age_categories,
Has_child = Has_child,
Employment_status = Working,
Key_worker = Key_worker,
Education = degree,
Region = region,
Total_outings = Going_out_total,
groceries = Adhere_shop_groceries,
shop_other = Adhere_shop_other,
meetup = Adhere_meet_friends,
symptoms = Sx_covid_nomissing,
think_covid = Ever_covid) %>%
mutate(think_covid = case_when(think_covid == 0 ~ "No",
think_covid == 1 ~ "Yes")) %>%
mutate(groceries = case_when(groceries == 0 ~ "On one or fewer days in the last week n = 2389",
groceries == 1 ~ "On two or more days in the last week n = 3760")) %>%
mutate(shop_other = case_when(shop_other == 0 ~ "Not at all in the last week n = 1833",
shop_other == 1 ~ "On one or more days in the last week n = 4316")) %>%
mutate(meetup = case_when(meetup == 0 ~ "Not at all in the last week n = 5271",
meetup == 1 ~ "On one or more days in the last week n = 878")) %>%
mutate(symptoms = case_when(symptoms == 0 ~ "Did not correctly identify common symptoms n = 2390",
symptoms == 1 ~ "Correctly identified common symptoms n = 3632")) %>%
select(think_covid, groceries, shop_other, meetup, symptoms)
Like before, we had to create individual data frames for each
variable. This again followed a very similar format as the previous
figures - whilst repetitive, it was rewarding and almost satisfying to
be able to apply the knowledge we had gained. In other words, we used
the select(), group_by(), and
arrange() functions to clean, and tabyl() to
count.
Here’s an example of the one created for the “Shopping for groceries/pharmacy” variable:
datagroceries <- data4 %>%
select(groceries,think_covid) %>% #selecting the relevant variables
group_by(think_covid) %>% #the main variable we were sorting by
arrange(factor(groceries, levels = c("On one or fewer days in the last week n = 2389", "On two or more days in the last week n = 3760"))) #arranging the factors to appear in the same order as they do in the original figure
tgroceries <- datagroceries %>%
tabyl(groceries, think_covid) #creating a count
gtgroceries <- gt(tgroceries) #creating a table to display and more easily combine later
datagroceries <- gtgroceries$"_data"
This was repeated for each of the other variables.
As this figure also included percentages, we had to create percentage
columns for each variable based on the previously created counts.
Luckily, we could apply the knowledge gained whilst making Figure 1 to
do this, since they used the same principles and the with()
and prop.table() functions! Again, we did this for every
variable.
perGroc <- with(data4, table(think_covid,groceries)) #creating a tibble to use to calculate %
pGroc <- prop.table(perGroc, 1)*100 #calculating the %
perShop <- with(data4, table(think_covid, shop_other))
pShop <- prop.table(perShop, 1)*100
perMeet <- with(data4, table(think_covid, meetup))
pMeet <- prop.table(perMeet, 1)*100
perSymp <- with(data4, table(think_covid,symptoms))
pSymp <- prop.table(perSymp, 1)*100
Like before, we used the list() and
rbindlist() functions to combine all the previously-made
individual count data sets, and as_tibble() to make it into
a tibble.
listfull3 <- list(datagroceries,
datashop,
datameetup,
datasymptoms)
listfull3 <- rbindlist(listfull3, use.name=FALSE)
listfull3 <- as_tibble(listfull3)
listfull3 <- listfull3
We repeated this for the percentage data sets.
ptibble4 <- rbind(pGroc, pShop, pMeet, pSymp)
ptibble4 <- as_tibble(ptibble4)
ptibble4 <- data.frame(lapply(ptibble4, function(x) if(is.numeric(x)) round(x,1) else x))
And the final step: combining the now combined count and percentage
tables! Since we did this in Figure 1, it wasn’t too difficult to apply
it here, with the relevant data sets, columns, and names. It felt good
to be able to go straight to paste0() this time. We also
ensured to add the brackets as features, to separate the values as seen
in the original figure - just like in Figure 1.
tibble4 <- cbind(listfull3, ptibble4)
tibble4$nocovid <- paste0(tibble4$No, " (", tibble4$On.one.or.fewer.days.in.the.last.week.n...2389, ")")
tibble4$yescovid <- paste0(tibble4$Yes, " (", tibble4$On.two.or.more.days.in.the.last.week.n...3760, ")")
tibble4 <- subset(tibble4, select = c(groceries, nocovid, yescovid))
Now we could finally see the output, and do formatting of the actual
table itself again using the gt() function - our best
friend.
gtfull4 <- gt(tibble4) %>%
tab_spanner(
label = "Think COVID-19",
columns = c(2:3)
) %>%
cols_label(
"nocovid" = "Think have not had COVID-19"
) %>%
cols_label(
"yescovid" = "Think have had COVID-19"
) %>%
cols_label(
"groceries" = "Self-Reported Behaviour n (%)"
) %>%
tab_row_group(
label = "Shopping for groceries/pharmacy",
rows = 1:2
) %>%
tab_row_group(
label = "Shopping for items other than groceries/pharmacy",
rows = 3:4
) %>%
tab_row_group(
label = "Meeting up with friends or family",
rows = 5:6
) %>%
tab_row_group(
label = "Correct identification of cough and fever",
rows = 7:8
) %>%
row_group_order(groups = c("Shopping for groceries/pharmacy",
"Shopping for items other than groceries/pharmacy",
"Meeting up with friends or family",
"Correct identification of cough and fever"
)) %>%
tab_style(cell_text(weight = "bold"),
locations = cells_column_labels(columns = everything())) %>%
tab_style(cell_text(weight = "bold"),
locations = cells_column_spanners()) %>%
tab_style(
style = list(
cell_text(weight = "bold")),
locations = cells_row_groups())
gtfull4
| Self-Reported Behaviour n (%) | Think COVID-19 | |
|---|---|---|
| Think have not had COVID-19 | Think have had COVID-19 | |
| Shopping for groceries/pharmacy | ||
| On one or fewer days in the last week n = 2389 | 1701 (36.5) | 688 (63.5) |
| On two or more days in the last week n = 3760 | 2955 (46.1) | 805 (53.9) |
| Shopping for items other than groceries/pharmacy | ||
| Not at all in the last week n = 1833 | 1156 (24.8) | 677 (75.2) |
| On one or more days in the last week n = 4316 | 3500 (45.3) | 816 (54.7) |
| Meeting up with friends or family | ||
| Not at all in the last week n = 5271 | 456 (9.8) | 422 (90.2) |
| On one or more days in the last week n = 878 | 4200 (28.3) | 1071 (71.7) |
| Correct identification of cough and fever | ||
| Correctly identified common symptoms n = 3632 | 2927 (62.9) | 705 (37.1) |
| Did not correctly identify common symptoms n = 2390 | 1729 (47.2) | 788 (52.8) |
Let’s compare to the original figure:
Although it didn’t look exactly as the figure in the paper, we believed that the rearrangement both presented the data more clearly, as well as allowed for less repetition. Bonus points that it looked more like the other figures in the paper - thus we opted for swapping the positions of the ‘self-reported behaviour’ and ‘think COVID-19’ columns. Most importantly, all the values and percentages matched up. We were happy with it, concluding our reproduction journey!
I generally used the same packages as above to carry out my exploratory analyses, but there were a few extra that came in handy:
library(RColorBrewer) # to change colours of plots
library(rcompanion) # an easy way to create a plot of means
library(kableExtra) # for displaying simple tables
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(report) #to help interpret statistical tests
library(corrplot) #to create correlation plots
## corrplot 0.92 loaded
A simple question, yet particularly relevant. The authors reported that those with children were more likely to think that they had previously caught coronavirus, likely due to greater exposure or perceived exposure (for instance, through schools). Was this linked with greater worry? It could be argued that greater worry would be expected, as the consequences would be worse considering children would be more susceptible to a harsher experience. As child-carers, this would be a cause for concern.
To start exploring this, I used the existing data set and cleaned it. I took the following steps:
select() to filter out my only 2 variables of
interestfilter() to filter out the ‘NA’ responses that
would exclude participants that were not eligiblemutate() to re-code the responses of ‘0’ and ‘1’
to ‘no’ and ‘yes’, in response to whether they had a childtabyl() to count, just out of curiosity and to
ensure I was on the right trackas.data.frame() to turn the data set into a data
frame, so it could be more easily merged laterI found it useful (which will be elaborated later on) to have 2 copies of this cleaned data - one with the count and one long-form, with just the raw data. I called the counted version ‘child’, and it looked like this:
child <- data %>%
select(Has_child, q9worry) %>%
mutate(Has_child = case_when(Has_child == 0 ~ "No",
Has_child == 1 ~ "Yes")) %>%
filter(Has_child != "NA") %>%
tabyl(Has_child, q9worry) %>%
as.data.frame()
child
## Has_child 1 2 3 4 5
## 1 No 96 347 1003 770 410
## 2 Yes 101 285 1027 962 787
And I called the long-form version ‘sig’. The only difference was
that I included the group_by() function. This version will
later allow me to create a plot. It looked like this:
sig <- data %>%
select(Has_child, q9worry) %>%
filter(Has_child != "NA") %>%
group_by(Has_child) %>%
as.data.frame() %>%
mutate(Has_child = case_when(Has_child == 0 ~ "No",
Has_child == 1 ~ "Yes"))
First, to gain a primary understanding of the data, I decided to
create a simple table illustrating the descriptives. As this time I was
only interested in the child variable across the worry variable, I only
needed to deal with these two variables. To do this, a simple
summarise() would suffice, and using
kable_styling() from the kableExtra package to illustrate
it.
mean1 <- data %>%
filter(Has_child != "NA") %>%
group_by(Has_child) %>%
mutate(Has_child = ifelse(Has_child, "yes", "no")) %>%
summarise(n(),
mean(q9worry),
sd(q9worry)) %>%
rename(avg = "mean(q9worry)") %>%
rename(n = "n()") %>%
rename(sd = "sd(q9worry)") %>%
kbl() %>%
kable_styling()
Let’s see what the values were:
mean1
| Has_child | n | avg | sd |
|---|---|---|---|
| no | 2626 | 3.400228 | 1.018002 |
| yes | 3162 | 3.648008 | 1.047924 |
Looks like for now, those with a child appear to have higher average
worry. But we must also take into account the fact that there were more
participants overall that did have a child. Out of curiosity, I also
wanted to see the number of participants on each child factor across
levels of worry. Obviously, since the child variable is not continuous,
I only calculated ‘n’ and not the mean and SD, so tabyl()
would suffice. This is demonstrated here, whereby the values 1-5
correspond to level of worry in increasing order:
mean2 <- data %>%
filter(Has_child != "NA") %>%
mutate(Has_child = ifelse(Has_child, "yes", "no")) %>%
tabyl(Has_child, q9worry) %>%
kbl() %>%
kable_styling()
What are the results?
mean2
| Has_child | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| no | 96 | 347 | 1003 | 770 | 410 |
| yes | 101 | 285 | 1027 | 962 | 787 |
Interesting data, but doesn’t actually tell us much. It would be much more helpful to see a graph or plot of it…
Next, I wanted a clear way to visualise the data and be able to see a pattern. I decided that creating a column graph would be the best way to achieve this - with levels of worry on the x-axis, percentage of people on the y-axis, and 2 bars each, corresponding with a legend, to represent whether they had a child.
To do this, I obviously had to create a percentage count. Luckily, I
had practice doing that in the reproducibility section! My first step
was to create a count of the total number of participants I would be
including. I did this using count(), and then renamed ‘n’
to ‘covidtotal’ using the rename() function, so I wouldn’t
get confused.
total <- data %>%
filter(Has_child != "NA") %>%
group_by(Has_child) %>%
count() %>%
rename(covidtotal = n) %>% as.data.frame()
I also had to create a long-form version of this count - however,
this wasn’t possible using count(), as this requires
group_by() to precede it. However, I found that I could use
tally() instead, which performed the same function but
without this barrier! While I was doing it, I thought I may as well use
mutate() to rename the coded variables on the worry scale
to appropriate names that I could later use in my plot. Ironically, I
called this ‘counting’.
counting <- data %>%
filter(Has_child != "NA") %>%
group_by(Has_child, q9worry) %>%
tally() %>%
mutate(q9worry = case_when(q9worry == 1 ~ "Not at all worried",
q9worry == 2 ~ "A little bit worried",
q9worry == 3 ~ "Moderately worried",
q9worry == 4 ~ "Very worried",
q9worry == 5 ~ "Extremely worried")) %>%
as.data.frame()
Since there were differing quantities of participants in each level
of the child factor, I felt as if percentages would work better as they
would be standardised. Since multiple of the figures in Part 2 used
percentages, I was able to apply this knowledge here. I could use the
previous counts to create the percentages that I could use in my plot!
Unfortunately, I couldn’t find a way to include the ‘total’ data set
directly, but luckily our previous count gave me the numbers I needed,
so I manually added it - it came to a total of 5788. This was done
primarily using the mutate() function:
pattempt <- counting %>% mutate(perc = n*100/5788) #calculating percentages for each worry factor by mutating a formula
counting <- counting %>% mutate(pattempt) #mutating the above 'pattempt' as a new column in the 'counting' dataframe
I was almost ready to see the plot itself - but once I renamed the
columns, they rearranged themselves in alphabetical order, which made no
logical sense. So first, I needed to reorder the columns in a way that
made sense now using factor() like I’ve demonstrated
multiple times in Part 2:
counting$q9worry <- factor(counting$q9worry, levels = c("Not at all worried", "A little bit worried", "Moderately worried", "Very worried", "Extremely worried"))
And now the graph itself! Using simple ggplot():
graph1 <- counting %>% mutate(Has_child = ifelse(Has_child, "Yes", "No")) %>% #renaming the legend
ggplot(aes(factor(q9worry), perc, fill = Has_child)) + #assigning variables to the axes/legend
geom_col(position = position_dodge()) + #bars side by side
ggtitle("Impact of Children on Level of COVID-worry") + #title
labs(x="Worry about COVID", y="% of people", fill="Have a child") + #axis labels
coord_flip() + #realised halfway it would be easier to read the 'worry' labels if the axes were flipped
theme_bw() #background theme
graph1
Which was great! … for visualising the proportions. But after seeing the output, I quickly realised that it didn’t tell me anything of statistically significant value. I might be better off creating a means box plot. Although it was annoying having to start over, I felt like it would be best.
summ() was the function I used to calculate the
descriptives of the worry variable - this came from the rcompanion
package. I found that summarise(), while succinct, just
didn’t have enough information to create a box plot.
summ = groupwiseMean(q9worry ~ Has_child,
data = sig, #the dataframe I was calculating from
conf = 0.95, #95% confidence interval
digits = 3) #3 significant figures
summ
## Has_child n Mean Conf.level Trad.lower Trad.upper
## 1 No 2626 3.40 0.95 3.36 3.44
## 2 Yes 3162 3.65 0.95 3.61 3.68
Which then allowed me to create the box plot easily. I called this ‘bplot’.
bplot <- ggplot(sig, aes(x=Has_child, y=q9worry, fill=Has_child)) + #assigning variables
geom_boxplot()+
stat_summary(fun=mean, geom="point", shape=23, size=2) + #mean marker
labs(title = "Mean Worry of Child vs No Child", x = "Have child", y = "Mean Worry about COVID") + #titles
theme_minimal() + #background
scale_fill_brewer(palette="Dark2") #colour palette
bplot
From looking at this box plot, though the other statistics appeared identical, there was definitely a visible difference between the two means.
But the final question remained - was this a statistically
significant difference? I used t.test() to check this:
t.test(q9worry ~ Has_child, data = sig, var.equal = TRUE)
##
## Two Sample t-test
##
## data: q9worry by Has_child
## t = -9.0723, df = 5786, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -0.3013202 -0.1942381
## sample estimates:
## mean in group No mean in group Yes
## 3.400228 3.648008
Looks like p<0.05, or it was significant! Now that we had these results, I was curious as to whether there was a correlation between having a child and the other measures relating to COVID-worry, namely immunity and perceived risk to self and others. I decided to create a correlation matrix to investigate this.
cors <- data %>% select(Has_child, q10arisk, q10brisk, q9worry, q8haveimmunity) %>%
filter(Has_child != "NA") %>%
cor()
print(cors)
## Has_child q10arisk q10brisk q9worry q8haveimmunity
## Has_child 1.000000000 0.1639095 0.03458871 0.1184296 0.003529561
## q10arisk 0.163909539 1.0000000 0.39187157 0.4721884 -0.162017663
## q10brisk 0.034588709 0.3918716 1.00000000 0.4418158 -0.189576707
## q9worry 0.118429627 0.4721884 0.44181579 1.0000000 -0.262237210
## q8haveimmunity 0.003529561 -0.1620177 -0.18957671 -0.2622372 1.000000000
From this, it looked like none of the other measures showed very high
correlations (<0.1) while worry and self-perceived risk had the
highest correlations… which was surprising, considering the rationale
that those with children would be worried on behalf of their children
leading to higher perceived risk towards others. I wondered whether the
correlation between self-perceived risk and having a child was
significant, so I tested this using report(), from the
report package.
cor2 <- cor.test(data$Has_child, data$q10arisk, method = "pearson")
report(cor2)
## Effect sizes were labelled following Funder's (2019) recommendations.
##
## The Pearson's product-moment correlation between data$Has_child and data$q10arisk is positive, statistically significant, and small (r = 0.16, 95% CI [0.14, 0.19], t(5786) = 12.64, p < .001)
It was statistically significant! To determine the direction this
correlation was in, I quickly used summarise() to calculate
the means.
childriskmean <- data %>%
group_by(Has_child) %>%
filter(Has_child != "NA") %>%
mutate(Has_child = case_when(Has_child == 0 ~ "No",
Has_child == 1 ~ "Yes")) %>%
rename(SelfRisk = q10arisk) %>%
summarise(mean(SelfRisk)) %>%
rename(MeanSelfRisk = "mean(SelfRisk)") %>%
kbl() %>%
kable_styling()
childriskmean
| Has_child | MeanSelfRisk |
|---|---|
| No | 2.668317 |
| Yes | 2.918722 |
Having a child also significantly increases self-perceived risk!
Those with children do indeed have significantly higher average worry about COVID-19, as well as self-perceived risk to COVID-19. The former appears to be intuitive, as one would assume child-carers would be more worried for their children as they are a vulnerable population. However, higher self-perceived risk is interesting - would this be due to higher perceived exposure from their children (eg. from schools)? This could be the topic of a later study.
The study showed that older participants were less likely to think that they had previously had coronavirus. I was curious as to whether this would be consistent with their behaviour. To do this, I thought that creating a column graph illustrating each age group’s mean total number of outings.
I first cleaned the data:
data1 <- data %>%
rename(Age = age_categories) %>%
select(Age, Going_out_total) %>% #the 2 variables of interest in this question
mutate(Age = case_when(Age == 1 ~ "18 to 24 years",
Age == 2 ~ "25 to 34 years",
Age == 3 ~ "35 to 44 years",
Age == 4 ~ "45 to 54 years",
Age == 5 ~ "55+"))
To get a general idea of the pattern I would be looking for, I
created a simple means table again using summarise():
summ1 <- data1 %>%
group_by(Age) %>%
summarise(mean(Going_out_total)) %>%
rename(MeanGoingOut = "mean(Going_out_total)") %>%
kbl() %>%
kable_styling()
summ1
| Age | MeanGoingOut |
|---|---|
| 18 to 24 years | 8.287623 |
| 25 to 34 years | 8.074407 |
| 35 to 44 years | 7.379904 |
| 45 to 54 years | 7.062674 |
| 55+ | 6.115451 |
Looks like there was a downward linear trend, with the youngest people going out the most frequently and the oldest people the least.
I wanted my column graph to include error bars - to do this, I needed
to use the summ() function from Q1 to include the other
needed statistics, such as the boundaries of the confidence
interval.
summ = groupwiseMean(Going_out_total ~ Age,
data = data1,
conf = 0.95,
digits = 3)
From there, it was relatively simple to create the graph, again using
ggplot(). I called the graph ‘cp’.
cp <- ggplot(summ, aes(x=Age, y=Mean, fill=Age)) +
geom_col() +
geom_errorbar(aes(ymin = Trad.lower, ymax = Trad.upper, width = 0.15)) + #creating error bars
ggtitle("Mean Total No. of Outings Across Age Groups") +
labs(x="Age", y="Mean Total No. of Outings") +
coord_flip() +
theme_bw() + #background aesthetic
scale_fill_brewer(palette="Accent") #colour aesthetic
cp
Hmm, there doesn’t look like there’s a big difference. I decided to
do a significance check with a one-way ANOVA significance test. I used
the aov() function to compute the analysis of variance, and
summary() to determine whether it was significant. A
significant result would mean that at least one pair had a significant
difference.
res.aov <- aov(Going_out_total ~ Age, data = data1)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 4 4604 1151.0 29.55 <2e-16 ***
## Residuals 6144 239286 38.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Meanwhile, report() would help me interpret my
results.
report(res.aov)
## For one-way between subjects designs, partial eta squared is equivalent to eta squared.
## Returning eta squared.
## Warning: 'data_findcols()' is deprecated and will be removed in a future update.
## Its usage is discouraged. Please use 'data_find()' instead.
## The ANOVA (formula: Going_out_total ~ Age) suggests that:
##
## - The main effect of Age is statistically significant and small (F(4, 6144) = 29.55, p < .001; Eta2 = 0.02, 95% CI [0.01, 1.00])
##
## Effect sizes were labelled following Field's (2013) recommendations.
Ah, so there is a statistically significant (albeit small) difference! To determine which pair/s differed, I used TukeyHSD.
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Going_out_total ~ Age, data = data1)
##
## $Age
## diff lwr upr p adj
## 25 to 34 years-18 to 24 years -0.2132159 -0.8772954 0.45086371 0.9057639
## 35 to 44 years-18 to 24 years -0.9077188 -1.6015390 -0.21389855 0.0033115
## 45 to 54 years-18 to 24 years -1.2249490 -2.0045370 -0.44536091 0.0001783
## 55+-18 to 24 years -2.1721722 -2.7808265 -1.56351783 0.0000000
## 35 to 44 years-25 to 34 years -0.6945029 -1.4118353 0.02282954 0.0631660
## 45 to 54 years-25 to 34 years -1.0117331 -1.8123184 -0.21114777 0.0051472
## 55+-25 to 34 years -1.9589563 -2.5942826 -1.32363006 0.0000000
## 45 to 54 years-35 to 44 years -0.3172302 -1.1426523 0.50819191 0.8325583
## 55+-35 to 44 years -1.2644534 -1.9308049 -0.59810198 0.0000023
## 55+-45 to 54 years -0.9472232 -1.7024684 -0.19197801 0.0056457
Based on basic statistical inference, the following pairs significantly differed:
And the biggest difference was between 18-24 and 55+, with 55+ year olds having the least number of outings! Looks like age does have an effect… although it’s difficult to determine the direction of the relationship.
A limitation noted in the original study was that in their analyses, the authors did not differentiate between government-mandated outings and illegal outings. They included the following ‘going-out’ variables in the study:
- Shopping for groceries/pharmacy
- Shopping for items other than groceries/pharmacy
- Meeting up with friends/family they do not live with
Government-mandated outings would only include shopping for groceries/pharmacy. Would this pattern maintain if we separated outings into the 3 different categories Smith et al studied?
First, I cleaned the data again under some different conditions, selecting the variables that would be relevant in this section of the analysis, but using the same functions I’ve been practicing throughout this whole journey.
gov <- data %>%
rename(Age = age_categories) %>%
select(Age, Adhere_shop_groceries, Adhere_shop_other, Adhere_meet_friends) %>%
rename(GroceryPharmacy = Adhere_shop_groceries,
ShopOther = Adhere_shop_other,
FriendsFamily = Adhere_meet_friends) %>%
mutate(Age = case_when(Age == 1 ~ "18 to 24 years",
Age == 2 ~ "25 to 34 years",
Age == 3 ~ "35 to 44 years",
Age == 4 ~ "45 to 54 years",
Age == 5 ~ "55+")) %>%
group_by(Age, GroceryPharmacy, ShopOther, FriendsFamily)
As there were so many variables to consider (5 age levels, 3 types of outings, and 2 levels of each outing being yes and no, with different combinations of the 3 outings), I decided to filter the data to only include instances where participants answered ‘yes’ to each respective question. This was coded as ‘1’ in all three - in the ‘GroceryPharmacy’ condition, it referred to visiting the supermarket or pharmacy more than twice in the past week, and for the remaining conditions it referred to engaging more than once in the past week.
I first wanted to do some simple counts - i.e., how many respondents there were in each age group, and how many had responded ‘yes’ to each type of outing - to give me an estimate of what I would be looking for. The former is demonstrated in this table:
agecount <- gov %>%
group_by(Age) %>%
count() %>% #counting how many respondents there were across each age group
kbl() %>%
kable_styling()
agecount
| Age | n |
|---|---|
| 18 to 24 years | 1422 |
| 25 to 34 years | 1223 |
| 35 to 44 years | 1045 |
| 45 to 54 years | 718 |
| 55+ | 1741 |
For the latter, I created individual data frames using
tabyl() for each type of outing across the age categories,
then combined them using list() and
rbindlist(). I also filtered out the ‘no’ responses using
filter().
This was the result:
tableG <- rbindlist(tibblist, use.names = FALSE) %>%
rename(Type = GroceryPharmacy) %>%
kbl %>%
kable_styling()
tableG
| Type | 18 to 24 years | 25 to 34 years | 35 to 44 years | 45 to 54 years | 55+ |
|---|---|---|---|---|---|
| GroceryPharmacy | 828 | 706 | 618 | 417 | 1191 |
| ShopOther | 791 | 764 | 735 | 558 | 1468 |
| FriendsFamily | 1092 | 966 | 918 | 661 | 1634 |
I wondered how this pattern could be visualised. Considering that
there were differing numbers of responses for each age category, it
wouldn’t make sense to plot the raw number of responses - the age group
with the most number of respondents would always show the highest number
in each outing group. Therefore, plotting based on percentage seemed to
make more sense. Since each age category had a different total, I would
have to calculate percentages based on each category. Also, since
creating a graph would require a long-form count, I couldn’t use the
previous table and had to utilise count() instead. Which
meant a long and arduous process of creating a data frame for each age
group, for each outing type, then combining them using
list() and rbindlist() again. Here is an
example of the process I went through for the 18-24 year-old group:
countag <- data %>%
filter(age_categories == 1, Adhere_shop_groceries == 1) %>% # age category 1 refers to 18-24y/os
group_by(age_categories, Adhere_shop_groceries) %>%
mutate(Adhere_shop_groceries = case_when(Adhere_shop_groceries == 1 ~ "Groceries")) %>% #to be able to differentiate between the 3, since they would otherwise all be labeled '1'
count() %>%
as.data.frame()
countao <- data %>%
filter(age_categories == 1, Adhere_shop_other == 1) %>%
group_by(age_categories, Adhere_shop_other) %>%
mutate(Adhere_shop_other = case_when(Adhere_shop_other == 1 ~ "Shop Other")) %>%
count() %>%
as.data.frame()
countaf<- data %>%
filter(age_categories == 1, Adhere_meet_friends == 1) %>%
group_by(age_categories, Adhere_meet_friends) %>%
mutate(Adhere_meet_friends = case_when(Adhere_meet_friends == 1 ~ "Meet Friends")) %>%
count() %>%
as.data.frame()
combine <- list(countag, countao, countaf)
combines <- rbindlist(combine, use.name=FALSE) %>%
rename(Type = Adhere_shop_groceries) %>% #creating a heading
mutate(age_categories = case_when(age_categories == 1 ~ "18-24y/o")) %>% #to be able to differentiate between the age categories later, when multiple would come into play
mutate(perc = (n/2711)*100) #mutating a new column to add percentage values - 2711 refers to the total number of respondents in this age category, which I manually added using the previous 'agecount' table
I was aware that this seemed to be an extremely inefficient way to
create this table, but after days thinking about this, it appeared to be
the only solution. I tried many other methods - such as creating
separate graphs for each age across the types of outings, separate graphs for each
outing across the age groups (I almost ended up using this method,
but concluded that it would be too difficult to visualise the difference
in outings between ages), graphing only the raw numbers… but this was
the only method I could think of to achieve the plot I wanted. Before
stumbling on this, I found it difficult to create a column for all the
types of outings as in the original data, they appeared as three
separate variables. But using group_by() and
count() automatically created a column detailing outing
type - and since I was only counting the ‘yes’ responses, I could rename
this response to the name of the outing. The column I was searching for,
with the different outing types, was created simply by virtue of me
merging the different data frames of the different outings. After that,
I simply renamed the top variable to ‘type’ - and suddenly my problem
was solved. I concluded that an inefficient solution is better than no
solution at all, so I proceeded to repeat this process for the remaining
4 age groups.
After this, I tried using merge() to combine all the
combined data frames from each age category - before realising that
obviously, merge() is only able to merge 2 data frames at a
time. So I merged them in pairs, then all together.
mulbine <- merge(combines, combines2, all = TRUE) #first 2
mulbine2 <- merge(combines3, combines4, all = TRUE) #second 2
mulbine3 <- merge(mulbine, mulbine2, all = TRUE) #4
mulbine4 <- merge(mulbine3, combines5, all = TRUE) #all!
Thankfully, it seemed to work.
print(mulbine4)
## age_categories Type n perc
## 1: 18-24y/o Groceries 828 30.54224
## 2: 18-24y/o Meet Friends 1092 40.28034
## 3: 18-24y/o Shop Other 791 29.17743
## 4: 25-34y/o Groceries 706 28.98194
## 5: 25-34y/o Meet Friends 966 39.65517
## 6: 25-34y/o Shop Other 764 31.36289
## 7: 35-44y/o Groceries 618 27.21268
## 8: 35-44y/o Meet Friends 918 40.42272
## 9: 35-44y/o Shop Other 735 32.36460
## 10: 45-54y/o Groceries 417 25.48900
## 11: 45-54y/o Meet Friends 661 40.40342
## 12: 45-54y/o Shop Other 558 34.10758
## 13: 55+y/o Groceries 1191 27.74284
## 14: 55+y/o Meet Friends 1634 38.06196
## 15: 55+y/o Shop Other 1468 34.19520
I finally had a data frame that was ready to be plotted. I kept it
simple with a geom_bar():
mulbinep <- mulbine4 %>%
ggplot(aes(age_categories, perc, fill = Type)) +
geom_bar(position = 'dodge', stat='identity', width=0.5) +
ggtitle("Differences in Types of Outings across Age") +
labs(x="Age Categories", y="% of people engaging in frequent outings", fill="Type of Outing") +
theme_bw()
print(mulbinep)
We determined earlier that the youngest age group of 18-24 year olds had the highest number of total outings and the oldest age group of 55+ year olds had the lowest number. However, upon separating the types of outings, it can be seen that for those that answered ‘yes’ to at least one type of outing, all age categories showed the highest proportion of outings were to visit friends and family they didn’t live with. In fact, the pattern of ‘meet friends’ having the highest proportion and ‘groceries’ (to be noted, the only legal outing) having the lowest was true across all ages apart from the 18-24 group, where ‘groceries’ was very slightly higher than ‘shop other’. Overall, the graph demonstrates that despite younger participants going out more, the types of outings irrespective of legality don’t differ across age groups. In fact, the only legal outing was frequently the least engaged-upon outing!
However, there are a few limitations to note whilst considering these results. The authors defined the ‘yes’ response to the ‘groceries/pharmacy’ question as “2 or more times in the last week”, but the ‘yes’ response to the other two types of outings was “1 or more times in the past week”. This may have skewed results - there may be a smaller likelihood of going to the grocery store twice a week if one could simply get everything they needed in one trip. Secondly, since there were varying combinations of responses (eg. yes to groceries and friends but no to other, yes to friends but no to groceries and other etc.), it was difficult to operationalise and proportion the responses correctly. I calculated this based on at least one yes response, which meant that there were many double-ups on responses for participants that answered yes to more than one type of outing - which again may have affected results. Finally, the results of this section differed from the main question in that the total number of outings was a continuous variable - as these responses were categorical, classed as ‘yes’ or ‘no’, it is impossible to know the extent of the differences between outings. This was measured with proportion of occurrence, rather than mean. Nevertheless, it is an interesting observation, though it would have been helpful if the authors included information by participants regarding how many of each type of outing they engaged in.
As discussed in Part 1, I found it surprising that there wasn’t a significant association between belief of having had COVID-19 before and perceived risk of it both towards oneself as well as others. I decided to investigate this first by trying to determine if there was an issue with the measures themselves - did they even correlate with each other?
First, I cleaned the data as usual:
data2 <- data %>%
rename(selfrisk = q10arisk,
otherrisk = q10brisk) %>%
mutate(Ever_covid = case_when(Ever_covid == 1 ~ "Yes",
Ever_covid == 0 ~ "No")) %>%
select(selfrisk, otherrisk, Ever_covid)
I ran into the same problem as I did in Question 1 in which R assigned “Has_child” as a continuous rather than a categorical variable - but this time with “Ever_covid”. And this time I knew what to do:
data2$Ever_covid <- as.factor(data2$Ever_covid) #changing into categorical
I decided to check out the descriptives first - I decided to look at
the mean and SD of the ‘self-risk’ variable across levels of the
‘other-risk’ factor under the assumption that if they both increased, it
would imply a positive correlation. I used summarise() to
calculate the values and kable_styling() to create a simple
table to display the values.
d4 <- data %>% group_by(q10brisk) %>%
summarise(
n = n(),
mean = mean(q10arisk),
sd = sd(q10arisk)) %>%
rename(OtherRisk = q10brisk) %>%
kbl %>% kable_styling()
d4
| OtherRisk | n | mean | sd |
|---|---|---|---|
| 1 | 64 | 1.812500 | 0.9900297 |
| 2 | 509 | 2.267191 | 0.7149446 |
| 3 | 2683 | 2.620201 | 0.6599169 |
| 4 | 2893 | 3.098514 | 0.7223281 |
Looks like the means of one group do steadily increase with the levels of the other.
Now, let’s visualise this using ggplot() again. Since I
was looking for a correlation this time, I decided to use
geom_smooth() rather than another column graph or box plot.
Also, since the authors measured this across levels of ‘ever_covid’, I
decided to do the same.
plot3 <- ggplot(data2, aes(selfrisk, otherrisk, group = Ever_covid, colour = Ever_covid)) +
geom_smooth(method = "lm", se = FALSE) +
scale_y_continuous(name = "Perceived Risk to Others") +
scale_x_continuous(name = "Perceived Risk to Self") +
theme_bw() +
ggtitle(label = "Perceived Risk of COVID-19 to Self vs Others")
plot3
## `geom_smooth()` using formula 'y ~ x'
Okay - there is definitely a positive trend there, for those who do and don’t believe they’ve caught COVID-19 before. A good sign… but let’s see if this is a statistically significant correlation.
To test whether there was a significant correlation, I used
cor.test():
res <- cor.test(data2$selfrisk, data2$otherrisk, method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: data2$selfrisk and data2$otherrisk
## t = 33.458, df = 6147, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3711432 0.4134369
## sample estimates:
## cor
## 0.3924975
Looks like there is a weak, but significant positive correlation!
I was curious about the correlation between these two ‘perceived
risk’ variables and behaviour - or the total number of outings. Previous
research has shown that increased risk perception is associated with
greater protective action - did the findings of this study support this
hypothesis? I thought I would plot this information on another
geom_smooth() plot - but, first, I had to convert the
self-risk variable into a discrete variable:
data$q10arisk <- as.factor(data$q10arisk)
And now I could plot!
plot4 <- ggplot(data, aes(Going_out_total, q10brisk, group = q10arisk, colour = q10arisk)) +
geom_smooth(method = "lm", se = FALSE) +
theme_bw() +
scale_colour_discrete("Rating of Perceived Risk to Self") +
ggtitle(label = "Total Outings across Levels of Perceived Risk") +
scale_x_continuous(name = "Total No. of Outings") +
scale_y_continuous(name = "Rating of Perceived Risk to Others")
plot4
## `geom_smooth()` using formula 'y ~ x'
Though this was far from perfect, considering I was forced to convert the ‘self-risk’ variable into a discrete variable but leave the ‘other-risk’ variable as continuous, I believed it was good enough to illustrate the pattern - perceived risk to others decreased as number of outings increased. This relationship was consistent for perceived risk to self, with this rating also being positively correlated with risk to others.
I then created a correlation matrix based on these observations using
the cor() function - but to do this, I first had to convert
the perceived self-risk ratings back to a numeric variable:
data$q10arisk <- as.numeric(data$q10arisk)
And here is the correlation:
data4 <- data %>% select(Going_out_total, q10arisk, q10brisk) %>% cor()
print(data4)
## Going_out_total q10arisk q10brisk
## Going_out_total 1.00000000 -0.08565588 -0.1842952
## q10arisk -0.08565588 1.00000000 0.3924975
## q10brisk -0.18429519 0.39249749 1.0000000
Looks like while there are small negative correlations between outings and perceived risk, the strongest correlation is between the 2 forms of perceived risks themselves.
I wanted to create a visual representation of the strengths of the
correlations using corrplot() - however, as I only had 3
variables, this didn’t appear to be possible. Since the question stemmed
from the worry variable, I thought I may as well include it and find out
the correlation.
data5 <- data %>% select(Going_out_total, q10arisk, q10brisk, q9worry) %>% cor()
print(data5)
## Going_out_total q10arisk q10brisk q9worry
## Going_out_total 1.00000000 -0.08565588 -0.1842952 -0.1781434
## q10arisk -0.08565588 1.00000000 0.3924975 0.4702685
## q10brisk -0.18429519 0.39249749 1.0000000 0.4385310
## q9worry -0.17814340 0.47026850 0.4385310 1.0000000
Same results as before, but with the new ‘worry’ variable. But let’s see the correlation matrix:
corrplot(data5)
Looks like there is a relatively strong relationship between worry and self-perceived risk. Apart from this relationship, there is also a positive correlation between the two forms of perceived risk, and a weak negative correlation between number of outings and the other three factors - which is to be expected, as staying home would be a protective measure/form of behaviour consistent with those attitudes.
So it looks there there isn’t much wrong with the perceived risk variables.
There is a correlation between perceived of COVID-19 to oneself vs others! This is also weakly correlated with behaviour, as operationalised by total number of outings, as well as general worry.
My first recommendation to Smith et al would be to include a README file. This refers to a text file that contains an introduction and information about the data set, providing instructions on how the authors used the code and how one might go about reusing it, especially aimed at novices to the data. Typically, this would include the purpose of the file, a description of the methodology and tools/technology, and a list of all the conventions and abbreviations used in the data set to allow for easier understanding. Considering the authors did not provide this file, nor a coding script, it was difficult to know where to start, particularly as we were complete novices. They failed to consider the readers’ point of view – guidance on the steps they took would provide a clearer starting point. Although it should be noted that Smith et al’s variables were well-labelled and hence it was easy to infer which variables were relevant to analysis, a README would increase the ease and smoothness of the reproducibility process. This document provides some basic information on the conventions of a typical README file and an exemplar, and this article explains the benefits of a good README file.
My second recommendation to the authors would be to include a data dictionary, or code book, in their OSF repositories. This is essentially a document that describes the data set, detailing their meanings and labeling the variables measured including details such as how the variable was measured, and the units used. All the categorical data that was provided in the OSF repositories was coded in 0s and 1s, which made it difficult for us (and anyone else trying to interpret it) to determine the meanings. For example, gender was coded in this way, and we were unsure which number corresponded to which gender. This was exacerbated in variables such as age category and region, where there were 5 possible responses. Though it was possible for us to determine this by counting the number of participants in each category and cross referencing it with the figures published in the paper, a simple, coherent code book detailing what each response corresponded to. Further, although the .sav file did provide a brief description of what each variable measured eg. For ‘q7beentested’, a line stating, “Have you been tested for coronavirus?”, there were some variables that were unclear. Most prominently, the variable ‘condition_antibodies’ was labelled “assigned to experimental condition” – but was not featured in any of the figures, and no other information was provided. To our knowledge after reading the paper, there was no experimental condition, leaving us extremely confused as to what this variable could be measuring. Further, as all the responses were coded in 0s and 1s, it was also impossible to determine from there.
That being said, I do believe Smith et al should be praised for using readable variable names; even when labeling descriptions were vague, it was possible to make an educated guess based on its name eg. ‘q8haveimmunity’ corresponding to the extent to which the participant believed they had immunity. On the other hand, it would help to remove the question number in each of the variables to increase human-readability. A code book would also reduce confusion regarding measurement units on the numerical variables – for example, the worry and immunity variables were rated on a scale of 5 whereas the perceived worry variables were rated on a scale of 4. Though it was possible to infer this by reading the paper, this path could have been simplified through the code book detailing the measurement units for every variable. Finally, it was noted in Part 2 that the authors made a mistake in the labeling of the ‘shopping for groceries/pharmacy’ variable, confusing the meanings of 0 and 1 (and we determined this by noticing a mismatch in the percentage values of Figures 2 and 3). A code book may have helped them in avoiding this.
Overall, providing human-readable details to support clear and easy interpretation and analysis of one’s data should be best practice. This website provides an excellent starting point, but there is definitely a wealth of resources online!
My final recommendation to Smith et al would be to provide their coding script and their analysis code. While coding script by itself may have been overwhelming, complemented by the previous two recommendations of a README file and codebook, it would have been very helpful. Particularly as we were very new to R, we had no idea where to start, what functions they used nor how they managed their data and hence ran into issues when trying to reproduce the descriptives and plots. We couldn’t be sure if the way we were producing the figures was the same way the authors did, as we were not provided any information on how they do it. We didn’t have a choice but to write our own code – this was extremely time-consuming as the entire process consisted of trial and error. It was essentially attempting an exam with no answer sheet. For example, we created data frames for each variable – was there a more efficient way to count all the values without doing this? When we stumbled upon a solution, like this, we couldn’t be certain if it was correct or if there was a more efficient way to go about it. Secondly, as the file in the OSF repository was a .sav file, it could be assumed that the authors conducted their analyses on SPSS. As this is not an open-source software, it is difficult to determine what steps the authors took to conduct their analysis, which was particularly problematic when attempting reproducibility. For example, the authors reported ‘little evidence of an association between perceived worry and belief of having had COVID-19 before’ – but it was difficult to recreate this test of significance without any instructions, guidance, or even a starting point. Whilst this was definitely a great learning experience (forcing us to learn how to code!), albeit the steep learning curve, I believe that the authors including their coding script, accompanied by the former recommendations, would increase the efficiency of the reproducibility process and hence the quality of open science.