HTML Link: https://rpubs.com/junekam17/verificationreport

Paper: Smith L., Mottershaw A., Egan M., Waller J., Marteau T., et al. (2020) The impact of believing you have had COVID-19 on self-reported behaviour: Cross-sectional survey. PLOS ONE 15(11): e0240399. https://doi.org/10.1371/journal.pone.0240399 pmid:33147219

Directory

Part 1: Summary and Reaction
- Summary
- Reaction
Part 2: Verification
- Figure 1
- Figure 2
- Figure 3
- Figure 4
Part 3: Exploration
Part 4: Recommendations

Part 1: Summary and Reaction

Summary

COVID-19 was, and remains, a great cause for concern in public health, and the primary method we have used to combat its effects is the presence of lockdown laws and government mandates. Adherence to these measures is essential in decreasing transmission rates, saving countless lives. However, those who believe they have already had COVID-19 may not adhere to these measures as they believe they have immunity against the virus. This may be problematic due to potential false test results and inaccurate self-diagnosis, but primarily because our knowledge regarding immunity is limited; it remains unknown whether we can be infectious more than once (Waller et al (2020)). For this reason, it is important to know the factors that may affect adherence to lockdown laws could inform us on targeted interventions to reduce them; however, at time of publication, there was no literature about whether belief of already having COVID-19 affects adherence to lockdown measures or attitudes towards it.

Smith et al (2020) investigated whether this belief affected self-reported behaviour, in terms adherence to lockdown measures, and beliefs early in the pandemic by employing an online cross-sectional survey of 6149 participants. Participants were asked how worried they were about COVID-19, whether they believed they had already had it, whether they believed they had immunity to it, and the extent to which they thought COVID-19 posed a risk to themselves and to others. They were also asked to indicate the number of outings they had for various reasons, both government-mandated as well as not, as well as total number of outings. They also recorded participants’ social and demographic characteristics.

They found that those who believed they already had COVID-19 were more likely to believe they had immunity, and also less likely to adhere to social distancing in that they had a higher total number of outings. They were also less likely to be able to identify symptoms. Surprisingly, there was no correlation between belief of having it and perceived risk. There were several limitations, such as complete reliance on self-report measures and the cross-sectional design prevented researchers from determining the direction of associations. They also noted that the study may have been overpowered, in that the large sample size meant even small differences were statistically significant. However, it may have important implications in guiding educational communications targeted at those who believe they have had COVID-19 regarding immunity (or lack thereof) and the importance of continuing to adhere to lockdown measures.

Reaction

I can see that this relates to real world public health and safety implications, particularly relevant in this case as public attitudes guide social norms regarding compliance, which in turn affect the rate of transmission. It could also have important implications in public health policy, which could impact the lives of the population in terms of health as well as economic and social wellbeing. As Smith et al discuss, it may lead to a vicious cycle in which belief in having had coronavirus leads to lower adherence, leading to lowered social expectations, further decreased adherence and an even larger proportion of the population believing they have caught the disease due to increased perceived exposure. It reminds me of a paper by Waller et al (2020), which investigated the impact of ‘immunity passports’ and found that participants who underwent an ‘immunity’ test instead of an ‘antibody’ test were more likely to perceive no risk of catching coronavirus. This forces consideration of the connotations and public perceptions of the concept of ‘immunity’. Following from this would be consideration of the Health Belief Model (Acosta & Wolfe, 2008), typically used to guide disease prevention programs. The model predicts individual health-related behaviours based on the drivers of cues to action and perceived risk, severity, benefits, self-efficacy, and barriers to action. This study could definitely be used in conjunction with the Model.
I was surprised that there was little evidence for the association between belief of having had COVID-19 before and perceived risk of it, particularly as it contradicts existing evidence that increased risk perception is associated with greater protective action (Brewer et al, 2004) – in this case, adhering to lockdown measures. However, their reckless behaviour instead indicated a lack of perceived risk. It seems logical that this belief would be positively correlated with both worry and perceived risk – what makes this factor stand out? Although the authors noted that this seemed unusual, they did not offer any alternatives. Though perhaps it may have been an issue with the power of the study – but it was overpowered, meaning it would have been in the opposite direction. Another potential factor may be the difference in scales (worry was measured on a scale of 1-5 whereas perceived risk on a scale of 1-4), or maybe an error in the self-report and perception of the ‘perceived risk’ questions themselves.
It seems that the next step in this area of research would be to investigate the sources of misinformation and miscommunication that led to such a high proportion of those believing they had COVID-19 to believe they were immune. It would be worth developing ways to minimise or mitigate these sources as well as investigating ways to effectively communicate with this subset of the population surrounding the importance of adhering with lockdown measures. Since the publication of this paper, Gonzalez-Castro et al (2021) have also studied the impact of perceived vulnerability predicting adherence to measures and found that high perceived vulnerability was linked to higher adherence. Although using the opposing measure of vulnerability rather than immunity, this lines up with Smith et al’s results. This could also be further explored by investigating which subsets of the population (eg. Age groups, education level, those with children) are most affected by the phenomenon, allowing targeted interventions. Finally, the factor of ‘perceived risk’ should be further investigated – why was it not correlated with the other measures? Is it an internal or external issue?

Part 2: Verification

We obtained the open data from the OSF repository. Our goal was to reproduce the following 4 figures:

Figure 1: Table of Participant Characteristics and Self-Reported Beliefs on COVID-19
Figure 2: Plot of Results
Figure 3: Table of Results
Figure 4: Table of Adherence to Lockdown Measures

Figure 1

We attempted to reproduce Figure 1 first as published in the study.

Load Packages

We first loaded the packages to be used throughout our entire reproducibility journey.

library(tidyverse) #for most of the important functions

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dplyr) # more important functions
library(gt) #making tables
library(gtExtras) #more tables
library(janitor) #wrangling

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(patchwork) #plots
library(data.table) #renaming

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(gtExtras) #for merging data 
library(readr)
library(glue)
library(data.table) #table formatting
library(scales) #more table formatting

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(tidyr) #to tidy messy data
library(knitr) #to knit to word/pdf later

Read Data

The data in the OSF repo was in the .sav format from SPSS. Though it was possible to work in this format, we found it easier to use the .csv as it told us what each of the variables meant - therefore, we used read_csv() to achieve this. This is from the readr package, contained within the tidyverse package. We named the data “data1”.

setwd('/Users/junekam/downloads')
data <- read_csv("Dataset - impact of believing you've had covid-19.csv")

## Rows: 6149 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (19): Ever_covid, q7beentested, q8haveimmunity, q9worry, q10arisk, q10br...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean Data

After reading the data, we had to clean it up so it was easier for us to work with. This included (all from the tidyverse package):

Using rename() to rename the variables
Using select() to select the relevant variables
Using mutate() to change the response codes from numbers to their meanings

Finally, we determined which code corresponded to which response by using the tabyl() function from the janitor package, which counts the frequency of respondents, and matching accordingly.

data1 <- data %>% 
  rename(think_covid = Ever_covid,
         Age = age_categories,
         Gender = gender) %>% 
  select(think_covid, Age, Gender, Age, degree, Has_child, Working, Key_worker, region) %>% 
  mutate(Gender = case_when(Gender == 1 ~ "Male",
                            Gender == 2 ~ "Female")) %>% 
  mutate(think_covid = case_when(think_covid == 1 ~ "Have COVID-19",
                                 think_covid == 0 ~ "Have Not Had COVID-19")) %>% 
  mutate(degree = case_when(degree == 0 ~ "GSCE/vocational/A-level/No formal qualifications",
                            degree == 1 ~ "Degree or higher (Bachelors, Masters, PhD)")) %>% 
  mutate(Age = case_when(Age == 1 ~ "18 to 24 years",
                         Age == 2 ~ "25 to 34 years",
                         Age == 3 ~ "35 to 44 years",
                         Age == 4 ~ "45 to 54 years",
                         Age == 5 ~ "55+")) %>% 
  mutate(Has_child = case_when(Has_child == 0 ~ "No",
                               Has_child == 1 ~ "Yes")) %>% 
  mutate(Working = case_when(Working == 0 ~ "Not Working",
                             Working == 1 ~ "Working")) %>% 
  mutate(Key_worker = case_when(Key_worker == 0 ~ "No",
                                Key_worker == 1 ~ "Yes")) %>% 
  mutate(region = case_when(region == 1 ~ "Midlands",
                            region == 2 ~ "South & East",
                            region == 3 ~ "North",
                            region == 4 ~ "London",
                            region == 5 ~ "Wales, Scotland and Northern Ireland"))

We were unsure what the ‘condition_antibodies’ variable was referring to, considering that it sounded like it was related to participant belief in their own immunity - however, there was already another immunity variable. Although the authors made the small note of ‘assigned to experimental condition’, this didn’t give us much information, and it’s worth noting that it definitely could have been described in more detail. As a result of this, we didn’t include it in our reproducibility journey.

Creating Data Frames

The next (painstaking) step was to create a data frame for each demographic characteristic against the ‘think_covid’ variable that was the primary focus of the study and hence the main DV, referring to whether or not the participant believed they had caught COVID-19 before. Before we knew how to deal with R efficiently, we made this easy task seem much harder than it was - here’s my first attempt at accomplishing this.

data2 <- data1 %>% 
  tabyl(Age, think_covid)
tab_2 <- data2 %>% gt() 

tab_2

Age	Have COVID-19	Have Not Had COVID-19
18 to 24 years	419	1003
25 to 34 years	400	823
35 to 44 years	294	751
45 to 54 years	164	554
55+	216	1525

That seemed fine… except for the fact that at the time, I didn’t know how to combine all these small gt tables I had. But the use of tabyl() definitely set my teammates and I on the right track!

An example of this with the “gender” variable:

datagender2 <- data1 %>% 
  select(Gender, think_covid) #creating the data frame and selecting the relevant variables

tgender <- datagender2 %>% 
  tabyl(Gender, think_covid) #used to count no. of respondents in these variables, to be included later in the final table

gtgender <- gt(tgender) #making it into a nice-looking table
datagender <- gtgender$'_data' %>% #allows us to merge gt tables
  arrange(factor(Gender, levels = c('Male','Female'))) #reordering the categories to recreate the figure


gtgender

Gender	Have COVID-19	Have Not Had COVID-19
Female	796	2459
Male	697	2197

As you may have noticed, there was a line in that chunk that is labeled with ‘allows us to merge gt tables’. This was a struggle for us for a long time, but thankfully with a lot of Googling and trying to apply our situation to that of the Stackoverflow questions, we found the solution.

Let’s compare this to the original figure:

Great, all the numbers looked right - it was just the formatting that was different, but we could fix that later. We repeated this process for the age, child, employment, key sector, education, and region variables. As we only included in our data set those that were eligible and provided responses, we excluded those that provided an “NA” response - for this, we used filter != "NA" to filter out these participants. We did this for the child, employment, and education variables. We also used arrange(factor()) to arrange the levels of factors that appeared in the wrong order, changing them to appear as they did in the figure.

#participants by age
data_Age <- data1 %>% 
  select(Age, think_covid) 
tage <- data_Age %>% 
  tabyl(Age, think_covid)
gtage <- gt(tage)
dataage<- gtage$'_data'

#by child
datachild <- data1 %>% 
  select(Has_child, think_covid) %>% 
  filter(Has_child != "NA")
tchild <- datachild %>% 
  tabyl(Has_child, think_covid)
gtchild <- gt(tchild)
datachild <- gtchild$'_data'

#employment status
dataworking <- data1 %>% 
  select(Working, think_covid) %>% 
  filter(Working != "NA")
tworking <- dataworking %>% 
  tabyl(Working, think_covid)
gtworking <- gt(tworking)
dataworking <- gtworking$'_data'

#working key sector
tkey <- data1 %>% 
  tabyl(Key_worker, think_covid)
gtkey <- gt(tkey)
datakey <- gtkey$'_data'

#highest educational or professional qualification
datadegree <- data1 %>% 
  select(degree, think_covid) %>% 
  filter(degree != "NA") 
tdegree <- datadegree %>% 
  tabyl(degree, think_covid)
gtdegree <- gt(tdegree)
datadegree <- gtdegree$'_data' %>% 
  arrange(factor(degree, levels = c('GSCE/vocational/A-level/No formal qualifications','Degree or higher (Bachelors, Masters, PhD)')))

#region
tregion <- data1 %>% 
  tabyl(region, think_covid)
gtregion <- gt(tregion)
dataregion <- gtregion$'_data' %>% 
  arrange(factor(region, levels = c('Midlands', 'South & East', 'North', 'London', 'Wales, Scotland and Northern Ireland')))

It definitely wasn’t easy for us to get to this point - very common was the error “Column - Gender (or whichever variable was deciding to act up that day) doesn’t exist”. Though this was extremely frustrating, since we knew that it definitely did exist, it was one of the first lessons we learnt about troubleshooting: it’s usually something small, and could be as small as a typo, or a lowercase letter where a capital should be, or that we started a new data set or rscript and forgot to rename the variable. We learnt to be very careful with reading through our code because there were a million small errors that could easily sneak past.

Creating Percentages

Next, since the table also displayed percentages, we had to calculate these for each variable as well. These are in-built functions in R.

perGender <- with(data1, table(Gender, think_covid)) #recreates the data frame in a table that allows us to evaluate expressions - in this case percentage
pGender <- prop.table(perGender, 1)*100  #turning the table of counts into a table of percentages

prop.table() created proportion counts, so by calculating these and multiplying them by 100, we were able to create percentage values that could be used later in the final table.

Combining the Data

After we had all the data we needed - i.e., the counts and percentages for each variable as an expression of ‘think_covid’, it was finally time to combine everything. This was tricky considering it was the first figure we attempted after learning R, and hence were unaware of the various formats it could take - but luckily we figured it out relatively quickly. First we combined the data frames with the counts using list(), then used rbindlist() to combine them into one big data table - from the data.table package.

listfull <- list(datagender, dataage, datachild, dataworking, datakey, datadegree, dataregion) #combined all the data frames into a list comprised of the 7 different data tables
fulltibble <- rbindlist(listfull, use.name=FALSE) #converted the list into 1 big combined data table as rows
fulltibble <- as_tibble(fulltibble) #converted it into a tibble
fulltibble1 <- fulltibble %>% #create column names
  rename(ycovid = "Have COVID-19",
         ncovid = "Have Not Had COVID-19")

We also renamed the think_covid columns to have appropriate headings again using rename().

But it wasn’t always so easy. The first approach we tried was creating data frames for each variable, like we did in the end, but only combining 2 at a time and then using inner_join() to combine them all. Here’s an excerpt of our previous code:

#merge tables

data_Gender <- data1 %>% 
  select(Gender, think_covid)

data_Age <- data1 %>% 
  select(Age, think_covid)

#set tables as data.frames

t1 = data.frame(data_Gender)
t2 = data.frame(data_Age)

dfgender <- data.frame(data_Gender)
dfage <- data.frame(data_Age)

#merge based on row names

gender_age = merge(t1, t2, by=0)
gt_gender_age <- gt(gender_age)

combined <- inner_join(data_Age, data_Gender, by = "think_covid")

Not only would it have been a painstakingly long process, this is what the output for our supposed ‘merged’ table would have looked like:

While we thought inner_join() was the solution to our problems, it definitely wasn’t! Luckily we managed to get through the process. Once we understood it, it was easy to repeat this process for the data frames with the percentages. In this case, we used rbind() from the dpylr package rather than rbindlist().

ptibble <- rbind(pGender,pAge,pChild,pEmp,pKey,pEdu,pRegion) #as before, but with percent tables
ptibble <- as_tibble(ptibble)
ptibble <- data.frame(lapply(ptibble, function(x) if(is.numeric(x)) round(x,2) else x)) #to ensure they all remained at 2 decimal places
ptibble1 <- ptibble %>% #match column names
  rename(ypercent = "Have.COVID.19",
         npercent = "Have.Not.Had.COVID.19")

Next was the trickiest part: combining the full tables for counts and percentages in the correct format. We ran into many problems with this, most prominently that attempting the unite() and paste() would work in combining them, but with many, many decimal points that even our previous mitigating code wouldn’t be able to counter. We also tried using gt_merge_stack(), which we were so convinced was the answer! Until…

An example of our table using gt_merge_stack() to merge the counts and percentages

While it did technically show the information, it was definitely not what we were looking for. There were too many decimal places, and it ‘stacked’ instead of putting the numbers side by side. Plus for some reason, the percentages refused to remain at 2 decimal places, despite our previous code ensuring this! After playing around with more merging functions, and lots of Googling, we stumbled across paste0 allowed us to proceed. Note that the following output, in the code, appears as one large table, but was unable to fit width-wise into this format.

pfull <- cbind(fulltibble1, ptibble1) #combine tables as columns 
pfull$havenocovid <- paste0(pfull$ncovid, " (", pfull$npercent, ")") #format percentages to be next to counts in brackets
pfull$havecovid <- paste0(pfull$ycovid, " (", pfull$ypercent, ")") #for both think_covid variables

pfull

##                                              Gender ycovid ncovid ypercent
## 1                                              Male    697   2197    24.45
## 2                                            Female    796   2459    24.08
## 3                                    18 to 24 years    419   1003    29.47
## 4                                    25 to 34 years    400    823    32.71
## 5                                    35 to 44 years    294    751    28.13
## 6                                    45 to 54 years    164    554    22.84
## 7                                               55+    216   1525    12.41
## 8                                                No    621   2005    23.65
## 9                                               Yes    776   2386    24.54
## 10                                      Not Working    357   1714    17.24
## 11                                          Working   1124   2871    28.14
## 12                                               No    753   3105    19.52
## 13                                              Yes    740   1551    32.30
## 14 GSCE/vocational/A-level/No formal qualifications   1060   3382    25.70
## 15       Degree or higher (Bachelors, Masters, PhD)    415   1200    23.86
## 16                                         Midlands    251    781    29.90
## 17                                     South & East    416   1369    24.32
## 18                                            North    335   1120    23.02
## 19                                           London    299    701    23.31
## 20             Wales, Scotland and Northern Ireland    192    685    21.89
##    npercent  havenocovid    havecovid
## 1     75.55 2197 (75.55)  697 (24.45)
## 2     75.92 2459 (75.92)  796 (24.08)
## 3     70.53 1003 (70.53)  419 (29.47)
## 4     67.29  823 (67.29)  400 (32.71)
## 5     71.87  751 (71.87)  294 (28.13)
## 6     77.16  554 (77.16)  164 (22.84)
## 7     87.59 1525 (87.59)  216 (12.41)
## 8     76.35 2005 (76.35)  621 (23.65)
## 9     75.46 2386 (75.46)  776 (24.54)
## 10    82.76 1714 (82.76)  357 (17.24)
## 11    71.86 2871 (71.86) 1124 (28.14)
## 12    80.48 3105 (80.48)  753 (19.52)
## 13    67.70  1551 (67.7)   740 (32.3)
## 14    74.30  3382 (74.3)  1060 (25.7)
## 15    76.14 1200 (76.14)  415 (23.86)
## 16    70.10   781 (70.1)   251 (29.9)
## 17    75.68 1369 (75.68)  416 (24.32)
## 18    76.98 1120 (76.98)  335 (23.02)
## 19    76.69  701 (76.69)  299 (23.31)
## 20    78.11  685 (78.11)  192 (21.89)

Creating the Table

Then the final step before making it into a full table - selecting and renaming the relevant variables in their respective categories using subset(), from base R.

pfull <- subset(pfull, select = c(Gender,havenocovid, havecovid)) %>% 
  rename(Level = Gender)

However, for a long time we were unable to create a stub and could only create row group headers. We almost considered just giving up and just leaving it like this…

An example piece of our table where we were unable to create the stub and instead had the participant characteristics as row group headers

While it would work and admittedly did serve the purpose of clearly displaying the data, we were determined to make it work and achieve our goal exactly how we set out to. We eventually realised that it could be best to create a whole new vector for the participant characteristics, separate from the features of the actual table, and just format them based on how we saw appropriate. This allowed us to group all the relevant categories together. We used c() to do this; however, as this function works by naming each row, we had to account for the spaces as well by creating blank rows, like so:

pchar <- c("Gender","","Age","","","","","Have a child","","Employment status","",
           "Working in key sector","","Highest educational or professional qualification","",
           "Region","","","","")

From there, it was easy to combine this column with the rest of the table using cbind().

pfull1 <- cbind(pchar, pfull) #combining the "participant characteristics" column with the rest of the table

Now the final step - making it into a nice-looking table, so all the important information could be easily displayed! This was done mainly by using the gt package - after some experimentation, we were able to reproduce the figure exactly as was shown in the paper!

gtfull2 <- gt(pfull1) %>% #creating a gt table!
  tab_spanner( 
    label = "Had COVID-19",
    columns = c(3:4) #creating a header for the think_covid columns
  ) %>%
  cols_label(
    pchar = "Participant characteristics" #labeling the columns as in the original figure
  ) %>% 
  cols_label(
    "havenocovid" = "Think have not had COVID-19 n = 4656 (%)"
  ) %>% 
  cols_label(
    "havecovid" = "Think have had COVID-19 n = 1493 (%)" 
  ) %>% 
  tab_style(cell_text(weight = "bold"), #bolding relevant text
            locations = cells_column_labels(columns = everything())) %>% 
  tab_style(cell_text(weight = "bold"),
            locations = cells_column_spanners())

gtfull2

Participant characteristics	Level	Had COVID-19
Participant characteristics	Level	Think have not had COVID-19 n = 4656 (%)	Think have had COVID-19 n = 1493 (%)
Gender	Male	2197 (75.55)	697 (24.45)
	Female	2459 (75.92)	796 (24.08)
Age	18 to 24 years	1003 (70.53)	419 (29.47)
	25 to 34 years	823 (67.29)	400 (32.71)
	35 to 44 years	751 (71.87)	294 (28.13)
	45 to 54 years	554 (77.16)	164 (22.84)
	55+	1525 (87.59)	216 (12.41)
Have a child	No	2005 (76.35)	621 (23.65)
	Yes	2386 (75.46)	776 (24.54)
Employment status	Not Working	1714 (82.76)	357 (17.24)
	Working	2871 (71.86)	1124 (28.14)
Working in key sector	No	3105 (80.48)	753 (19.52)
	Yes	1551 (67.7)	740 (32.3)
Highest educational or professional qualification	GSCE/vocational/A-level/No formal qualifications	3382 (74.3)	1060 (25.7)
	Degree or higher (Bachelors, Masters, PhD)	1200 (76.14)	415 (23.86)
Region	Midlands	781 (70.1)	251 (29.9)
	South & East	1369 (75.68)	416 (24.32)
	North	1120 (76.98)	335 (23.02)
	London	701 (76.69)	299 (23.31)
	Wales, Scotland and Northern Ireland	685 (78.11)	192 (21.89)

To compare it to the original, here it is again:

As you can see, the tables are nearly identical. However, through this comparison, we did notice a discrepancy in the percentage values of male and female - however, as the numbers of the count matched, we believe this was the authors’ mistake. Overall, although it took a significant amount of experimentation and lots of trial and error, we had our first triumph!

Figure 2

Secondly, we attempted to reproduce Figure 2, the plot of results.

Clean Data

Much like in Figure 1, our first task was to clean the data using:

rename() to rename the variables in a way that was easier to work with - such as rename(immunity = q8haveimmunity)
select() to select the relevant variables
mutate() to label the coded responses

data2 <- data %>% 
  rename(immunity = q8haveimmunity,
         gshop = Adhere_shop_groceries,
         oshop = Adhere_shop_other,
         friends = Adhere_meet_friends,
         outings = Going_out_total,
         noworry = q9worry,
         rself = q10arisk,
         rothers = q10brisk,
         covidknow = Sx_covid_nomissing,
         think_covid = Ever_covid) %>% 
  select(immunity, 
         gshop, 
         oshop, 
         friends, 
         outings,
         noworry,
         rself,
         rothers,
         covidknow,
         think_covid) %>% 
  mutate(think_covid = case_when(think_covid == 0 ~ "Think have not had COVID-19",
                                 think_covid == 1 ~ "Think have had COVID-19"))

Creating individual data sets

Like Figure 1, Figure 2 was also made up of various different data sets - therefore we used the same approach in again creating individual data sets. However, there were 2 parts to this:

Creating a data set to count for think_covid

As the whole plot was measuring these variables across the person’s belief that they have had COVID, we had to create an overall count for this factor that could be used across all the other data sets. We did this using group_by() to ensure it was counting the correct variable, then using the count() function to (as the name suggests) count. Since we were grouping all the following variables by ‘think_covid’, the main variable in the study, we didn’t bother adding ungroup() as it would just be redundant.

covid <- data2 %>% 
  select(immunity, think_covid) %>% #selecting the relevant variables - immunity as a placeholder. The real variable of interest is think_covid
  group_by(think_covid) %>% #so it knows to sort it by this
  count() %>% #count no. of people in each category
  rename(covidtotal = n) #to avoid confusion later

Creating a data set to count for all the other variables

This was the same process as the step above, however consisted of filtering specific instances of each measure. For example, the ‘immunity’ column in the graph was labeled “strongly agree that they are immune”, suggesting that they only included participants that scored 5 out of 5 on the scale in the plotted proportion. Therefore, we had to use filter() to filter only responses that were 5, ensuring that all other responses were excluded. Then we used mutate to rename according to the labels in the graph. Again, we used count() to count the number of participants in this category, and group_by() to select the relevant variables, but since each variable would only be counted once, we had to use ungroup() so that this filter wouldn’t apply to the following data sets of other variables. Finally, we used rename() to rename the ‘n’ count to a variable-relevant name so we wouldn’t get confused, since all of the other variable-counts would also be called ‘n’! Here is our data for the immunity variable - we followed the same process for each variable in this section.

dimmunity <- data2 %>% 
  select(immunity, think_covid) %>% 
  filter(immunity == "5") %>% #filter so only those who responded with 5 are included
  mutate(immunity = case_when(immunity == 5 ~ "Strongly agree that they are immune")) %>% #rename
  group_by(think_covid, immunity) %>% 
  count() %>% 
  ungroup() %>% #so this filter doesn't apply to following sets
  rename("stronglyagree" = "n")

Calculating percentages and creating a new column

As percentages were the y-axis component on this plot, for each variable, we had to:

Merge the counts of think_covid and the variable using merge()
Create a new column using mutate() and calculate this value by manually inputting the formula. This formula would be the ‘n’ value for each variable divided by the total number of included participants, creating a proportion, then multiplying by 100 to create a percentage value that could be used in the plot

In the case of immunity:

dimmunity <- merge(x = dimmunity, y = covid, by = "think_covid") %>% #where 'covid' is the think_covid count
  mutate(strongpercent = stronglyagree*100/covidtotal) #creating the percentage column

We repeated this for all other variables as well.

Merging the data

After we had the percentages for each relevant variable, we had to combine them all into one tibble. We used the same strategy as in Figure 1, i.e.:

list() function to combine them into a list
rbindlist() to convert it into one big data frame

Like so:

listdata2 <- list(dcovidknow,
                  drothers,
                  drself,
                  dnoworry,
                  doutings,
                  dfriends,
                  doshop,
                  dgshop,
                  dimmunity)

tdata2 <- rbindlist(listdata2, use.name=FALSE) 
tdata2<- subset(tdata2, select = -c(idcovidsym, covidtotal)) %>% 
  rename("groups"=covidknow) %>% 
  rename(Percentage = idcovidsympercent) %>% 
  group_by(think_covid)

Plotting the data

Finally the exciting part of converting the data into a visually appealing plot! We had all the information - our first step in plotting was to manually reorder the variables to appear as they do in the study. We did this by using mutate() to create a new column, then within this, using fct_relevel()to reorder them:

plotdata2 <- tdata2 %>% mutate(groups = fct_relevel(groups,
                                                    "Strongly agree that they are immune",
                                                    "Shopping for groceries/pharmacy two or more times",
                                                    "Shopping for items other than groceries/pharmacy",
                                                    "Met up with friends/family they do not live with",
                                                    "Out-of-home activity, 8 or more outings",
                                                    "Not worried at all",
                                                    "Perceive no risk at all to themselves",
                                                    "Perceive no risk at all to others",
                                                    "Did not identify common symptoms"))

It was here we realised that there was actually a mistake in this figure. The columns for “shopping for groceries/pharmacy two or more times” showed roughly 35% and 45% correspondingly - but cross-referenced with Figure 4, those percentage values actually matched up with the data for “shopping for groceries/pharmacy on one or fewer days”. This suggested that either Figure 2 or Figure 4 was incorrect! This was likely due to a typo in the code as in the original data, ‘0’ was coded as “one or fewer days” whereas ‘1’ was coded as “two or more times”. But moving on..

Then, like in Figure 1, we created a vector for the 2 levels of categories (think_covid) and named them accordingly:

plotdata2_new <- plotdata2
plotdata2_new$think_covid <- factor(plotdata2_new$think_covid,
                                    levels = c("Think have not had COVID-19", "Think have had COVID-19" ))

Now we could finally see the plot! For this, the ggplot2 package was essential. We used ggplot() for this - a super handy tool and very easy to use once you’ve got all your data in order. We formatted the plot until it replicated the original Figure 2.

plot2 <- ggplot(plotdata2_new, aes(fill=think_covid, 
                                   y= Percentage,
                                   x= groups)) +  #plotting the data
  geom_bar(position ='dodge', stat='identity', width=0.5) + #making it a bar chart, with bars next to each other for the think_covid factor
  scale_fill_grey()+ # scale_color_manual(labels = c("Think have not had COVID-19", "Think have had COVID-19"), #changing colours of the bars
  #values = c("black", "grey")) 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust = 1), #changing the text settings
        panel.background = element_rect(fill = "white", colour="white"), #general aesthetics
        panel.grid.major.y = element_line(size = 0.5, colour="grey"),
        panel.grid.minor.y = element_line(size = 0.5, colour="grey"),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(), 
        legend.position="bottom", #designing the legend
        legend.title=element_blank(),
        axis.title.x = element_blank(), #axis titles
        axis.title.y = element_text(margin = margin(t = 0, r = 50, b = 0, l = 0))) +
  scale_y_continuous(limits = c(0,60), breaks = c(0,10,20,30,40,50,60)) #scale lengths and intervals

Leading to an output looking like this:

plot2

Again, let’s compare to the original figure:

Yes! Figure 2 complete - halfway there. Figure 2 seemed easier than Figure 1 - firstly, because we were a bit more used to using R and definitely knew a lot more. But also because we had the guidance of Danielle’s video modules, which didn’t touch on tables - but covered plots!

Figure 3

Next we moved on to Figure 3. At this point, we had gotten a handle of using R and had split up into pairs to work on the last 2 figures - but Figure 3 was different because we had to learn how to include descriptives in our table. However, at this point, we had learnt so much about R and troubleshooting, and our ability to Google using the right search terms (we actually knew what the people on the forums were talking about!) and going to the right websites (such as beginner-friendly blog posts rather than official websites) had advanced so much, we were able to move on swiftly.

Clean data

Again, using rename(), select(), and mutate() for ease of working:

data3 <- data %>% 
  rename(immunity = q8haveimmunity, #rename variables appropriately
         outings = Going_out_total,
         noworry = q9worry,
         rself = q10arisk,
         rothers = q10brisk,
         think_covid = Ever_covid) %>% 
  select(immunity, #select relevant variables
         outings,
         noworry,
         rself,
         rothers,
         think_covid) %>% 
  mutate(think_covid = case_when(think_covid == 1 ~ "Have COVID-19", #rename coded responses
                                 think_covid == 0 ~ "Have Not Had COVID-19"))

Creating individual data sets to calculate mean and SD

Since this table includes the means and SDs for each measure/characteristic, we had to calculate them for each data set - meaning we had to create a data set for each variable, as in the previous 2 figures. However, this time, we used summarise() to calculate these values.

d3immune <- data3 %>% 
  select(immunity, think_covid) %>% #select relevant variable
  group_by(think_covid) %>% #the main variable of interest
  summarise(mean(immunity), #to calculate mean
            sd(immunity)) %>% #to calculate SD
  rename(mean = 'mean(immunity)', #rename for ease and for categorising later
         sd = 'sd(immunity)')

I mentioned earlier that we tried using unite() to combine the mean and SD cells - here’s an example of what this looked like.

This persisted even though we had options(digits=3) to keep values at 3 significant figures. No one - to this day - knows why this wasn’t working, but we did determine that it was unite that was causing it. Which meant that we had to try a different approach. Hence:

d3immune <- data.frame(lapply(d3immune, function(x) if(is.numeric(x)) round(x,2) else x)) #to ensure they would remain at 2 decimal places

d3immune$desc <- paste("M = ",d3immune$mean, sep="") %>% #paste mean and SD together in the same cell
  paste(", SD = ",d3immune$sd, sep="")

Then subset() the columns we didn’t need - and there we had our first data frame, for the immunity variable.

d3immune <- subset(d3immune, select = -c(mean,sd)) 

d3immune

##             think_covid                desc
## 1         Have COVID-19    M = 3.33, SD = 1
## 2 Have Not Had COVID-19 M = 2.38, SD = 1.01

Let’s quickly double-check that everything is correct by comparing to the original figure:

Yep, perfect.

And again, we repeated that for each variable. But we had a problem - we needed the think_covid variable to be column headings, but our output looked like this:

d3immune

##             think_covid                desc
## 1         Have COVID-19    M = 3.33, SD = 1
## 2 Have Not Had COVID-19 M = 2.38, SD = 1.01

So we needed to switch them. We ran into lots of trouble with this, and ended up having to create a whole new data frame for each, using the following function:

switchI <- as.data.frame(t(as.matrix(d3immune)))

But this ended up creating a matrix with the headings ‘V1’ and ‘V2’, with far too many decimal places, like so:

This took us such a long time to get rid of! We couldn’t even rename them, since they were not ‘officially’ part of the table, and we already had the desired headings. Worst of all, the ‘Have had COVID-19’ and ‘Have not had COVID-19’ headings still existed for each table. They were just pushed down, which was fine - if you just cropped the bottom half of the table, that is.

It was clearly a disaster, but we managed to mitigate this by using subset(). Phew! Now our tibbles were perfect:

switchI

##                            1                     2
## think_covid    Have COVID-19 Have Not Had COVID-19
## desc        M = 3.33, SD = 1   M = 2.38, SD = 1.01

Turning the data into a table

Like before, now we just had to combine all the data into a list using list(), and use this list to create a table with rbindlist().

fulllist3 <- list(switchI, switchO, switchW, switchS, switchR)
fulltable3 <- rbindlist(fulllist3, use.names = FALSE)
fulltable3 <- fulltable3[-c(1, 3, 5, 7, 9)] %>% 
  select(2, 1) #to clean

fulltable3

##                      2                   1
## 1: M = 2.38, SD = 1.01    M = 3.33, SD = 1
## 2: M = 6.69, SD = 5.63 M = 9.35, SD = 7.69
## 3: M = 3.59, SD = 1.01 M = 3.38, SD = 1.12
## 4: M = 2.81, SD = 0.76 M = 2.81, SD = 0.76
## 5: M = 3.39, SD = 0.67   M = 3.3, SD = 0.7

Next - the figure’s first two columns were completely comprised of text. As can be seen in the last screenshot of the old code, at that point, we were still only able to include the participant characteristics as row group headers. But once we figured out what to do in Figure 1, it was easy to apply it here. So to create this, we created vectors using the c() function and manually typed the relevant text, like so:

level3 <- c("1 = strongly disagree to 5 = strongly agree",
            "Range = 0 to 42",
            "1 = not at all worried to 5 = extremely worried",
            "1 = no risk at all to 4 = major risk",
            "1 = no risk at all to 4 = major risk")

pchar3 <- c("I think I have some immunity to COVID-19",
            "Total out-of-home activity in the last seven days",
            "Worry about COVID-19",
            "Perceived risk of COVID-19 to oneself",
            "Perceived risk of COVID-19 to people in the UK")

Lastly, using cbind() to create the table itself:

fulltable4 <- cbind(pchar3, level3, fulltable3)

Finally! Using the gt() function, we are able to visualise the whole table.

gtfulltable4 <- gt(fulltable4) %>% 
  tab_spanner( #column headers spanning columns 3 and 4
    label = "Had COVID-19",
    columns = c(3:4)
  ) %>%
  cols_label( #column labels
    pchar3 = "Participant characteristics"
  ) %>% 
  cols_label(
    "level3" = "Level"
  ) %>% 
  cols_label(
    "2" = "Think have not had COVID-19 n = 4656"
  ) %>% 
  cols_label(
    "1" = "Think have had COVID-19 n = 1493"
  ) %>% 
  tab_style(cell_text(weight = "bold"), #formatting
            locations = cells_column_labels(columns = everything())) %>% 
  tab_style(cell_text(weight = "bold"),
            locations = cells_column_spanners())

This leads to a table output looking like this: exactly as the paper demonstrated.

gtfulltable4

Participant characteristics	Level	Had COVID-19
Participant characteristics	Level	Think have not had COVID-19 n = 4656	Think have had COVID-19 n = 1493
I think I have some immunity to COVID-19	1 = strongly disagree to 5 = strongly agree	M = 2.38, SD = 1.01	M = 3.33, SD = 1
Total out-of-home activity in the last seven days	Range = 0 to 42	M = 6.69, SD = 5.63	M = 9.35, SD = 7.69
Worry about COVID-19	1 = not at all worried to 5 = extremely worried	M = 3.59, SD = 1.01	M = 3.38, SD = 1.12
Perceived risk of COVID-19 to oneself	1 = no risk at all to 4 = major risk	M = 2.81, SD = 0.76	M = 2.81, SD = 0.76
Perceived risk of COVID-19 to people in the UK	1 = no risk at all to 4 = major risk	M = 3.39, SD = 0.67	M = 3.3, SD = 0.7

Again, let’s compare to the actual figure:

Yay! We did it!

Figure 4

We were finally here at the last figure. This one was an interesting case - as it was the last one we attempted, we had significantly more R knowledge than we did when we started, and could hence apply that to various features, as will be described below. While we didn’t produce the exact figure the paper did, we believe we found a way that presented the data even more clearly. But let’s get into exactly how we did that.

Cleaning data

We cleaned the data almost exactly as we did in Figure 1, using the rename() and mutate() functions.

data4 <- data %>% 
  rename(Gender = gender, 
         Age = age_categories,
         Has_child = Has_child,
         Employment_status = Working,
         Key_worker = Key_worker,
         Education = degree,
         Region = region,
         Total_outings = Going_out_total,
         groceries = Adhere_shop_groceries,
         shop_other = Adhere_shop_other,
         meetup = Adhere_meet_friends,
         symptoms = Sx_covid_nomissing,
         think_covid = Ever_covid) %>% 
  mutate(think_covid = case_when(think_covid == 0 ~ "No",
                                 think_covid == 1 ~ "Yes")) %>% 
  mutate(groceries = case_when(groceries == 0 ~ "On one or fewer days in the last week n = 2389",
                               groceries == 1 ~ "On two or more days in the last week n = 3760")) %>% 
  mutate(shop_other = case_when(shop_other == 0 ~ "Not at all in the last week n = 1833",
                                shop_other == 1 ~ "On one or more days in the last week n = 4316")) %>% 
  mutate(meetup = case_when(meetup == 0 ~ "Not at all in the last week n = 5271",
                            meetup == 1 ~ "On one or more days in the last week n = 878")) %>% 
  mutate(symptoms = case_when(symptoms == 0 ~ "Did not correctly identify common symptoms n = 2390",
                              symptoms == 1 ~ "Correctly identified common symptoms n = 3632")) %>% 
  select(think_covid, groceries, shop_other, meetup, symptoms)

Creating individual data frames

Like before, we had to create individual data frames for each variable. This again followed a very similar format as the previous figures - whilst repetitive, it was rewarding and almost satisfying to be able to apply the knowledge we had gained. In other words, we used the select(), group_by(), and arrange() functions to clean, and tabyl() to count.

Here’s an example of the one created for the “Shopping for groceries/pharmacy” variable:

datagroceries <- data4 %>% 
  select(groceries,think_covid) %>% #selecting the relevant variables
  group_by(think_covid) %>% #the main variable we were sorting by
  arrange(factor(groceries, levels = c("On one or fewer days in the last week n = 2389", "On two or more days in the last week n = 3760"))) #arranging the factors to appear in the same order as they do in the original figure

tgroceries <- datagroceries %>% 
  tabyl(groceries, think_covid) #creating a count

gtgroceries <- gt(tgroceries) #creating a table to display and more easily combine later
datagroceries <- gtgroceries$"_data"

This was repeated for each of the other variables.

Creating percentages

As this figure also included percentages, we had to create percentage columns for each variable based on the previously created counts. Luckily, we could apply the knowledge gained whilst making Figure 1 to do this, since they used the same principles and the with() and prop.table() functions! Again, we did this for every variable.

perGroc <- with(data4, table(think_covid,groceries)) #creating a tibble to use to calculate %
pGroc <- prop.table(perGroc, 1)*100 #calculating the %

perShop <- with(data4, table(think_covid, shop_other))
pShop <- prop.table(perShop, 1)*100

perMeet <- with(data4, table(think_covid, meetup))
pMeet <- prop.table(perMeet, 1)*100

perSymp <- with(data4, table(think_covid,symptoms))
pSymp <- prop.table(perSymp, 1)*100

Combining the data

Like before, we used the list() and rbindlist() functions to combine all the previously-made individual count data sets, and as_tibble() to make it into a tibble.

listfull3 <- list(datagroceries,
                  datashop,
                  datameetup,
                  datasymptoms)
listfull3 <- rbindlist(listfull3, use.name=FALSE)
listfull3 <- as_tibble(listfull3)
listfull3 <- listfull3

We repeated this for the percentage data sets.

ptibble4 <- rbind(pGroc, pShop, pMeet, pSymp)
ptibble4 <- as_tibble(ptibble4)
ptibble4 <- data.frame(lapply(ptibble4, function(x) if(is.numeric(x)) round(x,1) else x))

And the final step: combining the now combined count and percentage tables! Since we did this in Figure 1, it wasn’t too difficult to apply it here, with the relevant data sets, columns, and names. It felt good to be able to go straight to paste0() this time. We also ensured to add the brackets as features, to separate the values as seen in the original figure - just like in Figure 1.

tibble4 <- cbind(listfull3, ptibble4)
tibble4$nocovid <- paste0(tibble4$No, " (", tibble4$On.one.or.fewer.days.in.the.last.week.n...2389, ")")
tibble4$yescovid <- paste0(tibble4$Yes, " (", tibble4$On.two.or.more.days.in.the.last.week.n...3760, ")")
tibble4 <- subset(tibble4, select = c(groceries, nocovid, yescovid))

Creating the table

Now we could finally see the output, and do formatting of the actual table itself again using the gt() function - our best friend.

gtfull4 <- gt(tibble4) %>% 
  tab_spanner(
    label = "Think COVID-19",
    columns = c(2:3)
  ) %>%
  cols_label(
    "nocovid" = "Think have not had COVID-19"
  ) %>% 
  cols_label(
    "yescovid" = "Think have had COVID-19"
  ) %>% 
    cols_label(
    "groceries" = "Self-Reported Behaviour n (%)"
  ) %>% 
  tab_row_group(
    label = "Shopping for groceries/pharmacy",
    rows = 1:2
  ) %>% 
  tab_row_group(
    label = "Shopping for items other than groceries/pharmacy",
    rows = 3:4
  ) %>%
  tab_row_group(
    label = "Meeting up with friends or family",
    rows = 5:6
  ) %>%
  tab_row_group(
    label = "Correct identification of cough and fever",
    rows = 7:8
  ) %>%
  row_group_order(groups = c("Shopping for groceries/pharmacy", 
                             "Shopping for items other than groceries/pharmacy",
                             "Meeting up with friends or family",
                             "Correct identification of cough and fever"
  )) %>% 
  tab_style(cell_text(weight = "bold"), 
            locations = cells_column_labels(columns = everything())) %>% 
  tab_style(cell_text(weight = "bold"),
            locations = cells_column_spanners()) %>% 
  tab_style(
    style = list(
      cell_text(weight = "bold")),
    locations = cells_row_groups()) 

gtfull4

Self-Reported Behaviour n (%)	Think COVID-19
Self-Reported Behaviour n (%)	Think have not had COVID-19	Think have had COVID-19
Shopping for groceries/pharmacy
On one or fewer days in the last week n = 2389	1701 (36.5)	688 (63.5)
On two or more days in the last week n = 3760	2955 (46.1)	805 (53.9)
Shopping for items other than groceries/pharmacy
Not at all in the last week n = 1833	1156 (24.8)	677 (75.2)
On one or more days in the last week n = 4316	3500 (45.3)	816 (54.7)
Meeting up with friends or family
Not at all in the last week n = 5271	456 (9.8)	422 (90.2)
On one or more days in the last week n = 878	4200 (28.3)	1071 (71.7)
Correct identification of cough and fever
Correctly identified common symptoms n = 3632	2927 (62.9)	705 (37.1)
Did not correctly identify common symptoms n = 2390	1729 (47.2)	788 (52.8)

Let’s compare to the original figure:

Although it didn’t look exactly as the figure in the paper, we believed that the rearrangement both presented the data more clearly, as well as allowed for less repetition. Bonus points that it looked more like the other figures in the paper - thus we opted for swapping the positions of the ‘self-reported behaviour’ and ‘think COVID-19’ columns. Most importantly, all the values and percentages matched up. We were happy with it, concluding our reproduction journey!

Part 3: Exploration

I generally used the same packages as above to carry out my exploratory analyses, but there were a few extra that came in handy:

library(RColorBrewer) # to change colours of plots
library(rcompanion) # an easy way to create a plot of means
library(kableExtra) # for displaying simple tables

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(report) #to help interpret statistical tests
library(corrplot) #to create correlation plots

## corrplot 0.92 loaded

Q1: Does having a child increase worry about COVID-19?

A simple question, yet particularly relevant. The authors reported that those with children were more likely to think that they had previously caught coronavirus, likely due to greater exposure or perceived exposure (for instance, through schools). Was this linked with greater worry? It could be argued that greater worry would be expected, as the consequences would be worse considering children would be more susceptible to a harsher experience. As child-carers, this would be a cause for concern.

Clean the data

To start exploring this, I used the existing data set and cleaned it. I took the following steps:

Used select() to filter out my only 2 variables of interest
Used filter() to filter out the ‘NA’ responses that would exclude participants that were not eligible
Used mutate() to re-code the responses of ‘0’ and ‘1’ to ‘no’ and ‘yes’, in response to whether they had a child
Used tabyl() to count, just out of curiosity and to ensure I was on the right track
Used as.data.frame() to turn the data set into a data frame, so it could be more easily merged later

I found it useful (which will be elaborated later on) to have 2 copies of this cleaned data - one with the count and one long-form, with just the raw data. I called the counted version ‘child’, and it looked like this:

child <- data %>% 
  select(Has_child, q9worry) %>% 
  mutate(Has_child = case_when(Has_child == 0 ~ "No",
                               Has_child == 1 ~ "Yes")) %>% 
  filter(Has_child != "NA") %>% 
  tabyl(Has_child, q9worry) %>%
  as.data.frame() 

child

##   Has_child   1   2    3   4   5
## 1        No  96 347 1003 770 410
## 2       Yes 101 285 1027 962 787

And I called the long-form version ‘sig’. The only difference was that I included the group_by() function. This version will later allow me to create a plot. It looked like this:

sig <- data %>% 
  select(Has_child, q9worry) %>% 
  filter(Has_child != "NA") %>% 
  group_by(Has_child) %>% 
  as.data.frame() %>% 
  mutate(Has_child = case_when(Has_child == 0 ~ "No",
                               Has_child == 1 ~ "Yes"))

1. Creating descriptives

First, to gain a primary understanding of the data, I decided to create a simple table illustrating the descriptives. As this time I was only interested in the child variable across the worry variable, I only needed to deal with these two variables. To do this, a simple summarise() would suffice, and using kable_styling() from the kableExtra package to illustrate it.

mean1 <- data %>% 
  filter(Has_child != "NA") %>% 
  group_by(Has_child) %>% 
  mutate(Has_child = ifelse(Has_child, "yes", "no")) %>%
  summarise(n(),
    mean(q9worry), 
            sd(q9worry)) %>% 
  rename(avg = "mean(q9worry)") %>% 
  rename(n = "n()") %>% 
  rename(sd = "sd(q9worry)") %>% 
  kbl() %>% 
  kable_styling()

Let’s see what the values were:

mean1

Has_child	n	avg	sd
no	2626	3.400228	1.018002
yes	3162	3.648008	1.047924

Looks like for now, those with a child appear to have higher average worry. But we must also take into account the fact that there were more participants overall that did have a child. Out of curiosity, I also wanted to see the number of participants on each child factor across levels of worry. Obviously, since the child variable is not continuous, I only calculated ‘n’ and not the mean and SD, so tabyl() would suffice. This is demonstrated here, whereby the values 1-5 correspond to level of worry in increasing order:

mean2 <- data %>% 
  filter(Has_child != "NA") %>% 
  mutate(Has_child = ifelse(Has_child, "yes", "no")) %>%
  tabyl(Has_child, q9worry) %>% 
  kbl() %>% 
  kable_styling()

What are the results?

mean2

Has_child	1	2	3	4	5
no	96	347	1003	770	410
yes	101	285	1027	962	787

Interesting data, but doesn’t actually tell us much. It would be much more helpful to see a graph or plot of it…

2. Visualising the data

Next, I wanted a clear way to visualise the data and be able to see a pattern. I decided that creating a column graph would be the best way to achieve this - with levels of worry on the x-axis, percentage of people on the y-axis, and 2 bars each, corresponding with a legend, to represent whether they had a child.

To do this, I obviously had to create a percentage count. Luckily, I had practice doing that in the reproducibility section! My first step was to create a count of the total number of participants I would be including. I did this using count(), and then renamed ‘n’ to ‘covidtotal’ using the rename() function, so I wouldn’t get confused.

total <- data %>% 
  filter(Has_child != "NA") %>% 
  group_by(Has_child) %>% 
  count() %>% 
  rename(covidtotal = n) %>% as.data.frame()

I also had to create a long-form version of this count - however, this wasn’t possible using count(), as this requires group_by() to precede it. However, I found that I could use tally() instead, which performed the same function but without this barrier! While I was doing it, I thought I may as well use mutate() to rename the coded variables on the worry scale to appropriate names that I could later use in my plot. Ironically, I called this ‘counting’.

counting <- data %>% 
  filter(Has_child != "NA") %>% 
  group_by(Has_child, q9worry) %>% 
  tally() %>% 
  mutate(q9worry = case_when(q9worry == 1 ~ "Not at all worried",
                             q9worry == 2 ~ "A little bit worried",
                             q9worry == 3 ~ "Moderately worried",
                             q9worry == 4 ~ "Very worried",
                             q9worry == 5 ~ "Extremely worried")) %>% 
  as.data.frame()

Since there were differing quantities of participants in each level of the child factor, I felt as if percentages would work better as they would be standardised. Since multiple of the figures in Part 2 used percentages, I was able to apply this knowledge here. I could use the previous counts to create the percentages that I could use in my plot! Unfortunately, I couldn’t find a way to include the ‘total’ data set directly, but luckily our previous count gave me the numbers I needed, so I manually added it - it came to a total of 5788. This was done primarily using the mutate() function:

pattempt <- counting %>% mutate(perc = n*100/5788) #calculating percentages for each worry factor by mutating a formula 
counting <- counting %>% mutate(pattempt) #mutating the above 'pattempt' as a new column in the 'counting' dataframe

I was almost ready to see the plot itself - but once I renamed the columns, they rearranged themselves in alphabetical order, which made no logical sense. So first, I needed to reorder the columns in a way that made sense now using factor() like I’ve demonstrated multiple times in Part 2:

counting$q9worry <- factor(counting$q9worry, levels = c("Not at all worried", "A little bit worried", "Moderately worried", "Very worried", "Extremely worried"))

And now the graph itself! Using simple ggplot():

graph1 <- counting %>% mutate(Has_child = ifelse(Has_child, "Yes", "No")) %>% #renaming the legend
  ggplot(aes(factor(q9worry), perc, fill = Has_child)) + #assigning variables to the axes/legend
  geom_col(position = position_dodge()) + #bars side by side
  ggtitle("Impact of Children on Level of COVID-worry") + #title
  labs(x="Worry about COVID", y="% of people", fill="Have a child") + #axis labels
  coord_flip() + #realised halfway it would be easier to read the 'worry' labels if the axes were flipped
  theme_bw() #background theme

graph1

Which was great! … for visualising the proportions. But after seeing the output, I quickly realised that it didn’t tell me anything of statistically significant value. I might be better off creating a means box plot. Although it was annoying having to start over, I felt like it would be best.

summ() was the function I used to calculate the descriptives of the worry variable - this came from the rcompanion package. I found that summarise(), while succinct, just didn’t have enough information to create a box plot.

summ = groupwiseMean(q9worry ~ Has_child,
                     data = sig, #the dataframe I was calculating from
                     conf = 0.95, #95% confidence interval
                     digits = 3) #3 significant figures
summ

##   Has_child    n Mean Conf.level Trad.lower Trad.upper
## 1        No 2626 3.40       0.95       3.36       3.44
## 2       Yes 3162 3.65       0.95       3.61       3.68

Which then allowed me to create the box plot easily. I called this ‘bplot’.

bplot <- ggplot(sig, aes(x=Has_child, y=q9worry, fill=Has_child)) + #assigning variables
  geom_boxplot()+ 
  stat_summary(fun=mean, geom="point", shape=23, size=2) + #mean marker
  labs(title = "Mean Worry of Child vs No Child", x = "Have child", y = "Mean Worry about COVID") + #titles
  theme_minimal() + #background
  scale_fill_brewer(palette="Dark2") #colour palette

bplot

From looking at this box plot, though the other statistics appeared identical, there was definitely a visible difference between the two means.

3. Statistics

But the final question remained - was this a statistically significant difference? I used t.test() to check this:

t.test(q9worry ~ Has_child, data = sig, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  q9worry by Has_child
## t = -9.0723, df = 5786, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -0.3013202 -0.1942381
## sample estimates:
##  mean in group No mean in group Yes 
##          3.400228          3.648008

Looks like p<0.05, or it was significant! Now that we had these results, I was curious as to whether there was a correlation between having a child and the other measures relating to COVID-worry, namely immunity and perceived risk to self and others. I decided to create a correlation matrix to investigate this.

cors <- data %>% select(Has_child, q10arisk, q10brisk, q9worry, q8haveimmunity) %>% 
    filter(Has_child != "NA") %>% 
  cor()

print(cors)

##                  Has_child   q10arisk    q10brisk    q9worry q8haveimmunity
## Has_child      1.000000000  0.1639095  0.03458871  0.1184296    0.003529561
## q10arisk       0.163909539  1.0000000  0.39187157  0.4721884   -0.162017663
## q10brisk       0.034588709  0.3918716  1.00000000  0.4418158   -0.189576707
## q9worry        0.118429627  0.4721884  0.44181579  1.0000000   -0.262237210
## q8haveimmunity 0.003529561 -0.1620177 -0.18957671 -0.2622372    1.000000000

From this, it looked like none of the other measures showed very high correlations (<0.1) while worry and self-perceived risk had the highest correlations… which was surprising, considering the rationale that those with children would be worried on behalf of their children leading to higher perceived risk towards others. I wondered whether the correlation between self-perceived risk and having a child was significant, so I tested this using report(), from the report package.

cor2 <- cor.test(data$Has_child, data$q10arisk, method = "pearson") 
report(cor2)

## Effect sizes were labelled following Funder's (2019) recommendations.
## 
## The Pearson's product-moment correlation between data$Has_child and data$q10arisk is positive, statistically significant, and small (r = 0.16, 95% CI [0.14, 0.19], t(5786) = 12.64, p < .001)

It was statistically significant! To determine the direction this correlation was in, I quickly used summarise() to calculate the means.

childriskmean <- data %>% 
  group_by(Has_child) %>% 
  filter(Has_child != "NA") %>% 
  mutate(Has_child = case_when(Has_child == 0 ~ "No",
                               Has_child == 1 ~ "Yes")) %>% 
  rename(SelfRisk = q10arisk) %>% 
  summarise(mean(SelfRisk)) %>% 
  rename(MeanSelfRisk = "mean(SelfRisk)") %>% 
  kbl() %>% 
  kable_styling()

childriskmean

Has_child	MeanSelfRisk
No	2.668317
Yes	2.918722

Having a child also significantly increases self-perceived risk!

Results

Those with children do indeed have significantly higher average worry about COVID-19, as well as self-perceived risk to COVID-19. The former appears to be intuitive, as one would assume child-carers would be more worried for their children as they are a vulnerable population. However, higher self-perceived risk is interesting - would this be due to higher perceived exposure from their children (eg. from schools)? This could be the topic of a later study.

Q2: Is there an effect of age on number of outings?

The study showed that older participants were less likely to think that they had previously had coronavirus. I was curious as to whether this would be consistent with their behaviour. To do this, I thought that creating a column graph illustrating each age group’s mean total number of outings.

Cleaning the data

I first cleaned the data:

data1 <- data %>% 
  rename(Age = age_categories) %>% 
  select(Age, Going_out_total) %>% #the 2 variables of interest in this question
  mutate(Age = case_when(Age == 1 ~ "18 to 24 years",
                                Age == 2 ~ "25 to 34 years",
                                Age == 3 ~ "35 to 44 years",
                                Age == 4 ~ "45 to 54 years",
                                Age == 5 ~ "55+"))

1. Descriptives

To get a general idea of the pattern I would be looking for, I created a simple means table again using summarise():

summ1 <- data1 %>% 
  group_by(Age) %>% 
  summarise(mean(Going_out_total)) %>% 
  rename(MeanGoingOut = "mean(Going_out_total)") %>% 
  kbl() %>% 
  kable_styling()

summ1

Age	MeanGoingOut
18 to 24 years	8.287623
25 to 34 years	8.074407
35 to 44 years	7.379904
45 to 54 years	7.062674
55+	6.115451

Looks like there was a downward linear trend, with the youngest people going out the most frequently and the oldest people the least.

I wanted my column graph to include error bars - to do this, I needed to use the summ() function from Q1 to include the other needed statistics, such as the boundaries of the confidence interval.

summ = groupwiseMean(Going_out_total ~ Age,
                     data = data1, 
                     conf = 0.95,
                     digits = 3)

2. Visualising the data

From there, it was relatively simple to create the graph, again using ggplot(). I called the graph ‘cp’.

cp <- ggplot(summ, aes(x=Age, y=Mean, fill=Age)) + 
  geom_col() + 
  geom_errorbar(aes(ymin = Trad.lower, ymax = Trad.upper, width = 0.15)) + #creating error bars
  ggtitle("Mean Total No. of Outings Across Age Groups") +
  labs(x="Age", y="Mean Total No. of Outings") + 
  coord_flip() + 
  theme_bw() + #background aesthetic
  scale_fill_brewer(palette="Accent") #colour aesthetic

cp

Hmm, there doesn’t look like there’s a big difference. I decided to do a significance check with a one-way ANOVA significance test. I used the aov() function to compute the analysis of variance, and summary() to determine whether it was significant. A significant result would mean that at least one pair had a significant difference.

res.aov <- aov(Going_out_total ~ Age, data = data1)
summary(res.aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## Age            4   4604  1151.0   29.55 <2e-16 ***
## Residuals   6144 239286    38.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Meanwhile, report() would help me interpret my results.

report(res.aov)

## For one-way between subjects designs, partial eta squared is equivalent to eta squared.
## Returning eta squared.

## Warning: 'data_findcols()' is deprecated and will be removed in a future update.
##   Its usage is discouraged. Please use 'data_find()' instead.

## The ANOVA (formula: Going_out_total ~ Age) suggests that:
## 
##   - The main effect of Age is statistically significant and small (F(4, 6144) = 29.55, p < .001; Eta2 = 0.02, 95% CI [0.01, 1.00])
## 
## Effect sizes were labelled following Field's (2013) recommendations.

Ah, so there is a statistically significant (albeit small) difference! To determine which pair/s differed, I used TukeyHSD.

TukeyHSD(res.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Going_out_total ~ Age, data = data1)
## 
## $Age
##                                     diff        lwr         upr     p adj
## 25 to 34 years-18 to 24 years -0.2132159 -0.8772954  0.45086371 0.9057639
## 35 to 44 years-18 to 24 years -0.9077188 -1.6015390 -0.21389855 0.0033115
## 45 to 54 years-18 to 24 years -1.2249490 -2.0045370 -0.44536091 0.0001783
## 55+-18 to 24 years            -2.1721722 -2.7808265 -1.56351783 0.0000000
## 35 to 44 years-25 to 34 years -0.6945029 -1.4118353  0.02282954 0.0631660
## 45 to 54 years-25 to 34 years -1.0117331 -1.8123184 -0.21114777 0.0051472
## 55+-25 to 34 years            -1.9589563 -2.5942826 -1.32363006 0.0000000
## 45 to 54 years-35 to 44 years -0.3172302 -1.1426523  0.50819191 0.8325583
## 55+-35 to 44 years            -1.2644534 -1.9308049 -0.59810198 0.0000023
## 55+-45 to 54 years            -0.9472232 -1.7024684 -0.19197801 0.0056457

Based on basic statistical inference, the following pairs significantly differed:

35-44 and 18-24
45-54 and 18-24
55+ and 18-24
45-54 and 25-34
55+ and 25-34
55+ and 35-44
55+ to 45-54

And the biggest difference was between 18-24 and 55+, with 55+ year olds having the least number of outings! Looks like age does have an effect… although it’s difficult to determine the direction of the relationship.

A limitation noted in the original study was that in their analyses, the authors did not differentiate between government-mandated outings and illegal outings. They included the following ‘going-out’ variables in the study:

Shopping for groceries/pharmacy

Shopping for items other than groceries/pharmacy

Meeting up with friends/family they do not live with

Government-mandated outings would only include shopping for groceries/pharmacy. Would this pattern maintain if we separated outings into the 3 different categories Smith et al studied?

Cleaning the data: pt 2

First, I cleaned the data again under some different conditions, selecting the variables that would be relevant in this section of the analysis, but using the same functions I’ve been practicing throughout this whole journey.

gov <- data %>% 
  rename(Age = age_categories) %>% 
  select(Age, Adhere_shop_groceries, Adhere_shop_other, Adhere_meet_friends) %>% 
  rename(GroceryPharmacy = Adhere_shop_groceries,
         ShopOther = Adhere_shop_other,
         FriendsFamily = Adhere_meet_friends) %>% 
  mutate(Age = case_when(Age == 1 ~ "18 to 24 years",
                         Age == 2 ~ "25 to 34 years",
                         Age == 3 ~ "35 to 44 years",
                         Age == 4 ~ "45 to 54 years",
                         Age == 5 ~ "55+")) %>% 
  group_by(Age, GroceryPharmacy, ShopOther, FriendsFamily)

As there were so many variables to consider (5 age levels, 3 types of outings, and 2 levels of each outing being yes and no, with different combinations of the 3 outings), I decided to filter the data to only include instances where participants answered ‘yes’ to each respective question. This was coded as ‘1’ in all three - in the ‘GroceryPharmacy’ condition, it referred to visiting the supermarket or pharmacy more than twice in the past week, and for the remaining conditions it referred to engaging more than once in the past week.

1. Counts: pt 2

I first wanted to do some simple counts - i.e., how many respondents there were in each age group, and how many had responded ‘yes’ to each type of outing - to give me an estimate of what I would be looking for. The former is demonstrated in this table:

agecount <- gov %>% 
  group_by(Age) %>% 
  count() %>% #counting how many respondents there were across each age group
  kbl() %>% 
  kable_styling()

agecount

Age	n
18 to 24 years	1422
25 to 34 years	1223
35 to 44 years	1045
45 to 54 years	718
55+	1741

For the latter, I created individual data frames using tabyl() for each type of outing across the age categories, then combined them using list() and rbindlist(). I also filtered out the ‘no’ responses using filter().

This was the result:

tableG <- rbindlist(tibblist, use.names = FALSE) %>% 
  rename(Type = GroceryPharmacy) %>% 
  kbl %>% 
  kable_styling()

tableG

Type	18 to 24 years	25 to 34 years	35 to 44 years	45 to 54 years	55+
GroceryPharmacy	828	706	618	417	1191
ShopOther	791	764	735	558	1468
FriendsFamily	1092	966	918	661	1634

2. Visualising the data: pt 2

I wondered how this pattern could be visualised. Considering that there were differing numbers of responses for each age category, it wouldn’t make sense to plot the raw number of responses - the age group with the most number of respondents would always show the highest number in each outing group. Therefore, plotting based on percentage seemed to make more sense. Since each age category had a different total, I would have to calculate percentages based on each category. Also, since creating a graph would require a long-form count, I couldn’t use the previous table and had to utilise count() instead. Which meant a long and arduous process of creating a data frame for each age group, for each outing type, then combining them using list() and rbindlist() again. Here is an example of the process I went through for the 18-24 year-old group:

countag <- data %>%   
  filter(age_categories == 1, Adhere_shop_groceries == 1) %>% # age category 1 refers to 18-24y/os
  group_by(age_categories, Adhere_shop_groceries) %>% 
  mutate(Adhere_shop_groceries = case_when(Adhere_shop_groceries == 1 ~ "Groceries")) %>% #to be able to differentiate between the 3, since they would otherwise all be labeled '1'
  count() %>% 
  as.data.frame()

countao <- data %>% 
  filter(age_categories == 1, Adhere_shop_other == 1) %>% 
  group_by(age_categories, Adhere_shop_other) %>% 
  mutate(Adhere_shop_other = case_when(Adhere_shop_other == 1 ~ "Shop Other")) %>% 
  count() %>% 
  as.data.frame()

countaf<- data %>% 
  filter(age_categories == 1, Adhere_meet_friends == 1) %>% 
  group_by(age_categories, Adhere_meet_friends) %>% 
  mutate(Adhere_meet_friends = case_when(Adhere_meet_friends == 1 ~ "Meet Friends")) %>% 
  count() %>% 
  as.data.frame()

combine <- list(countag, countao, countaf)
combines <- rbindlist(combine, use.name=FALSE) %>% 
  rename(Type = Adhere_shop_groceries) %>% #creating a heading
  mutate(age_categories = case_when(age_categories == 1 ~ "18-24y/o")) %>% #to be able to differentiate between the age categories later, when multiple would come into play
  mutate(perc = (n/2711)*100) #mutating a new column to add percentage values - 2711 refers to the total number of respondents in this age category, which I manually added using the previous 'agecount' table

I was aware that this seemed to be an extremely inefficient way to create this table, but after days thinking about this, it appeared to be the only solution. I tried many other methods - such as creating separate graphs for each age across the types of outings, separate graphs for each outing across the age groups (I almost ended up using this method, but concluded that it would be too difficult to visualise the difference in outings between ages), graphing only the raw numbers… but this was the only method I could think of to achieve the plot I wanted. Before stumbling on this, I found it difficult to create a column for all the types of outings as in the original data, they appeared as three separate variables. But using group_by() and count() automatically created a column detailing outing type - and since I was only counting the ‘yes’ responses, I could rename this response to the name of the outing. The column I was searching for, with the different outing types, was created simply by virtue of me merging the different data frames of the different outings. After that, I simply renamed the top variable to ‘type’ - and suddenly my problem was solved. I concluded that an inefficient solution is better than no solution at all, so I proceeded to repeat this process for the remaining 4 age groups.

After this, I tried using merge() to combine all the combined data frames from each age category - before realising that obviously, merge() is only able to merge 2 data frames at a time. So I merged them in pairs, then all together.

mulbine <- merge(combines, combines2, all = TRUE) #first 2
mulbine2 <- merge(combines3, combines4, all = TRUE) #second 2
mulbine3 <- merge(mulbine, mulbine2, all = TRUE) #4
mulbine4 <- merge(mulbine3, combines5, all = TRUE) #all!

Thankfully, it seemed to work.

print(mulbine4)

##     age_categories         Type    n     perc
##  1:       18-24y/o    Groceries  828 30.54224
##  2:       18-24y/o Meet Friends 1092 40.28034
##  3:       18-24y/o   Shop Other  791 29.17743
##  4:       25-34y/o    Groceries  706 28.98194
##  5:       25-34y/o Meet Friends  966 39.65517
##  6:       25-34y/o   Shop Other  764 31.36289
##  7:       35-44y/o    Groceries  618 27.21268
##  8:       35-44y/o Meet Friends  918 40.42272
##  9:       35-44y/o   Shop Other  735 32.36460
## 10:       45-54y/o    Groceries  417 25.48900
## 11:       45-54y/o Meet Friends  661 40.40342
## 12:       45-54y/o   Shop Other  558 34.10758
## 13:         55+y/o    Groceries 1191 27.74284
## 14:         55+y/o Meet Friends 1634 38.06196
## 15:         55+y/o   Shop Other 1468 34.19520

I finally had a data frame that was ready to be plotted. I kept it simple with a geom_bar():

mulbinep <- mulbine4 %>% 
  ggplot(aes(age_categories, perc, fill = Type)) +
  geom_bar(position = 'dodge', stat='identity', width=0.5) +
  ggtitle("Differences in Types of Outings across Age") +
  labs(x="Age Categories", y="% of people engaging in frequent outings", fill="Type of Outing") +
  theme_bw()

print(mulbinep)

Results

We determined earlier that the youngest age group of 18-24 year olds had the highest number of total outings and the oldest age group of 55+ year olds had the lowest number. However, upon separating the types of outings, it can be seen that for those that answered ‘yes’ to at least one type of outing, all age categories showed the highest proportion of outings were to visit friends and family they didn’t live with. In fact, the pattern of ‘meet friends’ having the highest proportion and ‘groceries’ (to be noted, the only legal outing) having the lowest was true across all ages apart from the 18-24 group, where ‘groceries’ was very slightly higher than ‘shop other’. Overall, the graph demonstrates that despite younger participants going out more, the types of outings irrespective of legality don’t differ across age groups. In fact, the only legal outing was frequently the least engaged-upon outing!

However, there are a few limitations to note whilst considering these results. The authors defined the ‘yes’ response to the ‘groceries/pharmacy’ question as “2 or more times in the last week”, but the ‘yes’ response to the other two types of outings was “1 or more times in the past week”. This may have skewed results - there may be a smaller likelihood of going to the grocery store twice a week if one could simply get everything they needed in one trip. Secondly, since there were varying combinations of responses (eg. yes to groceries and friends but no to other, yes to friends but no to groceries and other etc.), it was difficult to operationalise and proportion the responses correctly. I calculated this based on at least one yes response, which meant that there were many double-ups on responses for participants that answered yes to more than one type of outing - which again may have affected results. Finally, the results of this section differed from the main question in that the total number of outings was a continuous variable - as these responses were categorical, classed as ‘yes’ or ‘no’, it is impossible to know the extent of the differences between outings. This was measured with proportion of occurrence, rather than mean. Nevertheless, it is an interesting observation, though it would have been helpful if the authors included information by participants regarding how many of each type of outing they engaged in.

Q3: Is there a correlation between perceived risk of COVID-19 to oneself vs others?

As discussed in Part 1, I found it surprising that there wasn’t a significant association between belief of having had COVID-19 before and perceived risk of it both towards oneself as well as others. I decided to investigate this first by trying to determine if there was an issue with the measures themselves - did they even correlate with each other?

Cleaning the data

First, I cleaned the data as usual:

data2 <- data %>% 
  rename(selfrisk = q10arisk,
         otherrisk = q10brisk) %>% 
  mutate(Ever_covid = case_when(Ever_covid == 1 ~ "Yes",
                                Ever_covid == 0 ~ "No")) %>% 
  select(selfrisk, otherrisk, Ever_covid)

I ran into the same problem as I did in Question 1 in which R assigned “Has_child” as a continuous rather than a categorical variable - but this time with “Ever_covid”. And this time I knew what to do:

data2$Ever_covid <- as.factor(data2$Ever_covid) #changing into categorical

1. Descriptives

I decided to check out the descriptives first - I decided to look at the mean and SD of the ‘self-risk’ variable across levels of the ‘other-risk’ factor under the assumption that if they both increased, it would imply a positive correlation. I used summarise() to calculate the values and kable_styling() to create a simple table to display the values.

d4 <- data %>% group_by(q10brisk) %>% 
  summarise(
    n = n(), 
    mean = mean(q10arisk),
    sd = sd(q10arisk)) %>% 
  rename(OtherRisk = q10brisk) %>% 
  kbl %>% kable_styling()

d4

OtherRisk	n	mean	sd
1	64	1.812500	0.9900297
2	509	2.267191	0.7149446
3	2683	2.620201	0.6599169
4	2893	3.098514	0.7223281

Looks like the means of one group do steadily increase with the levels of the other.

2. Visualising the data

Now, let’s visualise this using ggplot() again. Since I was looking for a correlation this time, I decided to use geom_smooth() rather than another column graph or box plot. Also, since the authors measured this across levels of ‘ever_covid’, I decided to do the same.

plot3 <- ggplot(data2, aes(selfrisk, otherrisk, group = Ever_covid, colour = Ever_covid)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  scale_y_continuous(name = "Perceived Risk to Others") + 
  scale_x_continuous(name = "Perceived Risk to Self") + 
  theme_bw() +
  ggtitle(label = "Perceived Risk of COVID-19 to Self vs Others") 

plot3

## `geom_smooth()` using formula 'y ~ x'

Okay - there is definitely a positive trend there, for those who do and don’t believe they’ve caught COVID-19 before. A good sign… but let’s see if this is a statistically significant correlation.

3. Statistics

To test whether there was a significant correlation, I used cor.test():

res <- cor.test(data2$selfrisk, data2$otherrisk, method = "pearson") 

res

## 
##  Pearson's product-moment correlation
## 
## data:  data2$selfrisk and data2$otherrisk
## t = 33.458, df = 6147, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3711432 0.4134369
## sample estimates:
##       cor 
## 0.3924975

Looks like there is a weak, but significant positive correlation!

I was curious about the correlation between these two ‘perceived risk’ variables and behaviour - or the total number of outings. Previous research has shown that increased risk perception is associated with greater protective action - did the findings of this study support this hypothesis? I thought I would plot this information on another geom_smooth() plot - but, first, I had to convert the self-risk variable into a discrete variable:

data$q10arisk <- as.factor(data$q10arisk)

And now I could plot!

plot4 <- ggplot(data, aes(Going_out_total, q10brisk, group = q10arisk, colour = q10arisk)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() + 
  scale_colour_discrete("Rating of Perceived Risk to Self") +
  ggtitle(label = "Total Outings across Levels of Perceived Risk") + 
  scale_x_continuous(name = "Total No. of Outings") +
  scale_y_continuous(name = "Rating of Perceived Risk to Others")

plot4

## `geom_smooth()` using formula 'y ~ x'

Though this was far from perfect, considering I was forced to convert the ‘self-risk’ variable into a discrete variable but leave the ‘other-risk’ variable as continuous, I believed it was good enough to illustrate the pattern - perceived risk to others decreased as number of outings increased. This relationship was consistent for perceived risk to self, with this rating also being positively correlated with risk to others.

I then created a correlation matrix based on these observations using the cor() function - but to do this, I first had to convert the perceived self-risk ratings back to a numeric variable:

data$q10arisk <- as.numeric(data$q10arisk)

And here is the correlation:

data4 <- data %>% select(Going_out_total, q10arisk, q10brisk) %>% cor()
print(data4)

##                 Going_out_total    q10arisk   q10brisk
## Going_out_total      1.00000000 -0.08565588 -0.1842952
## q10arisk            -0.08565588  1.00000000  0.3924975
## q10brisk            -0.18429519  0.39249749  1.0000000

Looks like while there are small negative correlations between outings and perceived risk, the strongest correlation is between the 2 forms of perceived risks themselves.

I wanted to create a visual representation of the strengths of the correlations using corrplot() - however, as I only had 3 variables, this didn’t appear to be possible. Since the question stemmed from the worry variable, I thought I may as well include it and find out the correlation.

data5 <- data %>% select(Going_out_total, q10arisk, q10brisk, q9worry) %>% cor()
print(data5)

##                 Going_out_total    q10arisk   q10brisk    q9worry
## Going_out_total      1.00000000 -0.08565588 -0.1842952 -0.1781434
## q10arisk            -0.08565588  1.00000000  0.3924975  0.4702685
## q10brisk            -0.18429519  0.39249749  1.0000000  0.4385310
## q9worry             -0.17814340  0.47026850  0.4385310  1.0000000

Same results as before, but with the new ‘worry’ variable. But let’s see the correlation matrix:

corrplot(data5)

Looks like there is a relatively strong relationship between worry and self-perceived risk. Apart from this relationship, there is also a positive correlation between the two forms of perceived risk, and a weak negative correlation between number of outings and the other three factors - which is to be expected, as staying home would be a protective measure/form of behaviour consistent with those attitudes.

So it looks there there isn’t much wrong with the perceived risk variables.

Results

There is a correlation between perceived of COVID-19 to oneself vs others! This is also weakly correlated with behaviour, as operationalised by total number of outings, as well as general worry.

Part 4: Recommendations

1. Including a README file

My first recommendation to Smith et al would be to include a README file. This refers to a text file that contains an introduction and information about the data set, providing instructions on how the authors used the code and how one might go about reusing it, especially aimed at novices to the data. Typically, this would include the purpose of the file, a description of the methodology and tools/technology, and a list of all the conventions and abbreviations used in the data set to allow for easier understanding. Considering the authors did not provide this file, nor a coding script, it was difficult to know where to start, particularly as we were complete novices. They failed to consider the readers’ point of view – guidance on the steps they took would provide a clearer starting point. Although it should be noted that Smith et al’s variables were well-labelled and hence it was easy to infer which variables were relevant to analysis, a README would increase the ease and smoothness of the reproducibility process. This document provides some basic information on the conventions of a typical README file and an exemplar, and this article explains the benefits of a good README file.

2. Including a Codebook/Data Dictionary

My second recommendation to the authors would be to include a data dictionary, or code book, in their OSF repositories. This is essentially a document that describes the data set, detailing their meanings and labeling the variables measured including details such as how the variable was measured, and the units used. All the categorical data that was provided in the OSF repositories was coded in 0s and 1s, which made it difficult for us (and anyone else trying to interpret it) to determine the meanings. For example, gender was coded in this way, and we were unsure which number corresponded to which gender. This was exacerbated in variables such as age category and region, where there were 5 possible responses. Though it was possible for us to determine this by counting the number of participants in each category and cross referencing it with the figures published in the paper, a simple, coherent code book detailing what each response corresponded to. Further, although the .sav file did provide a brief description of what each variable measured eg. For ‘q7beentested’, a line stating, “Have you been tested for coronavirus?”, there were some variables that were unclear. Most prominently, the variable ‘condition_antibodies’ was labelled “assigned to experimental condition” – but was not featured in any of the figures, and no other information was provided. To our knowledge after reading the paper, there was no experimental condition, leaving us extremely confused as to what this variable could be measuring. Further, as all the responses were coded in 0s and 1s, it was also impossible to determine from there.

That being said, I do believe Smith et al should be praised for using readable variable names; even when labeling descriptions were vague, it was possible to make an educated guess based on its name eg. ‘q8haveimmunity’ corresponding to the extent to which the participant believed they had immunity. On the other hand, it would help to remove the question number in each of the variables to increase human-readability. A code book would also reduce confusion regarding measurement units on the numerical variables – for example, the worry and immunity variables were rated on a scale of 5 whereas the perceived worry variables were rated on a scale of 4. Though it was possible to infer this by reading the paper, this path could have been simplified through the code book detailing the measurement units for every variable. Finally, it was noted in Part 2 that the authors made a mistake in the labeling of the ‘shopping for groceries/pharmacy’ variable, confusing the meanings of 0 and 1 (and we determined this by noticing a mismatch in the percentage values of Figures 2 and 3). A code book may have helped them in avoiding this.

Overall, providing human-readable details to support clear and easy interpretation and analysis of one’s data should be best practice. This website provides an excellent starting point, but there is definitely a wealth of resources online!

3. Including coding script and analysis code

My final recommendation to Smith et al would be to provide their coding script and their analysis code. While coding script by itself may have been overwhelming, complemented by the previous two recommendations of a README file and codebook, it would have been very helpful. Particularly as we were very new to R, we had no idea where to start, what functions they used nor how they managed their data and hence ran into issues when trying to reproduce the descriptives and plots. We couldn’t be sure if the way we were producing the figures was the same way the authors did, as we were not provided any information on how they do it. We didn’t have a choice but to write our own code – this was extremely time-consuming as the entire process consisted of trial and error. It was essentially attempting an exam with no answer sheet. For example, we created data frames for each variable – was there a more efficient way to count all the values without doing this? When we stumbled upon a solution, like this, we couldn’t be certain if it was correct or if there was a more efficient way to go about it. Secondly, as the file in the OSF repository was a .sav file, it could be assumed that the authors conducted their analyses on SPSS. As this is not an open-source software, it is difficult to determine what steps the authors took to conduct their analysis, which was particularly problematic when attempting reproducibility. For example, the authors reported ‘little evidence of an association between perceived worry and belief of having had COVID-19 before’ – but it was difficult to recreate this test of significance without any instructions, guidance, or even a starting point. Whilst this was definitely a great learning experience (forcing us to learn how to code!), albeit the steep learning curve, I believe that the authors including their coding script, accompanied by the former recommendations, would increase the efficiency of the reproducibility process and hence the quality of open science.

Verification Report

June Kam

2022-07-28

Part 1: Summary and Reaction

Summary

Reaction

Part 2: Verification

Figure 1

Load Packages

Read Data

Clean Data

Creating Data Frames

Creating Percentages

Combining the Data

Creating the Table

Figure 2

Clean Data

Creating individual data sets

Calculating percentages and creating a new column

Merging the data

Plotting the data

Figure 3

Clean data

Creating individual data sets to calculate mean and SD

Turning the data into a table

Figure 4

Cleaning data

Creating individual data frames

Creating percentages

Combining the data

Creating the table

Part 3: Exploration

Q1: Does having a child increase worry about COVID-19?

Clean the data

1. Creating descriptives

2. Visualising the data

3. Statistics

Results

Q2: Is there an effect of age on number of outings?

Cleaning the data

1. Descriptives

2. Visualising the data

Cleaning the data: pt 2

1. Counts: pt 2

2. Visualising the data: pt 2

Results

Q3: Is there a correlation between perceived risk of COVID-19 to oneself vs others?

Cleaning the data

1. Descriptives

2. Visualising the data

3. Statistics

Results

Part 4: Recommendations

1. Including a README file

2. Including a Codebook/Data Dictionary

3. Including coding script and analysis code

Thanks for reading!