The link to my HTML for marking can be found here:

Part 1: Summary and Reaction

Smith, L. E., Mottershaw, A. L., Egan, M., Waller, J., Marteau, T. M., & Rubin, G. J. (2020). The impact of believing you have had COVID-19 on self-reported behaviour: Cross-sectional survey. PloS One, 15(11), e0240399–e0240399. https://doi.org/10.1371/journal.pone.0240399

Summary

Numerous countries enforced “lockdowns” that limited individual movements during the COVID-19 pandemic to reduce the spread of the virus in the community. However, not everyone adhered to the lockdown restrictions that were in place. Speculation that individuals’ lack of adherence to these protective measures was due to thinking they had previously had COVID-19 prompted Smith et al. (2020) to investigate whether this was true of adults from the UK. Specifically, Smith et al. (2020) hypothesised that thinking one has had COVID-19 is associated with being less likely to be worried about COVID-19, more likely to believe you have immunity, less likely to adhere to social distancing rules and perceive a lower risk of COVID-19 to oneself compared to those who do not believe they have had COVID-19. This study adds to the body of research looking into adherence to lockdown measures that were an important part of reducing the spread of COVID-19. This research is significant, as it fills a gap in the literature by addressing the question of whether the personal belief that you have had COVID-19, as opposed to having tested positive for COVID-19, is correlated with different levels of adherence to lockdown rules.

In Smith et al.’s (2020) study, 6149 participants from the UK aged 18 years and older completed an online cross-sectional self-report questionnaire to measure whether the belief that they have had COVID-19 previously is correlated with various factors, as mentioned above. Participants self-reported information to binary outcomes, such as yes and no questions, and continuous scales, such as a 5-point Likert scale. For instance, participants answered questions such as “To what extent do you agree with the statement ‘I think I have some immunity to coronavirus’” on a five-point Likert scale from ‘strongly agree’ to ‘strongly disagree’. Participants received points for undertaking this study which they could redeem for cash, gift vouchers or donations to charity.

Smith et al. (2020) conducted a 95% confidence interval, one-way ANOVA tests and logistic regression analysis to test whether their hypothesis was supported. Specifically, the data collected revealed that 24.3% of participants thought they had previously had COVID-19, however, only 9.4% of participants had gotten tested, and 4% of participants had received a positive test. Despite not having tested positive for COVID-19, participants who believed that they had previously had COVID-19 reported lower average levels of worry about COVID-19, higher average levels of perceived immunity to COVID-19 and higher average levels of breaking lockdown rules compared to participants who did not think they had previously had COVID-19. Concerningly, those who believed they had previously had COVID-19 were less likely on average to correctly identify the most common symptoms of COVID-19. Additionally, Smith et al. (2020) found that some demographics, including young people, parents, employed individuals and those who worked in a key sector, were more likely to believe they had caught COVID-19. However, there was no correlation between believing you previously had COVID-19 and the perceived risk of COVID-19 or average levels of leaving the house.

The use of online cross-sectional self-report questionnaires allowed Smith et al. (2020) to quickly and conveniently collect data from a vast demographic of individuals, allowing for a broad and diverse sample group, thus increasing the validity of the results. However, the self-report data may reduce the validity of the results due to people incorrectly reporting because of social desirability bias.

Nevertheless, these results are interesting, as they raise the question of whether the majority of these individuals who self-reported thinking they have had COVID-19 disobeying lockdown rules contributed to community transmission of COVID-19 in the UK. Furthermore, these results highlight the importance of targeting education towards these individuals to stop the spread of the virus in the UK, as at the time this study was conducted, there was no media coverage targeting people who believed they had already had COVID-19. This has real-world implications for targeting the identified groups with increased media exposure about the importance of adhering to lockdown measures, thus potentially reducing the spread of the virus in the community and reducing the risk of COVID-19 to vulnerable people in the community.

Reaction

The most interesting part of this paper was the expansive sample of 6149 participants that Smith et al. (2020) collected data from. The participants were adults from different regions across the UK with various education levels, ages, employment status’s and employment types. This large and broad sample increases the validity of the results and allows for further analysis into commonalities between the behaviour of various demographics during the pandemic. For instance, the UK is made up of various regions and it is likely that individuals’ adherence to lockdown measures differed between these regions, due to the different restrictions in the place, how they were enforced and the media coverage about the importance of lockdown measures in that region. Smith et al. (2020) did not look at whether adherence levels to lockdown restrictions differed in each region of the UK which would have been an interesting point of analysis to explore.

I was surprised that Smith et al. (2020) did not consider third variables that may be contributing to whether individuals break lockdown rules. For instance, I wondered whether participants’ adherence to the lockdown rule of not meeting up with friends or family differs according to their age, as younger people often spend more time socialising with friends and family. Additionally, individuals who are more educated may be more aware of how immunity works which makes me wonder whether there is a correlation between individuals’ education level and their perceived immunity to COVID-19. These questions would provide interesting points for further exploration.

I wonder whether participants were impacted by social desirability bias when completing their self-reported answers. For instance, participants may have been aware that the researchers conducting this study hypothesized that people who believed they have had COVID-19 are more likely to engage in risky behaviours, such as breaking lockdown rules. In turn, participants’ self-reported results may be biased towards answering the questions in a manner that aligns with what they expect the researchers to want to find. If social desirability bias did impact participants’ responses this would reduce the validity of Smith et al.’s (2020) findings.

Part 2: Verification

Verification Goals

The goal of the verification section of this report is to reproduce Table 1, Table 2, Table 3 and Figure 1 from Smith et al.’s (2020) study that have been included below. Smith et al. (2020) did not report the demographic descriptives of their participants, such as mean age or age range, thus we could not produce these.

Table 1

Table 2

Table 3

Figure 1

Locate the open data

First off, we obtained the raw data from the OSF repository which can be found here

Load packages

The base R package does not include all of the useful functions we used, rather they come from libraries that we must load before using these functions. Thus, our next step was to load the relevant packages needed to complete this verification report using the $library()$ function, as seen in below.

Note: If error message occurs run $install.package()$ and insert the package name in the bracket. This will download any packages that are not already installed, then try loading it again using the $library()$ function and inserting the package name in the brackets. For instance, if $library(tidyverse)$ fails try the $install.packages(tidyverse)$ function first.

library(tidyverse) # This package contains the majority of the functions we used
library(haven) # Reads data from .sav file which our raw dataset was (tidyverse)
library(dplyr)  # Allows R to manipulate the data more easily 
library(kableExtra) # Make the tables display the information in an aesthetic way
library(tibble) # Helps construct data frames
library(knitr) # Helps with knitting this report to PDF
library(stats) # calculates basic statistics including standard deviation
library(tinytex) # to help knit this report to PDF

Read the Data

Next, we set the working directory to my PSYC3361 folder that has the raw data on my computer using the $setwd()$ from the base package and putting the name of my working directory “~/PSYC3361” in the brackets. This function sets the working directory so that I wouldn’t have to keep reloading the data every time that I went onto the RMD file.

Then I loaded the data into my RMD file using the ‘read_sav()’ function from the haven package which is part of the tidyverse. The raw data from Smith et al.’s (2020) study was saved as a sav file in my working directory, as it was an SPSS data file. The raw data was called “smith_data.sav” in my working directory, so I inserted this name into the brackets of the ‘read_sav’ function for it to be read, as seen in the chunk below. I also assigned this raw data file with the name ‘data’ using the assignment operator ‘$<-$’.

Next I wanted to make ‘data’ into a data frame and view this as a table, so I used the $as.data.frame()$ function from the base package to organises the tibble rows into a table of rows and columns, as seen in the last line of this chunk.

Glimpse the data to look at variables

I then looked at every column in the data frame by using the glimpse() function from dplyr that provides a preview of the data Smith et al. (2020) collected from the questionnaires. The OSF repository for Smith et al.’s (2020) study did not include a codebook, so this function helped us to understand the data type for each variable, work out what each column contains and establish that there are 19 variables in the dataset.

glimpse(data)

## Rows: 6,149
## Columns: 19
## $ Ever_covid            <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ q7beentested          <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ q8haveimmunity        <dbl+lbl> 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 1, 2, 3, 1, 2,…
## $ q9worry               <dbl+lbl> 4, 4, 3, 2, 3, 4, 5, 3, 4, 3, 4, 4, 4, 3, 4,…
## $ q10arisk              <dbl+lbl> 3, 3, 4, 3, 3, 2, 3, 3, 2, 2, 4, 3, 2, 2, 2,…
## $ q10brisk              <dbl+lbl> 4, 4, 4, 3, 3, 4, 3, 3, 3, 3, 3, 4, 4, 3, 3,…
## $ Going_out_total       <dbl> 0, 5, 0, 10, 4, 6, 13, 8, 1, 7, 0, 2, 5, 10, 11,…
## $ Adhere_shop_groceries <dbl+lbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,…
## $ Adhere_shop_other     <dbl+lbl> 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ Adhere_meet_friends   <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Sx_covid_nomissing    <dbl+lbl> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1,…
## $ condition_antibodies  <dbl> 7, 1, 7, 1, 5, 1, 1, 6, 2, 2, 6, 2, 2, 5, 8, 4, …
## $ gender                <dbl+lbl> 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1,…
## $ age_categories        <dbl+lbl> 3, 5, 5, 4, 1, 1, 4, 3, 1, 5, 5, 2, 5, 3, 2,…
## $ Has_child             <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Working               <dbl+lbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ Key_worker            <dbl+lbl> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ degree                <dbl+lbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ region                <dbl+lbl> 1, 5, 5, 2, 2, 1, 5, 5, 2, 1, 5, 3, 3, 2, 3,…

Rename Variables

Once again I used the assignment operator ‘$<-$’ and this lets R know where to get the information from, and in this case it is getting it from “data” which is what I called the raw data file in the chunk above. I used the pipe function (%>%) from the tidyverse package at the end of this first line to pipe it forward into the next function. I like to think of the pipe function as saying “and then” when writing my code. The ‘<-’ and ‘%>%’ functions used in this first line are used regularly throughout my code and for the sake of brevity I will not explain these again.

I changed the variable names to be more clear using the $rename()$ function from the dplyr package. Inside the brackets the format is (“new name” = original name). When renaming then I used all lower case and made the names informative and clear. Renaming these variables to be clear and consistent helped me to avoid confusion when reproducing the results later on.

data <- data %>% 
  rename("ever_covid" = Ever_covid,
         "covid_test" = q7beentested,
         "immunity" = q8haveimmunity,
         "worry" = q9worry,
         "risk_self" = q10arisk,
         "risk_other" = q10brisk,
         "go_out_total" = Going_out_total,
         "adhere_groceries" = Adhere_shop_groceries,
         "adhere_other" = Adhere_shop_other,
         "adhere_friends" = Adhere_meet_friends,
         "identify_symptoms" = Sx_covid_nomissing,
         "antibodies" = condition_antibodies,
         "gender" = gender,
         "age" = age_categories,
         "child" = Has_child,
         "work" = Working,
         "key_work" = Key_worker,
         "degree" = degree,
         "region" = region)

Identifying the variables

We had to work out which numerical value is assigned to which factor for the variables, as the authors did not have a data dictionary outlining this which another job on our journey to reproducing Smith et al.’s (2020) results.

In the data from Smith et al.’s (2020) OSF file, gender was coded as 1 and 2, although there was no indication as to which referred to male or female. To identify which numeric responded to which gender, I found out how many participants were labelled as 1 and 2 using $table(data['gender'])$, as seen in the chunk below. Then, I matched these numbers up with how many males and females Smith et al. (2020) reported in their article to find out that ‘male’ = 1 and ‘female’ = 2 in the data from the OSF.

table(data['gender'])

## gender
##    1    2 
## 2894 3255

In this chunk, I followed the same process as above and simply changed the variable name in the code to find that participants having a qualification was coded as 1 and not having a qualification was coded as 2.

table(data['degree'])

## degree
##    0    1 
## 4442 1615

Table 1

In this section we set out to reproduce Table 1 of Smith et al.’s (2020) study.

Reproducing Table 1 proved difficult for our team with little coding experience. It was time-consuming and frustrating with lots of trial and error, until we had a breakthrough of finding the basic framework that we used to build our tables, as revealed in the coding journey below. This involved computing the variables 1 by 1, binding the rows together, then making them look good. While we later discovered better ways of doing things this initial triumph gave us the confidence to persevere which was extremely satisfying.

First I selected the columns from the data that are in Table 1 using the $select()$ function from dplyr which falls under tidyverse.

dataT1 <- data %>%
  select(ever_covid,
         gender,
         age,
         child,
         work,
         key_work,
         degree,
         region)

The code used for the chunk below was chosen at the beginning of our learning R journey, based on Danielle’s modules and Google. We wanted to calculate the number and proportion of participants according to their gender and whether or not they believed they had previously had COVID-19. We found someone doing this on stack overflow, so we thought we would try the same functions for our data and it worked! The process for creating this chunk went as follows:

First, the filtered data tibble ‘dataT1’ is saved as a new entity for the gender variable called ‘T1_gender’ using the assignment operator $<-$ as seen previously.

The next line uses the $filter()$ function from the dplyr package to subset rows of data to filter out redundant values from the gender variable. Smith et al. (2020) must have had an exclusion criteria they did not report, because some of the variables did not have data for all of the participants. We navigated this by using this filter function to assist in reproducing this table.

Then put the participants into subgroups according to whether or not they believed they previously had COVID-19 by using the ‘group_by()’ function from the dplyr package. Putting ‘gender’ first in the brackets, then ‘ever_covid’ is important, because participants must be categorised into male or female before they are categorised into the ‘ever_covid’ variable, as it further categorises them into the subgroups: Male-NoCovid, Male-YesCovid, Female-NoCovid, Female-YesCovid.

Next, we used the $summarise(n = n())$ function from the dplyr package to calculate the total number of participants that are in each subgroup.

Then, find the proportion of each gender within the subgroups by using the $mutate()$ function from the dplyr package. Put “percentage = n/sum(n) * 100” inside the brackets, as this tells R to take the total number of participants that are in each group that was found in the previous function, divide that by the total number of people, and then multiply this by 100 to create a new column for the percentages.

To round the numerical values to 1 d.p as Table 1 does we used the ‘mutate_if()’ function and put “is.numeric, ~ round(.,1)” into the brackets, as this tells r to round to 1 decimal place.

Finally, check whether the labels and descriptive that the functions used earlier in this chunk produced are right by adding ‘T1_gender’ which is essentially what I like to think of as the name of this chunk.

T1_gender <- dataT1 %>%
  filter(!is.na(gender)) %>% 
  group_by(gender, ever_covid) %>% 
  summarise(n = n()) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  mutate_if(is.numeric, ~ round(.,1))

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
## `mutate_if()` ignored the following grouping variables:

T1_gender

## # A tibble: 4 × 4
## # Groups:   gender [2]
##       gender ever_covid     n percentage
##    <dbl+lbl>      <dbl> <dbl>      <dbl>
## 1 1 [Male]            0  2197       75.9
## 2 1 [Male]            1   697       24.1
## 3 2 [Female]          0  2459       75.5
## 4 2 [Female]          1   796       24.5

To get the proportion of participants for the rest of the variables in table 1, repeat the same functions that were outlined above and swap their labels to the new variable name. For instance, change the variable name “gender” from the above chunk to “age” to calculate the numerical values for the age variable, as seen in the chunk below.

Out of interest, I added up the number of participants for each variable when looking at the tables of data produced by this chunk. I noticed that the total number of participants for some variables, including “Have a child” and “Employment Status”, did not equal the total number of participants in the study (n = 6149). This was a key moment on my reproducibility journey, as I realised that Smith et al. (2020) having an exclusion criteria that they did not document reduces the reproducibility of their research and made our journey of reproducing Table 1 more difficult than if it had been documented. Here, we realised that using the $filter()$ function from the dplyr package first makes our output match Table 1.

# Age
T1_age <- dataT1 %>%
  filter(!is.na(age)) %>%
  group_by(age, ever_covid) %>%
  summarise(n=n()) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  ungroup()

## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.
## `mutate_if()` ignored the following grouping variables:

# Have a Child
T1_child <- dataT1 %>%
  filter(!is.na(child)) %>%
  group_by(child, ever_covid) %>%
  summarise(n = n()) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  ungroup()

## `summarise()` has grouped output by 'child'. You can override using the
## `.groups` argument.
## `mutate_if()` ignored the following grouping variables:

# Employment Status
T1_work <- dataT1 %>%
  filter(!is.na(work)) %>%
  group_by(work, ever_covid) %>%
  summarise(n = n()) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  ungroup()

## `summarise()` has grouped output by 'work'. You can override using the
## `.groups` argument.
## `mutate_if()` ignored the following grouping variables:

# Working in Key Sector
T1_keywork <- dataT1 %>%
  filter(!is.na(key_work)) %>%
  group_by(key_work, ever_covid) %>%
  summarise(n = n()) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  ungroup()

## `summarise()` has grouped output by 'key_work'. You can override using the
## `.groups` argument.
## `mutate_if()` ignored the following grouping variables:

# Qualification
T1_degree <- dataT1 %>%
  filter(!is.na(degree)) %>%
  group_by(degree, ever_covid) %>%
  summarise(n = n()) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  ungroup()

## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.
## `mutate_if()` ignored the following grouping variables:

# Region
T1_region <- dataT1 %>%
  filter(!is.na(region)) %>%
  group_by(region, ever_covid) %>%
  summarise(n = n()) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  ungroup()

## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## `mutate_if()` ignored the following grouping variables:

Now that the numeric parts of Table 1 have been established, we can put this all together in a clear and cohesive table. Our first attempt involved saving each individual value into a new entity using the assignment operator $=$ from the base package to organise the labels we made before making it into a tibble. The chunk below provides an example of how we would do this with the first characteristic ‘gender’.

In the next section of this chunk we create a custom tibble from scratch using the $c()$ function to combine the values using the format: “column name” = c(INSERT VALUES), as seen below.

Note: This is named “T1_failedattempt” because it was out first attempt at this table although we did not run with this as we realised it is not the most efficient way to code. If we were to have continued this for the rest of the other 6 variables, there would have been over 300+ lines of codes for this table alone. This method is simplistic and repetitive, yet not very precise. Thus, we later chose to find a different function that does this more efficiently as demonstrated in the journey outlined below.

n_gender_male_no = T1_gender$n[1]
n_gender_male_yes = T1_gender$n[2]
n_gender_female_no = T1_gender$n[3]
n_gender_female_yes = T1_gender$n[4]
per_gender_male_no = T1_gender$percentage[1]
per_gender_male_yes = T1_gender$percentage[2]
per_gender_female_no = T1_gender$percentage[3]
per_gender_female_yes = T1_gender$percentage[4]

T1_failedattempt <- tibble(
  " " = c(
    "Gender",
    " "),
  "  " = c(
    "Male", 
    "Female"),
  "Not had n" = c(
    n_gender_male_no, 
    n_gender_female_no),
  "Not had %"= c(
    per_gender_male_no, 
    per_gender_female_no),
  "Had n" = c(
    n_gender_male_yes, 
    n_gender_female_yes),
  "Had %" = c(
    per_gender_male_yes, 
    per_gender_female_yes))
T1_failedattempt

## # A tibble: 2 × 6
##   ` `      `  `   `Not had n` `Not had %` `Had n` `Had %`
##   <chr>    <chr>        <dbl>       <dbl>   <dbl>   <dbl>
## 1 "Gender" Male          2197        75.9     697    24.1
## 2 " "      Female        2459        75.5     796    24.5

Next, we tried using the kableextra package to produce Table 1. While the code below runs, this was still a tedious process that requires lots of lines of code. We used the $kable()$ function from the kable extra package to generate the table. Then we used the ‘kable_paper()’ function to style this table with the ‘latex_options’ function inside the brackets so that it can be knitted to a pdf and the ‘add_header_above(c()’ function to insert headers into the table.

T1_failedattempt_kbl <- T1_failedattempt %>%
  kable("html") %>%

  kable_paper(
    latex_options = c(
      "striped", # adds striped lines    
      "hover", # highlight the hovered row
      "condensed", # makes the row height slightly shorter
      "hold_position", # puts the table into position
      "scale_down")) %>% # makes the table fit on the page
  
  add_header_above(c(
    "Participant characteristic" = 1,
    "Level" = 1,
    "Think have not had COVID-19 n = 4656" = 2,
    "Think have had COVID-19 n = 1493" = 2
  )) %>%
  
  add_header_above(c(
    " " = 2,
    "Had COVID-19" = 4
    ))

T1_failedattempt_kbl

		Had COVID-19
Participant characteristic	Level	Think have not had COVID-19 n = 4656		Think have had COVID-19 n = 1493
		Not had n	Not had %	Had n	Had %
Gender	Male	2197	75.9	697	24.1
	Female	2459	75.5	796	24.5

The above chunk still seems inefficient, and we thought surely there is a faster way to produce this. So, we tried did some Googling and found out about the ‘bind_rows()’ function from the dplyr package that is used to combine the tibbles vertically. We inserted the previously created variable names inside the brackets as seen in the chunk below.

T1_bind_1_failed <- bind_rows(T1_gender,
                              T1_age,
                              T1_child,
                              T1_work,
                              T1_keywork,
                              T1_degree,
                              T1_region)
T1_bind_1_failed

## # A tibble: 40 × 10
## # Groups:   gender [3]
##         gender ever_covid     n percentage       age child  work key_work degree
##      <dbl+lbl>      <dbl> <dbl>      <dbl> <dbl+lbl> <dbl> <dbl> <dbl+lb> <dbl+>
##  1  1 [Male]            0  2197       75.9 NA           NA    NA       NA     NA
##  2  1 [Male]            1   697       24.1 NA           NA    NA       NA     NA
##  3  2 [Female]          0  2459       75.5 NA           NA    NA       NA     NA
##  4  2 [Female]          1   796       24.5 NA           NA    NA       NA     NA
##  5 NA                   0  1003       70.5  1 [18-2…    NA    NA       NA     NA
##  6 NA                   1   419       29.5  1 [18-2…    NA    NA       NA     NA
##  7 NA                   0   823       67.3  2 [25 t…    NA    NA       NA     NA
##  8 NA                   1   400       32.7  2 [25 t…    NA    NA       NA     NA
##  9 NA                   0   751       71.9  3 [35 t…    NA    NA       NA     NA
## 10 NA                   1   294       28.1  3 [35 t…    NA    NA       NA     NA
## # … with 30 more rows, and 1 more variable: region <dbl+lbl>

We realised that this produces a table with all of the ‘ever_covid’ variables instead of splitting it into participants who have and have not previously thought they had COVID-19 as table 1 does. In the next chunk, we troubleshooted and did the same code as above, but removed the first column of each tibble by adding ‘[-1]’ after each variable to prevent the N/A values from the data being in the table.

T1_bind_1 <- bind_rows(T1_gender[-1],
                       T1_age[-1],
                       T1_child[-1],
                       T1_work[-1],
                       T1_keywork[-1],
                       T1_degree[-1],
                       T1_region[-1])
T1_bind_1

## # A tibble: 40 × 3
##    ever_covid     n percentage
##         <dbl> <dbl>      <dbl>
##  1          0  2197       75.9
##  2          1   697       24.1
##  3          0  2459       75.5
##  4          1   796       24.5
##  5          0  1003       70.5
##  6          1   419       29.5
##  7          0   823       67.3
##  8          1   400       32.7
##  9          0   751       71.9
## 10          1   294       28.1
## # … with 30 more rows

Smith et al.’s (2020) paper separated columns according to whether or not participants believed they had previously had COVID-19. Thus, the next step involved splitting the table we produced above into two sections: one with descriptive statistics for participants who thought they had not previously had COVID-19 and the other for participants who thought they had previously had COVID-19. In the chunk below, we grouped participants based on the ‘ever_covid’ variable using the $filter()$ function from the dplyr package. The part inside the brackets that says “ever_covid == 0” tells R to include participants who indicated that they do not think they’ve had COVID-19 in the first tibble, and “ever_covid == 1” inside the brackets tells R to produce output only for participants who believe that they had previously had COVID-19 in the second tibble.

# table for participants who had not had Covid
T1_nocovid <- T1_bind_1 %>%
  filter(ever_covid == 0)

# table for participants who had had Covid
T1_yescovid <- T1_bind_1 %>%
  filter(ever_covid == 1)

Next, the ‘bind_cols()’ function from the dplyr package is used to horizontally bind these two tables from the previous chunk together into a new table. Since this new table has specific headings, the columns with the numerical values were needed instead of the ‘ever_covid’ columns. Thus, the code inside the brackets is “T1_nocovid, T1_yescovid”, as these data frames have the numerical values, as seen above.

T1_bind_2 <- bind_cols(
  T1_nocovid,
  T1_yescovid)

## New names:
## • `ever_covid` -> `ever_covid...1`
## • `n` -> `n...2`
## • `percentage` -> `percentage...3`
## • `ever_covid` -> `ever_covid...4`
## • `n` -> `n...5`
## • `percentage` -> `percentage...6`

T1_bind_2

## # A tibble: 20 × 6
##    ever_covid...1 n...2 percentage...3 ever_covid...4 n...5 percentage...6
##             <dbl> <dbl>          <dbl>          <dbl> <dbl>          <dbl>
##  1              0  2197           75.9              1   697           24.1
##  2              0  2459           75.5              1   796           24.5
##  3              0  1003           70.5              1   419           29.5
##  4              0   823           67.3              1   400           32.7
##  5              0   751           71.9              1   294           28.1
##  6              0   554           77.2              1   164           22.8
##  7              0  1525           87.6              1   216           12.4
##  8              0  2005           76.4              1   621           23.6
##  9              0  2386           75.5              1   776           24.5
## 10              0  1714           82.8              1   357           17.2
## 11              0  2871           71.9              1  1124           28.1
## 12              0  3105           80.5              1   753           19.5
## 13              0  1551           67.7              1   740           32.3
## 14              0  3382           76.1              1  1060           23.9
## 15              0  1200           74.3              1   415           25.7
## 16              0   781           75.7              1   251           24.3
## 17              0  1369           76.7              1   416           23.3
## 18              0  1120           77                1   335           23  
## 19              0   701           70.1              1   299           29.9
## 20              0   685           78.1              1   192           21.9

In the table produced above, RStudio has renamed some of the column labels that were the same, so we made a note of this to remember later to rename these columns later.

Next we made a demographic characteristic list by creating a tibble with the characteristics in Table 1 by first specifying the necessary variable names using the $c()$ function from the base package to combine the characteristics. Inside the brackets here are the variables for Table 1 and blank entries (” “) that create blank space where necessary. Then, we put the characteristics list we just made into the tibble in the line”T1_bind_2$Characteristics <- Characteristics”. This manually creates a new column in ‘T1_bind_2’ to make the row labels for the characteristics. Then, we include the levels within each demographic characteristic by making a new list called “Level” and using the $c()$ function was to specify the levels within these individual characteristics. Next, we put the “Level” list we just made into the tibble using “T1_bind_2$Level <- Level” .

In other words, this chunk is making the ‘Gender’ variable have the levels Male and Female in the ‘Gender’ column. This process for all other variables including the Age characteristic with the levels “18 to 24 years”, “25 to 34 years”, “35 to 44 years”, “45 to 54 years”, “55 years and over” etc.

Characteristics = c(
  "Gender", " ",
  "Age"," ", " ", " ", " ",
  "Have a child"," ",
  "Employment status"," ",
  "Working in key sector", " ",
  "Highest educational or professional qualification", " ",
  "Region", " "," "," "," ")  

T1_bind_2$Characteristics <- Characteristics

Level <- c("Male",
           "Female",
           "18 to 24 years",
           "25 to 34 years",
           "35 to 44 years",
           "45 to 54 years",
           "55 years and over",
           "No",
           "Yes",
           "Not working",
           "Working",
           "No ",
           "Yes ",
           "GCSE/vocational/A-level/No formal qualifications",
           "Degree or higher (Bachelors, Masters, PhD",
           "Midlands",
           "South & East",
           "North",
           "London",
           "Wales, Scotland and Northern Ireland")

T1_bind_2$Level <- Level

In the chunk below we are checking that the data looks right and everything is in order (e.g characteristics and their levels match correctly) before we start formatting the tale. To do this we used the $select()$ function from the dplyr package to select the columns from the data based on the condition of the variable not starting with the word “ever”, as we are making a completely new table.

T1_bind_2_woevercovid <- T1_bind_2 %>% 
  select(-starts_with("ever"))

T1_bind_2_woevercovid

## # A tibble: 20 × 6
##    n...2 percentage...3 n...5 percentage...6 Characteristics               Level
##    <dbl>          <dbl> <dbl>          <dbl> <chr>                         <chr>
##  1  2197           75.9   697           24.1 "Gender"                      "Mal…
##  2  2459           75.5   796           24.5 " "                           "Fem…
##  3  1003           70.5   419           29.5 "Age"                         "18 …
##  4   823           67.3   400           32.7 " "                           "25 …
##  5   751           71.9   294           28.1 " "                           "35 …
##  6   554           77.2   164           22.8 " "                           "45 …
##  7  1525           87.6   216           12.4 " "                           "55 …
##  8  2005           76.4   621           23.6 "Have a child"                "No" 
##  9  2386           75.5   776           24.5 " "                           "Yes"
## 10  1714           82.8   357           17.2 "Employment status"           "Not…
## 11  2871           71.9  1124           28.1 " "                           "Wor…
## 12  3105           80.5   753           19.5 "Working in key sector"       "No "
## 13  1551           67.7   740           32.3 " "                           "Yes…
## 14  3382           76.1  1060           23.9 "Highest educational or prof… "GCS…
## 15  1200           74.3   415           25.7 " "                           "Deg…
## 16   781           75.7   251           24.3 "Region"                      "Mid…
## 17  1369           76.7   416           23.3 " "                           "Sou…
## 18  1120           77     335           23   " "                           "Nor…
## 19   701           70.1   299           29.9 " "                           "Lon…
## 20   685           78.1   192           21.9 " "                           "Wal…

From the table we produced in the chunk above we remembered that the column labels need renaming to match Table 1, so we used the $rename()$ function from the dplyr package to do this. This function was chosen as we had previously used it to rename variables when the raw dataset was first uploaded at the start and knew it would work to rename the variables. From the output above it was clear to us that Column 1 is how many participants think they have not previously had COVID-19, Column 2 is the percentage of participants that think they have not had COVID-19, Column 3 is how many participants think they have previously had COVID-19 and Column 4 is the percentage of participants who think they have previously had COVID-19. We used this knowledge to rename the variables correctly to the same names used in Table 1 by Smith et al. (2020).

T1_bind_2_woevercovid_renamed <- 
  T1_bind_2_woevercovid %>% 
  rename(
  "n(NotHad)" = n...2,          # n for P w/o COVID
  "%(NotHad)" = percentage...3, # % for P w/o COVID
  "n(Had)"    = n...5,          # n for P w COVID
  "%(Had)"    = percentage...6  # % for P w COVID
)

T1_bind_2_woevercovid_renamed

## # A tibble: 20 × 6
##    `n(NotHad)` `%(NotHad)` `n(Had)` `%(Had)` Characteristics               Level
##          <dbl>       <dbl>    <dbl>    <dbl> <chr>                         <chr>
##  1        2197        75.9      697     24.1 "Gender"                      "Mal…
##  2        2459        75.5      796     24.5 " "                           "Fem…
##  3        1003        70.5      419     29.5 "Age"                         "18 …
##  4         823        67.3      400     32.7 " "                           "25 …
##  5         751        71.9      294     28.1 " "                           "35 …
##  6         554        77.2      164     22.8 " "                           "45 …
##  7        1525        87.6      216     12.4 " "                           "55 …
##  8        2005        76.4      621     23.6 "Have a child"                "No" 
##  9        2386        75.5      776     24.5 " "                           "Yes"
## 10        1714        82.8      357     17.2 "Employment status"           "Not…
## 11        2871        71.9     1124     28.1 " "                           "Wor…
## 12        3105        80.5      753     19.5 "Working in key sector"       "No "
## 13        1551        67.7      740     32.3 " "                           "Yes…
## 14        3382        76.1     1060     23.9 "Highest educational or prof… "GCS…
## 15        1200        74.3      415     25.7 " "                           "Deg…
## 16         781        75.7      251     24.3 "Region"                      "Mid…
## 17        1369        76.7      416     23.3 " "                           "Sou…
## 18        1120        77        335     23   " "                           "Nor…
## 19         701        70.1      299     29.9 " "                           "Lon…
## 20         685        78.1      192     21.9 " "                           "Wal…

In the table produced by the chunk above the ‘Characteristics’ and ‘Level’ columns are at the end of the table in column 5 and 6. To make our table like Smith et al.’s (2020) Table 1, we rearranged these two columns to be in column 1 and 2 instead using the $select()$ function from the dplyr package. We chose this function as it is the most simple way to do this and we were following the KISS method.

T1_bind_2_woevercovid_renamed_rearrange <- T1_bind_2_woevercovid_renamed %>%
  select(5, 6, 1, 2, 3, 4)

T1_bind_2_woevercovid_renamed_rearrange

## # A tibble: 20 × 6
##    Characteristics               Level `n(NotHad)` `%(NotHad)` `n(Had)` `%(Had)`
##    <chr>                         <chr>       <dbl>       <dbl>    <dbl>    <dbl>
##  1 "Gender"                      "Mal…        2197        75.9      697     24.1
##  2 " "                           "Fem…        2459        75.5      796     24.5
##  3 "Age"                         "18 …        1003        70.5      419     29.5
##  4 " "                           "25 …         823        67.3      400     32.7
##  5 " "                           "35 …         751        71.9      294     28.1
##  6 " "                           "45 …         554        77.2      164     22.8
##  7 " "                           "55 …        1525        87.6      216     12.4
##  8 "Have a child"                "No"         2005        76.4      621     23.6
##  9 " "                           "Yes"        2386        75.5      776     24.5
## 10 "Employment status"           "Not…        1714        82.8      357     17.2
## 11 " "                           "Wor…        2871        71.9     1124     28.1
## 12 "Working in key sector"       "No "        3105        80.5      753     19.5
## 13 " "                           "Yes…        1551        67.7      740     32.3
## 14 "Highest educational or prof… "GCS…        3382        76.1     1060     23.9
## 15 " "                           "Deg…        1200        74.3      415     25.7
## 16 "Region"                      "Mid…         781        75.7      251     24.3
## 17 " "                           "Sou…        1369        76.7      416     23.3
## 18 " "                           "Nor…        1120        77        335     23  
## 19 " "                           "Lon…         701        70.1      299     29.9
## 20 " "                           "Wal…         685        78.1      192     21.9

Now our table is ready to be formatted using the kableExtra package. First, the $kbl()$ function tells R to make a table using the kableExtra package, and “T1_bind_2_woevercovid_renamed_rearrange’ is put in the brackets to tell R to use the data from this dataframe. Then, the table is styled to look like Smith et al.’s (2020) Table 1 using the ‘kable_styling()’ function and ‘bootstrap_option =c()’ with”striped” inside the brackets to add striped lines to the table and “hover” to highlight the hovered row, as seen below. Then, we add headings using the ‘add_header_above()’ function, as seen below.

T1_final_fail <- 
kbl(T1_bind_2_woevercovid_renamed_rearrange) %>%
  kable_styling(bootstrap_options = 
                  c("striped", 
                    "hover")) %>% 
  
  add_header_above(c(
    " " = 2,
    "Think have not had COVID-19 n = 4656" = 2, 
    "Think have had COVID-19 n = 1493" = 2)) %>%
  add_header_above(c(
    " " = 2,
    "Had COVID-19" = 4))


T1_final_fail

		Had COVID-19
		Think have not had COVID-19 n = 4656		Think have had COVID-19 n = 1493
Characteristics	Level	n(NotHad)	%(NotHad)	n(Had)	%(Had)
Gender	Male	2197	75.9	697	24.1
	Female	2459	75.5	796	24.5
Age	18 to 24 years	1003	70.5	419	29.5
	25 to 34 years	823	67.3	400	32.7
	35 to 44 years	751	71.9	294	28.1
	45 to 54 years	554	77.2	164	22.8
	55 years and over	1525	87.6	216	12.4
Have a child	No	2005	76.4	621	23.6
	Yes	2386	75.5	776	24.5
Employment status	Not working	1714	82.8	357	17.2
	Working	2871	71.9	1124	28.1
Working in key sector	No	3105	80.5	753	19.5
	Yes	1551	67.7	740	32.3
Highest educational or professional qualification	GCSE/vocational/A-level/No formal qualifications	3382	76.1	1060	23.9
	Degree or higher (Bachelors, Masters, PhD	1200	74.3	415	25.7
Region	Midlands	781	75.7	251	24.3
	South & East	1369	76.7	416	23.3
	North	1120	77.0	335	23.0
	London	701	70.1	299	29.9
	Wales, Scotland and Northern Ireland	685	78.1	192	21.9

We were happy with the table we produced above, however, we had to change the code to knit the table into a PDF for our final submission. We initially thought that we might have to use a different package such as Huxtable or flextable which we weren’t as familiar with. However, after some Googling we found out that we could continue using the kable Extra package and did not have to change the code that much. As seen in the chunk below, where we successfully reproduced Table 1 that could be knitted to a PDF. This was one of our biggest triumphs! Woohoo!

First, the $kbl(booktabs = T)$ function was chosen to enable the booktabs format for the table. Then, the ‘kable_paper()’ function is used to style this table and choose a theme to look like Smith et al.’s (2020) Table 1. Next, the ‘latex_options’ function is used to allow this table to be produced so that it can be knitted to PDF and formatting options are inserted in the brackets as in the chunk above. Then, the ‘add_header_above()’ function is used to add headings above the first row and then on top of the first headers created in the previous function, as seen below.

T1_finalT <- T1_bind_2_woevercovid_renamed_rearrange %>%
  kbl(booktabs = T) %>%
  
  kable_paper(
    latex_options = c(
      "striped", # adds striped lines    
      "hover", # highlight the hovered row
      "condensed", # makes the row height slightly shorter
      "hold_position", # puts the table into position
      "scale_down")) %>% # makes the table fit on the page

  add_header_above(c(
    " " = 2,
    "Think have not had COVID-19 n = 4656" = 2, 
    "Think have had COVID-19 n = 1493" = 2)) %>%
  
  add_header_above(c(
    " " = 2,
    "Had COVID-19" = 4))

T1_finalT

		Had COVID-19
		Think have not had COVID-19 n = 4656		Think have had COVID-19 n = 1493
Characteristics	Level	n(NotHad)	%(NotHad)	n(Had)	%(Had)
Gender	Male	2197	75.9	697	24.1
	Female	2459	75.5	796	24.5
Age	18 to 24 years	1003	70.5	419	29.5
	25 to 34 years	823	67.3	400	32.7
	35 to 44 years	751	71.9	294	28.1
	45 to 54 years	554	77.2	164	22.8
	55 years and over	1525	87.6	216	12.4
Have a child	No	2005	76.4	621	23.6
	Yes	2386	75.5	776	24.5
Employment status	Not working	1714	82.8	357	17.2
	Working	2871	71.9	1124	28.1
Working in key sector	No	3105	80.5	753	19.5
	Yes	1551	67.7	740	32.3
Highest educational or professional qualification	GCSE/vocational/A-level/No formal qualifications	3382	76.1	1060	23.9
	Degree or higher (Bachelors, Masters, PhD	1200	74.3	415	25.7
Region	Midlands	781	75.7	251	24.3
	South & East	1369	76.7	416	23.3
	North	1120	77.0	335	23.0
	London	701	70.1	299	29.9
	Wales, Scotland and Northern Ireland	685	78.1	192	21.9

Finishing this table was a great triumph for our group and what we learnt along the way greatly helped in producing the other figures later on! We were beginning to learn that while troubleshooting when learning R can be frustrating, it is actually quite satisfying when you finally get the code to run and produce a beautiful table!

Table 2

Next we aimed to reproduce Table 2 from Smith et al.’s (2020) paper. First, we selected the relevant variables that are in Table 2 by using the $select()$ function from the dplyr package. Next, we used the ‘group_by()’ function from dplyr to group these into subsections according to the ‘ever_covid’ variable i.e. participants who believed that they had previously had COVID-9 and those who didn’t.

dataT2 <- data %>%
  select(ever_covid,
         immunity,
         go_out_total,
         worry,
         risk_self,
         risk_other
         ) %>%
  group_by(ever_covid)

Next we use the $summarise()$ function from the dplyr package to create a new data frame. Inserting “mean(immunity)” and “sd(immunity)” within this function calculates the mean and standard deviation for the immunity variable, hence why ‘immunity’ is in the brackets. The “Mean =” and “SD =” before this names them. The we check whether the values produced are the same as those in Smith et al.’s paper by using the last line of code, ‘T2_immun’, and they are, meaning we can continue!

T2_immun <- dataT2 %>%
  summarise(
    Mean = mean(immunity), 
    SD   = sd(immunity) 
  )
T2_immun

## # A tibble: 2 × 3
##                           ever_covid  Mean    SD
##                            <dbl+lbl> <dbl> <dbl>
## 1 0 [Think have not had coronavirus]  2.38  1.01
## 2 1 [Think have had coronavirus]      3.33  1.00

Now we repeated the process seen in the chunk above for all of the other variables in Table 2 using equivalent code. That is, all we have to do here was copy the same code and change the variable names, as seen below. Refer to the comments above.

T2_goout <- dataT2 %>%
  summarise(
    Mean = mean(go_out_total),
    SD = sd(go_out_total)
  )

T2_worry <- dataT2 %>%
  summarise(
    Mean = mean(worry),
    SD = sd(worry)
  )

T2_riskself <- dataT2 %>%
  summarise(
    Mean = mean(risk_self),
    SD = sd(risk_self)
  )

T2_riskother <- dataT2 %>%
  summarise(
    Mean = mean(risk_other),
    SD = sd(risk_self)
  )

Next, we saved the individual values of each variable as a new entity to produce numeric variables. The code below uses the same “<-’ function as when we did this in table 1. The number in the square bracket represents which row you are referring to, nc = no COVID-19 (do not believe they have had COVID-19) and yc = yes COVID-19 (believe they have had COVID-19) and SD = standard deviation.

Here we learnt that R studio rounds .5 values down instead of up. We Googled why and found out that this is to prevent researchers from being biased towards over-estimating and inflating values.

# Immunity
immun_mean_nc <- T2_immun$Mean[1] 
immun_sd_nc   <- T2_immun$SD[1]
immun_mean_yc <- T2_immun$Mean[2]
immun_sd_yc   <- T2_immun$SD[2] 

# Going Out Total
goout_mean_nc <- T2_goout$Mean[1]
goout_sd_nc   <- T2_goout$SD[1]
goout_mean_yc <- T2_goout$Mean[2]
goout_sd_yc   <- T2_goout$SD[2]

# Worry
worry_mean_nc <- T2_worry$Mean[1]
worry_sd_nc   <- T2_worry$SD[1]
worry_mean_yc <- T2_worry$Mean[2]
worry_sd_yc   <- T2_worry$SD[2]

# Risk to Oneself
self_mean_nc <- T2_riskself$Mean[1]
self_sd_nc   <- T2_riskself$SD[1]
self_mean_yc <- T2_riskself$Mean[2]
self_sd_yc   <- T2_riskself$SD[2]

# For Risk to Others
other_mean_nc <- T2_riskother$Mean[1]
other_sd_nc   <- T2_riskother$SD[1]
other_mean_yc <- T2_riskother$Mean[2]
other_sd_yc   <- T2_riskother$SD[2]

Next, we created the tibble using the $tibble()$ function from the tibble package that combines the numerical values for the “Participant Characteristics” variables in Table 2 using the $c()$ function from the base package. Next, the levels of these variables for the new table are produced by using the $c()$ function and these are assigned the name ‘Level’ by the $=$ function, as seen below. Then, the numeric results produced are rounded to 2 decimal places using the ‘mutate_if()’ function and inserting “is.numeric, ~round(.,2)” in the brackets.

T2_final <- tibble(
  "Participant Characteristics" = c(
    "I think I have some immunity to COVID-19",
    "Total out-of-home activity in the last seven days",
    "Worry about COVID-19", 
    "Perceived risk of COVID-19 to oneself", 
    "Perceived risk of COVID-19 to people in the UK"),
  Level = c(
    "1 = strongly disagree to 5 = strongly agree",
    "Range = 0 to 42",
    "1 = not at all worried to 5 = extremely worried",
    "1 = no risk at all to 4 = major risk",
    "1 = no risk at all to 4 = major risk"),
  "Mean(Had not)" = c(
    immun_mean_nc, 
    goout_mean_nc, 
    worry_mean_nc, 
    self_mean_nc, 
    other_mean_nc),
  "SD(Had not)" = c(
    immun_sd_nc,
    goout_sd_nc, 
    worry_sd_nc, 
    self_sd_nc, 
    other_sd_nc),
  "Mean(Had)" = c(
    immun_mean_yc, 
    goout_mean_yc, 
    worry_mean_yc, 
    self_mean_yc, 
    other_mean_yc),
  "SD(Had)" =  c(
    immun_sd_yc, 
    goout_sd_yc, 
    worry_sd_yc, 
    self_sd_yc, 
    other_sd_yc)) %>%
  mutate_if(is.numeric, ~round(.,2))

kbl(T2_final)

Participant Characteristics	Level	Mean(Had not)	SD(Had not)	Mean(Had)	SD(Had)
I think I have some immunity to COVID-19	1 = strongly disagree to 5 = strongly agree	2.38	1.01	3.33	1.00
Total out-of-home activity in the last seven days	Range = 0 to 42	6.69	5.63	9.35	7.69
Worry about COVID-19	1 = not at all worried to 5 = extremely worried	3.59	1.01	3.38	1.12
Perceived risk of COVID-19 to oneself	1 = no risk at all to 4 = major risk	2.81	0.76	2.81	0.76
Perceived risk of COVID-19 to people in the UK	1 = no risk at all to 4 = major risk	3.39	0.76	3.30	0.76

Then, we use the kableExtra package to format the table. First, the ‘kable_styling()’ function is used to style this table and choose options to look like Smith et al.’s (2020) Table 1. Next, the ‘bootstrap_options’ function is used to allow this table to be produced so that it can be knitted to a HTML and formatting options are inserted in the brackets as in the chunk above. Then, the ‘add_header_above()’ function is used to add headings above the first row. For extra information about the functions used in the chunk below, refer to table 1 above. To print out this table I used the $print()$ function from the base package and put the name of the dataframe ‘T2_final_kbl’ in the brackets.

T2_final_kbl <- kbl(T2_final) %>%
  kable_styling(
    bootstrap_options = c(
      "striped", # adds striped lines    
      "hover", # highlight the hovered row
      "condensed", # makes the row height slightly shorter
      "hold_position", # puts the table into position
      "scale_down")) %>% # makes the table fit on the page
  
  add_header_above(c(
    " " = 2, 
    "Think have not had COVID-19 n = 4656" = 2, 
    "Think have had COVID-19 n = 1493" = 2)) 

T1_final_fail

		Had COVID-19
		Think have not had COVID-19 n = 4656		Think have had COVID-19 n = 1493
Characteristics	Level	n(NotHad)	%(NotHad)	n(Had)	%(Had)
Gender	Male	2197	75.9	697	24.1
	Female	2459	75.5	796	24.5
Age	18 to 24 years	1003	70.5	419	29.5
	25 to 34 years	823	67.3	400	32.7
	35 to 44 years	751	71.9	294	28.1
	45 to 54 years	554	77.2	164	22.8
	55 years and over	1525	87.6	216	12.4
Have a child	No	2005	76.4	621	23.6
	Yes	2386	75.5	776	24.5
Employment status	Not working	1714	82.8	357	17.2
	Working	2871	71.9	1124	28.1
Working in key sector	No	3105	80.5	753	19.5
	Yes	1551	67.7	740	32.3
Highest educational or professional qualification	GCSE/vocational/A-level/No formal qualifications	3382	76.1	1060	23.9
	Degree or higher (Bachelors, Masters, PhD	1200	74.3	415	25.7
Region	Midlands	781	75.7	251	24.3
	South & East	1369	76.7	416	23.3
	North	1120	77.0	335	23.0
	London	701	70.1	299	29.9
	Wales, Scotland and Northern Ireland	685	78.1	192	21.9

Table 3

Our next goal was to reproduce Table 3 from Smith et al’s (2020) paper. In the same way as the previous two tables, we first filtered the dataset to include the variables that are in Table 3 using the $select()$ function from the dplyr package. Next, we used the the ‘group_by()’ function from the dplyr package to group the data into subsections according whether participants believed that they had previously had COVID-9 or not (i.e. the ‘ever_covid’ variable). The last line of code, $dataT3$, allows us to look at what this chunk has produced in a table to see if it has done what we wanted and it has!

dataT3 <- data %>%
  select(
    ever_covid,
    adhere_groceries:identify_symptoms
  ) %>%
  group_by(ever_covid)
dataT3

## # A tibble: 6,149 × 5
## # Groups:   ever_covid [2]
##         ever_covid adhere_groceries adhere_other adhere_friends identify_sympto…
##          <dbl+lbl>        <dbl+lbl>    <dbl+lbl>      <dbl+lbl>        <dbl+lbl>
##  1 0 [Think have … 1 [Once or less] 1 [Reported… 1 [Reported n… 1 [Identified c…
##  2 0 [Think have … 1 [Once or less] 1 [Reported… 1 [Reported n… 1 [Identified c…
##  3 0 [Think have … 1 [Once or less] 1 [Reported… 1 [Reported n… 1 [Identified c…
##  4 0 [Think have … 1 [Once or less] 1 [Reported… 1 [Reported n… 1 [Identified c…
##  5 0 [Think have … 1 [Once or less] 1 [Reported… 1 [Reported n… 0 [Did not iden…
##  6 0 [Think have … 1 [Once or less] 0 [Reported… 1 [Reported n… 0 [Did not iden…
##  7 0 [Think have … 0 [Twice or mor… 0 [Reported… 1 [Reported n… 1 [Identified c…
##  8 0 [Think have … 1 [Once or less] 1 [Reported… 1 [Reported n… 1 [Identified c…
##  9 0 [Think have … 1 [Once or less] 1 [Reported… 1 [Reported n… 0 [Did not iden…
## 10 0 [Think have … 1 [Once or less] 1 [Reported… 1 [Reported n… 1 [Identified c…
## # … with 6,139 more rows

Now we can use the filtered dataset that we created in the chunk above to find the descriptive statistics for each variable.

We will start with the ‘groceries’ variable in the chunk below.

First, we find the number of observations for participants by using the $count()$ function and inserting the variable name, “adhere_groceries” in the brackets.

We chose the following functions, as the participants were already split into two groups according whether participants believed that they had previously had COVID-9 or not (the “ever_covid” variable) in the previous chunk. However, the “ever_covid” variable is on the left of Table 3 and the other variables at the top, which is opposite to Table 1 and Table 2. Due to this we used the ‘pivot_longer()’ function to put the “adhere_groceries” variable in the brackets which tells R to put this variable into the table instead of being a column label. This creates four distinct groups after also being separated by the factor “adhere_groceries”. The ‘names_to’ argument tells the new column to be created from the “characteristics” through the $=$ function. Then, the ‘values_to’ argument specifies the name of the column to create from the data stored in “factor” using the $=$ function. This is separated by adhere_groceries in these brackets using a comma, as seen below.

Then, to find the proportion of participants each group we used the $mutate$ function and put “percent = n/sum(n) * 100” inside the brackets to tell R to calculate these percentages.

Next, these percentages are rounded to 1 decimal place using the ‘mutate_if’ function with “is.numeric, ~round(.,1)” inside the brackets directing R to round these percentages. The percentage is then placed inside brackets using the $mutate$ function and putting “percent = paste0(”(“,percent,”)“))” inside the brackets.

The columns ‘n’ and percentages are then combined to form a new column using the $unite$ function from tidyr package. The grouping is then removed using the $ungroup()$ function just to prevent the possibility of the functions we used previously changing the dataset where we don’t want it to.

To move the levels from the “adhere_groceries” variable from row values to column labels I used the ‘pivot_wider()’ function to widen the data by increasing the number of columns and decreasing the number of rows to match Table 3. In this line, R is told that the output column “names_from” come from ‘factor’ and the cell values “values_from” come from ‘n(%)“)’ using the $=$ function.

Next, we select the values needed for Table 3 (i.e. n and %) using the $select()$ function. The ‘adhere_groceries’ variable’s the binary outcome ‘0’ refers to “on two or more days in the last week”, and ‘1’ refers to “on one or fewer days in the last week”. Thus, when using the $select()$ function, 1 goes before 0 to match Smith et al.’s (2020) Table 3 format where “on one or fewer days…” is before “one two or more days…”.

T3_groceries <- dataT3 %>%
  count(adhere_groceries) %>%
  pivot_longer(adhere_groceries,
               names_to = "characteristics",
               values_to = "factor") %>%
  mutate(percent = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>% 
  mutate(percent = paste0("(",percent,")")) %>%
  unite('n(%)',c(n, percent), sep = " ") %>%
  ungroup() %>%
  pivot_wider(
    names_from = "factor",
    values_from = "n(%)") %>%
  select(`1`, `0`)

## `mutate_if()` ignored the following grouping variables:
## • Column `ever_covid`

Next, we repeat the same coding process using the same functions used in the chunk above for the other variables as we did for the ‘adhere_groceries’ variable above. The chunk below uses the same code, although changes the variable names to find the statistics for the other variable. For instance, replace where ‘adhere_groceries’ is in the code to say ‘adhere_other’ to find the statistics for the ‘adhere_other’ variable. Refer to the comment above for an explanation of the functions used, what they are doing and how they were chosen.

When producing these descriptives, we noticed that for the ‘identify_symptoms’ variable there is not data for all 6149 participants. Upon looking into this we also realised that Smith et al. (2020) made a mistake and switched these values around in their first publication that we were assigned to reproduce. We did some Googling and found out that this was corrected in a later publication. As our group aimed to reproduce the correct statistics, we inserted an extra line of code using the $rename("1" = '0', "0" = '1')$ function to swap the values 0 and 1 around to produce the correct result, in line with open science. This was a key example that threw us into why the replication crisis in psychology may be occurring. We learnt that researchers double checking the data that they upload is key to increasing the reproducibility of research.

# Descriptives for Other
T3_other <- dataT3 %>%
  count(adhere_other) %>% 
  pivot_longer(adhere_other,
               names_to = "characteristics",
               values_to = "factor") %>%
  mutate(percent = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  mutate(percent = paste0("(",percent,")")) %>% 
  unite('n(%)',c(n, percent), sep = " ") %>%
  ungroup() %>%
  pivot_wider(
    names_from = "factor",
    values_from = c("n(%)")) %>%
  select(`1`, `0`)

## `mutate_if()` ignored the following grouping variables:
## • Column `ever_covid`

# Descriptives for Friends
T3_friends <- dataT3 %>%
  count(adhere_friends) %>%  
  pivot_longer(adhere_friends,
               names_to = "characteristics",
               values_to = "factor") %>%
  mutate(percent = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  mutate(percent = paste0("(",percent,")")) %>%  
  unite('n(%)',c(n, percent), sep = " ") %>%
  ungroup() %>%
  pivot_wider(
    names_from = "factor",
    values_from = "n(%)") %>%
  select(`1`, `0`)

## `mutate_if()` ignored the following grouping variables:
## • Column `ever_covid`

# Descriptives for Symptoms ~ n does not = 6149 
T3_symptoms <- dataT3 %>%
  count(identify_symptoms) %>% 
  pivot_longer(identify_symptoms,
               names_to = "characteristics",
               values_to = "factor") %>%
  mutate(percent = n/sum(n) * 100) %>% 
  mutate_if(is.numeric, ~round(.,1)) %>%
  mutate(percent = paste0("(",percent,")")) %>% 
  unite('n(%)',c(n, percent), sep = " ") %>%
  ungroup() %>%
  pivot_wider(
    names_from = "factor",
    values_from = "n(%)") %>%
  rename("1" = '0', #Extra step for this variable to correct authors mistake
         "0" = '1') %>%              
  select('1', '0')

## `mutate_if()` ignored the following grouping variables:
## • Column `ever_covid`

Next, we used the numeric descriptives we produced in the chunk above and combined these into a single data frame using the ‘bind_rows’ function from dplyr and put the variables needed to create Table 3 in the brackets to bind their rows.

To check out the newly made ‘T3_bind’ dataframe we used the $as.data.frame$ function from the base package to organise the tibble rows into a table of rows and columns.

Then, in the last line we check if the output we produced matches the authors and it does for the most part… However, we observed that the last couple of rows of values does not…

When completing this table, our group chose a divide and conquer strategy and all attempted to produce this table individually then would support each other when we ran into issues. All four of us came to our Thursday meeting with the same issue, that being the last rows of data from the below output not matching Smith et al.’s (2020). However, it was in this meeting that we realised upon comparing our code and output with each other that our numbers all matched up, and it was actually the authors’ that was different to ours! We discussed this and came up with the hypothesis that the authors must have had an exclusion criteria that they did not report, as this also explains why the total participant count for the ‘identify_symptoms’ variable is lower than all of the other variables. We are unsure whether Smith et al. (2020) simply forgot to include details about their exclusion criteria or whether something shady was going on here. However, this experience taught us the importance of authors providing details in their paper, such as exclusion criteria, to make results reproducible.

T3_bind <- bind_rows(T3_groceries,
                     T3_other,
                     T3_friends,
                     T3_symptoms)

T3_bind <- as.data.frame(T3_bind)

T3_bind

##             1           0
## 1 2955 (63.5) 1701 (36.5)
## 2  805 (53.9)  688 (46.1)
## 3 3500 (75.2) 1156 (24.8)
## 4  816 (54.7)  677 (45.3)
## 5 4200 (90.2)   456 (9.8)
## 6 1071 (71.7)  422 (28.3)
## 7 1729 (37.1) 2927 (62.9)
## 8  788 (52.8)  705 (47.2)

Next, we created a dataframe using the $tibble()$ function from the tibble package that combines the numerical values for the “Think have had COVID-19?” and “Self-reported behaviour n(%)” variables in Table 3 using the $c()$ function. The numbers in the square brackets refers to the data from the ‘T3_bind’ table above, with the format ([row, column]) to indicate the numerical values that correspond to each level of self-reported behaviour.

T3_table <- tibble(
  "Think have had COVID-19?" = c(" ", "No", "Yes",
                                 " ", "No", "Yes",
                                 " ", "No", "Yes",
                                 " ", "No", "Yes"),
  "Self-reported behaviour n(%)" = c("On one or fewer days in the last week n = 3760",
                                     T3_bind[1,1], T3_bind[2,1],
                                     "Not at all in the last week n = 4316",
                                     T3_bind[3,1], T3_bind[4,1],
                                     "Not at all in the last week n = 5271",
                                     T3_bind[5,1], T3_bind[6,1],
                                     "Did not correctly identify common symptoms n = 2517",
                                     T3_bind[7,1], T3_bind[8,1]),
  " "                           = c("On two or more days in the last week n = 2389",
                                     T3_bind[1,2], T3_bind[2,2],
                                     "On one or more days in the last week n = 1833",
                                     T3_bind[3,2], T3_bind[4,2],
                                     "One one or more days in the last week n = 878",
                                     T3_bind[5,2], T3_bind[6,2],
                                     "Correctly identified common symptoms n = 3632",
                                     T3_bind[7,2], T3_bind[8,2]))

Then, we use the kableExtra package to format the table by grouping the rows together.

Then, the ‘kable_styling()’ function is used to style this table and choose options to look like Smith et al.’s (2020) Table 3, as seen in the chunk.

Next, the ‘bootstrap_options’ function is used to allow this table to be produced so that it can be knitted to HTML, as in our workshop Kate said that our tables will be marked using HTML.

Then, the ‘pack_rows()’ function is used to group the rows in the table together under different labels. The label heading is first thing in the brackets in quotations marks, for instance the first label is “Shopping for groceries/pharmacy”. The numbers (e.g. “1, 3” in the first example below) refer to the start and end rows for that heading. ‘label_row_css =’ is defining the Cascading Style Sheets (CSS) and telling R what colours to use in the table.

T3_kbl <- kbl(T3_table) %>%
  kable_styling(
    bootstrap_options = c(
      "striped", # adds striped lines    
      "hover", # highlight the hovered row
      "condensed", # makes the row height slightly shorter
      "hold_position", # puts the table into position
      "scale_down")) %>% # makes the table fit on the page
 
  pack_rows("Shopping for groceries/pharmacy", 1, 3,
            label_row_css = "background-color: #666; color: #fff;") %>%
  pack_rows("Shopping for items other than groceries/pharmacy", 4, 6,
            label_row_css = "background-color: #666; color: #fff;") %>%
  pack_rows("Meeting up with friends or family", 7, 9,
            label_row_css = "background-color: #666; color: #fff;") %>%
  pack_rows("Correct identification of cough and ferver", 10, 11,
            label_row_css = "background-color: #666; color: #fff;") 
T3_kbl

Think have had COVID-19?	Self-reported behaviour n(%)
Shopping for groceries/pharmacy
	On one or fewer days in the last week n = 3760	On two or more days in the last week n = 2389
No	2955 (63.5)	1701 (36.5)
Yes	805 (53.9)	688 (46.1)
Shopping for items other than groceries/pharmacy
	Not at all in the last week n = 4316	On one or more days in the last week n = 1833
No	3500 (75.2)	1156 (24.8)
Yes	816 (54.7)	677 (45.3)
Meeting up with friends or family
	Not at all in the last week n = 5271	One one or more days in the last week n = 878
No	4200 (90.2)	456 (9.8)
Yes	1071 (71.7)	422 (28.3)
Correct identification of cough and ferver
	Did not correctly identify common symptoms n = 2517	Correctly identified common symptoms n = 3632
No	1729 (37.1)	2927 (62.9)
Yes	788 (52.8)	705 (47.2)

Figure 1

Producing this graph was the hardest of all the figures we had to reproduce, as we had gotten used to producing tables and this was our first go at producing a graph which challenged us.

First, we wanted to find the numerical values of the descriptive statistics. To lengthen the data and make the tibble longer so that we can see everything we used the ‘pivot_longer()’ function from tidyr and put the “immunity: identify_symptoms” variable in the brackets which tells R to put this variable into the table instead of being a column label.

The ‘names_to’ argument tells the new column to be created from the “characteristics” through the $=$ function and then, the ‘values_to’ argument specifys the name of the column to create from the data stored in “level” using the $=$ function. This is separated by adhere_groceries in these brackets using a comma, as seen below.

The participants are then separated into sub-groups using the ‘group_by()’ function and inserting the groups in the brackets.

Next, we find out how many individual ‘treatment’ outcomes occurred using the $count()$ function. Then, we remove the previous grouping with the $ungroup()$ function so that we can then use the ‘group_by’ function to group the ‘ever_covid’ and ‘characteristics’ variables together, as Figure 1 is presented using these subgroups, as discussed previously.

Then, to find the proportion of participants each group we used the $mutate$ function and put “percent = n/sum(n) * 100” inside the brackets to tell R to calculate these percentages for these subgroups we created in the line above.

The grouping is then removed using the $ungroup()$ function just to prevent the possibility of the functions we used previously changing the dataset where we don’t want it to.

To check out the newly made ‘data_f’ dataframe we used the $as.data.frame$ function from the base package to organise the tibble rows into a table of rows and columns.

Finally, we view the first part of the table using the $head()$ function that displays the first rows in the ‘data_F’ data frame. The output we produced in this table provides us with the numeric statistics for the first rows in the ‘data_F’ data frame that we used to check it aligned with the graph we were attempting to reproduce.

data_F <- data %>%
  pivot_longer( 
    immunity:identify_symptoms,
    names_to = "characteristics", 
    values_to = "level") %>%
  group_by(ever_covid, characteristics, level) %>%
  count() %>%
  ungroup() %>%
  group_by(ever_covid, characteristics) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  mutate_if(is.numeric, ~round(.,1)) %>%
  ungroup()

## `mutate_if()` ignored the following grouping variables:
## • Columns `ever_covid`, `characteristics`

data_F = as.data.frame(data_F)
head(data_F)

##   ever_covid  characteristics level    n percentage
## 1          0   adhere_friends     0  456        9.8
## 2          0   adhere_friends     1 4200       90.2
## 3          0 adhere_groceries     0 1701       36.5
## 4          0 adhere_groceries     1 2955       63.5
## 5          0     adhere_other     0 1156       24.8
## 6          0     adhere_other     1 3500       75.2

The above output we produced in the table above includes the numerical statistics for all of the factors in the data, however the graph we are attempting to reproduce only included observations from specific levels of each factor. For example, the authors only included participants who ‘strongly agree’ that they have immunity in the first bar, and did not mention the other levels of perceived immunity.

Therefore, we filtered out the irrelevant observations not needed for the graph using $filter()$ function from the base package to specify the levels we wanted to by putting the characteristic, such as immunity in quotations marks, then using the format ‘& level == INSERT NUMERICAL FACTOR LEVEL’, whereby the number after ‘==’ is the level of the factor we want. This is followed by the logical operator ‘$|$’ at the end of each line that can be thought of as an ‘or’ statement. For instance, the first line filters participants that answered “strongly agree” for the immunity variable.

Note that the ‘go_out_total’ characteristic is $>=8$ which means include this variable if the participants went out more than or equal to 8 times.

figure_filtered <- data_F %>%
  filter(characteristics == "immunity"          & level == 5 |
         characteristics == "adhere_groceries"  & level == 0 |
         characteristics == "adhere_other"      & level == 0 |
         characteristics == "adhere_friends"    & level == 0 |
         characteristics == "go_out_total"      & level >= 8 | 
         characteristics == "worry"             & level == 1 |
         characteristics == "risk_self"         & level == 1 |
         characteristics == "risk_others"       & level == 1 |
         characteristics == "identify symptoms" & level == 0)

figure_filtered

##    ever_covid  characteristics level    n percentage
## 1           0   adhere_friends     0  456        9.8
## 2           0 adhere_groceries     0 1701       36.5
## 3           0     adhere_other     0 1156       24.8
## 4           0     go_out_total     8  404        8.7
## 5           0     go_out_total     9  303        6.5
## 6           0     go_out_total    10  189        4.1
## 7           0     go_out_total    11  147        3.2
## 8           0     go_out_total    12  108        2.3
## 9           0     go_out_total    13  114        2.4
## 10          0     go_out_total    14   90        1.9
## 11          0     go_out_total    15   87        1.9
## 12          0     go_out_total    16   55        1.2
## 13          0     go_out_total    17   34        0.7
## 14          0     go_out_total    18   25        0.5
## 15          0     go_out_total    19   25        0.5
## 16          0     go_out_total    20   23        0.5
## 17          0     go_out_total    21   27        0.6
## 18          0     go_out_total    22   20        0.4
## 19          0     go_out_total    23   16        0.3
## 20          0     go_out_total    24   13        0.3
## 21          0     go_out_total    25   13        0.3
## 22          0     go_out_total    26   15        0.3
## 23          0     go_out_total    27    9        0.2
## 24          0     go_out_total    28    9        0.2
## 25          0     go_out_total    29    3        0.1
## 26          0     go_out_total    30    6        0.1
## 27          0     go_out_total    31    6        0.1
## 28          0     go_out_total    32    1        0.0
## 29          0     go_out_total    33    1        0.0
## 30          0     go_out_total    34    2        0.0
## 31          0     go_out_total    35    1        0.0
## 32          0     go_out_total    41    1        0.0
## 33          0     go_out_total    42    4        0.1
## 34          0         immunity     5  118        2.5
## 35          0        risk_self     1  148        3.2
## 36          0            worry     1  116        2.5
## 37          1   adhere_friends     0  422       28.3
## 38          1 adhere_groceries     0  688       46.1
## 39          1     adhere_other     0  677       45.3
## 40          1     go_out_total     8  112        7.5
## 41          1     go_out_total     9  100        6.7
## 42          1     go_out_total    10   56        3.8
## 43          1     go_out_total    11   57        3.8
## 44          1     go_out_total    12   46        3.1
## 45          1     go_out_total    13   39        2.6
## 46          1     go_out_total    14   41        2.7
## 47          1     go_out_total    15   44        2.9
## 48          1     go_out_total    16   32        2.1
## 49          1     go_out_total    17   29        1.9
## 50          1     go_out_total    18   22        1.5
## 51          1     go_out_total    19   23        1.5
## 52          1     go_out_total    20   21        1.4
## 53          1     go_out_total    21   19        1.3
## 54          1     go_out_total    22   15        1.0
## 55          1     go_out_total    23   12        0.8
## 56          1     go_out_total    24   14        0.9
## 57          1     go_out_total    25   11        0.7
## 58          1     go_out_total    26   15        1.0
## 59          1     go_out_total    27   12        0.8
## 60          1     go_out_total    28    5        0.3
## 61          1     go_out_total    29    9        0.6
## 62          1     go_out_total    30    5        0.3
## 63          1     go_out_total    31    3        0.2
## 64          1     go_out_total    32    3        0.2
## 65          1     go_out_total    34    2        0.1
## 66          1     go_out_total    35    4        0.3
## 67          1     go_out_total    36    2        0.1
## 68          1     go_out_total    37    1        0.1
## 69          1     go_out_total    38    1        0.1
## 70          1     go_out_total    40    1        0.1
## 71          1     go_out_total    42    8        0.5
## 72          1         immunity     5  181       12.1
## 73          1        risk_self     1   58        3.9
## 74          1            worry     1   90        6.0

We realised that it will be more logical and concise to split the ‘figure_filtered’ dataframe into smaller individual data frames with their own variables, and then combine them again. In the chunk below we tried this for the ‘immunity’ variable by first filtering the rows to only include those with the immunity characteristic using the $filter()$ function.

Then, we grouped the variables according to whether or not the participants believed that they had previously had COVID-19, by putting the “ever_covid” variable in the brackets.

Next, we selected the column ‘n’ and ‘percentage’ using the $select()$ function and putting “4, 5” in the brackets to indicate that we are selecting columns 4 and 5.

To summarise all of the columns for every variable we use the ‘summarise_all()’ function with “.funs = sum” in the brackets which tells R to summarise everything into a single value. For instance, this is important in the go out total variable in the next chunk down, as it puts people into a category of having gone out of the house 7+ times by telling the data to group everyone together whether they went out 8 times or 42 times.

Then, we add a new column with the heading “characteristics” and the variable “immunity” as its entries after the first column.

f_immun <- figure_filtered %>%
  filter(characteristics == "immunity") %>%
  group_by(ever_covid) %>%
  select(4,5) %>% 
  summarise_all(.funs = sum) %>%
  add_column(., characteristics = "immunity", .after = 1)

## Adding missing grouping variables: `ever_covid`

f_immun

## # A tibble: 2 × 4
##                           ever_covid characteristics     n percentage
##                            <dbl+lbl> <chr>           <dbl>      <dbl>
## 1 0 [Think have not had coronavirus] immunity          118        2.5
## 2 1 [Think have had coronavirus]     immunity          181       12.1

As always, used equivalent code as above and changed names of the variables to apply it to all of the other variables.

# adhere_groceries
f_groceries<- figure_filtered %>%
  filter(characteristics == "adhere_groceries") %>%
  group_by(ever_covid) %>%
  select(4,5) %>%           
  summarise_all(.funs = sum) %>%
  add_column(., characteristics = "adhere_groceries", .after = 1)

## Adding missing grouping variables: `ever_covid`

# adhere_other
f_other <- figure_filtered %>%
  filter(characteristics == "adhere_other") %>%
  group_by(ever_covid) %>%
  select(4,5) %>%         
  summarise_all(.funs = sum) %>%
  add_column(., characteristics = "adhere_other", .after = 1)

## Adding missing grouping variables: `ever_covid`

# adhere_friends
f_friends <- figure_filtered %>%
  filter(characteristics == "adhere_friends") %>%
  group_by(ever_covid) %>%
  select(4,5) %>%
  summarise_all(.funs = sum) %>%
  add_column(., characteristics = "adhere_friends", .after = 1)

## Adding missing grouping variables: `ever_covid`

# go_out_total
f_go.out.total <- figure_filtered %>%
  filter(characteristics == "go_out_total") %>%
  group_by(ever_covid) %>%
  select(4,5) %>%
  summarise_all(.funs = sum) %>%
  add_column(., characteristics = "go_out_total", .after = 1)

## Adding missing grouping variables: `ever_covid`

# worry
f_worry <- figure_filtered %>%
  filter(characteristics == "worry") %>%
  group_by(ever_covid) %>%
  select(4,5) %>%               
  summarise_all(.funs = sum) %>%
  add_column(., characteristics = "worry", .after = 1)

## Adding missing grouping variables: `ever_covid`

# risk_self
f_risk.self <- figure_filtered %>%
  filter(characteristics == "risk_self") %>%
  group_by(ever_covid) %>%
  select(4,5) %>%           
  summarise_all(.funs = sum) %>%  
  add_column(., characteristics = "risk_self", .after = 1)

## Adding missing grouping variables: `ever_covid`

# risk_other
f_risk.other <- figure_filtered %>%
  filter(characteristics == "risk_other") %>%
  group_by(ever_covid) %>%
  select(4,5) %>%       
  summarise_all(.funs = sum) %>%  
  add_column(., characteristics = "risk_other", .after = 1)

## Adding missing grouping variables: `ever_covid`

#identify_symptoms
f_symptoms <- figure_filtered %>%
  filter(characteristics == "identify_symptoms") %>%
  group_by(ever_covid) %>%
  select(4,5) %>% 
  summarise_all(.funs = sum) %>%
  add_column(., characteristics = "identify_symptoms", .after = 1)

## Adding missing grouping variables: `ever_covid`

The next step is to bind the tibbles together row by row by stacking them vertically using the rbind() function from the base package to combines the variables listed in the brackets. We chose this function, as it allowed us to rearrange the variables so that they are in the same order as Smith et al.’s (2020) we were reproducing.

figure_bind <- rbind(f_immun,
                     f_groceries,
                     f_other,
                     f_friends,
                     f_go.out.total,
                     f_worry,
                     f_risk.self,
                     f_risk.other,
                     f_symptoms)

Now that we have successfully binded the rows, we must define the columns values as either factors (categorical data) or numerics (numeric data). To do this, we used the as.factor() function from the base package to coerce “figure_bind$ever_covid” and “figure_bind$characteristics” to become factors as they represent categorical data.

Then, we define the percentage column label as numeric using the as.numeric() function and inserting “figure_bind$percentage” in the brackets.

Next, we recoded the factors in the “ever_covid” and “characteristics” variables using the ‘recode_factor’ function from dplyr to change the order of the levels to match the labels in the figure we are trying to reproduce.

figure_bind$ever_covid <- as.factor(figure_bind$ever_covid)

figure_bind$characteristics <- as.factor(figure_bind$characteristics)

figure_bind$percentage <- as.numeric(figure_bind$percentage)

figure_bind$Ever_covid = recode_factor(
  figure_bind$ever_covid,
  "0" = "Think have not had COVID-19",
  "1" = "Think have had COVID-19")

figure_bind$characteristics = recode_factor(
  figure_bind$characteristics,
  "immunity" = "Strongly agree that they are immune",
  "adhere_groceries" = "Shopping for groceries/pharmacy for two or more items",
  "adhere_other" = "Shopping for items other than groceries/pharmacy",
  "adhere_friends" = "Met up with friends/family they do not live with",
  "go_out_total" = "Out-of-home activity, 8 or more outings",
  "worry" = "Not worried at all",
  "risk_self" = "Perceive no risk at all to themselves",
  "risk_other" = "Perceive no risk at all to others",
  "identify_symptoms" = "Did not identify common symptoms")

figure_bind = as.data.frame(figure_bind)

figure_bind

##    ever_covid                                       characteristics    n
## 1           0                   Strongly agree that they are immune  118
## 2           1                   Strongly agree that they are immune  181
## 3           0 Shopping for groceries/pharmacy for two or more items 1701
## 4           1 Shopping for groceries/pharmacy for two or more items  688
## 5           0      Shopping for items other than groceries/pharmacy 1156
## 6           1      Shopping for items other than groceries/pharmacy  677
## 7           0      Met up with friends/family they do not live with  456
## 8           1      Met up with friends/family they do not live with  422
## 9           0               Out-of-home activity, 8 or more outings 1751
## 10          1               Out-of-home activity, 8 or more outings  764
## 11          0                                    Not worried at all  116
## 12          1                                    Not worried at all   90
## 13          0                 Perceive no risk at all to themselves  148
## 14          1                 Perceive no risk at all to themselves   58
##    percentage                  Ever_covid
## 1         2.5 Think have not had COVID-19
## 2        12.1     Think have had COVID-19
## 3        36.5 Think have not had COVID-19
## 4        46.1     Think have had COVID-19
## 5        24.8 Think have not had COVID-19
## 6        45.3     Think have had COVID-19
## 7         9.8 Think have not had COVID-19
## 8        28.3     Think have had COVID-19
## 9        37.4 Think have not had COVID-19
## 10       50.9     Think have had COVID-19
## 11        2.5 Think have not had COVID-19
## 12        6.0     Think have had COVID-19
## 13        3.2 Think have not had COVID-19
## 14        3.9     Think have had COVID-19

Finally we can now make these percentages that we previously calculated into a bar graph using the following functions from the ggplot2 package!

First, we construct the initial plot using the $ggplot()$ function. Then, we add $aes()$ inside the previous brackets which defines the aesthetics, specifically asking the plot to include the x axis, y axis and fill and assigning these to the correct variables using the format: “$x = value$”. This is followed by ’+' which is telling R to add another component to the plot.

Next, we create the bar chart using the ‘geom_bar()’ function and inside these brackets include the arguments ‘stat = identity’ to tell ggplot I will provide the y values, ‘width =’ to set the bar width, and ’position = position_dodge’ to make the bars dodged side-to-side. Again this is followed by ’+' which is telling R to add another component to the plot.

Now it is time to set the lables for this graph using the $labs()$ function. Inside the brackets include the arguments (title = “text for the title”, x = “text for x axis”, y = “text for y axis”). Then once again ’+' tells R to add another component to the plot.

Then we set the scale for the y axis using the ‘scale_y_continuous()’ function and inserting the limits of the y axis, specifying values in between y-axis (from 0-60, every 10). This function was chosen as we Googled how to set the scale and this came up as the default scale for continuous y aesthetics in the ggplot2 package and when we tried it, it did what we wanted!

We then wanted to pick themes for the bar graph to look like Smith et al.’s (2020) figure 1, so we chose to use the ‘theme_bw()’ as it is dark-on-light and ‘theme_minimal()’ to take out any background annotations. Then, to create the grey colour scale like Smith et al. (2020) I used the ‘scale_fill_grey()’ function and specified in the brackets “start = .6, end = 0”.

Then I adjusted the text on the x axis to be horizontal using the $theme()$ function. Inside the brackets I tell R to rotate the axis text labels. For example, specifing the argument angle for the x axis text label by putting axis.text.x = element_text(angle = 63, hjust = 1.0). ‘element_blank()’ is then used to take away the grid panel for the x axis and ‘element_line()’ changes the size and colour of the y axis’s line’s. The legend is placed at the bottom using ‘legend.position = “bottom”’ and the legend title is removed using the code ‘legend.title = element_blank()’.

Then we view the bar graph we created using the $print()$ function that prints the output produced when “figure_final” is placed in the brackets.

figure_final <- figure_bind %>% 
  ggplot(aes(x = characteristics,  
             y = percentage,              
             fill = Ever_covid)) +

  geom_bar(stat = "identity",
           width = 0.6,
           position = position_dodge(0.7)) + 

  labs(
    title = "Psychological and behavioural outcomes as a consequence COVID-19",
    x = " ",
    y = "Percentage") +

  scale_y_continuous(
    limits = c(0, 60),   
    breaks = seq(0, 60, 10)) +
  
  theme_bw() +
  theme_minimal() +
  scale_fill_grey(start = .6, end = 0) +
 
   theme(
    axis.text.x = element_text(angle = 30, hjust = 1),                 
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_line(size=.5, color="gray"),
    legend.position = "bottom",
    legend.title = element_blank())

print(figure_final)

We had trouble with the format of this figure once it was knitted and I was unsure of whether I was allowed to include a screenshot so I left it as is just to be safe. This was really disappointing, because we did figure out how to format the figure perfectly to look like Smith et al.’s (2020) paper which made it more frustrating that it changed when knitted.

Part 3: Exploratory Analysis

Load Packages

First I loaded the extra packages needed for exploratory analysis that I had not yet loaded at the start of Part 2.

library(ggplot2) # Provides features for visualizing data
library(ggeasy) # Helper functions that make formatting figures in ggplot2 easier
library(rstatix) # Performs basic statistical tests

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(RColorBrewer) # Creates nice looking color palettes for figures
library(scales) # Automatically determines breaks and labels for axes and legends

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(forcats) # Tools to change the order of categorical variables in figures

Does adherence to the lockdown rule of not meeting up with friends and family outside your home differ according to age?

When initially reading this paper I wondered whether adherence to following the lockdown rule of not meeting up with friends or family differed according to participants’ age, as it could be the case that younger people engage in these behaviours more often due to the risk-taking tendencies associated with adolecence. Hence, I hypothesised that adolescent participant’s aged 18-24 break lockdown regulations more than any other age groups.

To investigate this hypothesis I created a new dataframe called EA_1 and used the assignment operator $< -$ to tell R to get the data from “data” which is what I called the raw data file at the beginning. I then used the pipe function (%>%) from the tidyverse package at the end of this first line to pipe it forward into the next function like saying “and then” when writing my code. These two functions are used frequently throughout my exploratory analysis, so for the sake of brevity I will not explain them every single time I use them in this section.

I grouped this data according to the age ranges of participants using the ‘group_by’ function from the dplyr package and inserting ‘age’ into the brackets.

Then, I used the $summarise()$ function from the dplyr package to create a new data frame with rows for the Mean and SD of participants scores in each age range. I calculated the mean for the ‘adhere_friends’ variable using the $mean()$ function from the base package and the standard deviation for the ‘adhere_friends’ variable was calculated using the $sd()$ function from the stats package. Including “Mean =” and “SD =” before these functions called them ‘Mean’ and ‘SD’ using the $=$ assignment operator, as this means we can use these names in the code for later chunks.

Then, I want to change the numerical values the age ranges were coded from that were in the OSF file (1-5) by recoding them to be the actual age ranges (18-24, 25-34, 35-44, 55+). I chose to use the $mutate()$ and ‘case_when’ functions from dplyr, as they allow me to recode these age ranges in a single instruction using the format: (age == 1 ~ “18 to 24 years”, age == 2 ~ “25 to 34 years” etc).

Finally, I checked whether the values produced in this chunk and created a table with the information I was after by using the last line of code, ‘EA_1’.

EA_1 <- data %>%
  group_by(age) %>% 
  summarise(Mean = mean(adhere_friends), SD = sd(adhere_friends)) %>%
  mutate(age = case_when(age == 1 ~ "18 to 24 years", 
                         age == 2 ~ "25 to 34 years", 
                         age == 3 ~ "35 to 44 years", 
                         age == 4 ~ "45 to 54 years", 
                         age == 5 ~ "55 years and over"))
  EA_1

## # A tibble: 5 × 3
##   age                Mean    SD
##   <chr>             <dbl> <dbl>
## 1 18 to 24 years    0.768 0.422
## 2 25 to 34 years    0.790 0.408
## 3 35 to 44 years    0.878 0.327
## 4 45 to 54 years    0.921 0.271
## 5 55 years and over 0.939 0.240

Now that we have the descriptive statistics produced in the chunk above, the bar graph can be made using the following functions from the ggplot2 package!

First, we construct the initial graph using the $ggplot()$ function. Then, we add $aes()$ inside the previous brackets which defines the aesthetics, specifically telling the plot which values to include in the x axis, y axis and fill. For instance, the correct values are assigned in the x axis by the section “$x = age$” inside the brackets which is telling ggplot to include values from the variable ‘age’ in the x axis, and so on for the y axis and fill. This is followed by ‘$+$’ which tells ggplot2 to add another component to the plot.

Next, we create the bar chart using the ‘geom_bar()’ function and inside these brackets include the arguments: ‘stat=identity’ to tell ggplot2 to override the default behavior as I will provide the y value. Again this is followed by ’+' which is telling R to add another component to the plot.

Then, we add the error bars into the graph using the ‘geom_errorbar()’ function. These error bars are formatted using the $aes()$ function that sets the position of the error bars using the code “(ymin = (Mean - SD)*100, ymax = (Mean + SD)*100)”, position = position_dodge(0.9)” and the width of the error bars using the code “(width = 0.2)”. Then once again ’+' tells R to add another component to the plot.

To set the labels for this graph I used the $labs()$ function and followed the format (title = “text for the title”, x = “text for x axis”, y = “text for y axis”). This is followed by ‘$+$’ which tells ggplot2 to add another component to the plot.

Then, I modified the graph title to be more centered using the ‘plot.title = element_text()’ function and inserting “vjust = 0.5, hjust = 0.5” into the brackets. i chose these values after having a play around to see what numbers made the title the most centered to the plot.

Then I viewed the bar graph I created using the $print()$ function that prints the output from the chunk produced when “EA_1_graph” is placed in the brackets.

I chose the codes used in this chunk, as I had successfully used them to reproduce figure 1 with my group in Part 2 and they worked when I tried using them for this graph!

EA_1_graph <- EA_1 %>%
  ggplot(aes(x = age, y = Mean, fill = age)) + 
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin = (Mean - SD), 
                    ymax = (Mean + SD)), 
                width = 0.2, 
                position = position_dodge(0.9)) + 
  
 labs(title = "Relationship Between Age and Adherence to Not Meeting Friends", 
      x = "Age Range", 
      y = "Proportion of adherence to lockdown rule (%)") + 
  
    theme(plot.title = element_text(vjust = 0.5, hjust = 0.6))

  print(EA_1_graph)

I was happy with this graph that I had produced, except the scale for the y axis was in decimal form. I asked my tutor Anita how to change this from decimal form and she suggested that I add “*100” to the Mean in the first bracket. I tried this and it made the error bars become tiny and not match the bars, so she suggested that I also add “*100” after the ”(Mean - SD)” in the ‘geom_errorbar()’ function and it worked to rescale the error bars correctly! I am very impressed by how she thought to do that and I aspire to code at that level one day! This troubleshooting process was very satisfying, as it was such a quick fix! The code below is the exact same as the chunk above, just with some added ”*100” throughout as seen below. Refer to the comment above for a full breakdown of each function/package, how it was chosen and what the code is doing.

EA_1_graph <- EA_1 %>%
  ggplot(aes(x = age, y = Mean*100, fill = age)) + 
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin = (Mean - SD)*100, 
                    ymax = (Mean + SD)*100), 
                width = 0.2, 
                position = position_dodge(0.9)) + 
  
 labs(title = "Relationship Between Age and Adherence to Not Meeting Friends Outside Household", 
      x = "Age Range", 
      y = "Proportion of adherence to lockdown rule (%)") + 
  
    theme(plot.title = element_text(vjust = 0.5, hjust = 0.2))

  print(EA_1_graph)

From this graph it is clear that adherence to the rule of not meeting up with friends and family outside of your home does differ according to age, such that the younger an individual is the less likely they were to adhere to this rule. Thus, the hypothesis that adolescents aged 18-24 break this rule more than any other age groups is correct, as the graph reveals that they have the lowest percentage of adherence to this rule. However, it is important to remember that this data is correlational and not a causational, thus we can not conclude that younger people engage in these behaviours more often due to their risk-taking tendencies as hypothesised. Nevertheless, this highlights the need for media coverage targeted at adolescents at the importance of obeying this lockdown rule to stop the spread of COVID-19 in the community and provides an interesting point for future research.

Does perceived immunity to COVID-19 differ according to individuals’ education level?

When initially reading this paper I wondered whether individuals’ education level has any correlation with their perceived immunity to COVID-19. I hypothesised that individuals with a degree or higher level of education are more likely to have knowledge about immunity, and in turn they may have a higher average perceived immunity if they think they have previously had COVID-19.

To investigate this hypothesis I created a new dataframe called EA_2 and used the assignment operator $< -$ to tell R to get the data from “data” which is what I called the raw data file at the beginning. I then used the pipe function (%>%) from the tidyverse package at the end of this first line to pipe it forward into the next function like saying “and then” when writing my code. These two functions are used frequently throughout this exploratory analysis, so for the sake of brevity I will not explain them every single time I use them in this section.

I grouped this data according to the ‘degree’ variable which refers to whether or not participants have a university degree using the ‘group_by’ function from the dplyr package and inserting ‘degree’ in the brackets to tell R to group the data by this variable.

Then, I used the $summarise()$ function from the dplyr package to create a new dataframe with the calculated mean of perceived immunity scores for participants in the ‘Degree or higher’ group and the ‘No formal qualifications’ group using the $mean()$ function from the base package. This value is rounded to 2 decimal places by inserting the $round()$ function in the brackets and ‘2’ to indicate 2 decimal places. Having ‘immunity =’ in the brackets uses the $=$ assignment operator to rename this dataframe with the means for perceived immunity of the two groups ‘immunity’. I will use this new name when referring to this dataframe when creating the graph in the next chunk below when using the ggplot2 package.

In the next line of code I am renaming the levels of the degree variable to be the their name instead of a numerical value using the $mutate()$ function from dplyr that modifies these variable names in the columns (i.e. “0” renamed “No formal qualifications”). I told R to change the values in degree variable coded as “0” to be called “No formal qualifications” and change the values coded as “1” to be called “Degree or higher” by using the ‘case_when()’, ‘==’, ‘~’ and ‘TRUE’ functions as seen in the brackets of the $mutate()$ function below.

Then I used the $filter()$ function to filter the rows to only include the degree variable so that data that is N/A for my exploratory question is not in the table.

Finally, I checked whether the values produced in this chunk created a table with the information I was after by using the $print()$ function and putting ‘EA_2’ in the brackets.

These functions were chosen, as I they are the most concise and effective way to do what I wanted to achieve, as I was following the KISS and DRY methods.

EA_2 <- data %>%
  group_by(degree) %>%
  summarise(immunity = round(mean(immunity), 2)) %>%
  mutate(degree = case_when(degree == 0 ~ "No formal qualifications", 
                            degree == 1 ~ "Degree or higher", TRUE ~ "NA")) %>%
  filter(degree != "NA") 
print(EA_2)

## # A tibble: 2 × 2
##   degree                   immunity
##   <chr>                       <dbl>
## 1 No formal qualifications     2.6 
## 2 Degree or higher             2.65

Now that we have the descriptive statistics produced in the chunk above, we can make a graph using some functions from the ggplot2 package!

First, we construct the initial graph using the $ggplot()$ function. Then, we add $aes()$ inside the previous brackets which defines the aesthetics by inserting “(x = degree, y = immunity)” inside the brackets which tells the plot to include values from the ‘degree’ variable in the x axis and values from the ‘immunity’ variable in the y axis. Then insert a ‘$+$’ which tells ggplot2 to add another component.

Next, we create the bars for this graph using the ‘geom_bar()’ function and inside these brackets include the arguments: ‘stat=identity’ to tell ggplot2 to override the default behavior as I have provided the y value. Then add another ‘+’ the same as the line above.

Then we set the scale limits for the y axis using the ‘scale_y_continuous()’ function and inserting the limits of the y axis. I specifed the limits by including “limits = c(0, 5)” in the brackets. I found this function on Stackoverflow when Googling “how to set the y axis scale?” and it worked previously so I chose to use it again here.

I set the titles using the $labs()$ function and inserting the text I wanted for each title in the brackets using the format (title = “text for the title”, x = “text for x axis”, y = “text for y axis”). Then another ‘$+$’ as above.

I then added the mean for the perceived immunity score for each group above the bars using the ‘geom_text()’ function and putting “aes(label = immunity)” inside the brackets which tells R where to get this value from. Next, I specified in the brackets “vjust = -0.9” to set the position of these labels to be above the bars.

Then I viewed the bar graph I created using the $print()$ function that prints the output from the chunk produced when “GRAPH_EA_2” is placed in the brackets.

GRAPH_EA_2_attempt1 <- EA_2 %>%
  ggplot(aes(x = degree, y = immunity)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 5)) +
  labs(x = "Level of Education",
       y = "Average Perceived Immunity to COVID-19 (0-5)",
       title = "Perceived Immunity to COVID-19 According to Level of Education") +
  geom_text(aes(label = immunity), vjust = -1) 
  
  print(GRAPH_EA_2_attempt1)

While I was happy that I could do produce this figure relatively quickly, with concise code and no errors I felt as though this bar graph does not answer my initial question, “Is there a correlation between perceived immunity to COVID-19 and participants’ education level?” as best as it could. Usualyl my next step would be to format this graph further by changing size of titles/colours etc. However, I thought a different type of graph would answer my question more effectively, so I asked my tutor Anita and she agreed and suggested that I Google which type of graph is best for data from a Likert scale. After some research I decided that I will abandon this bar chart and make a diverging bar chart instead. I set the goal of making this with a center point at neither agree nor disagree and bars going in each direction to display the data better by showing what proportion of individuals responded to each level of the Likert scale.

I started this process by Googling “How to create a diverging bar chart in R” and I found a super helpful website that can be accessed here. While this website went through some code and methods that could be used to produce a diverging bar chart, I still had a big task ahead of me. I initially started by using their code and changing the variable names to try and match my variables, but it just wouldn’t run. I realised that I need to stop, take my time, and understand what each line of code they have produced is doing and then trial it for myself once I have this knowledge. Doing this was a huge turning point in my learning R journey, because in doing so I gained a new level of understanding as opposed to simply copying code which increased my confidence. I also started to enjoy the process more when I didn’t have to keep looking back at the code I was referring to all the time.

I began making the graph by creating a summary of the data in a new data frame for this second attempt called ‘GRAPH_EA_2’ using the ‘<-’ function which tell R to get the data from ‘data’. Then I pipe this line, as explained previously.

We grouped the data into subsections according to the ‘degree’ variable (ie whether or not participants had a formal qualification) and their perceived ‘immunity’ to COVID-19 (from strongly disagree-strongly agree) using the ‘group_by()’ function from the dplyr package.

Next, I worked how many participants were in each of these subsections (i.e how many answered each factor on the likert scale), using the $count()$ function from dplyr. I called this calculation “number_answers” by putting (name = “number_answers”) inside the brackets of the $count()$ function.

Then I use the ‘group_by()’ function as explained previously, although only grouping by the ‘degree’ variable this time which allows R to find the proportion of individuals who answered each factor from strongly disagree-strongly agree in the ‘degree’ and ‘no formal qualification’ groups.

Then, I used the $mutate$ function and put “percent_answers = number_answers / sum(number_answers)” inside the brackets to tell R to find the proportion of participants each subgroup we created above and call this calculation ‘percent_answers’ using $=$.

Then, we remove the previous grouping with the $ungroup()$ function before calculating the percentage of participants in each group to prevent the possibility of the functions we used previously changing the data set where we don’t want it to.

Next, I use the $mutate()$ function again to turn the decimals found in ‘percent_answers’ into a percentage by putting the $percent()$ function from the scales package in the brackets and I call these calculated percentages “percent_answers_label” using $=$.

GRAPH_EA_2 <- data %>% 
  group_by(degree, immunity) %>% 
  count(name = "number_answers") %>% 
  group_by(degree) %>% 
  mutate(percent_answers = number_answers / sum(number_answers)) %>% 
  ungroup() %>% 
  mutate(percent_answers_label = percent(percent_answers, accuracy = 1)) 

GRAPH_EA_2

## # A tibble: 15 × 5
##                  degree immunity number_answers percent_answers percent_answers…
##               <dbl+lbl> <dbl+lb>          <int>           <dbl> <chr>           
##  1  0 [GCSE/vocational… 1 [Stro…            850          0.191  19%             
##  2  0 [GCSE/vocational… 2 [Disa…           1058          0.238  24%             
##  3  0 [GCSE/vocational… 3 [Neit…           1747          0.393  39%             
##  4  0 [GCSE/vocational… 4 [Agre…            596          0.134  13%             
##  5  0 [GCSE/vocational… 5 [Stro…            191          0.0430 4%              
##  6  1 [Degree or highe… 1 [Stro…            316          0.196  20%             
##  7  1 [Degree or highe… 2 [Disa…            373          0.231  23%             
##  8  1 [Degree or highe… 3 [Neit…            589          0.365  36%             
##  9  1 [Degree or highe… 4 [Agre…            238          0.147  15%             
## 10  1 [Degree or highe… 5 [Stro…             99          0.0613 6%              
## 11 NA                   1 [Stro…             20          0.217  22%             
## 12 NA                   2 [Disa…             13          0.141  14%             
## 13 NA                   3 [Neit…             43          0.467  47%             
## 14 NA                   4 [Agre…              7          0.0761 8%              
## 15 NA                   5 [Stro…              9          0.0978 10%

In the table that I produced above I can see the values for the ‘degree’ and ‘immunity’ variables have been coded numerically and I want to rename these.

To do this I used the $mutate()$ function from the dplyr package and inside the brackets I followed this format: (VARIABLE NAME = case_when(VARIABLE == 0 ~ “RENAME”, VARIABLE == 1 ~ “RENAME”, TRUE ~ “NA”). For instance, change VARIABLE to ‘degree’ and add in the new names for the values where it says “RENAME”, as demonstrated in the chunk below. Please refer to exploratory analysis 1 when I recoded the age ranges using these functions for further explanation.

Then I used the $filter()$ function to filter the rows to only include the degree variable and filter out N/A values from the data. These functions were chosen, as I remembered previously using them and when I tried it this code it ran!

GRAPH_EA_2 <- GRAPH_EA_2 %>% 
mutate(degree = case_when(
  degree == 0 ~ "No formal qualifications", 
  degree == 1 ~ "Degree or higher", TRUE ~ "NA")) %>%
  filter(degree != "NA")

Now I am going to use the same functions commented above for the immunity variable to change ‘1-5’ to ‘strongly disagree - strongly disagree’.

GRAPH_EA_2 <- GRAPH_EA_2 %>% 
mutate(immunity = case_when(
  immunity == 1 ~ "strongly disagree", 
  immunity == 2 ~ "disagree", 
  immunity == 3 ~ "neither agree nor disagree", 
  immunity == 4 ~ "agree", 
  immunity == 5 ~ "strongly agree", TRUE ~ "NA")) %>%
    filter(immunity != "NA")

GRAPH_EA_2

## # A tibble: 10 × 5
##    degree               immunity number_answers percent_answers percent_answers…
##    <chr>                <chr>             <int>           <dbl> <chr>           
##  1 No formal qualifica… strongl…            850          0.191  19%             
##  2 No formal qualifica… disagree           1058          0.238  24%             
##  3 No formal qualifica… neither…           1747          0.393  39%             
##  4 No formal qualifica… agree               596          0.134  13%             
##  5 No formal qualifica… strongl…            191          0.0430 4%              
##  6 Degree or higher     strongl…            316          0.196  20%             
##  7 Degree or higher     disagree            373          0.231  23%             
##  8 Degree or higher     neither…            589          0.365  36%             
##  9 Degree or higher     agree               238          0.147  15%             
## 10 Degree or higher     strongl…             99          0.0613 6%

Now that the table produced above is looking how I want it, I am going to start making a diverging bar chart. Stacked bar charts in R are automatically made so that the bars for each level are directly on top of each other. However, because I am trying to make a diverging bar chart to display this data I want the bars to share a center point at ‘neither agree nor disagree’ and go in each direction from this point. Doing this proved to be one of the most challenging, time-consuming and fiddly parts of the code I used in this report, however very satisfying when I got it right!

I begin this chunk by making a new dataframe called ‘GRAPH_EA_2_diverging’.

I want to make sure that the bars are diverging $mutate()$ function and ‘if_else’ functions from dplyr. Specifically, this second line of code in this chunk is telling R that if the participants response for the ‘immunity’ variable is strongly agree or agree then leave the ‘percent_answers’ value as it is. However, if it is not one of these responses in (i.e. neither agree nor disagree, disagree, strongly disagree) then take the inverse of this which is ‘-percent_answers’.

The final line ‘GRAPH_EA_2_diverging’ produces the table applying the code above so that I can check the output has made the correct levels negative and it is yay!

GRAPH_EA_2_diverging <- GRAPH_EA_2 %>% 
  mutate(percent_answers = if_else(immunity %in% c("strongly agree", "agree"), percent_answers, -percent_answers)) 
GRAPH_EA_2_diverging

## # A tibble: 10 × 5
##    degree               immunity number_answers percent_answers percent_answers…
##    <chr>                <chr>             <int>           <dbl> <chr>           
##  1 No formal qualifica… strongl…            850         -0.191  19%             
##  2 No formal qualifica… disagree           1058         -0.238  24%             
##  3 No formal qualifica… neither…           1747         -0.393  39%             
##  4 No formal qualifica… agree               596          0.134  13%             
##  5 No formal qualifica… strongl…            191          0.0430 4%              
##  6 Degree or higher     strongl…            316         -0.196  20%             
##  7 Degree or higher     disagree            373         -0.231  23%             
##  8 Degree or higher     neither…            589         -0.365  36%             
##  9 Degree or higher     agree               238          0.147  15%             
## 10 Degree or higher     strongl…             99          0.0613 6%

Next, in a new data frame that I called “GRAPH_EA_2_diverging_good_labels”, I took the absolute value of the ‘percent_answers’ values to make the values positive numbers (e.g -0.19 will become 0.19). To do this I used the $mutate()$ function again and the $abs()$ function from the base package in the brackets to calculate the absolute value for the ‘percent_answers’ values.

Then I found the percentage of the ‘percent_answers_label’ values using the $percent()$ function from scales package in the same way as the code above, as these will go inside the bars in the final graph.

GRAPH_EA_2_diverging_good_labels <- GRAPH_EA_2_diverging %>%
  mutate(percent_answers_label = abs(percent_answers)) %>%
  mutate(percent_answers_label = percent(percent_answers_label, accuracy = 1))

The bars of this graph are put in alphabetical order as the default setting, so next I made a new data frame called GRAPH_EA_2_diverging_right_order where I put them in order from ‘strongly disagree’ to ‘strongly agree’.

To reorder these bars, I used the $mutate()$ function and the ‘immunity’ variable in the table a factor variable instead of a character variable using the ‘fct_relevel’ function from the forcats package. I also manually relevel the order of the bars using this function so that the ‘immunity’ variable is in order from strongly agree-strongly disagree instead of alphabetical order through the text I put in the brackets. Note that the order that I put this levels that are in quotation marks is important, as this is telling R how I want them reordered.

Then I reversed the order of the factor levels in the ‘immunity’ variable, to ensure these bars stay in the right order when I use the ‘coord_flip()’ function from ggplot2 to make the bars horizontal later in the code.

GRAPH_EA_2_diverging_right_order <- GRAPH_EA_2_diverging_good_labels %>% 
  mutate(immunity = fct_relevel(immunity,
                                "neither agree nor disagree", 
                                "disagree", 
                                "strongly disagree", 
                                "agree", 
                                "strongly agree"),
         immunity = fct_rev(immunity))

GRAPH_EA_2_diverging_right_order Now the dataframe in the table above is ready to be made into a plot using the ggplot2 package!

In the first line of code in this chunk below I tell R we will be using this dataframe and pipe it using the ‘%>%’ function explained previously.

To construct the graph I use the $ggplot()$ function and add $aes()$ inside these brackets to define the aesthetics by telling R where to get the data from for each axis and the bars by inserting “(x = degree, y = percent_answers, fill = immunity)”. Then, ‘$+$’ tells ggplot2 to add another component which occurs frequently in this chunk as seen in the chunk below.

Next, I create the bars in this graph using the ‘geom_col()’ function.

To put the percentage labels for each level inside the bars I use the ‘geom_text’ function and $aes()$ to indicate that these labels are from the ‘percent_answers_label’ using “(label = percent_answers_label)”. I format the position of these labels to be inside the middle of each section of the bar by indicating “position = position_stack(vjust = 0.5)”. Then, I make these labels white using ‘color = “white”’ and bold using ‘fontface = “bold”’.

To make the bars horizontal instead of vertical for the diverging bar chart I use ‘coord_flip()’ function.

Then, I make the order of the legend match the order of the bars by manually setting the ‘breaks’ to be in order from strongly disagree to strongly agree using the ‘scale_fill_viridis_d’ function.

I then create the title for this graph using the $labs()$ function and putting the text I want for the title in quotation marks after ‘title =’. I do not need a label for the x and y axis for this diverging bar graph, so I put ’x = NULL,

Next, I remove the background fill using ‘theme_minimal()’.

Then, I formatted the graph using the ‘theme()’ function. Specifically, I removed the x axis test using ‘axis.text.x = element_text()’, removed the x axis title using axis.title.x = element_blank(), removed the panel grid using panel.grid = element_blank() and positioned the legend at the top using legend.position = “top”.

I chose the codes used in this chunk as this was the most concise way to format the diverging bar chart how I wanted it to present the data and I was guided by the principles of KISS and DRY.

GRAPH_EA_2_diverging_right_order %>%
  ggplot(aes(x = degree, y = percent_answers, fill = immunity)) +
  geom_col() +
geom_text(aes(label = percent_answers_label),
            position = position_stack(vjust = 0.5),
            color = "white",
            fontface = "bold") +
  coord_flip() +
  scale_fill_viridis_d(breaks = c("strongly disagree", "disagree", "neither agree nor disagree", "agree", "strongly agree")) +
  labs(title = "Perceived Immunity to COVID-19 according to Education Level", 
       x = NULL,
       fill = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        panel.grid = element_blank(),
        legend.position = "top")

I am super happy with how this has diverging bar graph has turned out! It displays the data in a much better way then my first bar chart I created for this exploratory analysis.

I was also proud of the fact that my code ended up being way more concise than the code I was initally referring to, yet it still managed to produce the output I wanted. I made sure that the code did not repeat lines unnecessarily as the code I was referring to did, however this was time-consuming as I had to make sure I really understood what each line was doing so that I didn’t delete sections that I needed. Although, this process of making my code more condensed further built my confidence and understanding in my learning R journey.

Creating this graph taught me a lot about the functions of R, as with the previous exploratory analysis I was using functions that I had previously used in part 2. However, to make a diverging bar chart I used new functions and packages, such as ‘fct_rev’ from the forcats package. I found that the best way for me to really understand what each line of code was doing was to delete it and produce the output without it to see what changed from the graph.

This graph reveals that participants’ perceived immunity to COVID-19 is not majorly differ between the different education levels. When initially reading paper I hypothesised that individuals with a degree or higher level of education would have a higher average perceived immunity if they think they have previously had COVID-19. While this hypothesis is true, this graph reveals that there is not a major difference between perceived immunity to COVID-19 across different education levels.

Does adherence to lockdown measures differ across the different regions of the UK?

I focused my third exploratory analysis on looking into whether participants adherence to lockdown rules was different according to the region of the UK that they resided in.

When initially reading paper I was blown away by the vast sample group that included participants from all of the regions across the UK. I hypothesise that the extent to which individuals followed the rules would differ across these regions, due to the different restrictions in the place, how they were enforced and the media coverage about the importance of lockdown measures in that region.

First I selected the columns from the data I was interested in (i.e. region, adhere_groceries, adhere_other, adhere_friends) using the $select()$ function from dplyr which falls under tidyverse. Then I wanted to create a new variable called ‘mean_adhere’ that combines the three variables related to adherence to different lockdown rules in the data by using the $mutate()$ function from the dplyr package. I told R to find the average for the row of variables that begin with the word “adhere” by inserting “mean_adhere = rowMeans(select(data, starts_with(”adhere”)), na.rm = TRUE))” into the $mutate()$ function’s brackets.

Then, I used the $select()$ function as we did at the start of this chunk to select region and this newly made variable “mean_adhere”. Now that I had selected the data I wanted to focus on I made subgroups in this data according to which region the participants were from using the ‘group_by()’ function and putting ‘region’ in the brackets.

I was chatting to a friend who is proficient in R about this report and they told me about the ‘get_summary_stats()’ function from the rstatix package that can be used to compute summary statistics. I thought that I would give it a try and used it to calculate statistics for the “mean_adhere” variable. I chose to use this variable as I have learnt through this process that trying new random functions is key on the journey of learning R. I was excited to try this, as I had never used the ‘rstatix’ package before. I was bracing myself for an error the first time I ran the code as this tends to be the way and was thrilled when it ran and very impressed by all of the different statistics that this function calculated, even though I only needed the means and SD.

EA_3 <- data %>%
  select(region, adhere_groceries, adhere_other, adhere_friends) %>%
  mutate(data, 
         mean_adhere = rowMeans(select(data, starts_with("adhere")), 
                                na.rm = TRUE)) %>%
  select(region, mean_adhere) %>%
  group_by(region) %>%
  get_summary_stats(mean_adhere, type = 'full')
EA_3

## # A tibble: 5 × 14
##     region variable     n   min   max median    q1    q3   iqr   mad  mean    sd
##   <dbl+lb> <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 [Midl… mean_ad…  1032     0     1  0.667 0.667     1 0.333 0.494 0.724 0.316
## 2 2 [Sout… mean_ad…  1785     0     1  1     0.667     1 0.333 0     0.754 0.308
## 3 3 [Nort… mean_ad…  1455     0     1  0.667 0.667     1 0.333 0.494 0.74  0.306
## 4 4 [Lond… mean_ad…  1000     0     1  0.667 0.333     1 0.667 0.494 0.641 0.362
## 5 5 [Wale… mean_ad…   877     0     1  0.667 0.667     1 0.333 0.494 0.73  0.304
## # … with 2 more variables: se <dbl>, ci <dbl>

In the table produced by the chunk above, the regions are labelled by their numerical value as opposed to their name (i.e. ‘1’ instead of ‘Midlands’). I wanted to recode these to the region names, so I used the $mutate()$ function from the dplyr package and inside the brackets I put ‘region = case when()’. I chose this code as I had used it successfully in my first exploratory analysis to recode the age ranges and I knew that it would work to achieve what I wanted here. Then I used the $filter()$ function to filter the rows to only include the region variable. These functions were chosen, as I remembered previously using them and when I tried it this code it ran!

EA_3_rename <- EA_3 %>%
  mutate(region = case_when(
  region == 1 ~ "Midlands",
  region == 2 ~ "South & East",
  region == 3 ~ "North",
  region == 4 ~ "London",
  region == 5 ~ "Wales, Scotland & Northern Ireland", TRUE ~ "NA")) %>%
  filter(region != "NA") 
EA_3_rename

## # A tibble: 5 × 14
##   region   variable     n   min   max median    q1    q3   iqr   mad  mean    sd
##   <chr>    <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Midlands mean_ad…  1032     0     1  0.667 0.667     1 0.333 0.494 0.724 0.316
## 2 South &… mean_ad…  1785     0     1  1     0.667     1 0.333 0     0.754 0.308
## 3 North    mean_ad…  1455     0     1  0.667 0.667     1 0.333 0.494 0.74  0.306
## 4 London   mean_ad…  1000     0     1  0.667 0.333     1 0.667 0.494 0.641 0.362
## 5 Wales, … mean_ad…   877     0     1  0.667 0.667     1 0.333 0.494 0.73  0.304
## # … with 2 more variables: se <dbl>, ci <dbl>

Now that I had the descriptive statistics that I needed all ready to go I created the bar graph using the following functions from the ggplot2 package!

First, we construct the initial plot using the $ggplot()$ function. Then, I defines the aesthetics with the $aes()$ function inside the previous brackets where I told the plot which values to include in the x axis, y axis and fill. For instance, the correct values are assigned in the x axis by the section “$x = region$” inside the brackets which is telling ggplot to include values from the variable ‘region’ in the x axis, and so on for the y axis and fill. This is followed by ‘$+$’ which tells ggplot2 to add another component to the plot.

Next, we create the bar chart using the ‘geom_col()’ function and inside these brackets include the arguments: ‘stat=identity’ to tell ggplot2 to override the default behavior as I will provide the y value. Again this is followed by ’+' which is telling R to add another component to the plot.

Then, we add the error bars into the graph using the ‘geom_errorbar()’ function. These error bars are formatted using the $aes()$ function that sets the minimum and maximum values of the error bars using calculations from the ‘mean’ and ‘sd’ that were produced by the ‘get_summary_stats’ function above. Then once again ’+' tells R to add another component to the plot.

Then, I adjusted the text for the x axis labels using the ‘theme()’ function and specifying ‘axis.text.x = element_text()’. Inside the brackets I tell R to rotate the x axis text labels by 40 degrees by specifying the argument “angle=40” and I set how close the x axis labels are to the graph using the argument “vjust = 1, hjust = 1”. Next, I modified the graph title’s position using the ‘plot.title = element_text()’ function and inserting “(vjust = 0.5, hjust = 0.2)” into the brackets.

Next, I set the scale limits for the y axis using the $ylim()$ function and specifying the scale to be (0, 1.1) so that the error bars can be seen.

Finally, I use the $print()$ function to view the graph produced from this chunk when “EA_1_graph” is placed in the brackets.

I chose the codes used in this chunk from memory based on code I had used in Part 2 with my group and in my other exploratory analysis questions. I was thrilled when this code ran, as I it was the first time that I did not have to use Google once. This proved to myself that I am really starting to understand the R language. Honestly I feel sad that I have finished the coding for this report, as the better I got the more rewarding it was. Not to forget the frustrating errors that took a lot to solve, although this pain almost made the reward of solving the error so much greater!

graph_EA_3 <- EA_3_rename %>%
  ggplot(mapping = aes(x = region, 
                       y = mean, 
                       fill = region)) +
  geom_col() +
  labs(title = "Adherence to lockdown measures across the regions of the UK",
       x = "Region",
       y = "Proportion of Adherence") +
  theme(axis.text.x = element_text(angle = 40, 
                                   vjust = 1, 
                                   hjust = 1), 
        plot.title = element_text(vjust = 0.5, hjust = 0.2)) + 
  ylim(0, 1.1) 
print(graph_EA_3)

This graph reveals that adherence to lockdown measures does differ between the regions, with London having the lowest adherence level, followed by Midlands, then Wales, Scotland and Northern Ireland, then North and South & East with the highest adherence to lockdown measures. However, the extent to which individuals followed the rules or not was not that much different, as they all adhered to the rules about 60-70% of the time.

When initially reading paper I was blown away by the vast sample group that included participants from all of the regions across the UK. I hypothesise that would differ across these regions, due to the different restrictions in the place, how they were enforced and the media coverage about the importance of lockdown measures in that region.

Part 4: Recommendations

There are numerous ways in which Smith et al. (2020) could improve the reproducibility of their results. My main suggestions are to include a codebook, document the exclusion criteria they used, define their variables in a data dictionary and double-check that the data uploaded to the OSF is correct and up to date.

Include a Codebook with Rubber Duck Debugging

The authors could have included a codebook with a clear guide of the code that they used to produce the descriptives and figures in their paper as a readme file in the OSF repository to make reproducing their results easier. As a group, we encountered the challenge of attempting to reproduce the analysis, including the demographic descriptives (means and standard deviations) and figures in this paper, by coming up with code from scratch. For instance, we had to look at the original graph in the paper and conduct trial and error until we found the correct function to recreate the figures correctly. While we did eventually reproduce the figures, finding the best way to do so made the process more tedious and time-consuming than if a codebook with rubber duck debugging had been included. Rubber duck debugging refers to including comments explaining the packages and functions that are used, how they were chosen, and what the code is doing. Without a codebook, we were not clear that what we were doing to recreate the descriptive analysis was right. I recommend that the authors include a codebook with rubber duck debugging created in R Markdown to allow for ease of comments to improve the reproducibility of their results. The author could have gone about including a codebook with rubber duck debugging by consulting Arslan’s (2018) paper that teaches how to automatically document data with the ‘codebook package’. Arlan (2018) teaches the importance of including interpretable and informative comments to increase reproducibility, although notes that creating such codebooks requires effort. This learning resource proposes that the ‘codebook package’ makes creating a codebook with comments easier. This would be useful to Smith et al. (2020), as it provides them with the skills necessary to create a codebook with comments and thus make their work more easily reproducible.

Document Exclusion Criteria

The authors could have clarified their exclusion criteria and accounted for the data that is not present to make the job of reproducing their results easier. Our group encountered challenges when trying to reproduce Table 3, as the authors did not provide detail about missing data and the exclusion criteria that they used. For instance, the total number of participants’ responses recorded for the correct identification of cough and fever as the main Covid-19 symptoms variable was 6022 compared to 6149 participants that responses were recorded for all of the other variables. As a team we initially all questioned ourselves and thought that we may have made a mistake and somehow filtered the participants in the data set, so we reloaded the data to check this although this was when we realised that it turned out our team all had the same coding and the authors differed from ours. It seems as though we couldn’t correctly reproduce this figure, as Smith et al. (2020) excluded participants from their analysis without reporting this. The authors could have provided an explanation for the exclusion criteria that they used (e.g talking about why certain participants were excluded in their analysis) which would have allowed us to correctly reproduce the descriptive analysis in the same way that the authors did. Smith et al., (2020) could consult Patino and Ferreira’s (2018) paper that discusses the importance of documenting exclusion criteria and provides a step-by-step guide on how to do so. This learning resource would assist Smith et al. (2020) to acquire the skills of documenting their exclusion criteria correctly to allow their results to be reproducible.

Define Variables using a Data Dictionary

To make the job of reproducing their results easier, Smith et al. (2020) could have defined what each variable is referring to in a data dictionary. It was time-consuming for our group to match the data with what variable it was referring to due to the confusing names that the authors used. For instance, when initially looking at the data under the heading “Ever_covid” we thought this variable referred to if participants had ever tested positive for COVID-19. However, after undertaking the time-consuming process of looking at the results presented in the paper and aligning this to the data in the OSF file we realised that “Ever_covid” corresponded to the variable about whether participants believed they had previously had COVID-19. Smith et al. (2020) could have made their variable names more descriptive, for instance in the previous example including the word “believe” i.e. making the variable name “Believe_ever_covid” would have saved confusion. Furthermore, our group initially had difficulty working out which number corresponds to which variable in the data. For instance, there were two levels of the gender variable, female and male. However, to work out whether females were coded as ‘1’ or ‘2’ in the data, we found out how many participants were in each of these levels and matched this up with how many males and females Smith et al. (2020) reported in the paper. We followed a similar process to work out how all of our variables, including levels of each different scale, were coded which was time-consuming. However, the authors could have outlined which number corresponded to which level of the categorical variable in a data dictionary to simplify this process. OSF support provides a guide called “How to Make a Data Dictionary” that can be accessed at https://help.osf.io/article/217-how-to-make-a-data-dictionary. This learning resource could be used to teach Smith et al. (2020) to make a data dictionary that defines the variables and states why and how they were measured which would allow others to understand their data more easily, thus improving the reproducibility of their research.

Double-check the Data uploaded to the OSF

Smith et al. (2020) could have uploaded the correct data to reproduce their figures into the OSF repository to make reproducing their results easier. Our group encountered the challenge of not being able to reproduce the numeric statistics (counts and percentages) for the ‘identify_symptoms’ variable to match Smith et al.’s Table 3, as the data they uploaded into the OSF was from before their exclusion criteria. Additionally, Smith et al. (2020) accidentally switched the levels of this variable in the dataset, meaning that our group had to undertake the tedious and time-consuming process of eyeballing the data and using trial and error to work this out and then manipulate the data to reproduce Table 3. Eventually, we worked out that using the ‘rename()’ function to correct this mistake and swapping the values ‘0’ and ‘1’ around for the ‘identify_symptoms’ variable produced the table correctly. Smith et al. (2020) could have double checked this data before uploading it, by using this data to recreate their figures which would have made them realise that this is an old dataset from before they excluded participants and they had accidentally swapped labels for the ‘identify_symptoms’ variable. The authors correcting these mistakes and updating the data in the OSF repository would have made reproducing these results much easier. To acquire the skills necessary to double check the data they upload into the repository to make their work reproducible, I suggest Smith et al. (2020) consult the “Depositing Data on the OSF” article which can be accessed via https://osf.io/a5imq/wiki/How%20to%20Upload%20Data%20to%20the%20OSF/. This learning resource is a step-by-step guide on how to ensure the data you upload onto the OSF repository is correct and it includes methods of double checking, such as attempting to reproduce your figures which help to identify mistakes before the data is uploaded.

References

Arslan, R. C. (in press). How to automatically document data with the codebook package to facilitate data re-use. Advances in Methods and Practices in Psychological Science. doi:10.1177/2515245919838783 Open Access Preprint

Centre for Open Science (2022). How to make a data dictionary. OSF Support. https://help.osf.io/article/217-how-to-make-a-data-dictionary

Centre for Open Science. (2022). How to deposit data on the OSF. OSF Home. https://osf.io/a5imq/wiki/How%20to%20Upload%20Data%20to%20the%20OSF/

Keyes, D. (2021). How to make a diverging bar chart in r. R For the Rest of Us. https://rfortherestofus.com/2021/10/diverging-bar-chart/

Patino, C. M., & Ferreira, J. C. (2018). Inclusion and exclusion criteria in research studies: definitions and why they matter. Jornal brasileiro de pneumologia : publicacao oficial da Sociedade Brasileira de Pneumologia e Tisilogia, 44(2), 84. https://doi.org/10.1590/s1806-37562018000000088

Verification Report

Annabelle Hudson

2022-08-07

Part 1: Summary and Reaction

Summary

Reaction

Part 2: Verification

Verification Goals

Table 1

Table 2

Table 3

Figure 1

Locate the open data

Load packages

Read the Data

Glimpse the data to look at variables

Rename Variables

Identifying the variables

Table 1

Table 2

Table 3

Figure 1

Part 3: Exploratory Analysis

Load Packages

Does adherence to the lockdown rule of not meeting up with friends and family outside your home differ according to age?

Does perceived immunity to COVID-19 differ according to individuals’ education level?

Does adherence to lockdown measures differ across the different regions of the UK?

Part 4: Recommendations

Include a Codebook with Rubber Duck Debugging

Document Exclusion Criteria

Define Variables using a Data Dictionary

Double-check the Data uploaded to the OSF

References