Part 1 Summary

Smith, Mottershaw, Egan, Waller, Marteau and Rubin’s (2020) paper investigates the impact of believing you have had covid-19 on self-reported behavior and more specifically adherence to lockdown measures. This study was unique in its investigation of participants’ belief as to whether they have had Covid-19, rather than only measuring those who tested positive. It was important to study this population to determine priority population groups for adherence to lockdown measures.

This study measured 6149 UK participants, via an online cross-sectional survey in their response to whether they have had the virus, whether they got tested, the results of their test, their perceived immunity to the virus, their adherence to lockdown measures, how worried they are about the virus and finally what the most common symptoms are (to assess the likelihood of incorrect self-diagnosis). They measured participants answers on “yes/no”, 4 and 5 point likhert scales and continuous scales E.g how many times have you left the house.

They found that 24.3% of participants thought they have had Covid-19, however only 9.4% reported being tested. They also found that those who thought they have had Covid-19 were more likely to believe they were immune to it, disobey lockdown measures and incorrectly identify the common symptoms of the virus.

The results prompt the discussion that perhaps many people who thought they have had the virus didn’t actually have it, believed they were immune, and disobeyed lockdown. This is concerning because as Covid-19 continues, the number of people who believe they have had the virus will increase, and as this study suggests so will a lack of adherence to lockdown measures and potentially greater spreading of the virus. An important implication of this study is to target those who believe they have had Covid-19 in promoting adherence to lockdown measures.

Reaction

The most interesting part of this paper was the idea of perceived immunity. I didn’t know such a large portion of individuals believed they could be immune to the virus. This study found that those who believed they have had covid-19 were more likely to believe they were immune, which prompted me to consider whether the majority of these individuals were those who were tested or those who were not tested for covid-19. I would assume the majority of the individuals believing they were immune were those who had received a negative covid-19 test, as they thought they might have the virus but were told they didn’t, which could result in this idea of immunity. I am very interested in what triggers perceived immunity.

This paper reminded me of another paper I read on covid-19 by Beaunoyer et al. “COVID-19 and digital inequalities: Reciprocal impacts and mitigation strategies” (2020). An interesting idea from Beaunoyers paper identified the elderly as a social group vulnerable to digital literacy inequalities which results in reduced health literacy (2020). This made me consider the amount of participants who could not correctly identify cough and fever as the two main covid-10 symptoms in Smith et al’s paper (2020). I wonder whether the majority of these individuals are older adults, due to the suggestion made by Beaunoyers (2020). Smith et al’s (2020) paper found that those aged 55 or above were least likely to think they have had covid-19, despite being most vulnerable. This makes me wonder whether they didn’t think they have had the virus because they didn’t know the main symptoms.

I was surprised by the lack of adherence to lockdown measures found in this paper. The large proportion of participants who disobeyed lockdown measures, regardless of thinking they have had covid or not was very surprising to me. This prompted me to consider our lockdown in Sydney and perhaps this finding surprised me because everyone I know and spoke to, (largely) adhered to lockdown measures. I wonder whether adherence to lockdown is a bigger issue in the UK, and if adherence varies between different countries, and regions.

Part 2 Verificiation Report

The first step on my reproduction journey was to get the osf data file into R studio. First I located the osf file which was linked to the paper. I clicked download and saved the file.

I saved the file to a folder I had created labeled .Rdata, opened R studio, clicked the “files” tab, located the data, double clicked on it, and it loaded into my global environment (a place in Rstudio where all your data is stored).

Before I could try and reproduce any descriptive statistics I had to install the packages “tidyverse”, “dplyr” and “haven”. Tidyverse has a collection of packages which allowed me to do basic data analysis and graphing. The dplyr package allows for changing and analysing data and the haven package is for statistical analysis.

To install the packages I typed “install.packages(”package name“) into the console, and”library(package name)" into the R code. I then loaded in the data set using the “read.sav” function. I called the data file, titled “data_data.sav” “assignment_data”, using “<-” to tell R to rename the data set.

library(haven)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(ggplot2)
library(gt)

assignment_data <- read_sav("data_data.sav")

Table 1

I then started to reproduce table 1. I aimed to reproduce the total count (“n”) for those who thought they have and had not had covid, as well as the total count between those who thought they have and haven’t had covid-19 for each of the demographic groups (gender, have a child, employment etc). To determine what names in the data file correspond to what variables I opened the data file and under each name was a description of the variable it was describing. The names were fairly self-explanatory which made the process a lot easier.

I first set out to find the total number of individuals who thought they did and did not have covid-19. I created a name for the output counting the Ever_covid variable called “count_covid”, which uses data from “assignment_data”. This was followed by a pipe “%>%” which puts all of the functions in your code together. I then added the “group_by” function, to make the count distinguish between those who think they have or haven’t had covid-19. Finally I added the “count” function to get a total count for the variable (Ever_covid). I learnt these basic functions in Narvarro’s (2020) you tube series.

assignment_data <- read_sav("data_data.sav")

count_covid <- assignment_data %>%
  group_by(Ever_covid) %>% count(Ever_covid)

print(count_covid)

## # A tibble: 2 x 2
## # Groups:   Ever_covid [2]
##                           Ever_covid     n
##                            <dbl+lbl> <int>
## 1 0 [Think have not had coronavirus]  4656
## 2 1 [Think have had coronavirus]      1493

The labels for Ever_covid were 0 and 1. I cross-checked what corresponds to these values by finding this variable “Ever_covid” in the data file and it told me not having covid was 0 and having covid as 1. My output shows that these statistics were reproduced.

For the variables such as gender, age, have a child in table 1, I started by finding the count for each of them individually. I did this using the same code as before but changing the variable inside the “count()” function. This process of using similar code and substituting different variables was very helpful while learning R, and meant I didnt have to repeat myself (applying the principles of DRY).

amount_covid <- assignment_data %>%
  group_by(Ever_covid) %>% 
  count(gender)
print(amount_covid)

## # A tibble: 4 x 3
## # Groups:   Ever_covid [2]
##                           Ever_covid     gender     n
##                            <dbl+lbl>  <dbl+lbl> <int>
## 1 0 [Think have not had coronavirus] 1 [Male]    2197
## 2 0 [Think have not had coronavirus] 2 [Female]  2459
## 3 1 [Think have had coronavirus]     1 [Male]     697
## 4 1 [Think have had coronavirus]     2 [Female]   796

I was able to reproduced these statistics also. I soon realized that copying this code for each variable was redundant, so I tried listing multiple variables into the count function. I realised, to make it easier for myself while learning code, using one chunk of code to get a larger amount of output is much quicker and easier to turn into a table or graph. However I was getting various error messages, including this one.

Jenny Sloan providing us with the starting point of using the function “zap_labels”, which removed all of the labels from the data. I added the function “pivot_longer” which tells R to move my output which was really wide to a long format (Navarro, 2020). This made it much easier to check whether or not the statistics had been reproduced. I used the “:” symbol to tell R to select the variables from gender to region (as they were next to each other in the data file). This meant I didn’t have to type them all out (applying the KISS principles). Within “pivot_longer” I told R where to move numerical and categorical data using the functions “names_to” and “values_to”. Again I was grouping by Ever_covid, however I also wanted to group_by each variable and value so I could get all the output for the demographic variables at once. Finally, I added the count function with “name” = “number” which tells R to count these variables and call the count output “number”.

amount_covid <- assignment_data %>%
  zap_labels() %>% 
  pivot_longer(gender:region, names_to = "vars", values_to = "values") %>%
  group_by(Ever_covid, vars, values) %>% 
  count(name = "number") %>%
  
print(amount_covid)

I noticed missing data in this output which I needed to remove from my analysis. I asked in my learning log and received the answer of the function na.omit(), which does exactly that. I also needed to add percentages for all the “counts” i had found. I googled “how to find percentages in tidyverse” and I found the code (percentage = round(number/sum(number)* 100, 1)))" from Stack overflow which goes within the “mutate” function (2015). The “mutate” function allows you to change things about the variables. This gave me the correct percentages for all of the descriptive statistics in table 1.

  covid_1 <- assignment_data %>%
  zap_labels() %>% 
  pivot_longer(gender:region, names_to = "vars", values_to = "values") %>%
  group_by(Ever_covid, vars,values) %>% 
  count(name = "number") %>% 
  na.omit %>% 
  group_by(vars,values) %>% 
  mutate(percentage = round(number/sum(number) * 100, 1))
  
print(covid_1)

## # A tibble: 40 x 5
## # Groups:   vars, values [20]
##    Ever_covid vars           values number percentage
##         <dbl> <chr>           <dbl>  <int>      <dbl>
##  1          0 age_categories      1   1003       70.5
##  2          0 age_categories      2    823       67.3
##  3          0 age_categories      3    751       71.9
##  4          0 age_categories      4    554       77.2
##  5          0 age_categories      5   1525       87.6
##  6          0 degree              0   3382       76.1
##  7          0 degree              1   1200       74.3
##  8          0 gender              1   2197       75.9
##  9          0 gender              2   2459       75.5
## 10          0 Has_child           0   2005       76.4
## # … with 30 more rows

Next I needed to make this tibble into a table which looked something like that in the paper (Smith et al. 2020). I searched “different packages for creating tables in R studio”. As a group we tried the different pacakges, starting with “gtsummary”. We added the function “tbl_summary()” within the gtsummary package and got the following output.

Whilst this was a definite improvement from the tibble, there was limited functions in this package that would allow me to format the table to look closer to table 1. Arunima (a group member) then found the gt package. After installing gt I added the function “gt()” to my code, inserting the data within this function.

covid_1 <- assignment_data %>%
  zap_labels() %>% 
  pivot_longer(gender:region, names_to = "vars", values_to = "values") %>%
  group_by(Ever_covid, vars,values) %>% 
  count(name = "number") %>% 
  na.omit %>%
  group_by(vars,values) %>% 
  mutate(percentage = round(number/sum(number) * 100, 1))

gt(covid_1)

Ever_covid	number	percentage
age_categories - 1
0	1003	70.5
1	419	29.5
age_categories - 2
0	823	67.3
1	400	32.7
age_categories - 3
0	751	71.9
1	294	28.1
age_categories - 4
0	554	77.2
1	164	22.8
age_categories - 5
0	1525	87.6
1	216	12.4
degree - 0
0	3382	76.1
1	1060	23.9
degree - 1
0	1200	74.3
1	415	25.7
gender - 1
0	2197	75.9
1	697	24.1
gender - 2
0	2459	75.5
1	796	24.5
Has_child - 0
0	2005	76.4
1	621	23.6
Has_child - 1
0	2386	75.5
1	776	24.5
Key_worker - 0
0	3105	80.5
1	753	19.5
Key_worker - 1
0	1551	67.7
1	740	32.3
region - 1
0	781	75.7
1	251	24.3
region - 2
0	1369	76.7
1	416	23.3
region - 3
0	1120	77.0
1	335	23.0
region - 4
0	701	70.1
1	299	29.9
region - 5
0	685	78.1
1	192	21.9
Working - 0
0	1714	82.8
1	357	17.2
Working - 1
0	2871	71.9
1	1124	28.1

This was much closer to what I was looking for and after looking at a blog on gt I found many functions I could use within this package (Iannone, 2020). I needed to change the variable names for many of the variables, for example so Ever_covid = doesnt read as 0 and 1 by “think they haven’t had covid” and “think they have had covid”. I did this by first making the variables a factor. I wanted to change something within the data set (hence data <- data) then inside the function “factor” (which I learnt from tutorials) I used the “$” symbol which tells R which data set the variable is within. A group member told me that I could change the variable names within the factor function, simply by adding “labels = c()” and typing the new labels in order.

table1$Ever_covid <- factor(table1$Ever_covid, 
                            labels = c("Think have not had COVID-19 n = 4656", 
                                       "Think have had COVID-19 n = 1493"))
table1$gender <-factor(table1$gender, 
                       labels = c("Male", "Female"))
table1$age_categories <- factor(table1$age_categories, 
                                labels = c("18 to 24 years", 
                                           "25 to 34 years", 
                                           "35 to 44 years", 
                                           "45 to 54 years", 
                                           "55 years and over"))
table1$Has_child <- factor(table1$Has_child, 
                           labels = c("No", "Yes"))
table1$Working <- factor(table1$Working, labels = c("Not working", "Working"))
table1$Key_worker <- factor(table1$Key_worker, 
                            labels = c("No", "Yes"))
table1$degree <-factor(table1$degree, 
                       labels = c("GCSE/vocational/A-level/No formal qualifications", "Degree or higher (Bachelors, Masters, PhD"))
table1$region <- factor(table1$region, 
                        labels = c("Midlands", 
                                   "South and East", 
                                   "North", "London", 
                                   "Wales, Scotland and Northern Ireland"))

Now i had to reorder the variables so they would appear the same as table 1 (Smith et al. 2020). I had to first use the pivot_longer function for the demographic variables, so they would appear as rows. I then used the pivot wider function to tell R to move the values and Ever_Covid into individual columns (“id_cols”). I learned about these functions in Jenny’s tutorials. Within pivot wider I had to use the mutate function, to make a new column for the “real” percentage values that were calculated.

table1 <- table1 %>% 
  pivot_longer(age_categories:Working, 
               names_to = "Participant Characteristics", 
               values_to = "Levels") %>% 
  na.omit %>% 
  mutate(Real = paste0(number,"(",percentage,"%)")) %>%
  select(-percentage, -number) %>% 
  pivot_wider(id_cols = -Real, names_from = Ever_covid, values_from = Real)

Finally I created some aesthetics to the table, renaming the groups, rows and spanners (which are bars that go along the columns) all within the gt function. I specified the names of the rows and groups using the functions “rename_col” and “groupname_col”. I created a spanner for the Ever_covid variable using the “Tab_spanner” function, specifying the name of the spanner using “label =” and the columns within the spanner using “columns = vars()”. The function “row_group_order” allowed me to simply enter the variables in order. Finally in “tab_style” I specified text style and fill colours. I found these functions from Iannone’s (2021) website. Google was a very important troubleshooting source especially as I became familiar with certain websites that were more helpful, E.g Stackoverflow, and STHDA.

table1 <- table1 %>% 
  gt(rowname_col = "Levels",
     groupname_col = "Participant Characteristics") %>% 
  tab_spanner(label = md("**Had COVID-19**"), 
              columns = vars("Think have not had COVID-19 n = 4656",
                             "Think have had COVID-19 n = 1493")) %>% 
    row_group_order(c("Gender", "Age", "Have a child",
                      "Employment Status", "Working in Key Sector",
                      "Highest Education or Professional Qualification",
                      "Region")) %>% 
    cols_label(
      'Think have not had COVID-19 n = 4656' = md("**Think have not had COVID-19 n = 4656**"),
      'Think have had COVID-19 n = 1493' = md( "**Think have had COVID-19 n = 1493**")
    ) %>% 
    tab_style(
      style = list(
        cell_text(weight = "bold"),
        cell_fill()
      ),
  locations = cells_row_groups())

To make my code reproducible, I put all of the code used for table 1 into one chunk. Applying the Keep it super simple principles, this would allow someone else to simply run this one chunk of code.

table1 <- covid_1 %>% 
  pivot_wider(names_from = vars, values_from = values) %>% 
zap_labels() %>% mutate_if(is.numeric, as.factor)
  
  table1$Ever_covid <- factor(table1$Ever_covid, 
                              labels = c("Think have not had COVID-19 n = 4656", 
                                         "Think have had COVID-19 n = 1493"))
table1$gender <-factor(table1$gender, labels = c("Male", "Female"))
table1$age_categories <- factor(table1$age_categories, 
                                labels = c("18 to 24 years", "25 to 34 years", 
                                           "35 to 44 years", "45 to 54 years", 
                                           "55 years and over"))
table1$Has_child <- factor(table1$Has_child, labels = c("No", "Yes"))
table1$Working <- factor(table1$Working, labels = c("Not working", "Working"))
table1$Key_worker <- factor(table1$Key_worker, labels = c("No", "Yes"))
table1$degree <-factor(table1$degree, 
                       labels = c("GCSE/vocational/A-level/No formal qualifications", 
                                  "Degree or higher (Bachelors, Masters, PhD"))
table1$region <- factor(table1$region, 
                        labels = c("Midlands", "South and East", "North", "London", 
                                   "Wales, Scotland and Northern Ireland"))
    
table1 <- table1 %>% 
  pivot_longer(age_categories:Working, 
               names_to = "Participant Characteristics", values_to = "Levels") %>% 
  na.omit %>% 
  mutate(Real = paste0(number,"(",percentage,"%)")) %>%
  select(-percentage, -number) %>% 
  pivot_wider(id_cols = -Real, names_from = Ever_covid, values_from = Real) 
  
table1$`Participant Characteristics` <- 
  as.factor(table1$`Participant Characteristics`)
table1$`Participant Characteristics` <- 
  factor(table1$`Participant Characteristics`,
         levels = c ( "gender",  "age_categories", 
                      "Has_child", "Working", "Key_worker",
                      "degree", "region"),
         labels = c ( "Gender", "Age", "Have a child",
                      "Employment Status", "Working in Key Sector",
                      "Highest Education or Professional Qualification", "Region"))
                                                    
  table1 <- table1 %>% 
  gt(rowname_col = "Levels",
     groupname_col = "Participant Characteristics") %>% 
  tab_spanner(label = md("**Had COVID-19**"), 
              columns = vars("Think have not had COVID-19 n = 4656",
                             "Think have had COVID-19 n = 1493")) %>% 
    row_group_order(c("Gender", "Age", "Have a child",
                      "Employment Status", "Working in Key Sector",
                      "Highest Education or Professional Qualification",
                      "Region")) %>% 
    cols_label(
      'Think have not had COVID-19 n = 4656' = md("**Think have not had COVID-19 n = 4656**"),
      'Think have had COVID-19 n = 1493' = md( "**Think have had COVID-19 n = 1493**")
    ) %>% 
    tab_style(
      style = list(
        cell_text(weight = "bold"),
        cell_fill()
      ),
  locations = cells_row_groups())

## Warning: `columns = vars(...)` is deprecated since gt 0.2.3:
## * please use `columns = c(...)` instead

  print(table1)

I was able to reproduce all statistics for table 1. For reference here is table 1 (Smith et al. 2020).

Table 2

Table 2 was the easiest to find the statistics for, only requiring the mean and standard deviation, which I knew how to find using the summarise function I learned in Navarro’s (2020) you tube series. Here is table 2:

I had already found the count for Ever_covid in table 1 using this code:

count_covid <- assignment_data %>%
  group_by(Ever_covid) %>% count(Ever_covid)

print(count_covid)

## # A tibble: 2 x 2
## # Groups:   Ever_covid [2]
##                           Ever_covid     n
##                            <dbl+lbl> <int>
## 1 0 [Think have not had coronavirus]  4656
## 2 1 [Think have had coronavirus]      1493

To find the mean and standard deviation I created a new data set for these stats called MSD. Another group member used this name, so for easy handover we kept the names consistent. After telling R “MSD” was the new name for “assignment_data”, I grouped by Ever_covid and used the summarise function to tell R the “mean” equals the mean of the following variables (mean = mean(vars)).

MSD <- assignment_data %>% 
   group_by(Ever_covid) %>% 
   summarise(mean = mean(q8haveimmunity,Going_out_total,q9worry,q10arisk,q10brisk))

print(MSD)

## # A tibble: 2 x 2
##                           Ever_covid  mean
##                            <dbl+lbl> <dbl>
## 1 0 [Think have not had coronavirus]  2.38
## 2 1 [Think have had coronavirus]      3.33

This code found the mean across all of the variables by Ever_covid, which is not what I wanted. I tried using “:” to list the variables - which worked for table 1 however this gave me the same result. A group member asked about this in their learning log and was given the hint to use summarise “across” for a group of variables. I tried adding “across(mean = mean(variable1,variable2,variable3))” however I was getting an error that you cant average across labelled variables. I googled the error and found that you have to include “across(c(then list variables here))”, which tells R to summarise across columns (“c”). Now the summarise function was filled with the variables, so I used the “list” function to tell R what to calculate (Wickam, Romain, Henry & Muller, 2021).

MSD <- assignment_data %>% 
   group_by(Ever_covid) %>% 
   summarise(across(c(q8haveimmunity,Going_out_total,q9worry,q10arisk,q10brisk), 
                    list(mean = mean), standard_deviation = sd)) %>%
  mutate(across(everything(), round, 2))

print(MSD)

## # A tibble: 2 x 6
##   Ever_covid q8haveimmunity_mean Going_out_total_mean q9worry_mean q10arisk_mean
##        <dbl>               <dbl>                <dbl>        <dbl>         <dbl>
## 1          0                2.38                 6.69         3.59          2.81
## 2          1                3.33                 9.35         3.38          2.81
## # … with 1 more variable: q10brisk_mean <dbl>

All statistics were able to be reproduced from table 2, except for a small rounding error for perceived immunity (q8haveimmunity_standard_deviation). My output found the sd of 1.004532 which rounds 2dp to 1.00 however in the paper they rounded incorrectly to 1.01 (Smith et al. 2020).

I then needed to make this output longer, as it appears in the paper, so I used pivot longer for all of the varaibles. I used the “ends_with” function within pivot longer to make my code more simple (applying the KISS principles).

table2mean <- MSD %>% 
  select(ends_with("mean", "standard_deviation"), Ever_covid) %>%
  pivot_longer(ends_with("mean", "standard_deviation"), 
               names_to = "vars", values_to = "values")

This was causing errors in the ordering of the variables. Bart (a teammate) took over and tried pivoting the data twice and this made two different long columns of mean and sd but the variable order was still wrong He came up with the idea (with the help of Jenny’s tutorials) to make two separate data sets foor the mean and sd’s which could be merged later in the correct order. First I made two new names for the separate data sets. I used table2mean and table2SD to keep it simple. I then selected the variables I wanted and used the same code as above.

table2mean <- MSD %>% 
  select(ends_with("mean"), Ever_covid) %>% 
  pivot_longer(ends_with("mean") , 
               names_to = "Mean_vars", values_to = "Mean_values")
print(table2mean)

## # A tibble: 10 x 3
##    Ever_covid Mean_vars            Mean_values
##         <dbl> <chr>                      <dbl>
##  1          0 q8haveimmunity_mean         2.38
##  2          0 Going_out_total_mean        6.69
##  3          0 q9worry_mean                3.59
##  4          0 q10arisk_mean               2.81
##  5          0 q10brisk_mean               3.39
##  6          1 q8haveimmunity_mean         3.33
##  7          1 Going_out_total_mean        9.35
##  8          1 q9worry_mean                3.38
##  9          1 q10arisk_mean               2.81
## 10          1 q10brisk_mean               3.3

table2SD <- MSD %>% 
  select(ends_with("deviation"), Ever_covid) %>% 
  pivot_longer(ends_with("deviation") ,  
               names_to = "SD_vars", values_to = "SD_values")
print(table2SD)

I used a website to figure out how to bind these two data sets to create a nice table (Wickam, Romain, Henry & Muller, 2021). I found the “bind_cols” function and simply put the two data sets within this function. Inside mutate, “real = paste0” tells R to paste the Mean and SD within the binded column. I then decided to mutate the variables so variables such as perceived immunity wouldn’t read as 1,2,3,4,5 in the data, but instead 1 = strongly disagree etc. I did this by using the “mutate(level = case_when)” function which I learned in Navarro (2020)’s you tube series. This tells R when the variable within the data set starts with “Mean_vars” or “q8” to mutate it the following way.

table2 <- bind_cols(table2SD,table2mean) %>% 
  mutate(real = paste0("M = ", Mean_values,",", " SD = ", SD_values)) %>% 
  select(Ever_covid...1, Mean_vars, real) %>% 
  mutate(Level = 
           case_when(startsWith(Mean_vars,"q8") ~ 
                       "1 = Strongly disagree to  5 = Strongly agree", 
                           startsWith(Mean_vars,"q9") ~ 
                       "1 = not worried at all to  5 = extremely worried", 
                           startsWith(Mean_vars, "Going") ~
                       "Range = 0 to 42", 
                           TRUE ~"1 = not risk at all to  4 = major risk")) %>% 
  rename("Participant characteristics" = "Mean_vars")

I tried to make this data straight into a gt however I had to make Ever_covid, and the rest of the variables factors first, so they could be used within the function “names_from” and “values_from” in the gt package. I used the “as.factor” function with the “labels” function within it, to rename these factors.

 table2$Ever_covid...1 <- as.factor(table2$Ever_covid...1)
  table2$`Participant characteristics` <- as.factor(table2$`Participant characteristics`)
  table2$Ever_covid...1 <- 
    factor(table2$Ever_covid...1,
           labels = c("Think have not had COVID-19 n = 4656", 
                      "Think have had COVID-19 n = 1493"))
  table2$`Participant characteristics` <- 
    factor(table2$`Participant characteristics`,
           labels = c("I think I have some immunity to COVID-19",
                      "Total out-of-home activity in the last seven days",
                      "Worry about COVID-19",
                      "Perceived risk of COVID-19 to onself",
                      "Perceived risk of COVID-19 to people in the UK"))

Finally I pivoted these columns. This required telling R the individual columns are the “real” data that I found above, followed by where to move the categorical and which are numerical data using the “names_from” and “values_from” functions. Then i added the gt() function and used the “tab_spanner” function and relabeled the spanner columns within this function. Finally the “print” function tells R to display the output.

library(gt)

table2 <- table2 %>% 
  pivot_wider(id_cols = -real, names_from = Ever_covid...1, values_from = real) %>% 
    gt() %>% 
    tab_spanner(label = md("**Had COVID-19**"), 
                columns = vars("Think have not had COVID-19 n = 4656",
                               "Think have had COVID-19 n = 1493")) %>%
  cols_label('Participant characteristics' = md('**Participant Characteristics**'),
    Level = md('**Level**'),
    'Think have not had COVID-19 n = 4656' = md("**Think have not had COVID-19 n = 4656**"),
    'Think have had COVID-19 n = 1493' = md("**Think have had COVID-19 n = 1493**")) 
  
print(table2)

So others can reproduce my findings I have put all of the code needed for table 2 into one chunk.

MSD <- assignment_data %>% 
   group_by(Ever_covid) %>% 
   summarise(across(c(q8haveimmunity,Going_out_total,q9worry,q10arisk,q10brisk),
                    list(mean = mean, standard_deviation = sd))) %>%
  mutate(across(everything(), round, 2))

table2mean <- MSD %>% 
  select(ends_with("mean"), Ever_covid) %>% 
  pivot_longer(ends_with("mean") ,  names_to = "Mean_vars", values_to = "Mean_values")

table2SD <- MSD %>% 
  select(ends_with("deviation"), Ever_covid) %>% 
  pivot_longer(ends_with("deviation") ,  names_to = "SD_vars", values_to = "SD_values")

table2 <- bind_cols(table2SD,table2mean) %>% 
  mutate(real = paste0("M = ", Mean_values,",", " SD = ", SD_values)) %>% 
  select(Ever_covid...1, Mean_vars, real) %>% 
  mutate(Level = case_when(startsWith(Mean_vars,"q8") ~ "1 = 
                           Strongly disagree to  5 = Strongly agree", 
                           startsWith(Mean_vars,"q9") ~"1 = 
                           not worried at all to  5 = extremely worried", 
                           startsWith(Mean_vars, "Going") ~"Range = 0 to 42", 
                           TRUE ~"1 = not risk at all to  4 = major risk ")) %>% 
  rename("Participant characteristics" = "Mean_vars")

## New names:
## * Ever_covid -> Ever_covid...1
## * Ever_covid -> Ever_covid...4

table2$Ever_covid...1 <- as.factor(table2$Ever_covid...1)
  table2$`Participant characteristics` <- as.factor(table2$`Participant characteristics`)
  table2$Ever_covid...1 <- factor(table2$Ever_covid...1, 
                                  labels = c("Think have not had COVID-19 n = 4656", 
                                             "Think have had COVID-19 n = 1493"))
  table2$`Participant characteristics` <- 
    factor(table2$`Participant characteristics`, 
           labels = c("I think I have some immunity to COVID-19",
                      "Total out-of-home activity in the last seven days",
                      "Worry about COVID-19", "Perceived risk of COVID-19 to onself",
                      "Perceived risk of COVID-19 to people in the UK")) 
  
table2 <- table2 %>% 
    pivot_wider(id_cols = -real, names_from = Ever_covid...1, values_from = real) %>% 
    gt() %>%
    tab_spanner(label = md("**Had COVID-19**"), 
                columns = vars("Think have not had COVID-19 n = 4656",
                               "Think have had COVID-19 n = 1493")) %>%
  cols_label('Participant characteristics' = md('**Participant Characteristics**'),
    Level = md('**Level**'),
    'Think have not had COVID-19 n = 4656' = md("**Think have not had COVID-19 n = 4656**"),
    'Think have had COVID-19 n = 1493' = md("**Think have had COVID-19 n = 1493**"))

## Warning: `columns = vars(...)` is deprecated since gt 0.2.3:
## * please use `columns = c(...)` instead

print(table2)

For reference, here is table 2 and the descriptives we were trying to reproduce

Table 3

For table three I set out to find the count and percentages for the adherence variables and ability to identify covid symptoms, between those who think they have have and haven’t had covid-19.

I started by trying to reproduce the descriptive statistics for just one of the variables: shopping for groceries. I used the same code as for table 1, just substituting the data set and variables to keep it simple.

t3_top <- assignment_data %>%
  group_by(Ever_covid) %>%
  count(Adhere_shop_groceries)

print(t3_top)

## # A tibble: 4 x 3
## # Groups:   Ever_covid [2]
##                           Ever_covid Adhere_shop_groceries     n
##                            <dbl+lbl>             <dbl+lbl> <int>
## 1 0 [Think have not had coronavirus]     0 [Twice or more]  1701
## 2 0 [Think have not had coronavirus]     1 [Once or less]   2955
## 3 1 [Think have had coronavirus]         0 [Twice or more]   688
## 4 1 [Think have had coronavirus]         1 [Once or less]    805

This reproduced the count in the table so I knew this code was correct. I then tried to create a code which would do this for all of the variables in table 3 by adding all the variables into the count function.

t3_top <- assignment_data %>%
  group_by(Ever_covid) %>%
  count(Adhere_meet_friends, Adhere_shop_groceries, Adhere_shop_other, Sx_covid_nomissing)

print(t3_top)

## # A tibble: 32 x 6
## # Groups:   Ever_covid [2]
##       Ever_covid      Adhere_meet_friends Adhere_shop_gro…     Adhere_shop_other
##        <dbl+lbl>                <dbl+lbl>        <dbl+lbl>             <dbl+lbl>
##  1 0 [Think hav… 0 [Reported meeting fri… 0 [Twice or mor… 0 [Reported shopping…
##  2 0 [Think hav… 0 [Reported meeting fri… 0 [Twice or mor… 0 [Reported shopping…
##  3 0 [Think hav… 0 [Reported meeting fri… 0 [Twice or mor… 1 [Reported not shop…
##  4 0 [Think hav… 0 [Reported meeting fri… 0 [Twice or mor… 1 [Reported not shop…
##  5 0 [Think hav… 0 [Reported meeting fri… 1 [Once or less] 0 [Reported shopping…
##  6 0 [Think hav… 0 [Reported meeting fri… 1 [Once or less] 0 [Reported shopping…
##  7 0 [Think hav… 0 [Reported meeting fri… 1 [Once or less] 1 [Reported not shop…
##  8 0 [Think hav… 0 [Reported meeting fri… 1 [Once or less] 1 [Reported not shop…
##  9 0 [Think hav… 1 [Reported not meeing … 0 [Twice or mor… 0 [Reported shopping…
## 10 0 [Think hav… 1 [Reported not meeing … 0 [Twice or mor… 0 [Reported shopping…
## # … with 22 more rows, and 2 more variables: Sx_covid_nomissing <dbl+lbl>,
## #   n <int>

This was not what I wanted and the data was very confusing to analyse, being so long. I decided to first pivot_longer this data to understand what had happened. To make sure I wasn’t repeated myself (applying the principles of DRY) within the pivot_longer function I used the “:” symbol instead of listing all variables. I then told R where to pivot what data and added the “count()” function.

 t3_top <- assignment_data %>% 
   pivot_longer(Adhere_shop_groceries:Sx_covid_nomissing, names_to = "vars" , 
                values_to = "values") %>%
   group_by(vars, values) %>% 
   count(name = "number")
  print(t3_top)

## # A tibble: 8 x 3
## # Groups:   vars, values [8]
##   vars                  values number
##   <chr>                  <dbl>  <int>
## 1 Adhere_meet_friends        0    878
## 2 Adhere_meet_friends        1   5271
## 3 Adhere_shop_groceries      0   2389
## 4 Adhere_shop_groceries      1   3760
## 5 Adhere_shop_other          0   1833
## 6 Adhere_shop_other          1   4316
## 7 Sx_covid_nomissing         0   2517
## 8 Sx_covid_nomissing         1   3632

This found the total count for each of the variables. All data was reproduced except the variable measuring participants ability to identify the common covid symptoms. In this data, this is the “Sx_covid_nomissing” variable. We double checked our code, and Jenny also got the same output. Then as a team we brainstormed reasons for this. In the paper they did mention excluding participants, so we hypothesised that they excluded participants after publishing this data set. Hence, why our count is larger than theirs (Smith et al. 2020). This was a breakthrough moment in regards to my idea of what causes errors in reproducibility. I previously thought papers that couldn’t be reproduced were due to experimenters trying to make us believe conclusions that were obviously not true, whereas actually things such as unclear exclusion criteria and rounding errors cause errors in reproducibility.

I wanted to remove all missing values using “na.omit”, in-case this was causing the difference in count. I added the “mutate” function to mutate my count into a percentage. The percentage code is quite self-explanatory which is one thing I really like about the tidyverse and dplyr package.

 t3_top <- assignment_data %>% 
   pivot_longer(Adhere_shop_groceries:Sx_covid_nomissing, names_to = "vars" , 
                values_to = "values") %>%
   group_by(vars, values) %>% 
   count(name = "number") %>%
  na.omit %>%
  mutate(percentage = round(number/sum(number) * 100,1))

  print(t3_top)

## # A tibble: 8 x 4
## # Groups:   vars, values [8]
##   vars                  values number percentage
##   <chr>                  <dbl>  <int>      <dbl>
## 1 Adhere_meet_friends        0    878        100
## 2 Adhere_meet_friends        1   5271        100
## 3 Adhere_shop_groceries      0   2389        100
## 4 Adhere_shop_groceries      1   3760        100
## 5 Adhere_shop_other          0   1833        100
## 6 Adhere_shop_other          1   4316        100
## 7 Sx_covid_nomissing         0   2517        100
## 8 Sx_covid_nomissing         1   3632        100

Now I needed to turn this tibble into a gt table. First this involved turning variables into “factors” so they could be pivoted. I used the function “zap_lables” to remove all data labels as they would need to be renamed. All numeric data was mutated to a factor, using the functions “is.numeric” which means if the data is numerical, make it a factor “as.factor”. I then changed the names of variables, using the “labels = c()” function.

 table3 <- covid_3 %>% 
    pivot_wider(names_from = vars, values_from = values) %>% 
    zap_labels() %>% mutate_if(is.numeric, as.factor) 

   table3$Ever_covid <- as.factor(table3$Ever_covid)
  table3$Ever_covid <- factor(table3$Ever_covid, labels = c("Think have not had COVID-19", 
                                                            "Think have had COVID-19"))
  table3$Adhere_shop_groceries <- factor(table3$Adhere_shop_groceries, 
                                         labels = c("On one or fewer days in the last week, n = 2389",
                                                    "On two or more days in the last week, n = 3760" ))
  table3$Adhere_meet_friends <- factor(table3$Adhere_meet_friends,
                                       levels = c(1,0),
                                       labels = c("Not at all in the last week, n = 5271" ,
                                                  "On one or more days in the last week, n = 878"))
  table3$Adhere_shop_other <- factor(table3$Adhere_shop_other, 
                                     labels = c("Not at all in the last week, n = 1833",
                                                "On one or more days in the last week, n = 4316" 
                                                 ))
  table3$Sx_covid_nomissing <- factor(table3$Sx_covid_nomissing, 
                                      labels = c("Did not correctly identify symptoms, n = 2157", 
                                                 "Correctly identified common symptoms, n = 3632"))

table3$`Participant Characteristics` <- as.factor(table3$`Participant Characteristics`)
  table3$`Participant Characteristics` <- factor(table3$`Participant Characteristics`, labels = c(
    "Meet up with family and friends",
    "Shopping for groceries/pharmacy", 
    "Shopping for items other than groceries or pharmacy",
    "Correct identification of main symptoms"))

Finally I pivoted the variables, telling R where the categorical and numerical data goes. I then selected the calculations I wanted R to produce (“-number” and “-percentage” as they were now called) and finally I added pivot wider so “Ever_covid” was a column with counts and percentages beneath - as they appear in table 3.

  table3 <- table3 %>% 
    pivot_longer(Adhere_meet_friends:Sx_covid_nomissing, 
                 names_to = "Participant Characteristics", 
                 values_to = "Levels") %>% 
    na.omit %>% 
    mutate(Real = paste0(number,"(",percentage,")")) %>% 
    select(-number,-percentage) %>%
    pivot_wider(id_cols = -Real, names_from = Ever_covid, values_from = Real)

We wanted to have multiple spanner columns for the overall count for variables such as shopping for groceries, meeting friends etc, however the gt package would not allow for this. We asked about this in tutorial and were given the very helpful information that rather than reproducing the format of the tables, we could improve them! So we added some aesthetics and left the table with one spanner column at the top. We added the count number to the variable names and used “tab_style” to specify the style, text size (cell_text), and the colour of cells (cell_fill).

 table3 <- table3 %>% 
    gt(rowname_col = "Levels",
       groupname_col = "Participant Characteristics") %>% 
    tab_spanner(label = md("**Had COVID-19**"), 
                columns = vars("Think have not had COVID-19",
                               "Think have had COVID-19")) %>% 
    row_group_order(c("Shopping for groceries/pharmacy",
                      "Shopping for items other than groceries or pharmacy",
                      "Meet up with family and friends",
                      "Correct identification of main symptoms" )) %>% 
    
    cols_label(
      'Think have not had COVID-19' = md("**Think have not had COVID-19**"),
      'Think have had COVID-19' = md( "**Think have had COVID-19**")) %>% 
    tab_style(
      style = list(
        cell_text(weight = "bold"),
        cell_fill()),
      locations = cells_row_groups())

Again, to make it super simple for anyone wanted to reproduce my reproduced statistics I have put all of the code in one chunk that can be ran all at once.

 table3 <- covid_3 %>% 
    pivot_wider(names_from = vars, values_from = values) %>% 
    zap_labels() %>% mutate_if(is.numeric, as.factor) 


  table3$Ever_covid <- as.factor(table3$Ever_covid)
  table3$Ever_covid <- factor(table3$Ever_covid, labels = c("Think have not had COVID-19", 
                                                            "Think have had COVID-19"))
  table3$Adhere_shop_groceries <- factor(table3$Adhere_shop_groceries, 
                                         labels = c("On one or fewer days in the last week, n = 2389",
                                                    "On two or more days in the last week, n = 3760" ))
  table3$Adhere_meet_friends <- factor(table3$Adhere_meet_friends,
                                       levels = c(1,0),
                                       labels = c("Not at all in the last week, n = 5271" ,
                                                  "On one or more days in the last week, n = 878"))
  table3$Adhere_shop_other <- factor(table3$Adhere_shop_other, 
                                     labels = c("Not at all in the last week, n = 1833",
                                                "On one or more days in the last week, n = 4316" 
                                                 ))
  table3$Sx_covid_nomissing <- factor(table3$Sx_covid_nomissing, 
                                      labels = c("Did not correctly identify symptoms, n = 2157", 
                                                 "Correctly identified common symptoms, n = 3632"))
  
  table3 <- table3 %>% 
    pivot_longer(Adhere_meet_friends:Sx_covid_nomissing, 
                 names_to = "Participant Characteristics", values_to = "Levels") %>% 
    na.omit %>% 
    mutate(Real = paste0(number,"(",percentage,")")) %>% 
    select(-number,-percentage) %>%
    pivot_wider(id_cols = -Real, names_from = Ever_covid, values_from = Real) 
    
    table3$`Participant Characteristics` <- as.factor(table3$`Participant Characteristics`)
  table3$`Participant Characteristics` <- factor(table3$`Participant Characteristics`, 
                                                 labels = c(
    "Meet up with family and friends","Shopping for groceries/pharmacy", 
    "Shopping for items other than groceries or pharmacy",
    "Correct identification of main symptoms"))
  
  table3 <- table3 %>% 
    gt(rowname_col = "Levels",
       groupname_col = "Participant Characteristics") %>% 
    tab_spanner(label = md("**Had COVID-19**"), 
                columns = vars("Think have not had COVID-19",
                               "Think have had COVID-19")) %>% 
    row_group_order(c("Shopping for groceries/pharmacy",
                      "Shopping for items other than groceries or pharmacy",
                      "Meet up with family and friends",
                      "Correct identification of main symptoms" )) %>% 
    
    cols_label(
      'Think have not had COVID-19' = md("**Think have not had COVID-19**"),
      'Think have had COVID-19' = md( "**Think have had COVID-19**")) %>% 
    tab_style(
      style = list(
        cell_text(weight = "bold"),
        cell_fill()),
      locations = cells_row_groups())
  
  print(table3)

For reference, here is table 3 and what we were trying to reproduce (Smith et al. 2020).

Graph

Finally I tried to reproduced the graph:

First I loaded ggplot2, which was a package for plotting graphs in R that I had used in Navarro’s (2020) tutorial. Ggplot2 has functions that are human readable, self explanatory and it allows for many aesthetic specifications. To find the percentages I used the “mutate(percentage)” function, grouping by “ever_covid” however I was getting lots of different percentages for each variables. When i looked closer, one percentage for each variable was correct, and after looking deeper into the paper I realised that the researchers only reported “extreme” answers on the scales (Smith et al. 2020). For example (see image below), for the variable q8 have immunity, the percentages reported in the graph were for answers of extreme agreeance which corresponds only to values “[5]” - therefore 12% and 2%. I checked the percentages by looking at the graph as they were not written anywhere in the paper (Smith et al. 2020).

I then used the mutate function to tell R which case each variable is corresponding to. I zapped the labels so I could label them within the “filter” function. Starting with the going out total variable, I used the mutate function, followed by telling R which variable to mutate “going_out_total” and to mutate it when (“case_when”), the variable was “>” greater than 7. I used this process for “q8 have immunity for extreme values (”== 5“) and for the”Sx_covid_nomissing" for extreme low values (“==0”). Finally I mutated the Adhere variables using the “starts with” function to not repeat myself. I was getting lots of errors with this code, so I passed it onto a teammate who found a solution on stack overflow to add TRUE ~ 0 to the code for for going_out_total and TRUE ~ values = 1 for the adhere variables (StackOverflow 2016).

 graph_values_real <- assignment_data %>% 
   zap_labels() %>% 
   mutate(Going_out_total = case_when(Going_out_total > 7 ~ 1, TRUE ~0)) %>% 
   mutate(Percentage = round(number/sum(number) * 100,1)) %>% 
   filter(case_when(vars == "q8haveimmunity" ~ values == 5,
                    vars == "Sx_covid_nomissing" ~ values == 0 , 
                    startsWith(vars,'Adhere') ~ values ==0,
                    TRUE ~ values == 1)) %>% 
  mutate(qcovid = case_when(Ever_covid==1 ~ "Think have had COVID-19", 
                            TRUE ~ "Think have not had COVID-19"))

print(graph_values_real)

Finally, I added the pivot longer function to create a long wise table that could be used to plot a bar graph. I grouped by Ever_covid again, put in my code that found the count and percentages and omitted missing values. Here is the code for finding the percentages for the graph

 graph_values_real <- assignment_data %>% 
   zap_labels() %>% 
   mutate(Going_out_total = case_when(Going_out_total > 7 ~ 1, TRUE ~0)) %>% 
   pivot_longer(q8haveimmunity:Sx_covid_nomissing, names_to = "vars", 
                values_to = "values") %>%
   group_by(Ever_covid,vars,values) %>% 
   count(name = "number") %>% 
   na.omit %>% 
   group_by(Ever_covid, vars) %>% 
   mutate(Percentage = round(number/sum(number) * 100,1)) %>% 
   filter(case_when(vars == "q8haveimmunity" ~ values == 5,
                    vars == "Sx_covid_nomissing" ~ values == 0 , 
                    startsWith(vars,'Adhere') ~ values ==0,
                    TRUE ~ values == 1)) %>% 
   select(vars, Ever_covid, Percentage) %>% 
  mutate(qcovid = case_when(Ever_covid==1 ~ "Think have had COVID-19", 
                            TRUE ~ "Think have not had COVID-19"))

print(graph_values_real)

## # A tibble: 18 x 4
## # Groups:   Ever_covid, vars [18]
##    vars                  Ever_covid Percentage qcovid                     
##    <chr>                      <dbl>      <dbl> <chr>                      
##  1 Adhere_meet_friends            0        9.8 Think have not had COVID-19
##  2 Adhere_shop_groceries          0       36.5 Think have not had COVID-19
##  3 Adhere_shop_other              0       24.8 Think have not had COVID-19
##  4 Going_out_total                0       37.6 Think have not had COVID-19
##  5 q10arisk                       0        3.2 Think have not had COVID-19
##  6 q10brisk                       0        1   Think have not had COVID-19
##  7 q8haveimmunity                 0        2.5 Think have not had COVID-19
##  8 q9worry                        0        2.5 Think have not had COVID-19
##  9 Sx_covid_nomissing             0       37.1 Think have not had COVID-19
## 10 Adhere_meet_friends            1       28.3 Think have had COVID-19    
## 11 Adhere_shop_groceries          1       46.1 Think have had COVID-19    
## 12 Adhere_shop_other              1       45.3 Think have had COVID-19    
## 13 Going_out_total                1       51.2 Think have had COVID-19    
## 14 q10arisk                       1        3.9 Think have had COVID-19    
## 15 q10brisk                       1        1.3 Think have had COVID-19    
## 16 q8haveimmunity                 1       12.1 Think have had COVID-19    
## 17 q9worry                        1        6   Think have had COVID-19    
## 18 Sx_covid_nomissing             1       52.8 Think have had COVID-19

Finally I created a new data set in my global environment called “graph_plot” and used the data above from “graph_values_real”. I used ggplot to define my aesthetics. In this case all categorical variables (except Ever_covid) will go on the x axis, percentage will go on the y axis and Ever_covid will fill the colours of the bars, using different colours for the different conditions.

graph_plot <- graph_values_real %>% 
  group_by(Ever_covid) %>%
  ggplot(aes(x=vars, y=Percentage, fill=factor(Ever_covid))) +
   geom_bar()

My team mate (Bart) handed me back some specifications he had added to my code, including “position = position dodge” which makes sure there is sufficient room between the bars and “stat =”identity" which tells R to not cluster the data and just use the points specifically asked for. I then added the “scale_y_continuous” function to define the limits of the y axis “(0-60)”. Scale_fill_grey and theme_bw() make the bars grey and everything else black and white. The “Theme(axis.text.x)” tells R to change the x axis, the angle of the text and position on vertical axis, which I specified using “vjust” and “hjust”. “Legend.position” is pretty self-explanatory and all of the “axis title” functions specify the titles with “= element_blank()” meaning noo title and “=element_text()” specifying a title. I learnt these functinos from a website (Wickham et al. 2021a).

library(ggplot2)

graph_plot <- graph_values_real %>% 
  ggplot(aes(fill=factor(Ever_covid), x=vars, y=Percentage)) +
   geom_bar(position = position_dodge(width = 0.8),
            stat = "identity", width = 0.6) + 
  scale_y_continuous(limits = c(0,60), breaks = c(0,10,20,30,40,50,60)) +
   theme(axis.text.x = element_text(angle = 70, vjust = 1, hjust = 1), 
         legend.position = "bottom",
         legend.title = element_blank(),
         axis.title.x =  element_blank(),
         axis.title.y = element_text(vjust = 4)) + 
  scale_x_discrete(labels = c('meet up with friends/family they do not live with',
                              'Shopping for groceries/pharmacy two or more times',
                              'Shopping for items other than grocery/pharmacy', 
                              "Out of home activity, 8 or more outings", 
                              "percieved no risk at all to self", 
                              "percieved no risk at all to others", 
                              "strongly agree they are immune", 
                              "Not worried at all", "Did not identify main symptoms")) + 
  scale_fill_discrete(labels = c("Think they have not had covid-19", 
                                 "Think they have had covid-19")) + scale_fill_grey()

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

print(graph_plot)

For reference here is the graph the researchers created (Smith et al. 2020).

We identified in our learning logs that this graph, whilst it was reproduced, was not the graph we would have created if it was our research. Jenny Richmond then gave us the challenge to then improve this graph.

Some issues with this graph are that you can’t take anything away from this graph at first glance and it can be misleading. Considering the severity of covid-19 and the need for accurate information, figures need to be easy to read at a glance, and this one is not. Secondly, the graph displays so many different measures, which makes it difficult to read. Thirdly the graph only uses a percentage for extreme values, which is a large waste of data and doesn’t speak to the majority of individuals.

I had the idea to add the percentages to the bars of the graph so that it would be easier to read. A teammate coded this idea, however we had issues getting the numbers large enough to read but small enough to fit on the bars.

This first idea also didn’t solve the issues with the first graph. I then came up with the idea to create a line that showed the mean adherence, that went across the graph. I created this by adding “geom_abline(yvalue)” to the code however the line was misleading, in that it appeared to show the average across all of the variables, rather than just the adhere variables.

This idea lead to the solution of creating a graph plotting only the adherence values, against those who think they have and haven’t had covid-19 - as this was the main aim of this study. First I created this graph using a percentage as the y axis, as I had already calculated this.

I then considered that the use of percentage was one of those reasons the first graph was unsuccessful. What does overall percentage of participants who went grocery shopping on 2 or more occasions, met friends who are not within their household and went shopping on two or more occasions for something other than groceries, really tell us. I decided I would plot the average Adherence instead. First I created a new data set called “graph2” using the overall data. I grouped by Ever_covid, summarised across all “adhere” values and listed the calculations i wanted R to do, rounding by 2 decimal places within the mutate function.

graph2 <- assignment_data %>% 
   group_by(Ever_covid) %>% 
   summarise(across(c(starts_with("adhere")), list(mean = mean))) %>%
   mutate(across(everything(), round,2))

print(graph2)

## # A tibble: 2 x 4
##   Ever_covid Adhere_shop_groceries_m… Adhere_shop_other_m… Adhere_meet_friends_…
##        <dbl>                    <dbl>                <dbl>                 <dbl>
## 1          0                     0.63                 0.75                  0.9 
## 2          1                     0.54                 0.55                  0.72

I then changed this tibble to a long format so I could plot it easily using ggplot2. As usual, this meant selecting all of the variables, using the “end_with()” function. I then told R where to put categorical (“names_to =”) and numerical (“values_to =”) data.

graph2mean <- graph2 %>% 
   select(ends_with("mean"), Ever_covid) %>% 
   pivot_longer(ends_with("mean") ,  names_to = "Mean_vars", values_to = "Mean_values")
print(graph2mean)

## # A tibble: 6 x 3
##   Ever_covid Mean_vars                  Mean_values
##        <dbl> <chr>                            <dbl>
## 1          0 Adhere_shop_groceries_mean        0.63
## 2          0 Adhere_shop_other_mean            0.75
## 3          0 Adhere_meet_friends_mean          0.9 
## 4          1 Adhere_shop_groceries_mean        0.54
## 5          1 Adhere_shop_other_mean            0.55
## 6          1 Adhere_meet_friends_mean          0.72

Creating the graph I started with grouping by Ever_covid as always, then adding the ggplot function and defining the aesthetics. For this graph the x axis was the “Mean_vars” and the y axis was “mean_values”. Same as the graph provided in the paper I still wanted separate bars for those who have and havent had covid, so i used the fill = “ever_covid” function.

graph22 <- graph2mean %>% 
  group_by(Ever_covid) %>%
  ggplot(mapping = aes(x = Mean_vars, y = Mean_values, fill = (Ever_covid))) +
   geom_col()
   
print(graph22)

This bar graph was not what i was expecting. Rather than dividing the ever covid variable into two separate bars it created a stacked box plot. I googled how to unstack a ggplot however I couldn’t find any answers. I decided to google how to create one, and you had to turn scaled variables into factors. So to unstack my boxplot i tried to make my scaled variable; “Ever_covid”, a factor.

graph22$ever_covid <- as.factor(graph22$ever_covid)

I then passed this code over to a group member who added position = dodge2 to the geom_col, which tells R to leave a gap between the bars, and keep them the same size. I added the function “labs()” to define the title of the graph. I also added the “theme” function which allowed me to define the legend title, axis titles, size and position. To change the names of variables on the x axis I used the “scale_x_discrete” function. I found these functions from the sthda website (Statistical tools for high-throughput data analysis, 2020).

graph22 <- graph2mean %>% 
  group_by(Ever_covid) %>%
  ggplot(mapping = aes(x = Mean_vars, y = Mean_values, fill = factor(Ever_covid))) +
   geom_col(position = "dodge2") + 
  labs(title = "Whether perception of covid-19 impacts obedience to lockdown measures", 
       y = "Average self-reported adhereance") + 
  theme(legend.position = ("bottom"), axis.title.x = element_blank(), 
        legend.title = element_blank(), axis.text.x = element_text(vjust = 1, hjust = 1)) + 
  scale_x_discrete(labels = c("meeting friends", "grocery/pharmacy shopping", 
                              "shopping for not grocery/pharmacy")) + 
  scale_fill_discrete(labels = c("Think they have not had covid-19", 
                                 "Think they have had covid-19")) + 
  scale_y_continuous(limits = c(0,1))
   
print(graph22)

To make it easy for someone else to run my code for the improved graph I have provided all of the code in this chunk.

library(ggplot2)
library(tidyverse)

graph2 <- assignment_data %>% 
   group_by(Ever_covid) %>% 
   summarise(across(c(starts_with("adhere")),
                    list(mean = mean)))
 
 graph2 <- graph2 %>% 
   mutate(across(everything(), round,2))
 
 graph2mean <- graph2 %>% 
   select(ends_with("mean"), Ever_covid) %>% 
   pivot_longer(ends_with("mean") ,  
                names_to = "Mean_vars", values_to = "Mean_values")

graph22 <- graph2mean %>% 
  group_by(Ever_covid) %>%
  ggplot(mapping = aes(x = Mean_vars, y = Mean_values, fill = factor(Ever_covid))) +
   geom_col(position = "dodge2") + 
  labs(title = "Whether perception of covid-19 impacts obedience to lockdown measures", 
       y = "Average self-reported adhereance") + 
  theme(legend.position = ("bottom"), 
        axis.title.x = element_blank(), 
        legend.title = element_blank(), 
        axis.text.x = element_text(vjust = 1, hjust = 1)) + 
  scale_x_discrete(labels = c("meeting friends", "grocery/pharmacy shopping", 
                              "shopping for not grocery/pharmacy")) + 
  scale_fill_discrete(labels = c("Think they have not had covid-19", 
                                 "Think they have had covid-19")) + 
  scale_y_continuous(limits = c(0,1))
   
print(graph22)

Part 3 Exploration

Exploration 1

I have chosen to firstly investigate whether perceived immunity to covid-19 differs among those who have been tested for covid-19. I was prompted to consider this as the paper found that 24.3% of participants thought they have had COVID-19 but only 4.0% reported having received a positive test result (Smith et al. 2020). This made me wonder whether the greater perceived immunity was largely due to participants who had been given a negative result or who havent been tested at all. I hypothesise that those who have not gotten a negative test result will have the greatest perceived immunity, followed by those who have not been tested and finally those who received a positive result will have the lowest perceived immunity.

First I had to install and load some packages. The new packages used are library(rstatix) which allows for statistical analysis, library(RColourBrewer) whcich allows for colour specifications and library(corrplot) which allows for correlational analysis.

library(ggplot2)
library(tidyverse)
library(haven)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(corrplot)

## corrplot 0.84 loaded

library(ggpubr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(gt)
library(RColorBrewer)

I first set out to find some descriptive statistics for this relationship. I used the “group_by” function, this time to get statistics that would determine between different levels of covid testing. I then selected perceived immunity (q8haveimmunity)" and used the “summarise” function to find the mean and sd. I learnt these functions in Jenny’s tutorials. I added the gt function, determined what variables would be columns and rounded by 2 decimal places.

investigate5 <- assignment_data %>%
  group_by(q7beentested) %>%
  select(q8haveimmunity) %>%
  summarise(mean = mean(q8haveimmunity), sd = sd(q8haveimmunity), n=n(),
            se = sd/sqrt(n))

## Adding missing grouping variables: `q7beentested`

investigate5 %>%
  gt() %>% fmt_number(columns = c(mean, sd, n, se), decimals = 2)

q7beentested	mean	sd	n	se
0	2.48	1.02	5,574.00	0.01
1	3.62	0.92	330.00	0.05
2	4.19	0.86	245.00	0.06

I mutated the variable q7beentested so these statistics were readable to someone who hasnt seen the data.

investigate5 <- investigate5 %>%
  mutate(q7beentested = case_when(q7beentested == 0 ~ "Not tested",
                            q7beentested == 1 ~ "Negative result",
                            q7beentested == 2 ~"Positive result"))

Next I ran a correlational analysis to see the direction and magnitude of the correlation. I also installed the “report” package which Jenny Richmond told us about. This package provides an explanation of your statistical analysis. First I created a new data set. I then told R that in this new data set I wanted “<-” a correlational test “cor.test” between q8haveimmunity and the q7beentested variable which are from the same dataset. Finally I told R to report on this analysis. I had to include previous code for this chunk of code to run.

investigate5 <- assignment_data %>%
  group_by(q7beentested) %>%
  select(q8haveimmunity) %>%
  summarise(mean = mean(q8haveimmunity), sd = sd(q8haveimmunity), n=n(),
            se = sd/sqrt(n))

## Adding missing grouping variables: `q7beentested`

library(report)

my_correlation <- cor.test(investigate5$mean, investigate5$q7beentested)
report(my_correlation)

## Effect sizes were labelled following Funder's (2019) recommendations.
## 
## The Pearson's product-moment correlation between investigate5$mean and investigate5$q7beentested is positive, statistically not significant, and very large (r = 0.98, t(1) = 5.34, p = 0.118)

From this analysis we can conclude that there is a positive correlation between the level of testing and perceived immunity (as level of testing increases, so does perceived immunity). This difference is large and statistically significant (r = 0.98, p = 0.118).

I chose to create a column graph as there was an interesting difference between categorical variables. For the graph I started by defining which variable I was grouping by, followed by defining the aesthetics in ggplot, with the different levels of testing on the x axis, and mean perceived immunity on the y. Finally I told R to make a column graph “geom_col”.

graph_invest5 <- investigate5 %>%
  group_by(q7beentested) %>%
  ggplot(mapping = aes(x = q7beentested, y = mean)) + geom_col()

print(graph_invest5)

## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.

I then fixed the axis labels and added an error bar using “geom_errorbar(aes(ymin=meann-sd, ymax=meann+sd))” which i learned in Navarros you tube series (2020). However one of the error bars wasn’t showing:

I realised this was because my y axis was too small, so i fiddled with ylim function (starting with 0-10 and then adjusting until i could see how long the error bar was). I added the “labs” and “theme” functions for titles and text size. I wanted to add colour to my columns so i added the fill function to my mapping and a legend. I also found a package called RColourBrewer with the function “Scale_fill_brewer(Blues)” (Wickham Navarro & Pederson, 2021). My code wasn’t printing a graph and I was getting the error “continuous variable added to discrete scale”. I googled this error and found the solution to add “as.factor” before the variable “q7beentested” in the mapping (Stack Overflow, 2016).

investigate5 <- investigate5 %>%
  mutate(q7beentested = case_when(q7beentested == 0 ~ "Not tested",
                            q7beentested == 1 ~ "Negative result",
                            q7beentested == 2 ~"Positive result"))

graph_invest5 <- investigate5 %>%
  ggplot(mapping = aes(x = q7beentested, y = mean, fill= as.factor(q7beentested))) +
  geom_col() + geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd)) +
  labs(title = "Percieved immunity to covid-19 for those that
       have and have not been tested for covid-19", 
                x = "covid-19 test result", 
       y = "Average percieved immunity to covid-19 (0-5)", 
       fill = "level of covid-19 testing") + 
           ylim(0,5.1) + scale_fill_brewer(palette="Blues") + 
           theme(plot.title = element_text(size = 10)) + 
  theme(legend.text = element_text(size = 8))

print(graph_invest5)

Whilst using a column graph is effective in displaying the data, it seemed to be a waste of the large amount of data I had. So i decided to create a boxplot. First I tried making a box plot by just adding “geom_boxplot()”.

graph_invest7 <- investigate5 %>%
  group_by(q7beentested) %>%
  ggplot(mapping = aes(x = q7beentested, y = q8haveimmunity, fill=q7beentested)) + geom_boxplot() + geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd))

This result is not what I was looking for. In the tutorials I learned that to create a boxplot i had to remove the group_by variable and then plot the data. I used the same code just removing “group_by” and adding “geom_boxplot(alpha = 0.05)”. I used the stdha website to get code for the aesthetics such as the mean point shape (Statistical tools for high-throughput data analysis (2020).

library(RColorBrewer)
library(tidyverse)

investigate5 <- assignment_data %>%
  select(q8haveimmunity, q7beentested) %>%
  mutate(q7beentested = case_when(q7beentested == 0 ~ "Not tested",
                            q7beentested == 1 ~ "Negative result",
                            q7beentested == 2 ~"Positive result"))

graph_invest6 <- investigate5 %>%
  ggplot(mapping = aes(x = q7beentested, y = q8haveimmunity, fill = q7beentested)) + 
  geom_boxplot(alpha = 0.05) + 
  labs(title = "Percieved immunity to covid-19 for those that 
       have and have not been tested for covid-19", 
       x = "covid-19 test result", 
       y = "Average percieved immunity to covid-19 (0-5)") + 
  ylim(0,5.1) + theme(legend.title = element_blank()) + 
  theme(plot.title = element_text(size = 10)) + 
  theme(legend.text = element_text(size = 8)) + 
  stat_summary(fun=mean, geom="point", shape=8, size=2, color="black")

print(graph_invest6)

This was getting close, however the fill colours were too light. So i googled and found a function called “scale_Fill_hue” which allowed me to adjust the luminance and colour (Holtz, 2018).

library(RColorBrewer)
library(tidyverse)

investigate5 <- assignment_data %>%
  select(q8haveimmunity, q7beentested) %>%
  mutate(q7beentested = case_when(q7beentested == 0 ~ "Not tested",
                            q7beentested == 1 ~ "Negative result",
                            q7beentested == 2 ~"Positive result"))

graph_invest6 <- investigate5 %>%
  ggplot(mapping = aes(x = q7beentested, y = q8haveimmunity, fill = q7beentested)) + 
  geom_boxplot(alpha = 0.05) + scale_fill_hue(l=500, c=5000) + 
  labs(title = "Percieved immunity to covid-19 for those that 
       have and have not been tested for covid-19", 
       x = "covid-19 test result", 
       y = "Average percieved immunity to covid-19 (0-5)") + 
  ylim(0,5.1) + theme(legend.title = element_blank()) + 
  theme(plot.title = element_text(size = 10)) +
  stat_summary(fun=mean, geom="point", shape=8, size=2, color="black") +
  theme(legend.position = "none")

print(graph_invest6)

It is clear from this figure, and the correlation coefficient that my hypothesis was completely incorrect. I have found that those who have been tested for covid-19 have greater perceived immunity compared to those who have not been tested. And of those who have been tested, those who received a positive test result were more likely to believe they are immune to the virus. This difference is of statistical significance and there is a strong, positive correlation between level of covid-19 testing and perceived immunity.

Exploration 2

Next I chose to investigate whether individuals ability to identify cough and fever as the two main covid-19 symptoms differs across age groups. I wanted to investigate this after reading the paper “COVID-19 and digital inequalities: Reciprocal impacts and mitigation strategies” which discussed the idea of inequities in health literacy due to lack of technological abilities in older individuals (Beaunoyer, Dupere & Guitton, 2020). I wondered whether this idea would be demonstrated in this data set, with older individuals less aware of covid-symptoms.

I again started by obtaining a mean, sd, se and total count for ability to identify symptoms across age groups. The variable for ability to identify the main covid-19 symptoms is “sx_covid_nomissing” in this data set. I used a similar code as for exploration 1, just substituting the variables.

investigate2 <- assignment_data %>%
  group_by(age_categories) %>%
  select(Sx_covid_nomissing) %>%
  summarise(mean = mean(Sx_covid_nomissing), sd = sd(Sx_covid_nomissing), n=n(),
            se = sd/sqrt(n))

## Adding missing grouping variables: `age_categories`

print(investigate2)

## # A tibble: 5 x 5
##       age_categories  mean    sd     n     se
##            <dbl+lbl> <dbl> <dbl> <int>  <dbl>
## 1 1 [18-24 years]    0.470 0.499  1422 0.0132
## 2 2 [25 to 34 years] 0.529 0.499  1223 0.0143
## 3 3 [35 to 44 years] 0.622 0.485  1045 0.0150
## 4 4 [45 to 54 years] 0.636 0.481   718 0.0180
## 5 5 [55+ years]      0.694 0.461  1741 0.0110

I then ran a correlational analysis for these variables. I used the same process as my first investigation, just substituting different variables and data.

library(report)
my_correlation <- cor.test(investigate2$mean, investigate2$age_categories)
report(my_correlation)

## Effect sizes were labelled following Funder's (2019) recommendations.
## 
## The Pearson's product-moment correlation between investigate2$mean and investigate2$age_categories is positive, statistically significant, and very large (r = 0.98, 95% CI [0.73, 1.00], t(3) = 8.60, p < .01)

From this analysis we can conclude that there is a positive, very large and statistically significant correlation between age and ability to identify covid-19 symptoms (r = 0.98, p < 0.01). This means that as age increases so does ones ability to identify cough and fever as the two main covid-19 symptoms.

I then had to mutate the age categories to make the data readable.

investigate2 <- investigate2 %>%
  mutate(age_categories = case_when(age_categories == 1 ~ "18-24 years", 
         age_categories == 2 ~ "25-34 years",
         age_categories == 3 ~ "35-44 years",
         age_categories == 4 ~ "45-54 years",
         age_categories == 5 ~ "55+ years"))

To make the output easier to read I used the gt() function, specifying the columns and decimal places.

investigate2 <- assignment_data %>%
  group_by(age_categories) %>%
  select(Sx_covid_nomissing) %>%
  summarise(mean = mean(Sx_covid_nomissing), sd = sd(Sx_covid_nomissing), n=n(),
            se = sd/sqrt(n))

## Adding missing grouping variables: `age_categories`

investigate22 <- investigate2 %>%
  gt() %>% fmt_number(columns = c(mean, sd, n, se), decimals = 2)
print(investigate22)

Finally I made a line graph for this data, using ggplot2. I decided to use a line graph so the positive correlation that I found could be easily seen. I started with ggplot, mapping age categories on the x axis and average ability to identify symptoms as the y axis. I then added “geom_point()” which plots the means. The function “labs” allowed me to create titles for the graph as a whole, as well as the x and y axis “x =” “y =”. I also added “geom_line” to create a line through these points. I tried adding an error bar the same way as I did for my column graph, using geom_errorbar(aes(ymin=meann-sd, ymax=meann+sd)) however this didn’t add anything to my graph. After some googling I found that for line graphs you have to use the function geom_pointrange, and identify ymin and ymax as the error bar limits.

investigate2 <- investigate2 %>%
  mutate(age_categories = case_when(age_categories == 1 ~ "18-24 years", 
         age_categories == 2 ~ "25-34 years",
         age_categories == 3 ~ "35-44 years",
         age_categories == 4 ~ "45-54 years",
         age_categories == 5 ~ "55+ years"))

graph_investiagte2 <- investigate2 %>%
  ggplot(mapping = aes(x = age_categories, y = mean)) + 
  geom_point() + geom_line() + labs(title = "Ability to identify main symptoms of COVID-19 across age groups", 
       x = "age group", y = "Average ability to identify COVID-19 symptoms") + geom_pointrange(aes(ymin=mean-sd, ymax=mean+sd))

print(graph_investiagte2)

## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

Aesthetically this was still a little boring. In Jenny’s tutorial we plotted popular babynames and used the actual names of the baby as the dots. I thought I would try this for the age categories. I did this by using the “geom_text” function, defining the label as age_categories. I then played around with the nudge function to move the labels so they were next to the point. Whilst they didn’t replace the point like I thought they would, I was actually happier with this outcome. I also added a caption using “labs(caption)” to make sure the audience knows these lines are representing standard deviations.

library(ggplot2)

graph_investiagte2 <- investigate2 %>%
  group_by(age_categories) %>%
  ggplot(mapping = aes(x = age_categories, y = mean, group = "age_categories")) + 
  geom_point() + geom_line(colour = "blue", linetype = 2) +
  geom_pointrange(aes(ymin=mean-sd, ymax=mean+sd)) + 
  labs(title = "Ability to identify main symptoms of COVID-19 across age groups", 
       x = "age group", 
       y = "Average ability to identify COVID-19 symptoms", 
       caption = "Figure. Error bars represent one standard deviation") + 
  ylim(-0.1,1.25) + 
  geom_text(label = investigate2$age_categories, size=3, nudge_y = 0.1, 
            nudge_x = 0.3) + 
  theme(axis.text.x = element_blank())

After some feedback from Jenny and Kate, I was prompted to see what error bars of standard error would look like, because standard error is more informative, allowing me to see which levels are more different to each other and which points we are more certain of. I changed the sd, to se in the errorbar function as I had already calculated se earlier. I adjusted things such as point size and added tops and bottoms to the bars, to make them easier to see. I also added “linetype = 2” for aesthetics. Because the standard error is so small due to the large sample size, I adjusted ylim so the difference in standard error could be clearly seen.

library(ggplot2)

graph_investiagte23 <- investigate2 %>%
  group_by(age_categories) %>%
  ggplot(mapping = aes(x = age_categories, y = mean, group = "age_categories")) + 
  geom_point() + geom_line(colour = "blue", linetype = 2) + 
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, width = 0.1)) + 
  labs(title = "Ability to identify main symptoms of COVID-19 across age groups", x = "age group", 
       y = "Average ability to identify COVID-19 symptoms (0-1)", 
       caption = "Figure. Error bars represent standard error") + ylim(0.4,0.75) + 
  geom_text(label = investigate2$age_categories, size=3, nudge_y = -0.01, nudge_x = 0.3) + 
  theme(axis.text.x = element_blank())

print(graph_investiagte23)

It is clear from this graph and correlation analysis that Beaunoyer’s (2020) hypothesis was not replicated in this data. I found that older individuals were more able to identifying covid-19 symptoms. The correlation between age and ability to identify symptoms is very large, positive and statistically significant.

Exploration 3

Finally, I chose to investigate whether adherence to lockdown measures differed across UK regions. I hypothesise that adherence will differ across UK regions, with Wales/Scotland/Northern Ireland having the greatest adherence, followed by North, South and East and finally London having the lowest adherence.

First I found the mean and sd for the adhere variables all together. I used the group_by function for region, the select function for the variables and the summarise “across” function for a group of variables. This meant I had to use the “list” function to tell R the what I wanted it to calculate (mean and sd).

## Adding missing grouping variables: `region`

However this gave me a separate sd for each adherence variable and I wanted one for the adherence overall. I tried using “:” to group the variables however that didn’t work either so I asked for help in the tutorials. Jenny directed me in installing “rstatix” package. I used the function from the rstatix package called “get_summary_stats”, and told R I wanted the “mean_adhereance”, and all of the summary statistics - using the code "type=full.

library(haven)
library(dplyr)
library(rstatix)
assignment_data2 <- assignment_data %>%
  select(region, Adhere_shop_groceries, Adhere_shop_other, Adhere_meet_friends) %>%
  mutate(assignment_data, 
         mean_adhere = rowMeans(select(assignment_data, starts_with("Adhere")), 
                                na.rm = TRUE)) %>%
  select(region, mean_adhere) %>%
  group_by(region) %>%
  get_summary_stats(mean_adhere, type = "full") 

print(assignment_data2)

## # A tibble: 5 x 14
##     region variable     n   min   max median    q1    q3   iqr   mad  mean    sd
##   <dbl+lb> <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 [Midl… mean_ad…  1032     0     1  0.667 0.667     1 0.333 0.494 0.724 0.316
## 2 2 [Sout… mean_ad…  1785     0     1  1     0.667     1 0.333 0     0.754 0.308
## 3 3 [Nort… mean_ad…  1455     0     1  0.667 0.667     1 0.333 0.494 0.74  0.306
## 4 4 [Lond… mean_ad…  1000     0     1  0.667 0.333     1 0.667 0.494 0.641 0.362
## 5 5 [Wale… mean_ad…   877     0     1  0.667 0.667     1 0.333 0.494 0.73  0.304
## # … with 2 more variables: se <dbl>, ci <dbl>

Next i had to mutate the region variable so it would be easier to read using the “mutate” and “case_when” functions.

assignment_data2 <- assignment_data2 %>%
  mutate(region = case_when(region == 1 ~ "Midlands",
                            region == 2 ~ "South and East",
                            region == 3 ~ "North",
                            region == 4 ~ "London",
                            region == 5 ~ "Wales/Scot/NI"))

I then tried to plot the data using gpplot, specifying region on the x axis, mean on the y and filling by region. I chose to use a boxplot for this data because it involves a categorical variable and because I got summary statistics for a group of variables I wasnt able to plot individual data points. I used “labs” to define titles and “ylim” to specify the length of the y axis so the error bars could be seen. I added the same “scale_fill_brewer” function as before for consistency among my plots. Finally i used the function “geom_errorbar”. I was getting an error regarding the fill function. I had seen this error before and remembered I had to specify that region was a factor, adding “factor(region)” to the mapping.

graph_invest1 <- assignment_data2 %>%
  ggplot(mapping = aes(x = region, y = mean, fill = factor(region))) +
  geom_col() +
  labs(title = "Adhereance to lockdown measures in different UK Regionos", 
       y = "average adhereance (0-1)") + 
  ylim(0,1.1) + scale_fill_brewer(palette = "Blues") + 
  geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd))

print(graph_invest1)

I put all of the code to create this bar chart in one chunk to keep it simple if someone were to reproduce my analysis.

assignment_data2 <- assignment_data %>%
  select(region, Adhere_shop_groceries, Adhere_shop_other, Adhere_meet_friends) %>%
  mutate(assignment_data, 
         mean_adhere = rowMeans(select(assignment_data, starts_with("Adhere")),
                                na.rm = TRUE)) %>%
  select(region, mean_adhere) %>%
  group_by(region) %>%
  get_summary_stats(mean_adhere, type = "full") 


assignment_data2 <- assignment_data2 %>%
  mutate(region = case_when(region == 1 ~ "Midlands",
                            region == 2 ~ "South and East",
                            region == 3 ~ "North",
                            region == 4 ~ "London",
                            region == 5 ~ "Wales/Scot/NI"))

assignment_data$region <- as.factor(assignment_data$region)

graph_invest1 <- assignment_data2 %>%
  na.omit %>%
  ggplot(mapping = aes(x = region, y = mean, fill = factor(region))) +
  geom_col() +
  labs(title = "Adhereance to lockdown measures in different UK Regions", 
       y = "average adhereance (0-1)") + 
  scale_fill_brewer(palette = "Blues") + 
  geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd)) + ylim(0, 1.1) +
  theme(legend.position = "none")

print(graph_invest1)

It is clear from this graph that adherence does differ across UK regions. From the bar chart we can see that Wales/Scotland/Northern Ireland have the greatest adherence, followed by South and East then North, Midlands and London has the lowest adherence.

Part 4 Recommendation

There are numerous ways in which the researchers of this study could improve the ability for their results to be reproduced. My main suggestions for the researchers are to include a code book, have a more detailed method of their analysis and finally, clearly explain and account for missing data.

If the researchers had included a code book, which had accurate and clear code for obtaining all figures and descriptive in their paper it would be much easier to reproduce. Without a code book, we had to start with a blank canvas and create code from scratch to obtain statistics that we saw in figures. We therefore had to come up with each function for mutating, filtering and formatting the figures that we assumed would give us the desired statistics – without any guidance from the researchers that this was the correct procedure. This meant our process involved lots of trial and error, which is both time consuming and error provoking. It is not clear whether the researchers coded their statistics in the first place so I would suggest they create a code that obtained their statistics. I recommend watching Danny Navarro’s lectures for coding in tidyverse (Navarro, 2020), and coding in R studio, which has built in “help” drop downs which tell you about certain packages and functions. Tidyverse is a very useful package for beginners and it can be used for obtaining all of their descriptive statistics. They should use the dplyr package for getting the descriptive statistics in their tables and ggplot2 for creating the graph as both of these packages are compatible with Tidyverse and use functions that are self-explanatory E.g pivot longer for changing tables to long ways, and group_by, for grouping variables. I would also suggest using the following two websites for trouble shooting (GitHub.Inc, 2021; Wickham, Francois, Henry & Muller, 2021b). I would then suggest they check this code is human and computer readable and works on all machines. Finally, they should attach this code to their research, allowing for their results to be reproduced in a much more efficient and precise way.

Secondly, I would recommend the researchers include a more detailed method of their analysis for each statistic. If the researchers had clearly explained how they obtained each percentage, count, and average statistic, how they filtered each variable, missing data and mutated variables, the reproduction would have been much more efficient and accurate. This analysis would clearly operationalise dependent variables – specifically explaining that 0 = strongly disagree and 5 = strongly agree, and the different scales used for each variable. Without this, I spent lots of time trying to understand the different scales for each variable, filtering each variable, mutating variables into factors and deciding which code to use to get different kinds of statistics (percentages versus count). This was something we had to figure out through trial and error of summarising these variables. If the length of the research paper was an issue for more detail into the analysis of data, I would suggest researchers include this information in the preregistered document or in an extra information file that goes with their paper or data file. I would suggest they do this by repeating their analysis and documenting each step in detail.

Finally, I would suggest the researchers include clear detail of their exclusion criteria and account for any missing data. The total number of participants for the variable of whether participants could identify cough and fever as the main Covid-19 symptoms was not able to be reproduced. Their lack of detail in exclusion criteria created an issue in reproducing Table 3, as the researchers counted 239 and 2632 participants for the variable, however when I went to reproduce this statistic our dataset only had 2157 and 2632 participants. I double checked our coding and even reloaded the data set to check we hadn’t filtered any participants out. I hypothesise the researchers excluded certain participants later in their analysis and did not account for this in their data file. If they had clearly explained when they excluded certain participants and for what reasons, reproducing the statistics for this variable would have been possible. This issue was made even worse when I attempted to create the graph which plotted this variable. Therefore, the fact that I could not reproduce the statistics for this variable also meant the graph could not be completely reproduced. If a clear explanation for exclusion criteria and missing data was included, the reproduction of these figures could have been possible.

References

Beaunoyer, Elisabeth., Dupéré, Sophie., & Guitton, Matthieu J. (2020). COVID-19 and digital inequalities: Reciprocal impacts and mitigation strategies. Computers in Human Behavior, 111, 106424–106424. https://doi.org/10.1016/j.chb.2020.106424

GitHub.Inc (2021). GitHub Blog. Retrieved from https://github.blog/

Holtz, Yan. (2018). R Graph Gallery. Retrieved from https://www.r-graph-gallery.com/index.html

Iannone, Richard. (2020). Great Looking Tables: gt (v0.2). Retrieved from https://blog.rstudio.com/2020/04/08/great-looking-tables-gt-0-2/

Iannone, Richard (2021). gt0.2.2 reference. Retrieved from https://gt.rstudio.com/reference/index.html

Navarro, (2020). Starting R Markdown [video file]. Retrieved from https://www.youtube.com/watch?v=tuFn-sFSVgk&t=1s

Smith, Louise E, Mottershaw, Abigail L, Egan, Mark, Waller, Jo, Marteau, Theresa M, & Rubin, G James. (2020). Correction: The impact of believing you have had COVID-19 on self-reported behaviour: Cross-sectional survey. PloS One, 16(2), e0248076–e0248076. https://doi.org/10.1371/journal.pone.0248076

Stack overflow (2016). Error: Continuous value supplied to discrete scale” in default data set example mtcars and ggplot2. Retrieved from https://stackoverflow.com/questions/43359050/error-continuous-value-supplied-to-discrete-scale-in-default-data-set-example

Stack overflow (2015). Finding percentage in a sub-group using group_by and summarise. Retrieved from https://stackoverflow.com/questions/29549731/finding-percentage-in-a-sub-group-using-group-by-and-summarise/29549927

Statistical tools for high-throughput data analysis, (2020). ggplot2 axis ticks : A guide to customize tick marks and labels Retrieved from https://student.unsw.edu.au/how-do-i-cite-electronic-sources-using-apa-style

Statistical tools for high-throughput data analysis (2020). ggplot point shapes.Retried from http://www.sthda.com/english/wiki/ggplot2-point-shapes

Wickham, Hadley, Chang, Winston. Henry, Lionel., Pederson, Thomas,Lin., Takahashi, Kohske., Woo, Kara, Yutani, Hiroaki., Dunnington, Hiroaki. (2021). Aesthetic specifications. Retrieved from https://ggplot2.tidyverse.org/articles/ggplot2-specs.html

Wickham, Hadley., Romain, François., Henry, Lionel,. & Müller, Kirill. (2021a). Efficiently bind multiple data frames by row and column. Retrieved from https://dplyr.tidyverse.org/reference/bind.htm

Wickham, Hadley., Francois, Romain., Henry, Lionel., & Muller, Kirill. (2021b). Retrieved from https://dplyr.tidyverse.org/

Wickham, Hadley., Navarro, Danielle., & Pederson, Thomas. Lin. ggplot2: Elegant Graphics for Data Analysis. Retrieved from https://ggplot2-book.org/scale-colour.html

Verification Report

Jessica Riley - z5258121