Part 1: Summary and reaction

Summary

“The Emotional Path to Action: Empathy Promotes Physical Distancing and Wearing of Face Masks During the COVID-19 Pandemic” by Pfattheicher et al. (2020) is the first study of its kind to explore the relationship between empathy and adherence to health measures such as physical distancing and wearing of face masks in a COVID-19 specific context. COVID-19 as “an epidemic crisis” (Pfattheicher et al., 2020) necessitates rapid alterations in one’s core social habits in order to manage the uncontrolled growth in infections and reduce strain on the health care system. However, research examining the factors which can promote these changes in a COVID-19 pandemic environment is limited due to the contemporary nature of the issue. Thus, Pfattheicher et al.’s (2020) research in this area addresses an important gap in the literature.

Methods

The experimenters conducted 4 studies to address their research question.

Study 1 recruited 3 non-representative samples from 3 Western countries (USA, UK and Germany) to investigate the correlation between participants’ self-reported affective empathy for those most vulnerable to the virus and their self-reported adherence to physical distancing. To quantify empathy, participants were asked to respond to 3 prompts such as “I feel compassion for those most vulnerable to coronavirus (COVID-19)” on a 5-point Likert scale with 1 being “strongly disagree” and 5 being “strongly agree”. Physical distancing was also measured through responses on a similar 5-point Likert scale to the following question: “Because of coronavirus (COVID-19), I am massively curtailing social contact (so-called ‘social distancing’)”.

Study 2 was a correlational design similar to Study 1 except that it was conducted in 1 German sample and participants’ adherence to physical distancing was measured in more concrete terms i.e., rather than asking them about “curtailing social contact” in general, items referred to specific circumstances such as visiting the elderly and visiting places where other people will be such as cafes and church. The likelihood of them visiting these places were self-reported on a 5-point Likert scale with 1 being “very unlikely” and 5 being “very likely”. This was later recoded such that higher numbers represented greater motivation to adhere to physical distancing.

Study 3 tested whether manipulations aimed at promoting empathy can increase motivation to physically distance in a German sample. There were 3 conditions:

  • Information only where participants were provided a passage which informed participants of the prosocial utility of physical distancing in preventing serious illness and strain on the healthcare system
  • Information + empathy where in addition to this information, participants watched a 1 minute video in which a 91 year old man comments on how he was unable to visit his chronically sick wife due to the pandemic (this video was pretested and found to elicit high levels of affective empathy)
  • Control in which no information nor video were provided to participants

Adherence to physical distancing was measured using the same items as in Study 2.

Finally, to rule out the alternative explanation that affective empathy increases motivation to physically distance by way of increasing one’s sensitivity to their own vulnerability, Study 4 measured adherence to face-mask wearing instead of motivation to physically distance given that wearing masks (specifically non-surgical grade masks) has a greater protective effect for others than for the wearer. There were 3 conditions:

  • Information Only where participants read a text detailing how the virus is transmitted and that face masks can prevent the spread of COVID-19
  • Empathy where participants read a text of similar length in which a woman with a rare autoimmune disease detailed how she was seriously impacted by an infection with the virus, stating that she did not like it when people met others without wearing a face mask
  • Control where no text nor information were given

As for measures:

  • State empathy was measured after the condition manipulations using three itmes found to be effective in previous research - this was quantified using a 5-point Likert scale
  • Motiviation to wear a face mask was measured on a 5-point Likert scale where participants were asked to respond how likely they would engage in the following behaviour: “During the coming days, I will wear a face mask as often as possible when I meet other people.” A higher number on the scale indicated a higher likelihood.
  • Objective vulnerability (whether a participant was high-risk or not) was measured according to the criteria of the Robert Koch Institution
  • Subjective vulnerability was measured on a 5-point Likert scale with 1 representing lowest perceived risk of harm and 5 representing the highest perceived risk of harm
  • Acceptance of policy measures was also measured on a 5-point Likert scale asking to what extent the participant supported various measures mandating physical distancing and the wearing of face masks

Results

Although the paper has no explicitly stated hypotheses, it does note that affective empathy has previously been found to promote altruism and caring within a healthcare context. Thus, it appears Pfattheicher et al.’s (2020) stance is that high levels of empathy should increase prosocial behaviour in the context of the pandemic – i.e., increased physical distancing and wearing of face masks.

This aligns with all 4 studies’ findings:

  • Both Studies 1 and 2 found that higher levels of empathy correlate with greater motivation to physically distance across all 4 samples.
  • Additionally, Study 3 found that by increasing state empathy, you can also increase motivation to adhere to physical distancing.
  • Study 4 found that increasing state empathy can also increase motivation to adhere to wearing of face masks and that this effect could not be attributed to being made more sensitive to one’s own vulnerability i.e., empathy most likely increases prosocial adherence to COVID-19 measures.

Implications

This paper adds to the growing literature on how the COVID-19 pandemic has impacted our perceptions of health and our personal roles in preventing the spread of disease. It also establishes a positive correlation between empathy and adherence to COVID-19 measures – importantly, that interventions appealing to an individual’s empathy can increase motivation to engage in the prosocial protection of others in a pandemic environment.

Given that increasing adherence to COVID-19 health measures is of great interest to politicians, this paper has important real-world implications. While more studies will need to be conducted in more representative and diverse samples to support these initial findings, this paper provides some impetus for politicians to appeal to affective empathy when promoting adherence to physical distancing and wearing of face masks.

Reaction

When I was reading, I was excited to learn that… empathy was consistently associated with adherence to COVID-19 health measures. It seems that across all 4 studies, findings relating empathy to motivation to physically distance or motivation to wear a face mask were significant. This indicates to me that this relationship is somewhat robust and is groundbreaking because little is known about what can influence health-related behaviours specifically in a COVID-19 context. This is a highly topical issue even though the official policy is to continue business as usual and “live with COVID” because the more people that continue to engage in these health practices, the more we are able to protect each other and reduce strain on the healthcare system. As such, it is important that we identify effective ways to encourage altruistic health behaviours such that we can continue on our path of “living with COVID” in a sustainable manner. I would be excited to see more research verifying these findings as it could have promising real-world implications.

The most interesting part of this paper was… that while subjective vulnerability did increase in both the empathy condition and the information only condition, this didn’t significantly moderate the effect of empathy on one’s motivation to wear a face mask. Personally, I find myself wearing face masks due to wanting to protect myself from illness and the inconvenience of having to isolate - my initial thought isn’t to wear a face mask to protect others around me just in case I’m carrying the virus. So I think that it is interesting that the results run somewhat counter to my personal experience. I also wonder specifically how appeals to empathy work to encourage face-mask wearing - if it’s not one’s sensitivity to our personal vulnerability nor others’ vulnerability, what does empathy increase our sensitivity to? And how does this relate back to our motivation to wear a face mask? I’m also curious about the relationship between subjective vulnerability and objective vulnerability. Is there any correlation between the two? Do either of them significantly predict one’s motivation to wear a face mask? This measure really interests me because I love learning about how perception and how we experience the environment moderates our behaviour regardless of the realities of that environment.

I wonder whether… having self-report measures that were quite clear in indicating what construct they were getting at could enhance the demand characteristics of the population. Could participants be wanting to appear as empathetic as possible? How valid are these findings when other experiments with similar manipulations yielded a null effect as the experimenters raised in the discussion? By asking participants about their explicit opinions on mask-wearing/physical distancing in an online testing environment, experimenters can’t really operationalise real-world behaviour effectively. Furthermore, it is interesting what real world implications this may have since these measurements and manipulations occurred over a small period of time whilst mask wearing is most effective as a consistent habit. I would love to see future research look into investigating how this study’s findings may meaningfully be applied in the real-world i.e., how might this empathy manipulation translate into the real world? As an advertisement on the street, on TV or perhaps in high-risk settings? It would definitely be interesting to see these kinds of experiments being run in the field.

Part 2: Verification

Verification goals

Demographic statistics

Study 1:

  • “The final samples consisted of 322 participants from the US (45.7% female; age: M = 33.33 years, SD = 13.00), 317 from the UK (59.3% female; age: M = 38.05 years, SD = 12.20), and 326 from Germany (46.6% female; age: M = 29.44 years, SD = 9.31).”

Study 2:

  • “We collected data of 359 participants from Germany (48.5% female; age: M = 29.75 years, SD = 9.40).”

Study 3:

  • “Study 3 (N = 868; 43.8% female; age: M = 35.09 years, SD = 12.44) was run in Germany”

Study 4:

  • “Study 4 (final N = 1,526; 47.2% female; age: M = 34.71 years, SD = 12.09) was run in Germany”

Means/SD

Study 1:

  • Empathy (“We observed relatively high levels of empathy, with some variability (US: M = 4.46, SD = 0.74; UK: M = 4.56, SD = 0.61; Germany: M = 4.02, SD = 0.93).”)
  • Physical distancing motivation (“We observed relatively high levels of physical distancing in each country, with some variability (US: M = 4.30, SD = 0.99; UK: M = 4.12, SD = 1.01; Germany: M = 4.04, SD = 1.11)”)

Study 2:

  • Physical distancing motivation (“…but assessed physical distancing in a more concrete and prospective way (five items; overall M = 4.56, SD = 0.65, α = .76)”)

Study 3:

  • Physical distancing motivation (“…the motivation to adhere to physical distancing … (control condition: M = 4.30, SD = 0.76, information-only condition: M = 4.39, SD = 0.74), …the information + empathy condition had a significantly higher mean (4.51, SD = 0.66) compared with the control condition…”)

Study 4:

  • Empathy (“Participants in the empathy condition…(M = 4.03, SD = 0.90)…the information-only condition (M = 2.14, SD = 1.00),…the control condition (M = 2.10, SD = 1.01)…”)
  • Motivation to wear a face mask (“…the motivation to wear a mask…empathy condition (M = 4.00, SD = 1.12)…the control condition (M = 3.69, SD = 1.24)…the information-only condition, (M = 3.83, SD = 1.20)…”)

Figures

Installing and loading the relevant packages

Here we installed and loaded all the packages necessary to successfully meet our verification goals. Because we were starting from a blank slate, our group had to first install all the relevant packages using the install.package() function. There were some that were installed and loaded from the outset, while others (like the ggpubr package) were recruited as we were problem solving how to accurately reproduce particular figures.

library(tidyverse) # for most of the functions you'll need
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readxl) # to read excel files into RStudio
library(here) # to specify the file path, organise files appropriately and work collaboratively through OneDrive
## here() starts at /Users/jennywang/Desktop/University/Courses/2022/Term 2/PSYC3361/Assessments/RStudio files
library(ggplot2) # for most of the functions needed to reproduce the figures
library(ggpubr) # for additional customisation of figures (explained in further detail later)

Initially, we struggled to understand the utility of the here() function and didn’t know how we should implement it to improve our verification report workflow. However, after some discussion with Anita (our tutor) and consulting Jenny Richmond’s blog (http://jenrichmond.rbind.io/post/how-to-use-the-here-package/), we figured that the here() package can be used to facilitate file organisation such that you can map specific file paths to datasets you wish to read. It turned out to be quite useful when organising the shared OneDrive so that it didn’t get too messy.

I ran the here() function by itself to verify where my file path started.

here()
## [1] "/Users/jennywang/Desktop/University/Courses/2022/Term 2/PSYC3361/Assessments/RStudio files"

Study 3

We began with Study 3 as it looked like one of the simplest to reproduce out of the 4 studies presented in our paper (it had 1 sample unlike Study 1, Study 2 was related to Study 1 and Study 4 was related to Study 3).

Reading and understanding the data

The relevant data was saved as “Study3” in the “data&syntax” file in the OSF. There were three formats available (sps, sav and xlsx). Wanting to limit the number of unknowns in an area already unfamiliar to us, we decided to read and interpret the xlsx file.

  • After some investigating through Google, we downloaded and loaded the readxl package and used the read_excel() function to read the data into RStudio.
  • Then, I assigned the name Study3 to this dataset using the <- symbol
  • I specified the path to the relevant file by using the here() function and indicating the file’s relative position to the here() starting point indicated earlier.
Study3 <- read_excel(here("data&syntax", "Study3.xlsx")) 

Then, I used the glimpse() function from the pillar package (re-exported by dplyr) to gain a brief understanding of what variables were recorded in Study 3. This function makes it possible to see every column in the data frame which is helpful for our purposes.

glimpse(Study3)
## Rows: 868
## Columns: 17
## $ Q22_1          <dbl> 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1…
## $ Q22_2          <dbl> 1, 1, 2, 1, 1, 1, 1, 4, 1, 1, 1, 4, 1, 5, 1, 4, 2, 4, 1…
## $ Q22_3          <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 5, 1, 5, 1, 1, 1, 3, 1, 1, 1…
## $ Q22_4          <dbl> 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1…
## $ Q22_5          <dbl> 2, 1, 1, 1, 1, 1, 2, 4, 4, 1, 3, 1, 1, 1, 2, 2, 1, 1, 1…
## $ Q22_6          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q20            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Age            <dbl> 32, 57, 63, 50, 52, 44, 47, 24, 43, 57, 31, 50, 19, 53,…
## $ Gender         <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1…
## $ Household_size <dbl> 1, 1, 2, 2, 1, 3, 3, 3, 3, 2, 3, 3, 2, 3, 1, 4, 1, 1, 2…
## $ Q28            <chr> "Toast", "Bratkartoffeln", "Hühnchen", "Ich esse mittag…
## $ bed            <dbl> 2, 2, 0, 0, 2, 2, 0, 0, 0, 2, 0, 1, 2, 1, 1, 2, 2, 0, 0…
## $ sd             <dbl> 1.2, 1.0, 1.2, 1.0, 1.0, 1.0, 1.2, 3.4, 1.6, 1.8, 1.4, …
## $ sdR            <dbl> 4.8, 5.0, 4.8, 5.0, 5.0, 5.0, 4.8, 2.6, 4.4, 4.2, 4.6, …
## $ sd01           <dbl> 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1…
## $ d1             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ d2             <dbl> 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0…

This glimpse caused considerable confusion among our group. The variable labels made little intuitive sense and we were unsure why the variable labelled Q28 had various food items listed underneath it. We decided to investigate this further by exploring the other file formats as well as the supplementary materials included in the paper’s OSF.

The supplementary materials listed “bed” as corresponding with the condition participants were assigned to for Study 4. We verified that this was also the case for Study 3 by consulting its corresponding sav file (Study3.sav) through Jamovi.

Jamovi: Identifying conditions

To limit later confusion, I renamed the variable as “condition” using the rename() function from the dplyr package and assigned this new dataset to “Study3_Rename” to avoid overwriting the original dataset in case we needed the original dataset late on. The pipe function ( %>% ) helps string functions together such that the rename() function is acting upon the “Study3” object.

Study3_Rename <- Study3 %>% 
  rename(condition = "bed") # using the rename() function to rename "bed" to "condition"

By consulting the sav file in Jamovi we discovered the following:

  • Most of the questions were included in the description of each variable. However, because this study was conducted in Germany, most of the questions and the levels were in German. Thus, Google Translate would be our best friend.
  • As indicated in the screenshot above, “0”, “1” and “2” were used to represent Control, Info + Empathy and Info Only conditions respectively.
  • Q22_1 - Q22_5 were likely the questions that comprised physical distancing motivation since they each translated to:
    • Q22_1: Please indicate to what extent you consider the statements to be probable. - Because the weather is good, I’m going to meet up with friends today or tomorrow.
    • Q22_2: Please indicate to what extent you consider the statements to be probable. - I will be meeting with family members I don’t live with in the next few days.
    • Q22_3: Please indicate to what extent you consider the statements to be probable. - In my free time in the next few days I will visit older people (e.g. parents, grandparents, friends of advanced age)
    • Q22_4: Please indicate to what extent you consider the statements to be probable. - I will be meeting friends outside of my apartment in the next few days.
    • Q22_5: Please indicate to what extent you consider the statements to be probable. - In my free time, I will certainly spend the next few days in places where other people are (e.g. ice cream parlor, café, church).
  • Q28 (the weird food variable) was used as a bot check and was therefore not really of any consequence to our verification goals.

However, it was still not absolutely which variable was used to operationalise physical distancing motivation. Our initial idea was to compute a new variable that was the average of all the questions asked as part of Q22 (except Q22_6 which served as an attention check).

Replicating means: Attempt One

The reported means and standard deviations for motivation to physically distance in Study 3 were:

  • Control (M = 4.30, SD = 0.76)
  • Information + Empathy (M = 4.51, SD = 0.66)
  • Information Only (M = 4.39, SD = 0.74)

To do this:

  • First, I created a new variable for the average of the 5 relevant questions using the mutate() function from the dplyr package and named it “physdist_motiv”.
  • Then, I used the filter() function (from the dplyr package) 3 times to generate summary tables (using the summarise() function) with the means and standard deviations that were specific to each condition. The double equals sign serves as a test - for example, if the variable “condition” = 0 THEN we want to keep that portion (i.e. filter it in).
  • Finally, the print() function produces those tibbles for us to see.
Study3_RenameM <- Study3_Rename %>% 
  mutate(physdist_motiv = (Q22_1 + Q22_2 + Q22_3 + Q22_4 + Q22_5)/5) # creating a new variable for physical distancing motivation called "physdist_motiv" which is calculated by averaging participants' responses to the survey questions

Study3_MControl <- Study3_RenameM %>% 
  filter(condition == "0") %>% # generates the descriptives for the control condition only
  summarise(controlM = mean(physdist_motiv), # using the mean() function to calculate the mean
            controlsd = sd (physdist_motiv)) # using the sd() function to calculate the standard deviation

Study3_MInfEmp <- Study3_RenameM %>% 
  filter(condition == "1") %>% # generates the descriptives for the information + empathy condition only
  summarise(infempM = mean (physdist_motiv),
            infempsd = sd(physdist_motiv)) 

Study3_MInf <- Study3_RenameM %>% 
  filter(condition == "2") %>% # generates the descriptives for the information only condition only
  summarise(infM = mean(physdist_motiv),
            infsd = sd(physdist_motiv))

print(Study3_MControl)
## # A tibble: 1 × 2
##   controlM controlsd
##      <dbl>     <dbl>
## 1     1.70     0.760
print(Study3_MInfEmp)
## # A tibble: 1 × 2
##   infempM infempsd
##     <dbl>    <dbl>
## 1    1.49    0.657
print(Study3_MInf)
## # A tibble: 1 × 2
##    infM infsd
##   <dbl> <dbl>
## 1  1.61 0.739

The means did not match. But I also realised that the code was very inefficient. It would be easier to compare means if all the means were in one table. We looked through our notes from Danielle’s modules again, found the group_by() function from the dplyr package and tried it out to see if we could use it later in our coding journey.

Study3_M1 <- Study3_RenameM %>% 
  group_by(condition) %>% # group_by() function uses "condition" as the organising variable
  summarise(mean_resp = mean(physdist_motiv), 
            sd_resp = sd(physdist_motiv)) %>% 
  ungroup() # it's good practice to ungroup() the variables so that we can regroup them later if need be

print(Study3_M1)
## # A tibble: 3 × 3
##   condition mean_resp sd_resp
##       <dbl>     <dbl>   <dbl>
## 1         0      1.70   0.760
## 2         1      1.49   0.657
## 3         2      1.61   0.739

A minor triumph! However, the matching the means was still an issue.

We decided to adopt a trial and error approach and test the means for a number of variables in the dataset.

Replicating means: Attempt Two

Study3_M2 <- Study3_Rename %>% 
  group_by(condition) %>% 
  summarise (mean_resp = mean(sd),
             sd_resp = sd(sd)) %>% # this generated a table identical to the one above; this means that "sd" is the average of participants' responses to each of the questions
  ungroup() # "sd" is not the variable for motivation to physically distance

Study3_M2 <- Study3_Rename %>% 
  group_by(condition) %>% 
  summarise (mean_resp = mean(sdR),
             sd_resp = sd(sdR)) %>% # the descriptive statistics in this table match the reported ones!
  ungroup()

print(Study3_M2)
## # A tibble: 3 × 3
##   condition mean_resp sd_resp
##       <dbl>     <dbl>   <dbl>
## 1         0      4.30   0.760
## 2         1      4.51   0.657
## 3         2      4.39   0.739

After some trial and error, we realised that “sdR” was the variable that the experimenters had used to operationalise physical distancing motivation. I revisted the glimpse() function from above and realised that we could have recognised this earlier given that the numbers listed under variable name “sdR” were closest to the reported means.

It appears that the experimenters calculated this variable by subtracting “sd” from 6 which would allow them to invert the scale such that 5 represented the greatest motivation to physically distance and 1 represented the lowest motivation to physically distance.

To reduce later confusion and improve the clarity of subsequent code, we renamed the variable titled “sdR” to “physdist_motiv” using the rename() function in the dplyr package.

Study3_Rename <- Study3_Rename %>% 
  rename(physdist_motiv = "sdR")

Replicating the demographic statistics

Now that we had replicated the descriptive statistics, we moved onto replicating the demographic statistics (i.e., the number of participants, the age means and standard deviations as well as the gender distribution).

The number of participants

First, the number of participants. I used the count() function from the dplyr package to verify the number of participants in Study 3.

The paper reported 868 participants.

count(Study3)
## # A tibble: 1 × 1
##       n
##   <int>
## 1   868

Nice.

The age demographics

Now onto the age demographics. I used the summarise() function from the dplyr package to verify these numbers.

The paper reported the mean age as 35.09 years with the standard deviation being 12.44.

Study3_Age <- Study3_Rename %>%
  summarise(mean_age = mean(Age),
            sd_age = sd(Age))

print(Study3_Age)
## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     35.1   12.4

Very nice. Going suspiciously smoothly so far…

The gender distribution

The paper reported the gender distribution as being 43.8% female.

table(Study3_Rename$Gender)
## 
##   1   2 
## 484 380

However, which gender did “1” and “2” correspond with? The information in Jamovi was written in German (see the image below) but knowing a bit of German myself, I recognised that “1” was male and “2” was female. There was also a variable coded as “3” which was labelled as “Keine Angabe”. I had no idea what this meant so I popped it into Google Translate - turns out it means “No Information”.

Jamovi: Variables labelled in German

380/868 #trying to replicate the percentage female distribution through simple mathematics
## [1] 0.437788

This matches the reported gender distribution!

Replicating Figure 2

Attempt One

Initially we tried to plot the summary table “Study3_M2”.

  • We used the ggplot2 package (specifically the ggplot() function) which allows us to create and edit graphs.
  • The input data was specified to be “Study3_M2” as above by using the “data =” statement.
  • We specified the aesthetic mappings using the aes() function such that the “condition” was plotted along the X-axis and the mean response (“mean_resp”) was plotted along the Y-axis.
  • geom_violin() was used to replicate the violin shape seen in the paper
ggplot(data = Study3_M2, mapping = aes(x = condition, y = mean_resp)) + 
  geom_violin()

This does not look right at all. The condition needed to be a factor variable so that each of the violins were separate for each condition (it was currently continuous).

  • We tried to use the factor() function to achieve this
ggplot(data = Study3_M2, mapping = aes(x = factor(condition), y = mean_resp)) +
  geom_violin()
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0

This returned some errors. Perhaps this is because there is only one number per “factor” which cannot be used to generate a violin plot. Upon further thought, we realised that by only plotting the means of physical distancing motivation, we would also lose the whiskers in the box plot shown in the paper. We needed to try an alternative method.

Attempt Two

We realised we needed to be graphing using the whole dataset (i.e., “Study3_Rename”), not just the summary table in order to capture the whole graph.

  • We coded the aesthetic mappings of the plot (as specified by aes()) such that “condition” was plotted along the X-axis and participants’ motivation to physically distance (“physdist_motiv”) was plotted along the Y-axis. We kept “condition” as a factor variable by using the factor() function.
ggplot(data = Study3_Rename, mapping = aes(x = factor(condition), y = physdist_motiv))+
  geom_violin()

This looks more correct!

However, I noticed that the condition labels were numeric on the generated figure and the one presented in the paper had the condition specified non-numerically. Thus, I wanted to create a new variable called “conditionlabel” that would non-numerically describe the condition (i.e., as “Control”, “Information + Empathy” and “Information Only”). After some Googling, I discovered how to do this via this website: https://www.statology.org/conditional-mutating-r/.

Accordingly, I used the case_when() function from the dplyr package such that a specified label (e.g. “Control”) corresponded with a specified number in the “condition” column (e.g. “0”).

Study3Conditions <- Study3_Rename %>% 
  mutate (conditionlabel = case_when(condition == 0 ~ "Control", 
                                     condition == 1 ~ "Information + Empathy",
                                     condition == 2 ~ "Information Only"))

Furthermore, Figure 2 had a box-plot overlayed on top of the violin plot. We decided to try geom_boxplot() as part of the ggplot2 package to see if it had all the features we needed (spoiler: fortunately, it did).

We also spent some time perfecting the aesthetics of the plot, often using Google as a tool, as you will see in the code below.

  • First, we created a new object called “Figure2” and used the assignment operator (<-) to assign functions to this object. This would allow us to subsequently make aesthetic changes to the plot using an additive function (+).
  • The reorder () function was used to ensure that the graph was arranged left to right in order of ascending means as per the paper
  • fill() was used to colour the components of the graph according to the “conditionlabel”, ensuring that each violin was filled in with a different colour which we would later specify using scale_fill_manual() from the ggplot2 package and hexadecimal colour codes (see code for further elaboration)
  • I have also elaborated the utility of each function in the code below.
Figure2 <- ggplot(data = Study3Conditions,  mapping = aes(x = reorder(conditionlabel, physdist_motiv), y = physdist_motiv)) +
  geom_violin(aes(fill = conditionlabel)) +
  geom_boxplot(width = 0.1) # this replicated the box-plot overlayed on top of the violin plot; the width was specified using width() to adjust the size of the box plot according to the one presented in the paper

print(Figure2) # I printed the plot first to see if the basic graph was the right shape before I adjusted the aesthetics; it was! yay!

# Superficial Aesthetics #
Figure2 +
  theme_bw()+ # from ggplot2 pckg: this gets rid of the grey background in a default ggplot graph
  theme(panel.grid.major = element_blank(), # from ggplot2 pckg: this gets rid of the horizontal grid lines at each of the y-axis numbers
        panel.grid.minor = element_blank(), # this gets rid of the horizontal grid lines halfway between the y-axis numbers
        legend.position = "none") + # this removes the legend 
  scale_fill_manual(values = c("#A7CFE5", "#B3D88B", "#1779B6"))+ # Google Chrome's "ColourPick Eyedropper" extension tool was used to determine the exact hexadecimal codes for the colours used in the paper
  scale_y_continuous(name = "Physical Distancing Motivation") + # from ggplot2 pckg: this specifies the y-axis label, given that it is a continuous variable
  scale_x_discrete(name = NULL) + # from ggplot2 pckg: this removes the x-axis label, given that it is a discrete variable
  stat_summary(fun = "mean", shape = 9) # from ggplot2 pckg: this plots the mean onto the graph as a specified shape ("9") which was discovered on this website: https://www.oreilly.com/library/view/r-data-visualization/9781788398312/74f79f51-f6ee-45e8-9f7c-8df78c825cf6.xhtml
## Warning: Removed 3 rows containing missing values (geom_segment).

Success! We’ve completed our verification goals for Study 3! We felt that it would be faster if we divide and conquer for subsequent studies since all four experiments were largely similar. Study was assigned to myself and Daphne, Study 2 to Kopal, and Study 4 was assigned to Rukshali and Jasper: we were to attempt the verification goals for the assigned studies and regroup in a week’s time where we would present our efforts so that we were all on the same page.

Study 1

This study had 3 samples from different countries and thus, there were 3 datasets that we needed to work with.

Reading and understanding the data

Study1USA <- read_excel(here("data&syntax", "Study1USA.xlsx"))
Study1Ger <- read_excel(here("data&syntax","Study1Germany.xlsx"))
Study1UK <- read_excel(here("data&syntax","Study1UK.xlsx"))

glimpse(Study1USA)
## Rows: 322
## Columns: 27
## $ Q22_1          <dbl> 4, 4, 5, 5, 5, 2, 5, 1, 5, 4, 5, 5, 4, 4, 5, 4, 5, 5, 4…
## $ Q22_2          <dbl> 4, 3, 5, 5, 4, 2, 5, 5, 5, 5, 5, 4, 4, 3, 5, 3, 4, 4, 2…
## $ Q22_3          <dbl> 4, 4, 5, 5, 5, 2, 5, 3, 5, 4, 5, 5, 4, 4, 5, 5, 5, 5, 4…
## $ Q22_4          <dbl> 1, 1, 4, 4, 3, 2, 2, 1, 4, 3, 5, 2, 4, 1, 5, 1, 4, 4, 1…
## $ Q22_5          <dbl> 4, 3, 5, 4, 4, 2, 5, 1, 5, 4, 5, 3, 4, 3, 5, 5, 5, 5, 4…
## $ Q22_6          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q22_7          <dbl> 5, 2, 4, 4, 4, 2, 2, 4, 5, 4, 4, 4, 3, 3, 5, 5, 4, 3, 4…
## $ Q24_1          <dbl> 4, 4, 5, 4, 5, 2, 4, 5, 5, 4, 4, 4, 5, 2, 5, 4, 4, 5, 4…
## $ Q24_2          <dbl> 4, 5, 5, 5, 5, 2, 5, 5, 5, 4, 5, 4, 5, 3, 5, 5, 5, 5, 4…
## $ Q24_3          <dbl> 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q24_4          <dbl> 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q25_1          <dbl> 5, 5, 5, 5, 5, 2, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q25_2          <dbl> 5, 5, 5, 5, 5, 3, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q25_3          <dbl> 5, 5, 5, 5, 4, 2, 4, 3, 5, 4, 5, 4, 5, 4, 5, 5, 5, 5, 4…
## $ Q25_4          <dbl> 5, 5, 5, 5, 4, 4, 5, 5, 5, 4, 5, 4, 5, 4, 5, 5, 5, 5, 4…
## $ Q32            <dbl> 3, 4, 1, 7, 1, 8, 3, 5, 2, 6, 3, 2, 5, 3, 5, 3, 1, 2, 4…
## $ Q20            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Age            <dbl> 20, 19, 22, 46, 32, 19, 49, 31, 51, 21, 18, 21, 25, 34,…
## $ Gender         <dbl> 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2…
## $ Household_size <dbl> 4, 4, 4, 3, 3, 3, 1, 5, 3, 3, 4, 3, 2, 4, 5, 1, 2, 1, 2…
## $ Education      <dbl> 3, 2, 4, 5, 4, 2, 4, 4, 4, 3, 3, 3, 4, 4, 4, 4, 4, 5, 3…
## $ Income         <dbl> 12, 3, 12, 6, 9, 1, 3, 7, 2, 6, 10, 8, 6, 8, 5, 2, 3, 5…
## $ ve             <dbl> 4.000000, 3.666667, 5.000000, 4.666667, 4.666667, 2.000…
## $ sd             <dbl> 4.0, 4.5, 5.0, 4.5, 5.0, 2.0, 4.5, 5.0, 5.0, 4.0, 4.5, …
## $ q              <dbl> 5, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ i              <dbl> 5.0, 5.0, 5.0, 5.0, 5.0, 2.5, 5.0, 5.0, 5.0, 4.0, 5.0, …
## $ gov            <dbl> 5.0, 5.0, 5.0, 5.0, 4.5, 3.0, 4.5, 4.0, 5.0, 4.5, 5.0, …
glimpse(Study1Ger)
## Rows: 326
## Columns: 29
## $ Q22_1          <dbl> 4, 5, 3, 2, 5, 4, 4, 4, 5, 2, 2, 4, 5, 5, 5, 4, 3, 4, 3…
## $ Q22_2          <dbl> 29, 32, 32, 32, 30, 32, 30, 29, 31, 29, 30, 30, 32, 31,…
## $ Q22_3          <dbl> 4, 5, 4, 3, 3, 3, 5, 5, 5, 3, 3, 4, 2, 5, 4, 2, 3, 4, 3…
## $ Q22_4          <dbl> 29, 30, 28, 32, 29, 30, 30, 29, 29, 28, 28, 29, 29, 31,…
## $ Q22_5          <dbl> 4, 5, 3, 3, 5, 2, 4, 4, 5, 3, 2, 4, 5, 5, 5, 2, 4, 5, 3…
## $ Q22_6          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q22_7          <dbl> 29, 30, 32, 30, 32, 29, 29, 30, 31, 30, 32, 30, 29, 31,…
## $ Q77_1          <dbl> 5, 4, 3, 5, 5, 1, 3, 5, 5, 1, 4, 4, 5, 4, 3, 5, 3, 5, 5…
## $ Q77_2          <dbl> 4, 4, 4, 5, 5, 4, 3, 5, 5, 2, 4, 4, 5, 5, 5, 5, 4, 5, 5…
## $ Q77_3          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q77_4          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q78_1          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q78_2          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 5, 5, 5…
## $ Q78_3          <dbl> 4, 5, 4, 5, 5, 4, 4, 5, 5, 1, 4, 4, 5, 5, 4, 4, 4, 5, 4…
## $ Q78_4          <dbl> 4, 5, 4, 5, 5, 4, 4, 5, 5, 3, 4, 4, 5, 5, 4, 5, 4, 5, 5…
## $ Q20            <dbl> 1, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Age            <dbl> 18, 29, 18, 32, 37, 39, 20, 22, 39, 22, 32, 29, 26, 24,…
## $ Gender         <dbl> 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ Household_size <dbl> 2, 1, 4, 2, 1, 1, 3, 4, 2, 2, 2, 3, 2, 1, 3, 2, 1, 2, 1…
## $ Q21            <chr> "5b3d0c7477cb3a00019c1bb3", "5d9739389e5b9a0018820f68",…
## $ Q28            <chr> "Pizza", "Pizza", "Schweinefleisch mit Pilzen", "Brot u…
## $ `filter_$`     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ ve             <dbl> 4.000000, 5.000000, 3.333333, 2.666667, 4.333333, 3.000…
## $ sd             <dbl> 4.5, 4.0, 3.5, 5.0, 5.0, 2.5, 3.0, 5.0, 5.0, 1.5, 4.0, …
## $ q              <dbl> 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, …
## $ i              <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5…
## $ Zve            <dbl> -0.01877901, 1.06156624, -0.73900918, -1.45923934, 0.34…
## $ Zsd            <dbl> -0.5627027, 0.8062606, -0.5627027, 0.8062606, 0.8062606…
## $ gov            <dbl> 4.0, 5.0, 4.0, 5.0, 5.0, 4.0, 4.0, 5.0, 5.0, 2.0, 4.0, …
glimpse(Study1UK)
## Rows: 317
## Columns: 28
## $ Q41            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q42            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q22_1          <dbl> 4, 4, 5, 5, 5, 5, 5, 4, 5, 5, 4, 5, 4, 5, 4, 5, 5, 2, 5…
## $ Q22_2          <dbl> 2, 4, 4, 3, 4, 5, 5, 5, 4, 4, 4, 4, 5, 3, 5, 5, 4, 4, 4…
## $ Q22_3          <dbl> 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 4, 5, 5, 5, 5, 4, 4…
## $ Q22_4          <dbl> 1, 4, 4, 1, 2, 4, 3, 3, 2, 3, 4, 4, 5, 3, 2, 5, 2, 3, 2…
## $ Q22_5          <dbl> 5, 4, 5, 4, 3, 4, 3, 4, 5, 5, 4, 4, 4, 5, 4, 5, 4, 2, 3…
## $ Q22_6          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q22_7          <dbl> 3, 5, 3, 3, 4, 2, 5, 3, 4, 5, 3, 4, 4, 4, 3, 5, 4, 1, 4…
## $ Q24_1          <dbl> 3, 5, 5, 4, 3, 5, 4, 4, 4, 3, 4, 4, 3, 2, 2, 5, 5, 4, 5…
## $ Q24_2          <dbl> 3, 5, 5, 4, 4, 5, 4, 5, 4, 4, 4, 4, 5, 2, 3, 5, 5, 4, 5…
## $ Q24_3          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 2, 5, 5, 5…
## $ Q24_4          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 5, 5, 5…
## $ Q25_1          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 2, 5, 5, 5…
## $ Q25_2          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 5, 5, 5…
## $ Q25_3          <dbl> 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 5, 4, 5, 4, 3, 4, 5, 5, 5…
## $ Q25_4          <dbl> 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 5, 4, 5, 4, 4, 4, 5, 5, 5…
## $ Q20            <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Age            <dbl> 24, 31, 31, 33, 34, 21, 49, 27, 33, 45, 39, 39, 54, 53,…
## $ Gender         <dbl> 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2…
## $ Household_size <dbl> 7, 4, 2, 3, 4, 2, 4, 3, 3, 4, 4, 2, 2, 3, 1, 2, 4, 2, 3…
## $ Education      <dbl> 4, 3, 5, 4, 4, 3, 2, 4, 4, 3, 4, 4, 3, 2, 5, 5, 3, 4, 3…
## $ Income         <dbl> 5, 5, 11, 5, 2, 2, 4, 4, 11, 4, 9, 3, 4, 5, 3, 3, 2, 5,…
## $ ve             <dbl> 4.666667, 4.000000, 5.000000, 4.666667, 4.333333, 4.666…
## $ sd             <dbl> 3.0, 5.0, 5.0, 4.0, 3.5, 5.0, 4.0, 4.5, 4.0, 3.5, 4.0, …
## $ q              <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 3, 5, 5, 5…
## $ i              <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 3, 5, 5, 5…
## $ gov            <dbl> 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.5, 4.5, 4.5, 5.0, …

The glimpse() function did not tell us much. These datasets had no variable called “sdR” as was used in Study 3 to represent physical distancing motivation. It wasn’t entirely clear which variable corresponded to empathy scores either.

Firstly, I checked the supplementary materials to see if there was anything of note there.

Supplementary materials: Studies 1 and 2

This included the translated items from the survey but didn’t specify which variable name corresponded with which question. To locate this information, I opened each of the datasets in Jamovi. The information was there but I was still unsure whether the variable used to operationalise “empathy” and “physical distancing motivation” were an average of all these questions or could be quantified through a single test item (as it was in Study 3).

A trial and error approach was needed.

Finding empathy

Let’s begin with finding the variable for empathy. We ran multiple variables through the same summarise function used for Study 3 to find what the experimenters used to operationalise “empathy”. This time, however, we didn’t need the group_by() function given that there were no conditions that participants were assigned to.

The reported descriptive statistics were M = 4.46 and SD = 0.74.

First, I ran singular questions through the summarise function:

Study1USA %>% 
  summarise (mean_emp = mean(Q22_1), #Q: I am very concerned about those most vulnerable to coronavirus COVID-19. The questions in this dataset were in English.
             sd_emp = sd(Q22_1))
## # A tibble: 1 × 2
##   mean_emp sd_emp
##      <dbl>  <dbl>
## 1     4.49  0.836

This wasn’t right.

Study1USA %>% 
  summarise (mean_emp = mean(Q22_3), #Q: I feel compassion for those most vulnerable to coronavirus COVID-19.
             sd_emp = sd(Q22_3))
## # A tibble: 1 × 2
##   mean_emp sd_emp
##      <dbl>  <dbl>
## 1     4.61  0.733

This wasn’t right either.

Study1USA %>% 
  summarise (mean_emp = mean(Q22_5), #Q: I am quite moved by what can happen to those most vulnerable to coronavirus COVID-19.
             sd_emp = sd(Q22_5))
## # A tibble: 1 × 2
##   mean_emp sd_emp
##      <dbl>  <dbl>
## 1     4.28  0.926

Still incorrect.

However, I was sure I could find the variable in the dataset without having to calculate it myself since that was the case with Study 3. I decided to run other variables through the summarise function to see what it could be. Revisiting the glimpse() function from above, perhaps it could be any one of “ve”, “sd”, “q”, “i” or “gov” since the numbers looked most feasible given the reported mean.

Study1USA %>% 
  summarise (mean_emp = mean(ve),
             sd_emp = sd(ve))
## # A tibble: 1 × 2
##   mean_emp sd_emp
##      <dbl>  <dbl>
## 1     4.46  0.743

This matched the reported means! I also wanted to quickly check it wasn’t just a coincidence by checking the means and standard deviations of the other variables I thought it could be.

Study1USA %>% 
  summarise (mean(sd), mean(q), mean(i), mean(gov),
             sd(sd), sd(q), sd(i), sd(gov))
## # A tibble: 1 × 8
##   `mean(sd)` `mean(q)` `mean(i)` `mean(gov)` `sd(sd)` `sd(q)` `sd(i)` `sd(gov)`
##        <dbl>     <dbl>     <dbl>       <dbl>    <dbl>   <dbl>   <dbl>     <dbl>
## 1       4.44      4.88      4.89        4.75    0.802   0.434   0.406     0.471

None of the other means and standard deviations matched the reported ones. “Ve” must be the variable used to operationalise “empathy” in Study 1!

I hypothesised that this was calculated by averaging participants’ responses to the 3 empathy-related questions in the survey. To verify this hypothesis, I used the mutate() function from the dplyr package to create a new variable that was the average of participants’ three responses to Q22_1, Q22_3 and Q22_5 to see if this matched the variable “ve”. This process was similar to the one taken for Study 3 earlier.

Study1USA_Emp <- Study1USA %>% 
  mutate(empathy = ((Q22_1 + Q22_3 + Q22_5)/3))

Study1USA_Emp %>% 
  summarise(mean_emp = mean(empathy),
            sd_emp = sd(empathy))
## # A tibble: 1 × 2
##   mean_emp sd_emp
##      <dbl>  <dbl>
## 1     4.46  0.743

It’s the same (as suspected)!

Let’s see if “ve” is also the variable for empathy in the German and UK sample.

The reported means and standard deviations for the German and UK sample were M = 4.02, SD = 0.93 and M = 4.56, SD = 0.61 respectively.

Study1Ger %>% 
  summarise (mean_emp = mean(ve),
             sd_emp = sd(ve))
## # A tibble: 1 × 2
##   mean_emp sd_emp
##      <dbl>  <dbl>
## 1     4.02  0.926
Study1UK %>% 
  summarise (mean_emp = mean(ve),
             sd_emp = sd(ve))
## # A tibble: 1 × 2
##   mean_emp sd_emp
##      <dbl>  <dbl>
## 1     4.56  0.612

They match! Perfect. We have now verified the means for “empathy” in Study 1.

We decided to use the rename() function from the dplyr package to rename the variables accordingly and minimise confusion later down the track.

Study1USA_Rename <-Study1USA %>% 
  rename(empathy = ve)

Study1Ger_Rename <- Study1Ger %>% 
  rename(empathy = ve) 

Study1UK_Rename <-Study1UK %>% 
  rename(empathy = ve)

What about physical distancing motivation?

Let’s start by seeing if the question that states “Because of coronavirus COVID-19, I am massively curtailing social contact (so-called”social distancing”).” is the correct variable as per the supplementary materials.

The reported descriptive statistics for the USA sample were M = 4.30, SD = 0.99.

Study1USA %>% 
  summarise(mean_physdist = mean(Q24_1), #Q: Because of coronavirus COVID-19, I am massively curtailing social contact (so-called "social distancing").
            sd_physdist = sd(Q24_1))
## # A tibble: 1 × 2
##   mean_physdist sd_physdist
##           <dbl>       <dbl>
## 1          4.30       0.991

They match! Nice.

Now, to replicate this process for the German and UK samples. The UK sample was relatively simple to figure out since the questions were listed in English. However, when opening up “Study1Germany.sav” in Jamovi, all the questions were written in German.

Jamovi: Study1Germany.sav in need of translation

Google Translate was once again our best friend.

The reported descriptive statistics for the UK sample were M = 4.12, SD = 1.01 and for the German sample were M = 4.04, SD = 1.11.

Study1UK %>% 
  summarise(mean_physdist = mean(Q24_1), #Q: Because of coronavirus COVID-19, I am massively curtailing social contact (so-called "social distancing").
            sd_physdist = sd(Q24_1))
## # A tibble: 1 × 2
##   mean_physdist sd_physdist
##           <dbl>       <dbl>
## 1          4.12        1.01
Study1Ger %>% 
  summarise(mean_physdist = mean(Q77_1), #This Q translated to: Due to the coronavirus COVID-19, I have massively reduced my direct social contacts (so-called "social distancing").
            sd_physdist = sd(Q77_1))
## # A tibble: 1 × 2
##   mean_physdist sd_physdist
##           <dbl>       <dbl>
## 1          4.04        1.11

They also match! Triple nice. We have now verified the means for “physical distancing motivation” in Study 1.

Time to update the “Study1???_Rename” objects to ensure the corresponding question items are renamed to something more intuitive: “physdist_motiv”.

Study1USA_Rename <-Study1USA_Rename %>% 
  rename(physdist_motiv = Q24_1)

Study1Ger_Rename <- Study1Ger_Rename %>% 
  rename(physdist_motiv = Q77_1) 

Study1UK_Rename <-Study1UK_Rename %>% 
  rename(physdist_motiv = Q24_1)

Replicating the demographic statistics

The number of participants

First, the number of participants in each sample. We used the same process used for Study 3 to reproduce these numbers.

The paper reported the following numbers of participants:

  • USA sample (n = 322)
  • German sample (n = 326)
  • UK sample (n = 317)
count(Study1USA_Rename)
## # A tibble: 1 × 1
##       n
##   <int>
## 1   322
count(Study1Ger_Rename) 
## # A tibble: 1 × 1
##       n
##   <int>
## 1   326
count(Study1UK_Rename)
## # A tibble: 1 × 1
##       n
##   <int>
## 1   317

All of the above match perfectly.

The age demographics

Now onto the age demographics.

The paper reported the following age demographics:

  • USA sample (M = 33.33, SD = 13.00)
  • German sample (M = 29.44, SD = 9.31)
  • UK sample (M = 38.05, SD = 12.20)
Study1USA_Age <- Study1USA_Rename %>% 
  summarise (mean_age = mean(Age),
             sd_age = sd(Age)) 

print(Study1USA_Age)
## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     33.3   13.0
Study1Ger_Age <- Study1Ger_Rename %>% 
  summarise (mean_age = mean(Age),
             sd_age = sd(Age))

print(Study1Ger_Age)
## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     29.4   9.31
Study1UK_Age <- Study1UK_Rename %>% 
  summarise (mean_age = mean(Age),
             sd_age = sd(Age)) 

print(Study1UK_Age)
## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     38.1   12.2

Perfect.

The gender distribution

Onto the gender distribution.

The paper reported the following gender distributions:

  • USA sample (45.7% female)
  • German sample (46.6% female)
  • UK sample (59.3% female)
#1 = Male; 2 = Female; 3 = No information (N.A)
table(Study1USA_Rename$Gender)
## 
##   1   2   3 
## 172 147   3
147/322
## [1] 0.4565217
table(Study1Ger_Rename$Gender)
## 
##   1   2   3 
## 172 152   2
152/326
## [1] 0.4662577
table(Study1UK_Rename$Gender)
## 
##   1   2   3 
## 128 188   1
188/317
## [1] 0.5930599

Smooth sailing.

Study 2

Because studies 1 and 2 are represented in a singular figure, I thought it was appropriate to satisfy the verification goals for Study 2 first before attempting to reproduce Figure 1.

This process proved relatively straightforward given that there was only one sample to deal with.

Reading and understanding the data

Study2 <- read_excel(here("data&syntax", "Study2.xlsx"))

glimpse(Study2)
## Rows: 359
## Columns: 20
## $ Q36_1          <dbl> 5, 5, 5, 4, 2, 2, 4, 4, 5, 4, 3, 5, 4, 4, 5, 3, 5, 4, 5…
## $ Q36_2          <dbl> 3, 3, 5, 3, 3, 4, 5, 3, 4, 3, 5, 3, 4, 4, 5, 4, 3, 3, 5…
## $ Q36_3          <dbl> 5, 5, 5, 4, 2, 2, 4, 4, 5, 4, 3, 5, 4, 2, 5, 3, 5, 4, 5…
## $ Q36_4          <dbl> 3, 1, 5, 3, 2, 4, 2, 1, 2, 1, 5, 3, 4, 4, 5, 3, 3, 3, 4…
## $ Q36_5          <dbl> 5, 5, 5, 4, 2, 2, 5, 4, 5, 4, 3, 5, 4, 4, 5, 2, 5, 4, 5…
## $ Q36_6          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q36_7          <dbl> 3, 4, 3, 3, 2, 3, 3, 1, 4, 3, 4, 4, 2, 4, 3, 4, 2, 4, 4…
## $ Q22_1          <dbl> 1, 1, 2, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1…
## $ Q22_2          <dbl> 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 5, 1…
## $ Q22_3          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q22_4          <dbl> 1, 1, 2, 1, 4, 1, 1, 1, 1, 5, 1, 1, 1, 1, 2, 1, 1, 2, 1…
## $ Q22_5          <dbl> 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q22_6          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q20            <dbl> 3, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1…
## $ Age            <dbl> 27, 36, 26, 32, 29, 32, 37, 23, 27, 32, 34, 28, 57, 44,…
## $ Gender         <dbl> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1…
## $ Household_size <dbl> 1, 4, 1, 1, 1, 2, 3, 3, 3, 4, 3, 1, 1, 3, 1, 5, 2, 2, 4…
## $ ve             <dbl> 5.000000, 5.000000, 5.000000, 4.000000, 2.000000, 2.000…
## $ sd             <dbl> 1.0, 1.0, 1.6, 1.0, 2.8, 1.0, 1.0, 1.4, 1.0, 1.8, 2.0, …
## $ sdR            <dbl> 5.0, 5.0, 4.4, 5.0, 3.2, 5.0, 5.0, 4.6, 5.0, 4.2, 4.0, …

The glimpse revealed some variable names that we were quite familiar with by this point. Perhaps “ve” refers to empathy (as in Study 1) and “sdR” relates to the motivation to physically distance (as in Study 3). We tested these hypotheses by using the summarise() function from the dplyr package that we’ve been using thus far.

The paper reported the following descriptive statistics:

  • Empathy (M = 4.05, SD = 0.94)
  • Physical distancing motivation (M = 4.56, SD = 0.65)
Study2 %>% 
  summarise (mean_emp = mean(ve),
             sd_emp = sd(ve),
             mean_physdist = mean(sdR),
             sd_physdist = sd(sdR))
## # A tibble: 1 × 4
##   mean_emp sd_emp mean_physdist sd_physdist
##      <dbl>  <dbl>         <dbl>       <dbl>
## 1     4.05  0.941          4.56       0.648

Oh my! They match perfectly. We’re finally getting the hang of it. It feels very satisfying being able to identify the variables correctly first try. We were glad to know that we didn’t have to individually run all the items through trial and error.

As usual, we renamed the variables accordingly to minimise as much confusion as possible later down the track.

Study2_Rename <-Study2 %>% 
  rename(empathy = ve, physdist_motiv = sdR)

Replicating the demographic statistics

The number of participants

This was where we noticed a mismatch between the paper’s reported number of participants and the number that was specified in the supplementary materials. The paper stated that there were 359 participants. In the supplementary materials, the number of participants was recorded as 361 (see image below).

Supplementary materials: Study 2 number of participants

count(Study2_Rename)
## # A tibble: 1 × 1
##       n
##   <int>
## 1   359

Looks like the number reported in the paper itself was the correct one.

The age demographics

The paper reported that the mean age was 29.75 years with a standard deviation of 9.40.

Study2_Age <- Study2_Rename %>% 
  summarise (mean_age = mean(Age),
             sd_age = sd(Age))

print(Study2_Age)
## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     29.8   9.40

Success!

The gender distribution

The experimenters reported that 48.5% of participants were female.

table(Study2_Rename$Gender)
## 
##   1   2   3 
## 184 174   1
174/359
## [1] 0.4846797

That rounds up to 48.5%. Perfect.

Figure 1

Figure 1 is a line graph that represents data from all 3 samples in Study 1 as well as the single sample in Study 2. As such, our thought as a team was to generate graphs for each of the samples in Studies 1 and 2 and then combine all the graphs together afterwards.

The aesthetic portions of the graph took a considerable amount of Googling to fine tune but since we were more familiar with the ggplot2 package now, we understood where to start and what to Google in order to get the result we desired.

  • We knew from Danielle’s modules that to create a smooth line graph as shown in the paper, geom_smooth() from the ggplot2 package was the appropriate function.
    • method = lm creates a straight line/line of best fit across the different points (i.e., method = linear model)
    • colour = “black” specifies the colour of the line
  • scale_x_continuous() from the ggplot2 package allows me to specify the characteristics of the x axis provided that it is a continuous variable being plotted there
    • name = NULL gets rid of the axis label as that is not present for each individual graph in Figure 1
    • limits = c(1,5) specifies the limits of the axis (starting at 1 and ending at 5)
  • scale_y_continous() from the ggplot2 package allows me to specify the characteristics of the y axis
  • ggtitle() allows me to specify the characteristics of the plot’s title
    • label = “Study 1: Sample” specifies the plot title () is universal code language for ‘enter’
  • theme_bw() removes the grey shading that is default behind plots
  • theme() allows me to specify other aesthetic characteristics of the plot
    • panel.grid.major = element_blank() gets rid of the major horizontal grid lines
    • panel.grid.minor = element_blank() gets rid of the minor horizontal grid lines between the numbers on the y-axis
    • plot.title = element_text() allows me to specify the features of the plot’s title “Study 1: German Sample”
      • size = 10 allows me to specify size
      • hjust = 0.5 allows me to centre the heading

This process was repeated for each of the individual graphs.

Study1Ger_plot <- ggplot(data = Study1Ger_Rename, mapping = aes(x = empathy, y = physdist_motiv)) +
  geom_smooth(method = lm, colour = "black") +
  scale_x_continuous(name = NULL, limits = c(1,5)) +                                
  scale_y_continuous(name = NULL, limits =c (1,5)) +
  ggtitle(label = "Study 1:\nGerman Sample" ) + 
  theme_bw() + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(size = 10, hjust = 0.5))

print(Study1Ger_plot)
## `geom_smooth()` using formula 'y ~ x'

Study1UK_plot <- ggplot(data = Study1UK_Rename, mapping = aes(x = empathy, y = physdist_motiv)) +
  geom_smooth(method = lm, colour = "black") +
  scale_x_continuous(name = NULL, limits =c (1,5)) +                                  
  scale_y_continuous(name = NULL, limits= c (1,5)) +
  ggtitle(label = "Study 1:\nUK Sample" ) + 
  theme_bw() + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(size = 10, hjust = 0.5))

print(Study1UK_plot)
## `geom_smooth()` using formula 'y ~ x'

Study1USA_plot <- ggplot(data = Study1USA_Rename, mapping = aes(x = empathy, y = physdist_motiv)) +
  geom_smooth(method = lm,  colour = "black") +
  scale_x_continuous(name = NULL, limits= c (1,5)) +                                  
  scale_y_continuous(name = NULL, limits= c (1,5)) +
  ggtitle(label = "Study 1:\nU.S. Sample") + 
  theme_bw() + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(size = 10, hjust = 0.5))

print(Study1USA_plot)
## `geom_smooth()` using formula 'y ~ x'

Study2_plot <- ggplot(data = Study2_Rename, mapping = aes(x = empathy, y = physdist_motiv)) +
  geom_smooth(method = lm, colour = "black") +
  scale_x_continuous(name = NULL, limits = c(1,5)) +                                  
  scale_y_continuous(name = NULL, limits = c(1,5)) +
  ggtitle(label = "Study 2:\nGerman Sample") + 
  theme_bw() + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(size = 10, hjust = 0.5))

print(Study2_plot)
## `geom_smooth()` using formula 'y ~ x'

The difficult part was knowing how to combine the graphs. We did some more Googling and found that the ggarrange() function as part of the “ggpubr” package could help us satisfy this verification goal (http://www.sthda.com/english/articles/32-r-graphics-essentials/126-combine-multiple-ggplots-in-one-graph/). We also found some articles discussing the “patchwork” package as a useful tool. However, the advantage of the “ggpubr” package was that it contained an annotate_figure() function which allowed us to input axes labels once the graphs were combined. Because this was necessary as part of Figure 1, we decided to go with the “ggpubr” package.

#load package
library(ggpubr)

#Attempt 1
Figure1 <- ggarrange(Study1USA_plot, Study1UK_plot, Study1Ger_plot, Study2_plot, #ggarrange() helps combine multiple plots and arrange them in a specified configuration
          ncol=4) #this specifies how the plots are arranged (4 columns)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Goodness me! It worked! I can finalise the code to include all the aesthetic adjustments now.

Figure1 <- ggarrange(Study1USA_plot, Study1UK_plot, Study1Ger_plot, Study2_plot,
                     ncol= 4) #4 columns
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Figure1_Final <- annotate_figure(Figure1, #annotate_figure() from the ggpubr package to add aesthetics
                                 left = text_grob("Physical Distancing", #text_grob() function from the ggpubr package adds text into the graph on the left
                                                  rot = 90), #this rotates the title of the graph 90 degrees
                                 bottom = text_grob("Empathy")) #text_grob() function used to add text into the graph on the bottom

print(Figure1_Final)

And with that, we’ve completed the verification goals for Study 1.

Study 4

Reading and understanding the data

Study4 <- read_excel(here("data&syntax","Study4.xlsx"))

glimpse(Study4)
## Rows: 1,526
## Columns: 19
## $ QID90_1        <dbl> 1, 3, 1, 2, 2, 1, 5, 1, 1, 5, 5, 3, 1, 5, 4, 3, 4, 3, 1…
## $ QID90_2        <dbl> 1, 3, 1, 2, 2, 1, 4, 1, 1, 4, 5, 3, 1, 4, 3, 3, 4, 3, 1…
## $ QID90_3        <dbl> 1, 3, 1, 2, 2, 1, 4, 1, 1, 4, 5, 3, 1, 4, 3, 3, 4, 3, 2…
## $ Q22_1          <dbl> 5, 4, 4, 3, 5, 1, 5, 4, 4, 5, 5, 2, 5, 4, 4, 2, 5, 4, 5…
## $ Q92_1          <dbl> 4, 3, 5, 2, 5, 3, 2, 4, 3, 5, 5, 2, 4, 4, 3, 3, 5, 3, 5…
## $ Q92_2          <dbl> 1, 3, 1, 3, 1, 5, 1, 2, 2, 1, 1, 3, 1, 4, 3, 3, 1, 3, 1…
## $ Q92_3          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q100_1         <dbl> 2, 2, 3, 2, 3, 1, 2, 3, 4, 2, 4, 3, 3, 4, 3, 2, 3, 2, 4…
## $ Q100_2         <dbl> 3, 3, 4, 3, 3, 1, 3, 3, 4, 3, 4, 3, 3, 3, 4, 4, 3, 3, 4…
## $ Q61            <dbl> 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2…
## $ Q20            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Age            <dbl> 30, 52, 39, 29, 46, 20, 31, 43, 40, 18, 29, 33, 27, 28,…
## $ Gender         <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2…
## $ Household_size <dbl> 1, 1, 3, 1, 1, 2, 3, 1, 4, 5, 2, 4, 2, 5, 2, 2, 2, 2, 5…
## $ Q28            <chr> "Gulasch", "Nudeln", "Nudelsalat", "Nudelgratin", "Sala…
## $ bed            <dbl> 1, 1, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 1, 0…
## $ empa           <dbl> 1.000000, 3.000000, 1.000000, 2.000000, 2.000000, 1.000…
## $ Q92_2r         <dbl> 5, 3, 5, 3, 5, 1, 5, 4, 4, 5, 5, 3, 5, 2, 3, 3, 5, 3, 5…
## $ maskePo        <dbl> 4.5, 3.0, 5.0, 2.5, 5.0, 2.0, 3.5, 4.0, 3.5, 5.0, 5.0, …

By this point in the verification process, we had established a consistent workflow such that we knew how to approach and achieve our verification goals.

  • First, we consulted the supplementary materials to see if there was any information that could assist us in determining which variable name corresponded with our variables of interest (motivation to wear a face mask and empathy scores). This showed us that “Q22_1” was used to operationalise “motivation to wear a face mask” and that “bed” represented the condition participants were assigned to (1 = Control condition, 2 = Empathy condition). However, the materials did not specify what “0” meant in the “bed” variable.
  • Second, we consulted Jamovi and Google Translate to clarify the information found in the supplementary materials and fill in gaps in our understanding. Because this study was conducted in a German sample, the questions were all in German so Google translate was once again our best friend.
    • “Q22_1” translated to “People behave differently in times of the corona virus. Please indicate how likely you think the following statement is. In the next few days I will … - … wear a mouth and nose protector as often as possible when I meet other people”
    • For the “bed” variable, Jamovi confirmed that “0” indicated Information Only, “1” was Control and “2” was Empathy.
    • We still weren’t sure exactly which variable empathy scores were coded under but from previous studies, we decided to test if it was “empa”.

The reported descriptive statistics for “empathy” in the paper were:

  • Information Only (M = 2.14, SD = 1.00)
  • Control (M = 2.10, SD = 1.01)
  • Empathy (M = 4.03, SD = 0.90)
#First I wanted to rename "bed" to "condition" so that the dataset made more intuitive sense:
Study4_Rename <- Study4 %>% 
  rename(condition = "bed")

#Testing "empa"
Study4_Rename %>% 
  group_by(condition) %>% 
  summarise (mean_resp = mean(empa), 
             sd_resp = sd(empa)) %>% 
  ungroup()
## # A tibble: 3 × 3
##   condition mean_resp sd_resp
##       <dbl>     <dbl>   <dbl>
## 1         0      2.14   1.00 
## 2         1      2.10   1.01 
## 3         2      4.03   0.900

It looks like “empa” was used to operationalise empathy scores. I also wanted to verify that “Q22_1” was indeed the motivation to wear a face mask given that the supplementary materials can be misleading at times. I used the same summarise() function to do this.

The reported descriptive statistics for “motivation to wear a face mask” in the paper were:

  • Information Only (M = 3.83, SD = 1.20)
  • Control (M = 3.69, SD = 1.24)
  • Empathy (M = 4.00, SD = 1.12)
Study4_Rename %>% 
  group_by(condition) %>% 
  summarise (mean_resp = mean(Q22_1), 
             sd_resp = sd(Q22_1)) %>% 
  ungroup()
## # A tibble: 3 × 3
##   condition mean_resp sd_resp
##       <dbl>     <dbl>   <dbl>
## 1         0      3.83    1.20
## 2         1      3.69    1.24
## 3         2      4.00    1.12

This looks right. We weren’t misled by the supplementary materials this time! A small win.

To clarify variable labels in the dataset, we used the rename() function from the dplyr package to rename these variables to something more intuitive.

Study4_Rename <- Study4_Rename %>% 
  rename(empathy = "empa",
         mask_motiv = "Q22_1")

Time to replicate the demographic statistics.

Replicating the demographic statistics

The number of participants

The paper reported n = 1526.

count(Study4_Rename)
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1526

Perfect!

The age demographics

The mean age was reported to be 34.71 years old with a standard deviaiton of 12.09.

Study4_Age <- Study4_Rename %>% 
  summarise (mean_age = mean(Age),
             sd_age = sd(Age))

print(Study4_Age)
## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     34.7   12.1

Great!

The gender distribution

According to Jamovi, “1” corresponds with male and “2” with female (same as previous studies).

The sample was said to be 42.7% female.

table(Study4_Rename$Gender)
## 
##   1   2 
## 796 720
720/1526 #1526 being the total number of participants
## [1] 0.4718218

This checks out.

Now that we have completed verifying the demographic and descriptive statistics of this sample, it’s time to reproduce the last figure of the paper. What a journey we’ve been on so far…

Replicating Figure 3

Upon observation, this figure is really similar to Figure 2. The process for reproducing this figure should be the same as for Figure 2. Below you will see me replicate the process used for Figure 2 but with data from Study 4 in place of Study 3.

  • First, I used the mutate() function to create a new variable which textually represented which condition participants were assigned to. I named this variable “conditionlabel” and then used the case_when() function from the dplyr package to create a variable based on an existing variable (“condition”). I assigned this to a new object called “Study4_Conditions”.
  • Second, I used the ggplot package to generate a figure and specify the input data.
  • The rest is almost identical but I have annotated the code below to explain line-by-line.
#Creating a new variable with condition labels
Study4_Conditions <- Study4_Rename %>% 
  mutate (conditionlabel = case_when(condition == 0 ~ "Information Only",
                                     condition == 1 ~ "Control",
                                     condition == 2 ~ "Empathy"))

#Generating the plot
Study4_plot <- ggplot(data = Study4_Conditions, 
                     mapping = aes(x = reorder(conditionlabel, mask_motiv), y = mask_motiv)) + 
  geom_violin(aes(fill = conditionlabel)) + # from ggplot2 pckg: colour fills each violin plot differently according to the conditionlabel variable
  geom_boxplot(width = 0.1) + # from ggplot2 pckg: adds a boxplot layer and specifies the width of the plot as 0.1 of the violin plot
  theme_bw()+ # from ggplot2 pckg: this gets rid of the grey background in a default ggplot graph
  theme(panel.grid.major = element_blank(), # theme() from ggplot2 pckg: this gets rid of the horizontal grid lines at each of the y-axis numbers
        panel.grid.minor = element_blank(), # this gets rid of the horizontal grid lines halfway between the y-axis numbers
        legend.position = "none") + # this removes the legend
  scale_fill_manual(values = c("#A7CFE5", "#B3D88B", "#1779B6"))+ # from ggplot2 pckg: this allows me to manually specify the fill colours of each violin plot
  scale_y_continuous(name = "Physical Distancing Motivation") + # from ggplot2 pckg: this specifies the y-axis label as "Physical Distancing Motivation", given that it is a contunous variable
  scale_x_discrete(name = NULL) + # from ggplot2 pckg: this removes the x-axis label, given that it is a discrete variable
  stat_summary(fun = "mean", shape = 9) # from ggplot2 pckg: this plots the mean onto the graph as a specified shape ("9")

print(Study4_plot)
## Warning: Removed 3 rows containing missing values (geom_segment).

Success! After being able to reproduce this figure within one attempt, I could tangibly see how much I had improved when learning R. It was really about consulting as many resources as possible (including but not limited to stackoverflow forums, ALL the available files in the OSF provided, Google Translate, various coding blogs) and trying out different functions until I came across one that consistently did what I wanted it to.

Part 3: Exploration

As part of my exploratory analyses I will be using the skills I have acquired over the verification process to ask questions of the data.

Q1: How does household size impact empathy and motivation to physically distance? (Study 2)

This question arose out of curiosity about whether living with more people would increase one’s sensitivity to their ability to pass the illness onto others and thus impact their motivation to physically distance. Personally, my primary worry when it comes to the virus is passing it on to others, particularly in the household, rather than my personal vulnerability. Therefore, I thought it would be interesting to explore this relationship.

I also wondered if constantly being surrounded by more people in a large household would impact one’s affective empathy. An article by Promsri (2019) [“Does A Family Size Have an Effect on Empathy Level?: Filling the Gap in Empathy Literature”] indicated that children with larger family sizes exhibited higher levels of empathetic skill, possibly due to family socialisation. I was interested to see if this trend could be observed in the Study 2 sample.

First, I wanted to generate a frequency table to guage the range of household sizes that were recorded. This was the same process we used to verify the gender distribution in Part 2 of this report.

table(Study2_Rename$Household_size) 
## 
##   1   2   3   4   5   6   7  44 
##  67 130  76  52  23   7   3   1

There was a household size of FORTY-FOUR?? Interesting…

I wanted to generate descriptive statistics to see if I could spot a pattern in the numbers.

Study2_HouseholdSummary <- Study2_Rename %>% 
  group_by(Household_size) %>% 
  summarise (mean_emp = mean(empathy),
             mean_physdist = mean(physdist_motiv))

print(Study2_HouseholdSummary)
## # A tibble: 8 × 3
##   Household_size mean_emp mean_physdist
##            <dbl>    <dbl>         <dbl>
## 1              1     3.95          4.44
## 2              2     4.08          4.57
## 3              3     4.12          4.59
## 4              4     4.04          4.56
## 5              5     4             4.63
## 6              6     3.81          4.83
## 7              7     4.11          4.6 
## 8             44     5             4.4

The numbers all seem pretty similar…Let’s try and represent this visually through a graph.

This is where I encountered some issues:

  • I wanted to graph both the empathy means and the physical distancing means on the same column graph but I had no idea how to do so. Then I remembered the pivot() function (from the dplyr package) introduced in Danielle’s coding modules which allowed us to pivot a table longer such that the values were all in column and the variable labels were all in another. This would allow me to graph all the means on the y-axis, all the household sizes on the x-axis and use the aesthetic mappings to specify the fill colour according to the variable label.
  • I wasn’t familiar with the geom_col() function from the ggplot2 package so I had to Google “how to graph different variables on geomcol()”. This website came up: https://stackoverflow.com/questions/42820677/ggplot-bar-plot-side-by-side-using-two-variables introducing the position = “dodge” as a solution. This worked! However, I wanted error bars on my graph to give a clearer depiction of the differences between each column. After some Googling I realised I would need to have the standard deviations of each group as well.
  • Little did I know that this was going to be quite difficult.
    • First, I used the summarise() function to generate the means and standard deviations of each group (sorted by household size).
    • Then, I tried to use the pivot_longer() function to generate a table which had 5 columns: 1 for the household size, 1 for the mean values, 1 for the mean labels, 1 for the standard deviation values and 1 for the standard deviation labels. However, when I tried to pivot the means and standard deviations separately, the standard deviation values and labels didn’t correspond to the correct mean values and labels (see the second row of the tibble for household size = 1: the mean for empathy was listed in the same row as the standard deviation for physical distancing).
    • I realised that I would have to create the tibbles separately and join them together, using an organising variable that assigned a unique number to each row (simialr to the ID variable Danielle had in her dataset).
#Graphing attempt 1
ggplot(data = Study2_HouseholdSummary,
       mapping = aes(x = factor(Household_size))) +
  geom_col(aes(y = mean_emp)) #I was a bit stuck as to how to superimpose two column graphs on top of each other. Then, I thought to use the pivot() function

Study2_HouseholdSummary2 <- Study2_HouseholdSummary %>% 
  pivot_longer(cols = c(mean_emp, mean_physdist),
             names_to = "category",
             values_to = "means")

print(Study2_HouseholdSummary2)
## # A tibble: 16 × 3
##    Household_size category      means
##             <dbl> <chr>         <dbl>
##  1              1 mean_emp       3.95
##  2              1 mean_physdist  4.44
##  3              2 mean_emp       4.08
##  4              2 mean_physdist  4.57
##  5              3 mean_emp       4.12
##  6              3 mean_physdist  4.59
##  7              4 mean_emp       4.04
##  8              4 mean_physdist  4.56
##  9              5 mean_emp       4   
## 10              5 mean_physdist  4.63
## 11              6 mean_emp       3.81
## 12              6 mean_physdist  4.83
## 13              7 mean_emp       4.11
## 14              7 mean_physdist  4.6 
## 15             44 mean_emp       5   
## 16             44 mean_physdist  4.4
#Graphing attempt 2
ggplot(data = Study2_HouseholdSummary2,
       mapping = aes(x = factor(Household_size))) +
  geom_col(aes(y = means, fill = category),
           position = "dodge") #I want to add error bars (this requires the sd)

Study2_HouseholdSummary2 <- Study2_Rename %>% 
  group_by(Household_size) %>% 
  summarise (mean_emp = mean(empathy),
             sd_emp = sd(empathy),
             mean_physdist = mean(physdist_motiv),
             sd_physdist = sd(physdist_motiv))

print(Study2_HouseholdSummary2)
## # A tibble: 8 × 5
##   Household_size mean_emp sd_emp mean_physdist sd_physdist
##            <dbl>    <dbl>  <dbl>         <dbl>       <dbl>
## 1              1     3.95  0.985          4.44       0.767
## 2              2     4.08  0.893          4.57       0.670
## 3              3     4.12  0.956          4.59       0.512
## 4              4     4.04  0.988          4.56       0.660
## 5              5     4     0.853          4.63       0.664
## 6              6     3.81  1.44           4.83       0.214
## 7              7     4.11  0.694          4.6        0.4  
## 8             44     5    NA              4.4       NA
Study2_HouseholdSummary3 <- Study2_HouseholdSummary2 %>% 
  pivot_longer(cols = c(mean_emp, mean_physdist),
               names_to = "category_mean",
               values_to = "means") %>% 
  pivot_longer(cols = c(sd_emp, sd_physdist),
               names_to = "category_sd",
               values_to = "sd") #this doesn't work numbers don't match up; will need to do these tables separately and then join them

print(Study2_HouseholdSummary3)
## # A tibble: 32 × 5
##    Household_size category_mean means category_sd    sd
##             <dbl> <chr>         <dbl> <chr>       <dbl>
##  1              1 mean_emp       3.95 sd_emp      0.985
##  2              1 mean_emp       3.95 sd_physdist 0.767
##  3              1 mean_physdist  4.44 sd_emp      0.985
##  4              1 mean_physdist  4.44 sd_physdist 0.767
##  5              2 mean_emp       4.08 sd_emp      0.893
##  6              2 mean_emp       4.08 sd_physdist 0.670
##  7              2 mean_physdist  4.57 sd_emp      0.893
##  8              2 mean_physdist  4.57 sd_physdist 0.670
##  9              3 mean_emp       4.12 sd_emp      0.956
## 10              3 mean_emp       4.12 sd_physdist 0.512
## # … with 22 more rows
  • I used the mutate() function to create a variable named “ID” which numbered each row from 1 to 16. This allowed me to use the left_join() function from the dplyr package to specify “ID” as the organising variable when joining the two tibbles together. This ensured that each mean and standard deviation corresponded with the correct measure (empathy/physical distancing motivation) and household size.

You can see my excitement in the code annotations below.

Study2_HouseholdSummary3 <- Study2_HouseholdSummary2 %>% 
  pivot_longer(cols = c(mean_emp, mean_physdist),
               names_to = "category_mean",
               values_to = "means") %>% 
  mutate(ID = 1:16) #https://statisticsglobe.com/assign-unique-id-for-each-group-in-r

Study2_HouseholdSummary4 <- Study2_HouseholdSummary2 %>% 
  pivot_longer(cols = c(sd_emp, sd_physdist),
               names_to = "category_sd",
               values_to = "sd") %>% 
  mutate(ID = 1:16)

Study2_HouseholdJoin <- Study2_HouseholdSummary4 %>% 
  left_join(Study2_HouseholdSummary3, by = "ID") 

print(Study2_HouseholdJoin) #omg it works ok lets see if i can make it into a graph
## # A tibble: 16 × 11
##    Household_size.x mean_emp mean_physdist category_sd     sd    ID
##               <dbl>    <dbl>         <dbl> <chr>        <dbl> <int>
##  1                1     3.95          4.44 sd_emp       0.985     1
##  2                1     3.95          4.44 sd_physdist  0.767     2
##  3                2     4.08          4.57 sd_emp       0.893     3
##  4                2     4.08          4.57 sd_physdist  0.670     4
##  5                3     4.12          4.59 sd_emp       0.956     5
##  6                3     4.12          4.59 sd_physdist  0.512     6
##  7                4     4.04          4.56 sd_emp       0.988     7
##  8                4     4.04          4.56 sd_physdist  0.660     8
##  9                5     4             4.63 sd_emp       0.853     9
## 10                5     4             4.63 sd_physdist  0.664    10
## 11                6     3.81          4.83 sd_emp       1.44     11
## 12                6     3.81          4.83 sd_physdist  0.214    12
## 13                7     4.11          4.6  sd_emp       0.694    13
## 14                7     4.11          4.6  sd_physdist  0.4      14
## 15               44     5             4.4  sd_emp      NA        15
## 16               44     5             4.4  sd_physdist NA        16
## # … with 5 more variables: Household_size.y <dbl>, sd_emp <dbl>,
## #   sd_physdist <dbl>, category_mean <chr>, means <dbl>

I was finally up to plotting this tibble. The basic components of the plot were specified as normal i.e., using <- to assign a name to the object, specifying the input data and the aesthetic mappings using the ggplot() function etc.

  • alpha = 0.9 was used to specify the opacity of the column graph. By increasing the opacity, I could see the horizontal grid lines more clearly, assisting in the legibility of the graph.
  • geom_errorbar was used to generate the error bars
    • ymin = means - sd, ymax = means + sd specified the bounds of the error bar
    • group = category_mean specified the grouping variable for each pair of columns (this gave the position = “dodge” function something to “dodge” from)
    • width = 0.25 specifies the width of the error bar relative to the columns
  • scale_fill_manual was used before to specify the specific colours of the columns (I wanted to stick to the theme of the paper)
    • name = “Measure” specifies “Measure” as the title for the legend
    • labels = c(“Empathy”, “Physical Distancing”) specifies “Empathy” and “Physical Distancing” as the labels for each colour in the legend respectively
  • The other functions were the same as those used previously in other figures
#Graphing attempt 3
expplot1_3 <- ggplot(data = Study2_HouseholdJoin,
                    mapping = aes(x = factor(Household_size.x))) +
  geom_col(aes(y = means, fill = category_mean),
           position = "dodge",
           alpha = 0.9) +
  geom_errorbar(aes(ymin=means-sd, ymax=means+sd, group = category_mean),
                position = position_dodge(0.9),
                width = 0.25) + # the position_dodge() function required some tinkering but group = category_mean solved it because it gave it a position to dodge from
  scale_fill_manual(values = c("#A7CFE5", "#1779B6"),
                    name = "Measure",
                    labels = c("Empathy", "Physical Distancing")) +
  scale_x_discrete(name = "Household Size") +
  scale_y_continuous(name = "Means") +
  theme_bw()

print(expplot1_3)

It looks like there is little correlation between household size and self-reported empathy and motivation to physically distance - not really worth doing a statistical test.

Q2: How do levels of empathy and the motivation to physically distance interact with age?

As we get older, perhaps there is a greater chance that empathy impacts our behaviour as we rely more on others to look after us and our wellbeing. Furthermore, older participants may have higher empathy levels due to the accumulation of social interactions over the years. Younger age groups might be more independent and self-motivated because they are just entering adulthood and are looking to forge their own paths. Alternatively, perhaps younger age groups are more emotionally driven in their behaviour and high levels of empathy might actually increase motivation to physically distance.

For this question, I wanted to create a graph which could visually represent these relationships before seeing if it is worth doing any statistical tests to verify if there are any differences that approach statistical significance.

To manage the large range of ages recorded, I decided to create a new variable that grouped age groups together.

  • Firstly, I wanted to generate a frequency table to see what age range I was working with.
  • Then, I wanted to create a new variable that yield a specific output if a particular condition was satisfied. I did some Googling and found a presentation (https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/6) which made the process quite clear.
    • The ifelse() function allows me to specify which variable I want the new variable to be based upon (“Age”) and the range of values I want to select from that variable (%in% 18:28 - %in% identifies if an element belongs to a dataframe). I can then assign a specific name “18-28” to the new variable “age_bracket” whenever that condition is satisifed. This was done for all age-brackets.
table(Study2_Rename$Age) #looks like 18 is the youngest and 67 is the oldest; I will generate new age categories (18-28), (28-38), (38-48), (48-58), (58-68)
## 
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 
##  9  8 13 14 27 28 20 27 19 21 15 20 16 13 14 10  5 11 10  4  1  7  3  6  2  4 
## 44 45 47 48 50 52 53 54 55 56 57 58 60 67 
##  4  2  1  1  2  1  3  2  2  6  3  1  3  1
Study2_Age <- Study2_Rename %>% 
  mutate(age_bracket = ifelse(Age %in% 18:28, "18-28", 
                              ifelse(Age %in% 28:38, "28-38", 
                                     ifelse(Age %in% 38:48, "38-48", 
                                            ifelse(Age %in% 48:58, "48-58",
                                                   ifelse(Age %in% 58:68, "58-68", "NA")))))) 

glimpse(Study2_Age) #looks like it worked!!
## Rows: 359
## Columns: 21
## $ Q36_1          <dbl> 5, 5, 5, 4, 2, 2, 4, 4, 5, 4, 3, 5, 4, 4, 5, 3, 5, 4, 5…
## $ Q36_2          <dbl> 3, 3, 5, 3, 3, 4, 5, 3, 4, 3, 5, 3, 4, 4, 5, 4, 3, 3, 5…
## $ Q36_3          <dbl> 5, 5, 5, 4, 2, 2, 4, 4, 5, 4, 3, 5, 4, 2, 5, 3, 5, 4, 5…
## $ Q36_4          <dbl> 3, 1, 5, 3, 2, 4, 2, 1, 2, 1, 5, 3, 4, 4, 5, 3, 3, 3, 4…
## $ Q36_5          <dbl> 5, 5, 5, 4, 2, 2, 5, 4, 5, 4, 3, 5, 4, 4, 5, 2, 5, 4, 5…
## $ Q36_6          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q36_7          <dbl> 3, 4, 3, 3, 2, 3, 3, 1, 4, 3, 4, 4, 2, 4, 3, 4, 2, 4, 4…
## $ Q22_1          <dbl> 1, 1, 2, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1…
## $ Q22_2          <dbl> 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 5, 1…
## $ Q22_3          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q22_4          <dbl> 1, 1, 2, 1, 4, 1, 1, 1, 1, 5, 1, 1, 1, 1, 2, 1, 1, 2, 1…
## $ Q22_5          <dbl> 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q22_6          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ Q20            <dbl> 3, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1…
## $ Age            <dbl> 27, 36, 26, 32, 29, 32, 37, 23, 27, 32, 34, 28, 57, 44,…
## $ Gender         <dbl> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1…
## $ Household_size <dbl> 1, 4, 1, 1, 1, 2, 3, 3, 3, 4, 3, 1, 1, 3, 1, 5, 2, 2, 4…
## $ empathy        <dbl> 5.000000, 5.000000, 5.000000, 4.000000, 2.000000, 2.000…
## $ sd             <dbl> 1.0, 1.0, 1.6, 1.0, 2.8, 1.0, 1.0, 1.4, 1.0, 1.8, 2.0, …
## $ physdist_motiv <dbl> 5.0, 5.0, 4.4, 5.0, 3.2, 5.0, 5.0, 4.6, 5.0, 4.2, 4.0, …
## $ age_bracket    <chr> "18-28", "28-38", "18-28", "28-38", "28-38", "28-38", "…

I was very excited to see that it worked! Now, onto graphing this relationship.

  • First, I generated summary tables, grouping by “age_bracket” and used the same process as above to pivot the table longer such that I could represent “empathy” and “physical distancing” as two different colours on a column graph.
  • The code for the column graph is also essentially the same as in Q1 except that there are no aesthetic adjustments and no error bar as I want to guage whether this was the appropriate graph for addressing the research question.
Study2_AgeSummary <- Study2_Age %>% 
  group_by(age_bracket) %>% 
  summarise(mean_emp = mean(empathy),
            mean_physdist = mean(physdist_motiv)) %>% 
  ungroup()

Study2_AgeSummary2 <- Study2_AgeSummary %>% 
  pivot_longer(cols = c(mean_emp, mean_physdist),
               names_to = "category_mean",
               values_to = "means")

print(Study2_AgeSummary2)
## # A tibble: 10 × 3
##    age_bracket category_mean means
##    <chr>       <chr>         <dbl>
##  1 18-28       mean_emp       4.03
##  2 18-28       mean_physdist  4.47
##  3 28-38       mean_emp       3.96
##  4 28-38       mean_physdist  4.67
##  5 38-48       mean_emp       4.37
##  6 38-48       mean_physdist  4.63
##  7 48-58       mean_emp       4.13
##  8 48-58       mean_physdist  4.83
##  9 58-68       mean_emp       4.75
## 10 58-68       mean_physdist  4.35
expplot2col <- ggplot(data = Study2_AgeSummary2,
                     mapping = aes(x = age_bracket, y = means)) +
  geom_col(aes(fill = category_mean),
           position = "dodge",
           alpha = 0.9) 

expplot2col 

I felt that the graph didn’t really say much about the specific interaction between physical distancing motivation and empathy across the different age groups.

At this point, I had been coding for several hours. My previous hiccups with R were always made better by a pair of fresh eyes, so I decided to leave this for the next day.

Refreshed, I decided to take a completely different angle. Rather than limiting the input data to the means only, I tried to plot the entire dataset (“Study2_Age”) to look at the relationship between physical distancing and empathy and how this might differ between age groups. Thus, I decided to plot physical distancing on the x-axis, empathy on the y-axis and use the age bracket as different coloured lines on the graph.

The following code is a mix of code used to reproduce Figure 1 and Figures 2 and 3 from the original paper.

  • The geom_smooth() function was used in Figure 1 as well to produce smooth line graphs. The rest of the code is almost identical to that used for Figure 1 except that the “colour” of the lines are specified according to the variable “age_bracket”.
  • The use of “colour” or “fill” is reminiscent of the aesthetic mappings used to make each violin a different colour in Figures 2 and 3.
expplot2smooth <- ggplot(data = Study2_Age,
                         mapping = aes(x = physdist_motiv)) +
  geom_smooth(aes(y = empathy, colour = age_bracket),
              method = lm,
              se = FALSE) +
  scale_colour_discrete(name = "Age (Years)") +
  scale_x_continuous(name = "Empathy", limits = c(1,5)) +
  scale_y_continuous(name = "Physical Distancing Motivation", limits = c(1,5)) +
  theme_bw()

expplot2smooth
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 13 rows containing missing values (geom_smooth).

The resultant graph was very interesting indeed. Before jumping to any conclusions, I wanted to revisit the sample sizes of each age-bracket (from memory, it was positively skewed).

table(Study2_Age$age_bracket)
## 
## 18-28 28-38 38-48 48-58 58-68 
##   201   104    30    20     4

Very much positively skewed. Because the sample sizes for the older age brackets were quite small, the trend observed in the graph may be attributed to individual differences rather than any real trend in the population. An interesting graph nonetheless and perhaps something to explore in future studies…

To verify whether these differences approached any significance, I decided to conduct some statistical tests. Because I had never done this in R before, I needed to do some Googling. This website was helpful: https://www.scribbr.com/statistics/anova-in-r/

  • First, I needed to load an additional package.
library(broom) #this takes the messy output of built-in functions in R such as aov() and turns them into tidy and more easily interpretable tibbles
  • I decided to first run two one way ANOVAs to determine whether there was any significant comparisons between empathy and the different age brackets or between physical distancing and the different age brackets. An ANOVA was appropriate as I needed to compare means between more than two groups.
  • Because the one way ANOVA testing the effect of age on physical distancing yielded a p<.05, I decided to run a post-Hoc Tukey HSD test to work out which comparison specifically was significant.
one.way_empa <- aov(empathy ~ age_bracket, data = Study2_Age) #aov() conducts an ANOVA testing the effect of age on empathy using the data from "Study2_Age"

summary(one.way_empa) #summary() prints the results in an interpretable table; since p>.05 can't reject the null hypothesis
##              Df Sum Sq Mean Sq F value Pr(>F)
## age_bracket   4   5.94  1.4854   1.689  0.152
## Residuals   354 311.39  0.8796
one.way_pd <- aov(physdist_motiv ~ age_bracket, data = Study2_Age) #aov() conducts an ANOVA testing the effect of age on physical distancing motiavtion using the data from "Study2_Age"

summary(one.way_pd) #since p<.05 can reject the null; let's run some post-hoc tests
##              Df Sum Sq Mean Sq F value Pr(>F)  
## age_bracket   4    4.6  1.1495   2.788 0.0264 *
## Residuals   354  146.0  0.4123                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_pd <- TukeyHSD(one.way_pd) #TukeyHSD() conducts a Tukey HSD test using the data from the "one.way_pd" ANOVA test

tukey_pd #this prints the results
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = physdist_motiv ~ age_bracket, data = Study2_Age)
## 
## $age_bracket
##                    diff         lwr       upr     p adj
## 28-38-18-28  0.19672790 -0.01593831 0.4093941 0.0850077
## 38-48-18-28  0.15800995 -0.18658635 0.5026062 0.7175504
## 48-58-18-28  0.36134328 -0.05146308 0.7741496 0.1176557
## 58-68-18-28 -0.11865672 -1.00767805 0.7703646 0.9961625
## 38-48-28-38 -0.03871795 -0.40358813 0.3261522 0.9984310
## 48-58-28-38  0.16461538 -0.26525987 0.5944906 0.8317528
## 58-68-28-38 -0.31538462 -1.21245907 0.5816898 0.8711085
## 48-58-38-48  0.20333333 -0.30491113 0.7115778 0.8080751
## 58-68-38-48 -0.27666667 -1.21382314 0.6604898 0.9276394
## 58-68-48-58 -0.48000000 -1.44432606 0.4843261 0.6506354

However, none of the individual comparisons had a p<.05. This may be because the number of participants in each age bracket were drastically unequal - the higher age brackets have less power due to their smaller samples.

I decided to run one last statistical test to check if there is a significant interaction between age bracket and empathy which ultimately impacts the motivation to physically distance as I speculated previously.

interaction_empd <- aov(physdist_motiv ~ age_bracket*empathy, data = Study2_Age) #aov() conducts an ANOVA testing the effect of the interaction between age and empathy on physical distancing motivation using the data from "Study2_Age"

summary(interaction_empd) 
##                      Df Sum Sq Mean Sq F value   Pr(>F)    
## age_bracket           4   4.60   1.149   3.186   0.0137 *  
## empathy               1  15.20  15.203  42.133 2.93e-10 ***
## age_bracket:empathy   4   4.82   1.206   3.343   0.0105 *  
## Residuals           349 125.93   0.361                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given the statisically significant output for the interaction (p < .05), it appears that the relationship between age and physical distancing differs by one’s level of empathy and that the effect of age on physical distancing should not be considered without empathy scores in mind. This is in line with what the paper suggests: that empathy the variable of interest when asking what influences motivation to physically distance.

Overall, the results are largely inconclusive. There are no significant comparisons between age brackets but there does seem to be some promising findings to base future research on (namely, how age might interact with empathy to impact physical distancing).

Q3: Is there a difference in the relationship between one’s objective vulnerability and motivation to wear a face mask versus one’s subjective vulnerability and motivation to wear a face mask?

As a psychology student, one thing that is often reiterated is that it is often our perception of harm/risk/the world around us that influences our choices rather than the reality of our surroundings. I wanted to see if this was the case even when making important decisions regarding the health of ourselves and others. Although the experimenters reported that neither of these factors significantly moderated the relationship between empathy and motivation to wear a face mask, I thought it would be interesting to see whether there is a difference in their relationship.

To briefly recap, subjective vulnerability was coded on a 5-point Likert scale in which 1 represented the lowest and 5 represented the highest perceived personal harm that could come from infection with COVID-19. Objective vulnerability was a binary yes/no variable and was determined according to the criteria of the Robert Koch Institution. According to Jamovi, “High-risk” was coded as 1 and “Low-risk” was coded as 2.

To address this question, I wanted to produce a visual representation of the relationship before determining if any statistical tests are warranted.

  • First, I needed to identify which variable corresponded to one’s subjective vulnerability and which to one’s objective vulnerability. This was done using Jamovi and Google translate as they were discovered as useful resources in our verification process. Variables were then renamed accordingly to make more intuitive sense - this should aid the reproducibility of my outputs by enhancing the code’s readability.
Study4exp <- Study4_Rename %>% 
  rename(obj_vuln = "Q61",
         subj_vuln = "Q100_1")

glimpse(Study4exp)
## Rows: 1,526
## Columns: 19
## $ QID90_1        <dbl> 1, 3, 1, 2, 2, 1, 5, 1, 1, 5, 5, 3, 1, 5, 4, 3, 4, 3, 1…
## $ QID90_2        <dbl> 1, 3, 1, 2, 2, 1, 4, 1, 1, 4, 5, 3, 1, 4, 3, 3, 4, 3, 1…
## $ QID90_3        <dbl> 1, 3, 1, 2, 2, 1, 4, 1, 1, 4, 5, 3, 1, 4, 3, 3, 4, 3, 2…
## $ mask_motiv     <dbl> 5, 4, 4, 3, 5, 1, 5, 4, 4, 5, 5, 2, 5, 4, 4, 2, 5, 4, 5…
## $ Q92_1          <dbl> 4, 3, 5, 2, 5, 3, 2, 4, 3, 5, 5, 2, 4, 4, 3, 3, 5, 3, 5…
## $ Q92_2          <dbl> 1, 3, 1, 3, 1, 5, 1, 2, 2, 1, 1, 3, 1, 4, 3, 3, 1, 3, 1…
## $ Q92_3          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ subj_vuln      <dbl> 2, 2, 3, 2, 3, 1, 2, 3, 4, 2, 4, 3, 3, 4, 3, 2, 3, 2, 4…
## $ Q100_2         <dbl> 3, 3, 4, 3, 3, 1, 3, 3, 4, 3, 4, 3, 3, 3, 4, 4, 3, 3, 4…
## $ obj_vuln       <dbl> 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2…
## $ Q20            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Age            <dbl> 30, 52, 39, 29, 46, 20, 31, 43, 40, 18, 29, 33, 27, 28,…
## $ Gender         <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2…
## $ Household_size <dbl> 1, 1, 3, 1, 1, 2, 3, 1, 4, 5, 2, 4, 2, 5, 2, 2, 2, 2, 5…
## $ Q28            <chr> "Gulasch", "Nudeln", "Nudelsalat", "Nudelgratin", "Sala…
## $ condition      <dbl> 1, 1, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 1, 0…
## $ empathy        <dbl> 1.000000, 3.000000, 1.000000, 2.000000, 2.000000, 1.000…
## $ Q92_2r         <dbl> 5, 3, 5, 3, 5, 1, 5, 4, 4, 5, 5, 3, 5, 2, 3, 3, 5, 3, 5…
## $ maskePo        <dbl> 4.5, 3.0, 5.0, 2.5, 5.0, 2.0, 3.5, 4.0, 3.5, 5.0, 5.0, …

My next task was to create summary tables such that I could garner a quick snapshot of the descriptive statistics and later graph the means for motivation to wear a face-mask with objective vulnerability and subjective vulnerability as grouping factors. This process was essentially identical as previous summary tables except that:

  • I needed to filter out non-responses to the objective vulnerability question in order to ensure I could later graph the responses as accurately as possible. I used the filter() function from the dplyr package to do so, requiring that objective vulnerability is a number less than 3 for a participant to be included in the dataset.
Study4exp_o <- Study4exp %>% 
  group_by(obj_vuln) %>% 
  summarise(mean_masko = mean(mask_motiv),
            sd_masko = sd(mask_motiv)) %>% 
  ungroup() 

print(Study4exp_o) #looks like there are some entries that are blank
## # A tibble: 3 × 3
##   obj_vuln mean_masko sd_masko
##      <dbl>      <dbl>    <dbl>
## 1        1       3.98    1.17 
## 2        2       3.76    1.20 
## 3       NA       2.5     0.707
Study4expfilter <- Study4exp %>% 
  filter(obj_vuln < 3)

Study4exp_o <- Study4expfilter %>% 
  group_by(obj_vuln) %>% 
  summarise(mean_masko = mean(mask_motiv),
            sd_masko = sd(mask_motiv)) %>% 
  ungroup()

Study4exp_s <- Study4exp %>% 
  group_by(subj_vuln) %>% 
  summarise(mean_masks = mean(mask_motiv),
            sd_masks = sd(mask_motiv)) %>% 
  ungroup()

print(Study4exp_s)
## # A tibble: 5 × 3
##   subj_vuln mean_masks sd_masks
##       <dbl>      <dbl>    <dbl>
## 1         1       2.45    1.44 
## 2         2       3.61    1.20 
## 3         3       3.96    1.05 
## 4         4       4.28    0.992
## 5         5       4.82    0.644

Now to plotting this information.

  • First, I wanted to see whether objective vulnerability had any effect on the relationship between subjective vulnerability and mask wearing. This would indicate whether it is actually perceived vulnerability that is important or objective vulnerability that impacts motivations to wear a face mask in a pandemic environment (or both!).
  • This process was very similar to that used to reproduce Figure 1 except I have noted some differences in the annotations below.
expplot4_1 <- ggplot(data = Study4expfilter,
              mapping = aes(x = subj_vuln, y = mask_motiv)) +
  geom_smooth(aes(colour = factor(obj_vuln)),
                          method = lm,
                          se = FALSE) + #this removes the confidence interval shading that was present in Figure 1
  scale_x_continuous(name = "Subjective Vulnerability", limits = c(1,5)) +
  scale_y_continuous(name = "Face-Mask Wearing Motivation", limits = c(1,5)) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom", #this specifies the location of the legend as being at the bottom of the plot
        axis.title = element_text(size = 10),
        legend.title = element_text(size = 9), #this specifies the size of the legend title
        legend.text = element_text(size = 8)) + #this specifies the size of the legend labels
  scale_colour_manual(name = element_text("Objective Vulnerability"), #this specifies the legend title
                      labels = c("Low-risk", "High-risk"), #this specifies the legend labels
                      values = c("#B3D88B", "#1779B6")) #this specifies the colours of each line (I wanted to match the aesthetics of the original paper)

print(expplot4_1)
## `geom_smooth()` using formula 'y ~ x'

From this graph, it looks like there is absolutely no effect of objective vulnerability on the relationship between perceived personal vulnerability and face mask wearing: the lines are almost completely overlapping! Because this graph was a little difficult to wrap my head around, I decided to break it down into two smaller graphs plotting objective vulnerability against motivation to wear a face mask and subjective vulnerability against motivation to wear a face mask individually.

This was done using geom_col() which I am quite familiar with by now. I have noted any differences in the code below to reduce repetition as much as possible.

expplot4_2 <- ggplot(data = Study4exp_o,
              mapping = aes(x = factor(obj_vuln), y = mean_masko)) +
  geom_col(fill = "#A7CFE5") +
  scale_x_discrete(name = "Objective Vulnerability", labels = c("Low-risk", "High-risk")) +
  scale_y_continuous(name = "Face-Mask Wearing Motivation", limits = c(0,5)) + # I set the limits to 0 and 5 to match the graph produced for subjective vulnerability
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "none",
        axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 6))

print(expplot4_2)

expplot4_3 <- ggplot(data = Study4exp_s,
              mapping = aes(x = factor(subj_vuln), y = mean_masks)) +
  geom_col(fill = "#A7CFE5") +
  scale_x_discrete(name = "Subjective Vulnerability") +
  scale_y_continuous(name = "Face-Mask Wearing Motivation") +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "none",
        axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 6))

print(expplot4_3)

For the sake of completeness, I wanted to put all the graphs together such that they complemented each other and the smaller graphs on the right hand side extrapolated what is depicted in the larger one on the left. However, I wasn’t sure how to use the ggarrange() function (as used to reproduce Figure 1) to arrange the graphs in a specified configuration. After some Googling, it looked like the “patchwork” package made this process very simple with / transposing the following graph into the line below and | transposing the following graph into a column on the right.

#https://ggplot2-book.org/arranging-plots.html
library(patchwork)

(expplot4_1|expplot4_2/expplot4_3)
## `geom_smooth()` using formula 'y ~ x'

This produced exactly the output I wanted: the main graph on the left hand side and two of the smaller graphs on the right. While the paper recognised that there were no significant interactions between either vulnerability measures and empathy’s effect on the wearing of face masks, the paper did not compare the difference between how subjective and objective vulnerability related to the wearing of face masks. This figure indicates that subjective vulnerability may have a stronger impact on motivation to wear a face mask relative to objective vulnerability which appears to have no effect at all. This supports my initial hypothesis regarding the importance of one’s subjective perception in influencing our behaviour.

Part 4: Recommendation

1. Ensure any supplementary materials are sufficiently detailed

My first recommendation is for authors to include a codebook or other document that is sufficiently detailed and descriptive. One of the first issues that we encountered was difficulty interpreting the variable labels which led to a tedious process of trial and error when reproducing the reported descriptive statistics. To begin remedying the reproducibility crisis and facilitate maximal transparency within the scientific community, experiments should be relatively easy to reproduce. In order to fulfil this goal, any accompanying codebooks or supplementary materials should be easy to understand whilst maintaining a high level of detail. They should also aim to be accurate since we found some discrepancies between the supplementary materials and the paper itself in our verification process.

When describing the data, there should be description at both the variable-level and the collection-level. Noting that it may be difficult for experimenters who may have specific (and habitual) naming conventions for their data to change the way they name variables, having a comprehensive codebook which details specifically which measures belong to which variable name would be very useful. Furthermore, specific information including the text of all questions, details on missing data, exclusion criteria and information on how the variables were constructed/calculated should all be included. This helps uphold principles of reproducibility and accountability that are central to open science. Codebooks can be created in programs such as SPSS and excel although experimenters unfamiliar with those programs can also create supplementary materials using a word processor.

This is a useful resource that includes both examples and guidelines for writing an appropriately detailed codebook: https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html. This website is also a useful resource for learning how to create a codebook in SPSS specifically: https://libguides.library.kent.edu/spss/codebooks.

2. Include the author(s)’ code with comprehensive annotatations

One of the issues we encountered when reproducing the descriptive statistics and figures was the language barrier (none of us understood German). Thus, my second recommendation is for authors to include the code that they used to calculate their statistics and generate their figures. This is because code is a universal language. This ensures an even playing field for experimenters from all backgrounds without specifying a universal spoken language that authors must publish their materials in, upholding principles of diversity and equity.

However, if code is included, it must also be comprehensively annotated as per rubber ducking protocols. This accounts for differences in experimenters’ literacy in various coding languages and ensures that code is easy to follow and understand. By doing so, experimenters can best ensure that their code and their reported statistics and figures are easily reproducible and that any shady practice is discouraged due to the additional accountability this practice provides. Ultimately, including one’s annotated code in the OSF should aid in remedying the reproducibility crisis.

This website provides an easily understandable process on how to engage in rubber ducking as well as links some resources to a virtual rubber duck if experimenters can’t source a physical one: https://rubberduckdebugging.com/ (how cute!) I would also recommend RMarkdown as an appropriate software to annotate code as it makes it exceptionally easy to explain code as well as include in-code annotations where necessary.

3. Ensure consistency across data entry

Another issue we encountered was always having to check the data files in order to determine which number in the variable values corresponded with which parameter. For example, while the control condition was coded as “0” in Study 3, it was coded as “1” in Study 4. Additionally, while empathy was labelled as “empa” in Study 4 (which is somewhat intuitive), it was labelled as “ve” in Study 1. This generates unnecessary confusion and can be quite inefficient if we need to re-consult the data for each study especially when there are so many studies to reproduce within the one paper. Thus, my third recommendation is to encourage experimenters to code their variables consistently to minimise such confusion and aid the ease of reproducing their data. By gradually eroding the barriers to reproducing studies, perhaps we as a scientific community can encourage more attempts at reproducing experimenters’ findings and therefore cement accountability as a core tenet in academia.

To learn more about best practice when coding variables as well as other recommendations for how to uphold open science principles see Kathawalla, Silverstein and Syed’s (2021) article “Easing Into Open Science: A Guide for Graduate Students and Their Advisors.” It’s a very helpful article that details how and why certain practices are helpful as well as addresses common worries when engaging in these practices. Furthermore, there is a tiered system of difficulty that allows experimenters to ease into open science in a manageable way.

Thanks for reading!!