Link to RPubs: https://rpubs.com/Sad5/931136

Part 1

Pfattheicher, S., Nockur, L., Böhm, R., Sassenrath, C., & Petersen, M. B. (2020). The Emotional Path to Action: Empathy Promotes Physical Distancing and Wearing of Face Masks During the COVID-19 Pandemic. Psychological Science, 31(11), 1363–1373. https://doi.org/10.1177/0956797620964422

Summary

The COVID-19 pandemic presented challenges to countries all over the world. Physical distancing and the wearing of face masks are two behavioural measures that were enforced during the pandemic in order to reduce the spread of the virus. Both of these behavioural changes came at considerable personal costs as physical distancing resulted in reduced social contact with others and wearing a face mask altered people’s breathing and appearance. However, practising these behaviours protects not only oneself but also other individuals, particularly those most vulnerable to the virus. It may be the case that adhering to such crucial measures reflects and encourages the activation of a prosocial emotional process, i.e. having empathy for individuals most vulnerable. Therefore, it is important to identify the emotional basis for the motivation behind implementing physical distancing and wearing a face mask. This research would inform policymakers on how to better promote these two important behavioural measures to obtain more public compliance and thus, slow the spread of the virus.

Pfattheicher et al. (2021) assessed the psychological motivation to adhere to these behaviours. The researchers tested (1) whether empathy promotes the motivation to implement physical distancing and (2) whether both wearing a face mask and physical distancing are encouraged by inducing affective empathy for those most vulnerable to the virus such as the elderly.

The researchers conducted 4 studies which included samples from the US, UK and Germany. In study 1, they measured participants’ affective empathy for individuals most vulnerable to the virus via 3 items. Ratings for each item ranged from 1 (strongly disagree) - 5 (strongly agree). Physical-distancing practices were measured via self-report on a 5-point Likert scale. The study observed a positive correlation between empathy and social distancing in all 3 countries.

In study 2, researchers replicated the findings of study 1 in a different German sample. They used the same measurement of empathy but measured physical distancing in a more prospective way, consisting of five items, e.g. one sample item read “In the coming days, I will visit elderly people such as parents, grandparents, elderly friends, during my leisure time”. The results were comparable to study 1.

In study 3, there were 3 conditions: information only, information + empathy and a control group (no information and video were given). In the information-only group, participants read a short text highlighting the importance of physical distancing for slowing the spread of COVID-19. In the information + empathy group, participants read the same short text and viewed a one-minute video of a 91-year-old man reporting that he stopped visiting his chronically ill wife because of the virus. Physical distancing was then measured in all participants with the same items used in study 2. The results revealed that when empathy was induced, the motivation to adhere to physical distancing increased significantly.

In study 4, participants were assigned to either an empathy condition (read a text about a woman with a rare immune disease), an information-only condition (read an informative text about COVID-19) or a control condition. They were assessed for state empathy via three items as well as their motivation to wear a face mask via one item. Ratings for each item ranged from 1 (strongly disagree) - 5 (strongly agree). The results indicated that with increased levels of state empathy, the motivation to wear a face mask also increased. The study also tested whether objective or subjective vulnerability to the virus would have an effect on people’s compliance to wearing a face mask after the empathy induction. They found that neither of the vulnerability measures had a significant effect on wearing face masks. This ruled out the alternative interpretation that affective empathy makes people sensitive to their own vulnerability which promotes their motivation to comply with COVID-19 measures.

These findings suggest that there is potential in combining background information with affective empathy to further increase the motivation to comply with COVID-19 measures. While this may be the case, the authors also suggested replications be conducted using different stimuli and including other, non-western samples.

Reaction

I was surprised to learn that there are certain conditions in which affective empathy motivates behavioural outcomes, in this case, increasing the motivation to adhere to COVID-19 measures. Generally, research has shown that affective empathy does improve health outcomes. One example the paper notes is that physicians’ affective empathy levels have been positively correlated with better self-care in diabetic patients. However, the paper also noted that other studies that looked at the same notion as them, (i.e. whether empathy for those most vulnerable to the virus promotes adherence to certain behavioural measures) found null effects. The authors speculated that this was due to the strength of experimental manipulations as these studies used short empathy-related messages or statements, whereas they used strong manipulations that induced a higher level of state empathy. As a result, this was found to significantly increase motivation.

The most interesting part of this paper was the findings from study 4, whereby it was found that empathy increases the motivation to wear face masks beyond a person’s vulnerability perception. These results suggest evidence against the alternative explanation that affective empathy makes a person sensitive to their own vulnerability, which in turn increases their motivation to comply with COVID-19 measures, thus, leading to their adherence being for egotistic reasons rather than prosocial. While potential egotistic explanations cannot be completely ruled out, it was very interesting to learn that, particularly in western countries, it is empathy for others more than oneself that increases people’s motivation.

I wonder whether certain variables like gender, would have any effect on empathy motivating the compliance to physical distance and wear a face mask. I speculate this because research has found that women are more likely to view COVID-19 as a serious health issue, to agree with restricting public policy measures and to adhere to them (Galasso et al., 2020). In this study, while the researchers did assess certain demographics, including gender, age, household size and education level, it did not explore whether any of these variables may have moderated the results. I think this would be an important avenue to explore because if significant effects are found, then policymakers would be able to tailor their communication more effectively to achieve greater compliance regarding these two important behavioural measures.

Part 2 - Verification

Verification Goals

Our team set out to reproduce the demographic descriptives and the means and SD’s reported for the different measures across all 4 studies, along with 3 figures from the Pfattheicher et al. (2020) paper.

Demographic Descriptives

Study 1:

“The final samples consisted of 322 participants from the US (45.7% female; age: M = 33.33 years, SD = 13.00), 317 from the UK (59.3% female; age: M = 38.05 years, SD = 12.20), and 326 from Germany (46.6% female; age: M = 29.44 years, SD = 9.31).”

Study 2:

“We collected data of 359 participants from Germany (48.5% female; age: M = 29.75 years, SD = 9.40).”

Study 3:

N = 868; 43.8% female; age: M = 35.09 years, SD = 12.44.

Study 4:

N = 1,526; 47.2% female; age: M = 34.71 years, SD = 12.09.

Means and SD’s Reported

Study 1:

Empathy - US: M = 4.46, SD = 0.74; UK: M = 4.56, SD = 0.61; Germany: M = 4.02, SD = 0.9.

Physical Distancing - US: M = 4.30, SD = 0.99; UK: M = 4.12, SD = 1.01; Germany: M = 4.04, SD = 1.11.

Study 2:

Empathy - M = 4.05, SD = 0.94.

Physical Distancing - M = 4.56, SD = 0.65.

Study 3:

Motivation to physical distance:

Control condition - M = 4.30, SD = 0.76.

Information-only condition - M = 4.39, SD = 0.74.

Information + empathy condition - M = 4.51, SD = 0.66.

Study 4:

Empathy:

Information-only condition - M = 2.14, SD = 1.00.

Empathy condition - M = 4.03, SD = 0.90.

Control condition - M = 2.10, SD = 1.01.

Motivation to wear a face mask:

Information-only condition - M = 3.83, SD = 1.20.

Empathy condition - M = 4.00, SD = 1.12.

Control condition - M = 3.69, SD = 1.24.

Figures

Figure 1:

Figure 2:

Figure 3:

Demographic Descriptives

Here, we will attempt to reproduce the demographic descriptives for all 4 studies in the paper.

First, we load the package ‘tidyverse’ into our library, which contains most of the functions we will need and then, we load the package ‘haven’, which enables R to read various data formats.

library(tidyverse)
library(haven)

Next, we use the function read_sav to load in the data for all 4 studies as the data files are in .sav format. The files are located in a folder called ‘lib’.

For study 1, there are 3 separate data files, 1 for each country sample, thus I name and load each data set individually.

Study1USA <- read_sav("/cloud/lib/Study1USA.sav") 
Study1UK <- read_sav("/cloud/lib/Study1UK.sav")
Study1Germany <- read_sav("/cloud/lib/Study1Germany.sav")

Next, we reproduce the demographic statistics for each sample separately.

To calculate the mean and SD of the age of the participants, we pipe the data set through the summarise() function.
To determine the number of final participants used in the study, we use the count() function.
To find the percentage of males and females in each sample, we pipe the data set through the group_by() function to specify that we want to group rows based on gender and then, use the summarise() function to find the percentages of each gender in the sample.

For Study1USA:

The article reported the mean age = 33.33 and SD = 13.

Study1USA %>% 
  summarise(mean_age = mean(Age),
            SD = sd(Age))

## # A tibble: 1 × 2
##   mean_age    SD
##      <dbl> <dbl>
## 1     33.3  13.0

The article reported the sample consisted of 322 participants, with 45.7% of the participants being female.

count(Study1USA)

## # A tibble: 1 × 1
##       n
##   <int>
## 1   322

Study1USA %>% 
  group_by(Gender) %>% 
  summarise(percent = 100*n()/nrow(Study1USA))

## # A tibble: 3 × 2
##       Gender percent
##    <dbl+lbl>   <dbl>
## 1 1 [Male]    53.4  
## 2 2 [Female]  45.7  
## 3 3 [N.A.]     0.932

The supplementary material was rather misleading as it had specified that the final samples only consisted of male and female participants. Therefore, the N.A. row was a surprise to see as the supplementary did not provide any information about a third category being a part of the gender variable and neither, that these participants were included in the final analyses.

For Study1UK:

The article reported the mean age = 38.05 and SD = 12.20.

Study1UK %>% 
  summarise(mean_age = mean(Age),
            SD = sd(Age)) %>% 
   round(digits = 2) # Used to produce values to 2 decimal places.

## # A tibble: 1 × 2
##   mean_age    SD
##      <dbl> <dbl>
## 1     38.0  12.2

The article reported the sample consisted of 317 participants, with 59.35% of the participants being female.

count(Study1UK)

## # A tibble: 1 × 1
##       n
##   <int>
## 1   317

Study1UK %>%
  group_by(Gender) %>% 
  summarise(percent = 100*n()/nrow(Study1UK))

## # A tibble: 3 × 2
##       Gender percent
##    <dbl+lbl>   <dbl>
## 1 1 [Male]    40.4  
## 2 2 [Female]  59.3  
## 3 3 [N.A.]     0.315

For Study1Germany:

The article reported the mean age = 29.44 and SD = 9.31.

Study1Germany %>% 
  summarise(mean_age = mean(Age),
            SD = sd(Age)) %>% 
   round(digits = 2) # Used to produce values to 2 decimal places.

## # A tibble: 1 × 2
##   mean_age    SD
##      <dbl> <dbl>
## 1     29.4  9.31

The article reported the sample consisted of 326 participants, with 46.6% of the participants being female.

count(Study1Germany)

## # A tibble: 1 × 1
##       n
##   <int>
## 1   326

Study1Germany %>% 
  group_by(Gender) %>% 
  summarise(percent = 100*n()/nrow(Study1Germany))

## # A tibble: 3 × 2
##             Gender percent
##          <dbl+lbl>   <dbl>
## 1 1 [Männlich]      52.8  
## 2 2 [Weiblich]      46.6  
## 3 3 [Keine Angabe]   0.613

Now, for the rest of the studies, we follow the same process using the same functions as aforementioned.

For study 2:

I first load and name the data set.

Study2 <- read_sav("/cloud/lib/Study2.sav")

The article reported the mean age = 29.75 and SD = 9.40.

Study2 %>% 
  summarise(mean_age = mean(Age), # To find mean and SD
            SD = sd(Age)) %>% 
  round(digits = 2) # # Used to produce values to 2 decimal places.

## # A tibble: 1 × 2
##   mean_age    SD
##      <dbl> <dbl>
## 1     29.8   9.4

The article reported the sample consisted of 359 participants, with 48.5% of the participants being female.

count(Study2) # To determine number of final participants.

## # A tibble: 1 × 1
##       n
##   <int>
## 1   359

Study2 %>% 
  group_by(Gender) %>% # Groups factors of gender.      
  summarise(percent = 100*n()/nrow(Study2)) # To find percentage of each gender.

## # A tibble: 3 × 2
##             Gender percent
##          <dbl+lbl>   <dbl>
## 1 1 [Männlich]      51.3  
## 2 2 [Weiblich]      48.5  
## 3 3 [Keine Angabe]   0.279

For study 3:

I first load and name the data set.

Study3 <- read_sav("/cloud/lib/Study3.sav")

The article reported the mean age = 35.09 and SD = 12.44.

Study3 %>% 
  summarise(mean_age = mean(Age), # To calculate mean and sd.
            SD = sd(Age)) %>% 
  round(digits = 2) # Used to produce values to 2 decimal places.

## # A tibble: 1 × 2
##   mean_age    SD
##      <dbl> <dbl>
## 1     35.1  12.4

The article reported the sample consisted of 868 participants, with 43.8% of the participants being female.

count(Study3) # To determine the number of final participants.

## # A tibble: 1 × 1
##       n
##   <int>
## 1   868

Study3 %>% 
  group_by(Gender) %>% # Groups factors of gender.
  summarise(percent = 100*n()/nrow(Study3)) # To determine percentages of each gender.

## # A tibble: 3 × 2
##          Gender percent
##       <dbl+lbl>   <dbl>
## 1  1 [Männlich]  55.8  
## 2  2 [Weiblich]  43.8  
## 3 NA              0.461

For study 4:

I first load and name the data set.

Study4 <- read_sav("/cloud/lib/Study4.sav")

The article reported the mean age = 34.71 and SD = 12.09.

Study4 %>% 
  summarise(mean_age = mean(Age), # To calculate mean and SD.
            SD = sd(Age)) %>% 
  round(digits = 2) # Used to produce values to 2 decimal places.

## # A tibble: 1 × 2
##   mean_age    SD
##      <dbl> <dbl>
## 1     34.7  12.1

The article reported the sample consisted of 1,526 participants with 47.2% of the participants being female.

count(Study4) # To determine the number of final participants.

## # A tibble: 1 × 1
##       n
##   <int>
## 1  1526

Study4 %>% 
  group_by(Gender) %>% # Groups factors of gender.
  summarise(percent = 100*n()/nrow(Study4)) # To determine percentages of each gender.

## # A tibble: 3 × 2
##          Gender percent
##       <dbl+lbl>   <dbl>
## 1  1 [Männlich]  52.2  
## 2  2 [Weiblich]  47.2  
## 3 NA              0.655

Means and SDs for the Different Measures

Now that we have produced all the correct demographic statistics, we can move on to producing the means and SD’s for all the measures in each study.

In study 1, participants were measured on their affective empathy and their physical distancing practice.

The article reported that the means and SDs for the different samples were as follows:

US: M = 4.46, SD = 0.74; UK: M = 4.56, SD = 0.61; Germany: M = 4.02, SD = 0.9.

First, we pipe the data set through the function transmute() and specify which columns measure affective empathy (i.e. Q22_1, Q22_3 and Q22_5). Doing this drops all existing variables in the data set except the selected columns. Next, we use the pivot_longer() function, selecting the same columns. This lengthens the data, changing it from a wide format to a long format. We also used the names_to() and values_to() functions to re-label the columns after altering the data. Then, with the summarise() function, we calculate the means and SD’s of the empathy scores.

Study1USA %>% 
  transmute(Q22_1, Q22_3, Q22_5) %>%
  pivot_longer(cols = c(Q22_1, Q22_3, Q22_5),
               names_to = "Empathy", 
               values_to = "Empathy_scores") %>% 
  summarise(mean(Empathy_scores), sd(Empathy_scores))

## # A tibble: 1 × 2
##   `mean(Empathy_scores)` `sd(Empathy_scores)`
##                    <dbl>                <dbl>
## 1                   4.46                0.846

This produced the correct mean, however, we are not getting the correct SD. After much critical thinking and searching through Google, we discovered the rowwise() function. Rowwise() allows us to sum across the selected columns for each row individually. Thus, we pipe the data set through the rowwise() function, which specifies that the following step will be computed per row. We then use the mutate() function to create a new variable called “Empathy”, which will be the sum of the selected columns measuring empathy. Next, we divide the Empathy variable by 3 (i.e. the number of columns measuring affective empathy) and save it under a new object. Then, we use the mean() and sd() function to find the mean and SD of the empathy scores.

Correct Code:

For Study1USA:

Study1USA <- Study1USA %>% 
  rowwise() %>%
  mutate(Empathy = sum(Q22_1, Q22_3, Q22_5))
Average_empathyUSA<-Study1USA$Empathy/3 # Save data under new object called "Average_empathyUSA".
mean(Average_empathyUSA)

## [1] 4.459627

sd(Average_empathyUSA)

## [1] 0.7427453

For Study1UK:

Study1UK<- Study1UK %>% 
rowwise() %>% 
  mutate(Empathy= sum(Q22_1, Q22_3, Q22_5))
Average_empathyUK <- Study1UK$Empathy/3 # Save data under new object called "Average_empathyUK".
mean(Average_empathyUK)

## [1] 4.560463

sd(Average_empathyUK)

## [1] 0.6124644

For Study1Germany:

Study1Germany<- Study1Germany %>% 
rowwise() %>% 
  mutate(Empathy= sum(Q22_1, Q22_3, Q22_5))
Average_empathyGermany <- Study1Germany$Empathy/3 # Save data under new object called "Average_empathyGermany".
mean(Average_empathyGermany)

## [1] 4.017382

sd(Average_empathyGermany)

## [1] 0.92563

The article reported that the mean and SD for the physical distance scores in each sample were as follows:

US: M = 4.30, SD = 0.99; UK: M = 4.12, SD = 1.01; Germany: M = 4.04, SD = 1.11.

The study only used one item (Q24_1 or Q77_1) to measure physical distancing practice, hence, all we need to do is pipe the data through the summarise() function to calculate the means and SD’s.

For Study1USA:

Study1USA %>% 
  summarise(mean(Q24_1), SD= sd(Q24_1))

For Study1UK:

Study1UK %>% 
  summarise(mean(Q24_1), SD= sd(Q24_1))

For Study1Germany:

In the data set for Germany, the item measuring physical distancing was labelled “Q77_1”.

Study1Germany %>% 
  summarise(mean(Q77_1), SD= sd(Q77_1))

For Study 2:

In study 2, participants were also measured on their affective empathy and their physical distancing practice.

The article reported that the mean and SD for the empathy scores were: M = 4.05, SD = 0.94.

Here, we apply the same process and functions as outlined in Study 1 to reproduce the means and SD’s of the empathy scores.

Study2 <- Study2 %>% 
  rowwise() %>% # Sums across the selected columns for each row individually.
  mutate(Empathy = sum(Q36_1, Q36_3, Q36_5)) # Creates a new variable. 
Average_empathy2 <- Study2$Empathy/3 #Dividing by the amount of items used to measure empathy and create as a new object.
mean(Average_empathy2) # Calculate mean

## [1] 4.052925

sd(Average_empathy2) # Calculate SD

## [1] 0.9414831

The article reported that the mean and SD for physical distancing scores were: M = 4.56, SD = 0.65.

The study used 5 items (Q22_1, Q22_2, Q22_3, Q22_4 and Q22_5) to measure physical distancing practice, hence we need to create a new variable including the items, so we will implement the same process used in the empathy condition above.

Study2 <- Study2 %>% 
  rowwise() %>% # Sums across the selected columns for each row individually.
  mutate(socialdistance = sum(Q22_1, Q22_2, Q22_3, Q22_4, Q22_5)) # Creates a new variable that is the sum of the relevant item columns. 
Average_socialdistance2 <- Study2$socialdistance/5 # Dividing by the amount of items used to measure physical distancing practice and create as a new object.
mean(Average_socialdistance2) # To calculate mean.

## [1] 1.44234

sd(Average_socialdistance2) # To calculate SD.

## [1] 0.6484973

As can be seen, the means produced are incorrect as they are not in line with the means reported by the paper. Thus, we went back to read the supplementary material and the article again. We realised that in the article, we had missed the one line where the authors had reported that the data was re-coded in Study 2. In measuring physical distancing, the authors had used rating scales that ranged from 1-5 and this was re-coded in the analyses so that the values were inverted, i.e. 1 was now 5 and 5 was now 1. We were confused as to why there was no mention of this in the supplementary material and why the authors felt the need to re-code all the items.

After many discussions with our tutor (Yuki), we are able to re-code the items to a flipped scale by subtracting all the column values by 6. Hence, we pipe the data through rowwise() and then, mutate().

Study2 <-Study2 %>% 
rowwise() %>% # Sums across the selected columns for each row individually.
mutate(RECODE1 = 6-Q22_1, # Creates a new column for each item which reflects the re-coded values.
RECODE2 = 6-Q22_2,
RECODE3 = 6-Q22_3,
RECODE4 = 6-Q22_4,
RECODE5 = 6-Q22_5)

Now that we have the correct values, we can apply the same functions that were used prior to re-coding the data to find the mean and SD of the physical distancing practice scores.

Correct Code:

Study2<- Study2 %>% 
  rowwise() %>% # Sums across the selected columns for each row individually. 
  mutate(SRecodeDistance = sum(RECODE1, RECODE2, RECODE3, RECODE4, RECODE5)) # Creates a new variable that is the sum of all the re-coded columns. 

Average_SRecodeDistanceStudy2 <- Study2$SRecodeDistance/5 # Divides the new variable by the number of items used to measure physical distancing, i.e. 5 and saves it under a new object called "Average_SRecodeDistanceStudy2".

mean(Average_SRecodeDistanceStudy2) # To calculate mean

## [1] 4.55766

sd(Average_SRecodeDistanceStudy2) # To calculcate SD

## [1] 0.6484973

For Study 3:

In Study 3, there were 3 conditions: information-only, information + emapthy and the control group. Participants were measured on their adherence to physical distancing with the same items used in Study 2.

First, we re-code the column items used to measure compliance to physical distancing in the data set. This is done in the same way as Study 2 - by piping the data through rowwise() and then, mutate().

Study3 <-Study3 %>% 
rowwise() %>% # Sums across the selected columns for each row individually. 
mutate(RECODE1 = 6-Q22_1, RECODE2 = 6-Q22_2, RECODE3 = 6-Q22_3, RECODE4 = 6-Q22_4, RECODE5 = 6-Q22_5) # Creates new variables for the relevant columns which reflect the re-coded values.

Now, with the correct values, we reproduce the means and SD’s.

The article reported that the mean and SD for the physical distancing scores in each condition were as follows:

Control condition - M = 4.30, SD = 0.76. Information-only condition - M = 4.39, SD = 0.74. Information + empathy condition - M = 4.51, SD = 0.66.

In the data set, there is a variable named “bed” which contains values ranging from 0 - 2. These numbers reflect the condition that the participants were in for this study. Thus, we need to filter the data to retain the relevant participants for each condition. To do this, we pipe the data through the filter() function and specify the column and value that we want to filter the data by. Then, we pipe it through the select_all() function to ensure all the participants we want are selected and have this saved under a new variable.

Next, we pipe the new variable through rowwise() and then, through the mutate() function to create another variable that is the sum of the re-coded columns measuring physical distancing. Doing this allows us to obtain the re-coded scores only for the participants in a specific condition. Then, we divide this variable by the number of items measuring physical distancing (i.e. 5) and save that under another object. We use the mean() and sd() functions to calculate the means and SD’s.

Correct Code:

Control group:

Study3controlPD <- Study3 %>% # New variable called 'Study3controlPD'.
  filter(bed == 0) %>% # Filters 'bed' variable and retains only participants who selected 0. 
  select_all() # Selects all 0 values in 'bed' subset. 

Study3controlPD<- Study3controlPD %>% 
  rowwise() %>% # Sums across the selected columns for each row individually. 
  mutate(SStudy3controlPD = sum(RECODE1, RECODE2, 
                                RECODE3, RECODE4,RECODE5)) # Creates a new variable labelled 'SStudy3controlPD' which is the sum of the re-coded columns.
                                
Average_SStudy3controlPD <- Study3controlPD$SStudy3controlPD/5 # Dividing SStudy3controlPD by 5 and saving under new object labelled 'Average_SStudy3controlPD. 

mean(Average_SStudy3controlPD) # Calculates mean.

## [1] 4.297049

sd(Average_SStudy3controlPD) # Calculates SD.

## [1] 0.7602098

Information-Only:

Study3infoonlyPD <- Study3 %>% # New variable called 'Study3infoonlyPD'.
  filter(bed == 2) %>% # Filters 'bed' variable and retains only participants who selected 2.  
  select_all() # Selects all 2 values in 'bed' subset. 

Study3infoonlyPD<- Study3infoonlyPD %>% 
  rowwise() %>% # Sums across the selected columns for each row individually. 
  mutate(SStudy3infoonlyPD = sum(RECODE1, RECODE2, 
                                 RECODE3, RECODE4, RECODE5)) # Creates a new variable labelled 'SStudy3infoonlyPD' which is the sum of the re-coded columns. 

Average_SStudy3infoonlyPD <- Study3infoonlyPD$SStudy3infoonlyPD/5 # Dividing SStudy3infoonlyPD by 5 and saving under new object labelled 'Average_SStudy3infoonlyPD. 

mean(Average_SStudy3infoonlyPD) # Calculates mean.

## [1] 4.392787

sd(Average_SStudy3infoonlyPD) # Calculates SD.

## [1] 0.7393635

Information + Empathy:

Study3infoempathyPD <- Study3 %>% # New variable called 'Study3infoempathyPD'. 
  filter(bed == 1) %>% # Filters 'bed' variable and retains only participants who selected 1. 
  select_all() # Selects all 1 values in 'bed' subset. 

Study3infoempathyPD<- Study3infoempathyPD %>% 
  rowwise() %>% # Sums across the selected columns for each row individually. 
  mutate(SStudy3infoempathyPD = sum(RECODE1, RECODE2, 
                                    RECODE3, RECODE4, RECODE5)) # Creates a new variable labelled 'SStudy3infoempathyPD' which is the sum of the re-coded columns.  

Average_SStudy3infoempathyPD <- Study3infoempathyPD$SStudy3infoempathyPD/5 # Dividing SStudy3infoempathyPD by 5 and saving under new object labelled 'Average_SStudy3infoempathyPD.  

mean(Average_SStudy3infoempathyPD) # Calculates mean.

## [1] 4.510078

sd(Average_SStudy3infoempathyPD) # Calculates SD.

## [1] 0.6572961

For Study 4:

In Study 4, there were 3 conditions: information-only, empathy and the control group. Participants state empathy was assessed after the condition manipulations via 3 items. They were also measured on their motivation to wear a face mask with 1 item.

The article reported that the means and SD’s for the measures in each condition were as follows:

Empathy: Information-only condition - M = 2.14, SD = 1.00. Empathy condition - M = 4.03, SD = 0.90. Control condition - M = 2.10, SD = 1.01.

Motivation to wear a face mask: Information-only condition - M = 3.83, SD = 1.20. Empathy condition - M = 4.00, SD = 1.12. Control condition - M = 3.69, SD = 1.24.

First, we begin with reproducing the means and SDs for the state empathy measure.

We follow the same process and use the same functions as described for Study 3: We pipe the data through the filter() function and specify the column and value that we want to filter the data by. Then, we pipe it through the select_all() function to ensure all the participants we want are selected and have this saved under a new variable.

The study used 3 items (QID90_1, QID90_2 and QID90_3) to measure state empathy. So, we pipe the new variable through rowwise() and then, through the mutate() function to create another variable that is the sum of the state empathy scores for the filtered participants. Doing this allows us to obtain the scores only for the participants in a specific condition. Then, we divide this variable by the number of items measuring physical distancing (i.e. 3) and save that under another object. We use the mean() and sd() functions to calculate the means and SD’s.

Info-Only Condition:

Study4infoonlySE <- Study4 %>% # New variable called "Study4infoonlySE"
  filter(bed == 0) %>% # Filters 'bed' variable and retains only participants who were assigned 0.  
  select_all() # Selects all 0 values in 'bed' subset.

Study4infoonlySE<- Study4infoonlySE %>% 
  rowwise() %>% # Sums across the selected columns for each row individually. 
  mutate(SStudy4infoonlySE = sum(QID90_1, QID90_2, QID90_3)) # Creates a new variable labelled 'SStudy4infoonlySE' which is the sum of the state empathy scores for participants in the information-only group.   

Average_SStudy4infoonlySE <- Study4infoonlySE$SStudy4infoonlySE/3 # Dividing SStudy4infoonlySE by 3 and saving under new object labelled 'Average_SStudy4infoonlySE'.  

mean(Average_SStudy4infoonlySE) # Calculates mean.

## [1] 2.144309

sd(Average_SStudy4infoonlySE) # Calculates SD.

## [1] 1.000428

Control Condition:

Study4controlSE <- Study4 %>% # New variable called "Study4controlSE"
  filter(bed == 1) %>% # Filters 'bed' variable and retains only participants who were assigned 1.
  select_all() # Selects all 1 values in 'bed' subset.

Study4controlSE<- Study4controlSE %>% 
  rowwise() %>% # Sums across the selected columns for each row individually. 
  mutate(SStudy4controlSE = sum(QID90_1, QID90_2, QID90_3)) # Creates a new variable labelled 'SStudy4controlSE' which which is the sum of the state empathy scores for participants in the control group. 

Average_SStudy4controlSE <- Study4controlSE$SStudy4controlSE/3 # Dividing SStudy4controlSE by 3 and saving under new object labelled 'Average_SStudy4controlSE'.  

mean(Average_SStudy4controlSE) # Calculates mean.

## [1] 2.096754

sd(Average_SStudy4controlSE) # Calculates SD.

## [1] 1.010061

Empathy Condition:

Study4empathSE <- Study4 %>% # New variable called "Study4empathSE" 
  filter(bed == 2) %>% # Filters 'bed' variable and retains only participants who were assigned 2.
  select_all() # Selects all 2 values in 'bed' subset.

Upon looking through the filtered data, we realised that one column contained a few NA values which would affect the output when calculating the mean and SD for this condition. Thus, with the filter() function, we also specify that we want to retain values = and > 1 for the column with NA values (i.e. QID90_2).

Correct Code:

Study4empathSE <- Study4 %>% # New variable called "Study4empathSE" 
  filter(bed == 2, QID90_2 >= 1 ) %>%  # Filters 'bed' variable and retains only participants who were assigned 2 and who scored 1 or more for item in column QID90_2. 
  select_all() # Selects all 2 values in 'bed' subset.

Study4empathSE<- Study4empathSE %>% 
  rowwise() %>% # Sums across the selected columns for each row individually. 
  mutate(SStudy4empathSE = sum(QID90_1, QID90_2, QID90_3)) # Creates a new variable labelled 'SStudy4empathSE' which is the sum of the state empathy scores for participants in the empathy group.

Average_SStudy4empathSE <- Study4empathSE$SStudy4empathSE/3 # Dividing SStudy4empathSE by 3 and saving under new object labelled 'Average_SStudy4empathSE'.  

mean(Average_SStudy4empathSE) # Calculates mean.

## [1] 4.032064

sd(Average_SStudy4empathSE) # Calculates SD.

## [1] 0.8998714

Now, we reproduce the means and SDs for the motivation to wear a face mask.

The study only used one item (Q22_1) to measure motivation to wear a face mask. We pipe the data through the filter() function and specify the column and value that we want to filter the data by. Then, we pipe it through the select_all() function to ensure all the participants we want are selected and have this saved under a new variable. Then, all we need to do is pipe the new variable through the summarise() function to calculate the mean and SD.

Information-Only Condition:

Study4infoPD <- Study4 %>% # New variable called "Study4infoPD". 
  filter(bed == 0) %>%  # Filters 'bed' variable and retains only participants who were assigned 0. 
  select_all() # Selects all 0 values in 'bed' subset

Study4infoPD %>% 
summarise(mean(Q22_1), sd(Q22_1)) # Calculates mean and SD.

## # A tibble: 1 × 2
##   `mean(Q22_1)` `sd(Q22_1)`
##           <dbl>       <dbl>
## 1          3.83        1.20

Empathy Condition:

Study4empathyPD <- Study4 %>% # New variable called "Study4empathyPD".
  filter(bed == 2) %>% # Filters 'bed' variable and retains only participants who were assigned 2. 
  select_all() # Selects all 2 values in 'bed' subset. 
  
Study4empathyPD %>% 
  summarise(mean(Q22_1), sd(Q22_1)) # Calculates mean and SD.

## # A tibble: 1 × 2
##   `mean(Q22_1)` `sd(Q22_1)`
##           <dbl>       <dbl>
## 1          4.00        1.12

Control Condition:

Study4controlPD <- Study4 %>% # New variable called "Study4controlPD".
  filter(bed == 1) %>% # Filters 'bed' variable and retains only participants who were assigned 1.
  select_all() # Selects all 2 values in 'bed' subset.
  
Study4controlPD %>% 
  summarise(mean(Q22_1), sd(Q22_1)) # Calculates mean and SD.

## # A tibble: 1 × 2
##   `mean(Q22_1)` `sd(Q22_1)`
##           <dbl>       <dbl>
## 1          3.69        1.24

Figures

The process of reproducing the figures was definitely a challenging one. Figuring out the functions needed for figure one took many hours, a whole lot of trial and error and many moments in which we wanted to give up.

Figure 1:

Figure 1 displays physical distancing as predicted by empathy in the four samples in studies 1 and 2.

To reproduce Figure 1, we first need to produce 4 separate plots including three for study 1 and one for study 2. Then, we need to combine all the plots together to make one graph.

For each of the plots, the process was as follows:

We begin by piping the data through rowwise() and then, mutate() to create a new variable that is the sum of the empathy scores.
Next, we pipe the data once again through rowwise() and then, mutate() to create a new variable that is the sum of the empathy scores divided by 3 (the number of items measuring empathy).
Now we can focus on the plot. We start by creating a new object and then, use the assignment operator ‘<-’ to assign variables to this object.
We use ggplot() to establish that the object is a plot and specify the data set.
Then, we use geom_smooth() to add a trend line over the existing plot. We specify the aesthetic mappings, i.e. which variables will be included in the plot by using aes() and identify the x-axis and y-axis variables. Method = lm is used to produce a fitted regression line.
We use scale_x_continuous() to remove the x-axis title, set the limits for the x-axis to be from 1-5 and use the ‘break’ argument to set the values for the x-axis to go up by 1.
Next, we use scale_y_continuous() to remove the y-axis title, set the limits for the y-axis to be from 1-5 and use the ‘break’ argument to set the values for the y-axis to go up by 1.
We use ggtitle() to add a plot title.
Finally, we use theme() to customise the non-data components of the plot. ‘panel.background’ is used to remove the default grey background of the plot. ‘panel.border’ is used to add panel borders around the plot and we use ‘element_rect()’ to set the border colour as the default (black) and the border size to 0.5.

For Study1UK:

Study1UK<- Study1UK %>% 
  rowwise() %>% 
  mutate(Empathy = sum(Q22_1, Q22_3, Q22_5))

Study1UK <- Study1UK %>% 
  rowwise() %>% mutate(AvgEmpathy = Empathy/3)

Figure1UK <- 
  ggplot(data = Study1UK) + 
  geom_smooth(mapping = 
                aes(x = AvgEmpathy, y = Q24_1), 
              method = lm) + scale_x_continuous(name = NULL, limits = c(1,5), breaks = c(1,2,3,4,5)) + 
  scale_y_continuous(name = NULL, limits = c(1,5), breaks = c(1,2,3,4,5)) + ggtitle('Study 1: UK Sample') + 
  theme(panel.background = element_blank(),
  panel.border = element_rect(fill = NA, size = 0.5))

Figure1UK

## `geom_smooth()` using formula 'y ~ x'

For Study1USA:

Study1USA<- Study1USA %>% 
  rowwise() %>% 
  mutate(Empathy= sum(Q22_1, Q22_3, Q22_5))

Study1USA <- Study1USA %>% 
  rowwise() %>% 
  mutate(AvgEmpathy = Empathy/3)
  
Figure1USA <- ggplot(data = Study1USA) +
  geom_smooth(mapping = aes(x= AvgEmpathy, y=Q24_1), method=lm) + scale_x_continuous(name = NULL, limits = c(1,5), breaks = c(1,2,3,4,5)) + scale_y_continuous(name = 'Physical Distancing', limits = c(1,5),  breaks = c(1,2,3,4,5)) + ggtitle('Study 1: U.S. Sample')  + 
  theme(panel.background = element_blank(), 
        panel.border = element_rect(fill = NA, size = 0.5))

Figure1USA

## `geom_smooth()` using formula 'y ~ x'

For Figure1Germany:

Study1Germany<- Study1Germany %>% 
  rowwise() %>% 
  mutate(Empathy= sum(Q22_1, Q22_3, Q22_5))
  
Study1Germany <- Study1Germany %>% 
  rowwise() %>% mutate(AvgEmpathy = Empathy/3)
  
Figure1Germany <- ggplot(data = Study1Germany) +
  geom_smooth(mapping = aes(x= AvgEmpathy, y=Q77_1), 
              method=lm) + scale_x_continuous(name = NULL, limits = c(1,5), breaks = c(1,2,3,4,5)) + 
  scale_y_continuous(name = NULL, limits = c(1,5), breaks = c(1,2,3,4,5)) + ggtitle('Study 1: German Sample') + 
  theme(panel.background = element_blank(), panel.border = element_rect(fill = NA, size = 0.5))

Figure1Germany

## `geom_smooth()` using formula 'y ~ x'

For Study 2:

For study 2, we first, pipe the data through rowwise() and then, mutate() to create a new variable called “Empathy”, which will be the sum of the selected columns measuring empathy. Then, we use rowwise() and mutate() again to divide the Empathy variable by 3 and save it under a new object.

Next, we re-code the items measuring physical distancing by subtracting all the relevant column values by 6. We pipe the data through rowwise() and mutate() to create a new variable called “RecodePhysicalDistance”, which will be the sum of the re-coded scores for physical distancing. Then, we divide RecodePhysicalDistance by 5 (i.e. the number of items measuring physical distancing) and save it as a new object called ‘AvgRecodePhysicalDistancing’ using the mutate() function.

Study2 <- Study2 %>% 
rowwise() %>% 
mutate(Empathy = sum(Q36_1, Q36_3, Q36_5))

For empathy:

Study2 <- Study2 %>% rowwise() %>% mutate(AvgEmpathy = Empathy/3)

For physical distancing:

Study2 <-Study2 %>% rowwise() %>% mutate(RECODE1 = 6-Q22_1, 
                                         RECODE2 = 6-Q22_2,
                                         RECODE3 = 6-Q22_3,
                                         RECODE4 = 6-Q22_4,
                                         RECODE5 = 6-Q22_5)

Study2<- Study2 %>% 
rowwise() %>% 
mutate(RecodePhysicalDistance = sum(RECODE1, RECODE2, RECODE3, RECODE4, RECODE5))

Study2 <- Study2 %>% 
rowwise() %>% 
mutate(AvgRecodePhysicalDistancing = RecodePhysicalDistance/5)

To re-produce the graph, we use the same functions as mentioned for the previous plots.

Figure:

Figure1Study2 <- ggplot(data = Study2) + # ggplot is used to establish that the object is a plot and specify the data set.  
  geom_smooth(mapping = 
                aes(x= AvgEmpathy, y=AvgRecodePhysicalDistancing), method=lm) + # identify the x-axis and y-axis variables and method = lm is used to produce a fitted regression line.
  scale_x_continuous(name = NULL, limits = c(1,5), breaks = c(1,2,3,4,5)) + # we remove the x-axis title, set the limits for the x-axis to be from 1-5 and use the 'break' argument to set the values for the x-axis to go up by 1
  scale_y_continuous(name = NULL, limits = c(1,5), breaks = c(1,2,3,4,5)) +  ggtitle('Study 2: German Sample') + theme(panel.background = element_blank(), # Used to remove the default grey background of the plot.
        panel.border = element_rect(fill = NA, size = 0.5)) # 'panel.border' is used to add panel borders around the plot and we use 'element_rect()' to set the border colour as the default (black) and the border size to 0.5. 

Figure1Study2

## `geom_smooth()` using formula 'y ~ x'

Combining all the graphs:

Now, we need to combine all the plots. First, we install and load in the ‘cowplot’ package, which provides various features that help in aligning plots.

install.packages("cowplot") # Install package

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library('cowplot') # Load in data

Then, we create a new variable called ‘panel1’ which will comprise of figure1USA and figure1UK. We create another variable called ‘panel2’, which will contain the figures for Germany and study 2. Next, we use the plot_grid() function to arrange the plots into a grid and use the ‘labels’ argument to add in a label for the x-axis: ‘Empathy’. Vjust and hjust are used to position the x-axis label at the center of the graph.

panel1 <-plot_grid(Figure1USA, Figure1UK)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

panel2 <-plot_grid(Figure1Germany, Figure1Study2)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

plot_grid(panel1, panel2, labels = "Empathy",
          vjust=47, hjust = -6)

So, the attempt at re-proucing figure 1 in the paper has been fairly successful! A reminder of what the original figure in the paper looks like is below.

Figure 2:

Figure 2 displays physical-distancing motivation in each condition of Study 3.

First, we re-code the column items used to measure compliance with physical distancing in the data set for study 3. To do this, we pipe the data through rowwise() and then, mutate().

Study3 <-Study3 %>% 
rowwise() %>% 
mutate(RECODE1 = 6-Q22_1,RECODE2 = 6-Q22_2,RECODE3 = 6-Q22_3,RECODE4 = 6-Q22_4, RECODE5 = 6-Q22_5)

Next, we pipe the data through rowwise() and then, mutate() to create a new variable called “PD”, which will be the sum of the re-coded columns measuring phsyical distancing. Then, we use rowwise() and mutate() again to divide the PD variable by 5 (i.e. number of items measuring motivation to physical distance) and save it under a new object called “AvgPD”.

Study3 <- Study3 %>% 
  rowwise() %>% 
  mutate(PD = sum(RECODE1, RECODE2, RECODE3, RECODE4, RECODE5))
Study3 <- Study3 %>% rowwise() %>% mutate(AvgPD = PD/5)

Then, we use the factor() function and ‘levels’ argument to specify how the levels in the ‘bed’ variable are coded. Essentially, this allows us to re-order the levels in the ‘bed’ variable to reflect the order that the article has it in.

Study3$bed <- factor(Study3$bed, levels = c('0','2', '1'))

To create the plot, we follow a similar process to Figure1:

First, we install and load in the ‘ggeasy’ package, which contained a feature that removed the legend from the graph.
We use ggplot() to establish that the object is a plot and specify the data set.
Then, we specify the aesthetic mappings, i.e. which variables will be included in the plot by using aes() and identify the x-axis and y-axis variables. The ‘group’ argument is used to define the grouping structure and thus, split the participants by each condition in Study 3. The ‘fill’ argument is used to change the colour of the geom elements in the plot.
We use geom_violin() to make the violin plot.
We use scale_x_discrete() to remove the x-axis title and then, we use the ‘break’ argument to specify the values for the x-axis. We use the ‘labels’ argument to specify the name for each condition.
Next, we use scale_y_continuous() to name the y-axis title as ‘Physical-Distancing Motivation’.
We use theme() to customise the non-data components of the plot. ‘panel.background’ is used to remove the default grey background of the plot. ‘panel.border’ is used to add panel borders around the plot and we use ‘element_rect()’ to set the border colour as the default (black) and the border size to 0.5.
We use ‘scale_fill_manual’ to fill in the plot with colours we have also specified.
‘easy_remove_legend()’ is used to remove the legend from the graph.
Then, we use geom_boxplot to display the summary statistics for each condition in the study and the aesthetics: ‘width’ to change the size of the boxplots to 0.1 as they were originally too big, ‘colour’ to specify black as the colour of the boxplots and ‘fill’ to specify white as the colour we want the boxplots to be filled in by.
Finally, we use stat_summary() and the arguments: ‘fun’ to specify the mean values as the main summary function, ‘geom’ to specify that we want the data to be displayed in the form of a diamond, ‘colour’ to fill in the diamond with the colour black and ‘cex’ to set the size of the diamond as 3.

install.packages("ggeasy")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(ggeasy)

ggplot(data= Study3, mapping = 
         aes(x= bed, y= AvgPD, group=bed, fill=bed)) + 
  geom_violin()+ 
  scale_x_discrete(name= NULL, breaks = 
                                    c('0','2','1'), 
                   labels = c('Control', 'Information Only', 'Information + Empathy')) + 
  scale_y_continuous(name = 'Physical-Distancing Motivation') + 
  theme(panel.background = element_blank(), panel.border = element_rect(fill = NA, size = 0.5))  + 
  scale_fill_manual(values = 
                      c('lightskyblue2','deepskyblue4',
                        'darkolivegreen2')) + 
  easy_remove_legend() + 
  geom_boxplot(width=0.1, color="black", fill = 'white') + stat_summary(fun= 'mean', geom='point', shape = 9, 
               colour = 'black', cex=3)

Our attempt at reproducing figure 2 has been successful! A reminder of what figure 2 in the paper looks like is below.

Figure 3:

Figure 3 displays the motivation to wear a face mask in each condition of Study 4.

First, we use the factor() function and ‘levels’ argument to specify how the levels in the ‘bed’ variable are coded. Essentially, this allows us to re-order the levels in the ‘bed’ variable to reflect the order that the article has it in.

Study4$bed <- factor(Study4$bed, levels = c('1','0', '2'))

To reproduce the figure, we use the same functions required in re-creating figure 2.

ggplot(data= Study4, mapping = # Establishes that the object is a plot and specifies the data set. 
         aes(x= bed, y= Q22_1, group=bed, fill=bed)) + # identify the x-axis and y-axis variables as well as other aesthetics.
  geom_violin() + # To make the violin plot.
  scale_x_discrete(name= NULL, 
                   breaks = c('1','0','2'), 
                                  labels = c('Control', 
                                             'Information Only', 'Empathy')) + # To remove the x-axis title, to specify the values for the x-axis and to specify the name for each condition. + scale_y_continuous(name = 'Motivation to Wear a Face Mask') # To name the y-axis title
  theme(panel.background = element_blank(),panel.border = element_rect(fill = NA, size = 0.5)) + scale_fill_manual(values = c('lightskyblue2', 'deepskyblue4', 'darkolivegreen2'))  + 
  easy_remove_legend() + # To remove the legend from the graph.
  geom_boxplot(width=0.1, color="black", fill = 'white') + # To display the summary statistics for each condition and aesthetics specify design.
  stat_summary(fun= 'mean', geom='point', shape = 9, colour = 'black', cex=3)

## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.

Our attempt at reproducing figure 3 has also been successful! A reminder of what the original figure in the paper looks like is below.

Part 3 - Exploration

Question 1:

Is there a relationship between an individual’s household size and their physical distancing practice in Study 1?

The authors assessed the participants on many demographics including age, gender, household size and education level. My initial thoughts were to see if any of these variables may have an effect on people’s compliance to physical distance, particularly household size. Study 1 comprised of participant samples from the UK, US and Germany, thus if there is an effect found, I can also see whether that effect is constant over different countries as there is great variability in the way in which different countries handled COVID-19. I was interested in household size because I assume that since physical distancing policies require individuals to limit social contact with other close and valued people, those who live alone or with fewer people are more likely to feel lonely and therefore, may be less likely to adhere to physical distancing.

To begin with, I calculated the necessary descriptive statistics, then created a scatter plot to visualise the question and lastly, carried out a correlation test. I followed this process for each sample in study 1 separately.

I first load in the package ‘tidyverse’, which contains most of the functions I will need for general coding and then, the package ‘grid.Extra’, which includes the function ‘grid.arrange’. I will need to use this function later on to combine the scatter plots into one graph.

library(tidyverse)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

For Study 1 UK:

To calculate the descriptive statistics:

I used the transmute() function to retain only the selected variables, household size and the column measuring physical distancing (i.e. Q24_1) in the data set. Then, I used group_by() to group the physical distancing scores by household size and found the mean, SD and number of participants for each household size using the summarise() function. I also renamed the headings for more clarity of the data.
Mean and SD:

Social.Dist <- Study1UK %>%
  transmute(Household_size, Q24_1) %>%
  group_by(Household_size) %>% 
  summarise(mean = mean(Q24_1),
            sd = sd(Q24_1),
                    n = n()) %>% 
  rename(Mean_Social_Distance_Ratings = mean,
         Standard_Deviation = sd,
         Number_of_Participants = n)

The means in the output seem to be reflecting against my intial assumption, with higher means being produced by the lower household size groups. However, the number of participants in each group is very variable, with only 1 person with a household size of 9 and 1 person with a household size of 33, hence the NA’s in the SD column. The person with a household size of 33 seems like an outlier and I think may have an effect when creating a scatterplot for the data.

Scatter Plot:

With Outlier:

First, I decided to create the scatter plot keeping the outlier in the data to see how the graph would look like and whether it had any effect on the output. - I used ggplot(), specifying household size to be on the x-axis and the item measuring physical distancing on the y-axis. - Then, I used geom_jitter(), which adds a small amount of random variation to the location of each score to create some noise. - I used geom_smooth() to add a trend line over the data and specify ‘method = lm’ to determine if there is a linear relationship between the variables. ‘se = FALSE’ establishes that I don’t want the plot to produce standard error values. - I use scale_x_continuous to remove the x-axis title, set the limits for the x-axis to be from 1-33 and use the ‘break’ argument to set the values for the x-axis to go up by 5. - I use scale_y_continuous to remove the y-axis title, set the limits for the y-axis to be from 1-5 and use the ‘break’ argument to set the values for the y-axis to go up by 1. - I used ggtitle() to add a plot title and theme_minimal() to apply a minimilastic theme.

Fig1UK <- ggplot(Study1UK, mapping = aes(Household_size, Q24_1)) +
  geom_jitter() + 
  geom_smooth(method = "lm", se = FALSE) + 
  scale_x_continuous(name = NULL, limits = c(1,33),
                     breaks = c(5,10,15,20,25,30,35)) + 
  scale_y_continuous(name = NULL, limits = c(1,5), 
                     breaks = c(1,2,3,4,5)) +
  ggtitle('Study 1: UK Sample') + theme_minimal()

Fig1UK

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 90 rows containing missing values (geom_point).

It can be seen that with the outlier included in the data, the scores are only distributed on a third of the graph. As a result, not much of a relationship can be determined between the variables. Thus, I created the scatter plot again following the same process as above, however, I changed x-axis limits from 1,33 to 1,9 and set the values to go up by 1.

Without Outlier:

Fig1UK <- ggplot(Study1UK, mapping = aes(Household_size, Q24_1)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) + 
  scale_x_continuous(name = NULL, limits = c(1,9),
                     breaks = c(1,2,3,4,5,6,7,8,9)) + 
  scale_y_continuous(name = NULL, limits = c(1,5), 
                     breaks = c(1,2,3,4,5)) +
  ggtitle('Study 1: UK Sample') + theme_minimal()

Fig1UK

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 84 rows containing missing values (geom_point).

Here, the scores are more spread out and the trend is easier to follow. The line of best fit clearly displays a negative correlation.

Correlation Test:

I used cor.test() to run a correlation test between the two variables and specified that it was a pearson test.

cor.test(Study1UK$Household_size, Study1UK$Q24_1, 
    method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  Study1UK$Household_size and Study1UK$Q24_1
## t = -0.37958, df = 315, p-value = 0.7045
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.13123138  0.08898574
## sample estimates:
##         cor 
## -0.02138217

The results depict that there is insufficient evidence to determine a correlation between an individual’s household size and their compliance to physical distancing, r = -0.02, p = 0.70.

For Study 1 USA:

To calculate the descriptive statistics, create a scatter plot and run a correlation test, I use the same functions and follow the same procedure as implemented for Study 1 UK.

Mean and SD:

Social.Dist2 <- Study1USA %>%
  transmute(Household_size, Q24_1) %>% # Retains only selected variables.
  group_by(Household_size) %>% # Groups physical distancing scores by household size. 
  summarise(mean = mean(Q24_1),
            sd = sd(Q24_1),
            n = n()) %>% # To find mean, sd and number of participants for each household size
  rename(Mean_Social_Distance_Ratings = mean,
         Standard_Deviation = sd,
         Number_of_Participants = n) # To rename headings.

The table displays a similar pattern of means to that seen in Study 1 UK - Mean scores are higher for the smaller household sizes, however, the number of participants in each group is very variable again, with the smaller household sizes having much larger numbers of participants.

Scatter Plot:

Fig1USA <- ggplot(Study1USA, mapping = aes(Household_size, Q24_1)) + # Specify x-axis and y-axis variables.
  geom_jitter() + # Adds a small amount of random variation to the location of each score to create some noise.  
  geom_smooth(method = "lm", se = FALSE) +  # To add a trend line.
  scale_x_continuous(name = NULL, limits = c(1,11),
                     breaks = c(1,2,3,4,5,6,7,8,9,10,11)) + # To remove the x-axis title, set the limits for the x-axis and the values for the breaks.
  scale_y_continuous(name = NULL, limits = c(1,5), 
                     breaks = c(1,2,3,4,5)) +
  ggtitle('Study 1: USA Sample') + # To add a plot title  
  theme_minimal() # To apply a minimilast theme.

Fig1USA

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 112 rows containing missing values (geom_point).

Here, the physical distancing scores are quite scattered and the line of best fit displays a negative correlation.

Correlation Test:

cor.test(Study1USA$Household_size, Study1USA$Q24_1, 
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  Study1USA$Household_size and Study1USA$Q24_1
## t = -1.0038, df = 320, p-value = 0.3162
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.16432076  0.05359827
## sample estimates:
##         cor 
## -0.05602849

The results for Study 1 USA depict that there is insufficient evidence to identify a correlation between an individual’s household size and their physical distancing practice, r = -0.05, p = 0.32.

For Study 1 Germany:

To calculate the descriptive statistics, produce a scatter plot and perform a correlation test, I use the same functions and follow the same procedure as implemented for the studies above.

Mean and SD:

Social.Dist3 <- Study1Germany %>%
  transmute(Household_size, Q77_1) %>% # Retains only the selected variables, household size and the column measuring physical distancing. 
  group_by(Household_size) %>% # Groups the physical distancing scores by household size.
  summarise(mean = mean(Q77_1),
            sd = sd(Q77_1),
            n = n()) %>% # Finds the mean, SD and number of participants for each household size.
  rename(Mean_Social_Distance_Ratings = mean,
         Standard_Deviation = sd,
         Number_of_Participants = n) # Renames headings.

The mean scores for each household size are quite variable, along with the number of participants in each group, with the smaller household sizes having much larger numbers of participants.

Scatter Plot:

Fig1Germany <- ggplot(Study1Germany, mapping = aes(Household_size, Q77_1)) + ## Specify x-axis and y-axis variables.
  geom_jitter() + # Adds a small amount of random variation to the location of each score to create some noise. 
  geom_smooth(method = "lm", se = FALSE) + # To add a trend line.
  scale_x_continuous(name = NULL, limits = c(1,7),
                     breaks = c(1,2,3,4,5,6,7)) + # To remove the x-axis title, set the limits for the x-axis and the values for the breaks.
  scale_y_continuous(name = 'Social Distancing Practice Rating', limits = c(1,5), breaks = c(1,2,3,4,5)) +
  ggtitle('Study 1: Germany Sample') + #To add a plot title  
  theme_minimal() #To apply a minimalist theme.
  
Fig1Germany

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 95 rows containing missing values (geom_point).

The physical distancing practice scores are quite scattered with no obvious trend that can be perceived, however, the line of best fit seems to be displaying negative correlation.

Correlation Test:

cor.test(Study1Germany$Household_size, Study1Germany$Q77_1, 
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  Study1Germany$Household_size and Study1Germany$Q77_1
## t = -0.22406, df = 324, p-value = 0.8229
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1209082  0.0963086
## sample estimates:
##         cor 
## -0.01244664

The outcome of the correlation test depicts that there is insufficient evidence to suggest a correlation between household size and adhering to physical distancing, r = -0.01, p = 0.82.

Combining the graphs:

I decided to arrange the scatter plots together in a line to compare the trend across the different samples. To do this, I used the grid.arrange() function and specified the x-axis title using the ‘bottom’ argument as well as the graph title using the ‘top’ argument. ‘nrow’ is used to specify that I want the plots layed out in 1 row.

grid.arrange(Fig1Germany, Fig1UK, Fig1USA, 
             nrow = 1,
             top = "Correlation Between Household Size and Social Distancing Practice",
             bottom = "Household Size")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 106 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 89 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 102 rows containing missing values (geom_point).

Results:

Ultimately, there is insufficient evidence to determine a correlation between a person’s household size and physical distancing compliance. The number of participants in each household size category fluctuated greatly, with some group sizes only comprising of 1 participant.

Question 2:

Is there a relationship between an individual’s education level and their physical distancing practice in Study 1?

My thought process for question 2 was similar to that of question 1, except now, I want to explore whether an individual’s education level has any effect on their compliance with physical distancing. I am interested in this because research has shown that more educated individuals are more likely to comply with preventative measures than individuals with less or no education. It may be the case that people who have completed higher educations also have greater general knowledge regarding the nature of viruses and the transmission process. As a result, they may be more willing to comply with COVID-19 measures.

To begin, I calculated the necessary descriptive statistics, then created a scatter plot to visualise the question and lastly, carried out a correlation test. I applied this process for each sample in study 1 separately.

For Study 1 UK:

The authors did not provide any information in their supplementary material regarding the levels of education that was recorded and what each value was coded as. Hence, I used the count() function to identify the range of education for the participants in this sample. Then, I used the factor() function to label the education levels accordingly.

Study1UK %>% 
  count(Education)

## # A tibble: 5 × 2
## # Rowwise: 
##                                  Education     n
##                                  <dbl+lbl> <int>
## 1 2 [Secondary School]                        34
## 2 3 [College/A levels]                       100
## 3 4 [Undergraduate degree (BA/Bsc/other)]    126
## 4 5 [Graduate degree (MA/MSc/MPhil/other)]    48
## 5 6 [Doctorate degree (PhD/MD/other)]          9

Study1UK$Education <- factor(Study1UK$Education,
                             levels = c(2,3,4,5,6),
                             labels = c("Secondary School",
                                        "College/A Levels",
                                        "Undergraduate Degree (BA/Bsc/Other)",
                                        "Graduate Degree (MA/MSc/MPhil/Other)",
                                        "Doctorate Degree (PhD/MD/Other)"))

Mean and SD:

To calculate the descriptive statistics:

I used the transmute() function to retain only the selected variables, education and the column measuring physical distancing (i.e. Q24_1) in the data set. Then, I used group_by() to group the physical distancing scores by education level and found the mean, SD and number of participants using the summarise() function. I also renamed the headings for more clarity of the data.

Edu.Lvl <- Study1UK %>%
  transmute(Education, Q24_1) %>%
  group_by(Education) %>% 
  summarise(mean = mean(Q24_1),
            sd = sd(Q24_1),
            n = n()) %>% 
  rename(Mean_Social_Distance_Practice = mean,
         Standard_Deviation = sd,
         Number_of_Participants = n)

The mean scores seem to be generally increasing as education level also increases. However, it is also the case that the number of participants in each group is quite variable, which may be why some groups have greater means than others.

Scatter Plot:

I used ggplot(), specifying education to be on the x-axis and the item measuring physical distancing to be on the y-axis. I used ‘x = as.numeric’ to convert the education variable to a numeric type.
Then, I used geom_jitter(), which adds a small amount of random variation to the location of each score to create some noise.
I used geom_smooth() to add a trend line over the data and specify ‘method = lm’ to determine if there is a linear relationship between the variables. ‘se = FALSE’ establishes that I don’t want the plot to produce standard error values.
I use scale_x_continuous to remove the x-axis title, set the limits for the x-axis to be from 1-6 and use the ‘breaks’ argument to set the values for the x-axis to go up by 1.
I use scale_y_continuous to remove the y-axis title, set the limits for the y-axis to be from 1-5 and use the ‘break’ argument to set the values for the y-axis to go up by 1.
I used ggtitle() to add a plot title and theme_minimal() to apply a minimilastic theme.

Edu1.UK <- ggplot(Study1UK, mapping = aes(x = as.numeric(Education), Q24_1)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(name = NULL, limits = c(1,6), breaks = c(1,2,3,4,5,6)) + 
  scale_y_continuous(name = 'Social Distancing Practice', limits = c(1,5), 
                     breaks = c(1,2,3,4,5)) +
  ggtitle('Study 1: UK Sample') + theme_minimal()

Edu1.UK

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 92 rows containing missing values (geom_point).

Here, the line of best fit shows a positive correlation between education level and physical distancing compliance.

Correlation Test:

I used cor.test() to run a correlation test between the two variables and specified that it was a pearson test.

cor.test(Study1UK$Education, Study1UK$Q24_1, 
           method = "pearson")

The correlation between an individual’s education level and their adherence with physical distancing is small, positive and statistically significant, r = 0.149, p = 0.007. This suggests that as the level of education an individual has completed increases, so does their compliance to physical distance.

For Study 1 USA:

To calculate the descriptive statistics, produce the scatter plot and perform a correlation test, I implemented the same processes as followed in Study 1 UK.

First, I used the count() function to identify the range of education for the participants in this sample and then, used the factor() function to label the education levels accordingly.

Study1USA %>% 
  count(Education)

## # A tibble: 6 × 2
## # Rowwise: 
##                                  Education     n
##                                  <dbl+lbl> <int>
## 1 1 [No formal qualifications]                 7
## 2 2 [Secondary School]                        79
## 3 3 [College/A levels]                        67
## 4 4 [Undergraduate degree (BA/Bsc/other)]    117
## 5 5 [Graduate degree (MA/MSc/MPhil/other)]    44
## 6 6 [Doctorate degree (PhD/MD/other)]          8

Study1USA$Education <- factor(Study1USA$Education,
                             levels = c(1,2,3,4,5,6),
                             labels = c("No Formal Qualifications",
                             "Secondary School",
                             "College/A Levels",
                             "Undergraduate Degree (BA/Bsc/Other)",
                             "Graduate Degree (MA/MSc/MPhil/Other)",
                             "Doctorate Degree (PhD/MD/Other)"))

Mean and SD:

Edu.Lvl2 <- Study1USA %>%
  transmute(Education, Q24_1) %>%
  group_by(Education) %>%
  summarise(mean = mean(Q24_1),
            sd = sd(Q24_1),
            n = n()) %>%
  rename(Mean_Social_Distance_Practice = mean,
         Standard_Deviation = sd,
         Number_of_Participants = n)

The same general trend in Study 1 UK can also be seen here.

Scatter Plot:

Edu1.USA <- ggplot(Study1USA, mapping = aes(x = as.numeric(Education), Q24_1)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) + 
  scale_x_continuous(name = NULL, limits = c(1,6),
                     breaks = c(1,2,3,4,5,6)) + 
  scale_y_continuous(name = NULL, limits = c(1,5), 
                     breaks = c(1,2,3,4,5)) +
  ggtitle('Study 1: USA Sample') + theme_minimal()

Edu1.USA

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 100 rows containing missing values (geom_point).

The line of best fit shows a positive correlation between education level and physical distancing compliance for the USA sample.

Correlation Test:

cor.test(Study1USA$Education, Study1USA$Q24_1, 
         method = "pearson")

The correlation between the variables is small, positive and statistically significant, r = 0.148, p = 0.008. This suggests that as the level of education an individual has completed increases, so does their compliance to physical distance.

For Study 1 Germany:

The authors did not assess the education level of participants in the Germany sample for Study 1 and there was no mention as to why this was the case.

Combining the Plots:

I decided to arrange the scatter plots together in a line to compare the trend across the different samples. To do this, I used the grid.arrange() function.

grid.arrange(Edu1.UK, Edu1.USA,
             nrow = 1,
             top = "Correlation Between Education Level Completed and Social Distancing Practice",
             bottom = "Education Level Completed")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 84 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 107 rows containing missing values (geom_point).

Results:

The analyses conducted exhibited a small, positive and statistically significant correlation between an individual’s education level and their compliance with physical distancing in both the UK and USA samples for Study 1, supporting our hypothesis and past research. However, the number of participants in each group varied greatly, with some groups having double the participants than others, which may have skewed our results.

Question 3:

Is there an effect of gender (male or female) on whether empathy promotes physical distancing in Study 3?

For question 3, I want to explore whether gender has any effect on empathy promoting physical distancing in Study 3. I am interested in this because as mentioned in one of my reactions to the paper, research has shown that females tend to better comply with COVID-19 measures.

To begin, I calculated the necessary descriptive statistics, then created a bar graph to visualise the question and lastly, carried out a one-way ANOVA test.

First, I piped the data set through rowwise() and then, mutate() to create a new variable called ‘Rating’, which will contain the sum of the re-coded physical distancing scores.

Study3 <- Study3 %>% 
  rowwise() %>%
  mutate(Rating = sum(RECODE1, RECODE2, 
                       RECODE3, RECODE4, RECODE5))

Then, I used the factor() function to label the factors of the ‘Gender’ variable.

Study3$Gender <- factor(Study3$Gender, 
                        levels = c(1,2),
                        labels = c("Male",
                                   "Female"))

Mean and SD:

I used the transmute() function to retain only the selected variables, ‘Gender’ and ‘Rating’ in the data set. Then, I used group_by() to group the physical distancing scores by gender and the na.omit() function to remove any incomplete cases in the data set. I found the mean, SD and number of participants using the summarise() function and I also renamed the headings for more clarity of the data.

Gender.Effect0 <- Study3 %>%
  transmute(Gender, Rating) %>%
  group_by(Gender) %>%
  na.omit(Gender) %>% 
  summarise(mean = mean(Rating),
            sd = sd(Rating),
            n = n()) %>% 
  rename(Mean_Social_Distance_Practice = mean,
         Standard_Deviation = sd,
         Number_of_Participants = n)

The table displays that females have a higher mean score compared to males, which may predict a significant test outcome further on.

Column Graph:

First, I used the as.factor() function to convert the ‘Gender’ column in the table of means from numeric to factor.

Gender.Effect0$Gender <- as.factor(Gender.Effect0$Gender)

To create the column graph:

I used ggplot(), specifying gender be on the x-axis and the mean physical distancing values to be on the y-axis.
Then, I used geom_col() to produce a column graph and specify that I want the columns to be blue using the ‘fill’ argument.
I used theme_minimal() to apply a minimilastic theme.
Next, I used scale_y_continuous to label the y-axis and then, used ggtitle() to add a plot title.
Lastly, I used theme_classic() to apply a classic-looking theme with no grid lines.

Gender.Effect2 <- ggplot(Gender.Effect0, mapping = aes(Gender, Mean_Social_Distance_Practice)) +
  geom_col(fill = "blue") +
  theme_minimal() +
  scale_y_continuous(name = "Social Distance Practice") +
  ggtitle('Gender Differences in Effect of Empathy on Social Distancing Practice') + 
  theme_classic()
  
Gender.Effect2

One-way ANOVA Test:

I used the aov() function to conduct a one-way ANOVA test, specifying ‘Rating’ and ‘Gender’ as the two groups being compared. Then, I use the summary() function to print the summary of the model.

anova_one_way <- aov(Rating ~ Gender, data = Study3)
summary(anova_one_way)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Gender        1    179  178.55   13.64 0.000235 ***
## Residuals   862  11282   13.09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 4 observations deleted due to missingness

The results depict a statistically significant difference in physical distance compliance between the two genders.

Results:

As expected, there is a statistically significant difference in the adherence to physical distancing between males and females.

Part 4 - Recommendations

Provide a good code book.

The only supporting material the authors provided in their OSF was a pdf document labelled ‘Supplementary Material’. This consisted of a full description of the items and materials used in the studies along with information about excluded participants, additional measures they looked at and their findings, and discussions about the relevance of their work. There was no outline of any code used to produce their statistics and graphs, therefore, the first problem that was encountered in the reproducibility process was not knowing how or where to begin. We had to navigate the process with only what we had learnt from Danielle’s coding modules and the help of Google. Along with this, there was no explanation given about a lot of the variables in the data sets, hence, it was unclear what certain variables were measuring and their relevance to the studies, e.g., some variables were labelled “ve”, “q” and “i”. As such, I would recommend that for future papers the authors provide a good code book that allows other people to effectively analyse and validly interpret their data without any further information apart from what is given. This would include, as referenced in the name ‘codebook’, a clear layout of the code that was used and coherent information about the steps taken to run the analyses. It would also include a clear account of each variable in the data sets, including what they are, how they were measured, and why. Therefore, a good code book would improve the efficiency of reproducibility attempts and result in psychological data being more transparent, interpretable and reusable. A great reference on how to write a good code book can be found here. While the document does refer to medicine, I believe the guide can be applied to psychological science as well.

Keep the translation consistent.

For this paper, the authors had only provided two (out of six) data sets in which the variable and column descriptions were translated from German to English. We only became aware of this language issue half way through producing the descriptive statistics for Study 1 as it was not mentioned anywhere in the supplementary material. The first image below shows one of the data sets for study 1 and it can be seen that the first few columns are the statements used to measure empathy, physical distancing practice or compliance to wear a face mask. These are all in English along with the rest of the questions and variables, however, in the second image, which displays another data set for study 1, it can be seen that the statements are in German.

English Version:

German Version:

I was confused by why the authors decided it was appropriate to leave the other data sets untranslated and leave no reference as to which columns were relevant to which measures. I also found that Google translate was very inaccurate and unreliable in translating German, so our team had to rely on key words that came up and matched with the statements described in the supplementary material to determine what they were measuring. Therefore, I would recommend that for future papers, the authors keep the translation consistent and ensure that their data sets are all in English to prevent any language barriers since English is also the primary language used in the scientific world. This can be easily done by the researchers translating as they go along or via dedicating specific people in the team to ensure their descriptions are consistently translated. For accurate translation, the authors can also work with translating services from certain organisations. Although, while English is the primary language, it should be mentioned that this puts non-English speakers at a disadvantage and efforts must be made both ways to help non-native English speakers to publish in English language journals.

Improve the consistency of the analyses.

In this paper, for studies 2 and 3, the authors had re-coded the items measuring physical distancing. This was briefly mentioned in the paper and there was no guidance on how the data was re-coded or why. I believe there was no real reason for this re-coding as all they did was flip the rating scales which originally ranged from 1 = very unlikely - 5 = very likely so that 5 = very unlikely and 1 = very likely. This caused quite a bit of confusion as the re-coding was not applied to all the studies conducted and it was a tedious process, which could have been avoided if the researchers kept their rating scales consistent throughout and avoided re-coding since it did not serve any actual purpose. Furthermore, in the supplementary material, a description of the sample characteristics were provided, in which it was made clear that the samples consisted of only participants who identified as male and female. However, when reproducing the gender descriptions we found another category labelled ‘N.A.’ and upon reproducing other descriptive statistics, we realised that the scores of these individuals were included as part of the analyses. This was also confusing and inconsistent as to what was mentioned in the supplementary material. Therefore, I would recommend that in future papers, the authors provide a more thorough and concise explanation of their analyses, including specifying the reasons behind re-coding (if any). I would also suggest that they stay consistent with re-coding throughout their studies and that they improve the accuracy of the information conveyed in the supplementary materials to stay consistent with the raw data. This blog post is a great site that identifies the importance of maintaining high data quality and consistency and provides ways in which this can be achieved.

Verification Report

Sadaf Hazara, z5311207

2022-08-04

Part 1

Summary

Reaction

Part 2 - Verification

Verification Goals

Demographic Descriptives

Means and SD’s Reported

Figures

Demographic Descriptives

Means and SDs for the Different Measures

Figures

Part 3 - Exploration

Question 1:

Question 2:

Question 3:

Part 4 - Recommendations