https://rpubs.com/sarahodaniels/930468

Part 1 - Summary and Reactions

Summary

There is a wealth of research indicating that older individuals have better emotional wellbeing, with two dominant theories attempting to explain this phenomenon. Socioemotional Selectivity Theory (SST) suggests that, due to changes in perceptions of remaining time, goals switch from being future-oriented in youth, to present experience enjoyment in old age. Alternatively, the Strength and Vulnerability Integration (SAVI) model argues that older people have better emotional wellbeing simply because they evade stress. There is a paucity of evidence providing direct support for one explanation over the other, due to ethical constraints on research; however, since this pandemic is an unavoidable chronic stressor that all people face regardless of age, it creates the perfect conditions for this investigation. Thus, the current study aims to leverage this situation and explore whether these emotional wellbeing differences with age depend on stress avoidance (as per the SAVI model).

945 adults living in the U.S. during the country’s first COVID-19 pandemic peak, in April 2020, completed a variety of self-reported questionnaires. When completing the survey, participants rated the frequency of which they experienced a list of 29 positive and negative emotions on a Likert scale from 0 to 4, and intensity from 1 to 5 (with higher scores indicating greater frequency and intensity). Perceptions of remaining time were measured by the Future Time Perspective (FTP) Scale, where 10 items were rated from 1 to 7 to produce an overall, extension, opportunity, and constraint score. Additionally, the extent of the risk of COVID-19 to themself and others was rated by participants from 0 (none) to 5 (high risk). Furthermore, occupational impact was rated from 0 (no impact) to 4 (great impact), and self-health perceptions were rated from 1 (good) to 5 (poor). Finally, the Big Five personality traits (openness, conscientiousness, extraversion, agreeableness, and neuroticism) were measured using the Ten-Item Personality Inventory (TIPI).

Results supported their hypothesis that the superior emotional wellbeing of older individuals would remain, regardless of the stress that the pandemic has produced. They found that older age was related to a lower frequency and intensity of negative emotions, and higher frequency and intensity of positive emotions. It should be noted that, when controlling for the effects of personality and demographic factors, the relationships between age and the frequency and intensity of negative emotions and frequency of positive emotions remained, but not the relationship between age and intensity of positive emotions. The researchers also found that risk perceptions increased with age, indicating that the improved emotional wellbeing in older individuals was not due to a lack of awareness of the risk that COVID-19 poses. Furthermore, older individuals have greater emotional wellbeing even though they are more likely to feel like they have less time remaining.

Overall, emotional wellbeing improvements across age remained even with the unavoidable stress that the pandemic has produced. Additionally, this greater emotional wellbeing is not due to older individuals failing to recognise the risk COVID-19 poses to them. Therefore, the researchers concluded that evasion of stress is not key to older individuals having better emotional wellbeing, and as such, counters the claims of the dominant SAVI model.

Reactions

The logic of the rationale was compelling because the global pandemic has provided a unique opportunity to investigate the two main theories of the mechanism behind the association between advanced age and greater emotional wellbeing. As it is unethical to experimentally manipulate participants to experience a prolonged stressor, previous research into the theories lacks the ability to directly determine whether stress evasion is the root of elderly individuals’ greater wellbeing. This pandemic has impacted everyone and it is not possible to escape the risks and stress it has created. Therefore, the situation has provided a basis by which the researchers can argue that if the age differences in emotional wellbeing remain during this chronic stress, avoidance of stressors (as proposed by the SAVI model) is not the best explanation.

I’m not sure that I understood why the researchers chose to binarise their participants’ characteristics. For example, despite collecting detailed information about the race distribution of their participants, the researchers chose to divide their sample into white and non-white individuals. They did the same thing with education levels (did or did not partake in at least some college) and employment (working or not working for pay). My theory as to why they did this is that some of the race groups may have had such a small number of participants identifying as them, thus potentially making it difficult to carry out analyses. However, this was not true of education levels or employment. There is a wealth of understanding that can be gained when you look into trends across different races, employment status, and education levels. As such, it is a shame that they did not consider this.

When I was reading I was excited to learn that the researchers pre-registered their plans, sample size, and exclusion criteria for this study, as mentioned in their paper. Doing so is beneficial because it prevents the researchers from increasing their sample size until they get a significant result, and biassing their samples by excluding participants who don’t fit the data trends that they are looking for. Furthermore, pre-registration ensures that any changes to the design and analyses would have to be thoroughly explained and justifiable. It is great to see that these researchers are committed to taking the reproducibility crisis into consideration, a product of which is that confidence and trust in the paper’s results and conclusions can be increased.

Part 2 - Verification

Reproducibility Goals

The three main parts that we aimed to verify were the following:

1. Demographic descriptive statistics

As seen in the following screenshot from the paper, the demographic descriptive statistics include the sample characteristics about sex, age, race, household income, employment status, education levels, and whether they live alone. We aimed to verify all of the highlighted values, along with the total number of participants in the analyses (945 participants).
(page 3)

2. Means and SDs

Additionally, we aimed to verify the means and standard deviations (SDs) of the measures that the researchers used:

  • Time horizons
    • Overall FTP scores → M = 4.09, SD = 1.35
    • Extension → M = 4.02, SD = 1.91
    • Opportunity → M = 4.73, SD = 1.62
    • Constraint → M = 3.74, SD = 1.76
  • Perceived risk (all from page 4)



  • Effect on employment
    • M = 1.41, SD = 1.46
  • Subjective health
    • M = 3.36; SD = 0.99
  • Personality (from page 4)

We also aimed to replicate the following means, SDs, and mean differences:

3. Tables and Figures

Another goal was to verify Table 1 (emotion frequency means, SDs, and 95% confidence interval limits).

Table 1 (from page 5)

Finally, we aimed to reproduce Figure 1 (a scatterplot of the relationship between age and the frequency and intensity of positive and negative emotions).

Figure 1 (from page 7)

Finding the paper’s data

Since the paper included the OSF link in its “Transparency” section after the discussion, finding their data was relatively easy. I will note that they had a CSV data file, and an excel one which included an extra 155 rows. Initially, I wasn’t sure which one they used in their paper, but upon further investigation into their OFS file “Variables name and meaning”, I noticed the following:

Since the excel data included “assisted_living.xlsx” in its name, and there is a variable called “sample” where participants are marked as Prolific or Eskaton, I was able to determine that the extra 155 participants were the “Eskaton” assisted living sample. It was very confusing when I found this because the published paper does not refer to it, nor is there an explanation for it in the OSF, which made me realise that just having open data is not enough to make the reproducibility process easy. Since the Eskaton sample was not analysed in the paper, we downloaded the csv data file to use in replicating the published results (though I will come back to the Eskaton sample in my exploratory analyses).

Loading the packages

Firstly, we install the packages needed that weren’t already installed, using install.packages(). Next we loaded the packages, using library() so that they could be utilised:

library(tidyverse) # really helpful package with a wide range of functions
library(plyr) # similar to dplyr (as below)
library(dplyr) # for making data easier to deal with (data wrangling)
library(ggplot2) # to make plots
library(patchwork) # to combine plots
library(readxl) # to import excel data files
library(stats) # for statistical analyses
library(knitr) # to display output in a nice table

Importing data into the Rmd file

read_csv() is a function from readr (within the tidyverse package) which we can use to load the csv data into the Rmd. Using the <- assignment operator from R base, we have created a new object named aaec which is a simpler name than “AgeAdvantagesEmotionCovid” and makes the data frames we produce easier to label and find later on.

aaec <- read_csv(file = "AgeAdvantagesEmotionCovid_Data.csv")

To check whether the data has all been loaded correctly and if we have the correct number of participants, we used the nrow() function (from R base) to ascertain the number of rows in the data set.

nrow(aaec)
## [1] 945

Evidently, there are 945 participants which is the same as the final sample size reported in the paper! This means that all the excluded participants have already been removed by the researchers; though it would have been useful to have the raw data so we can verify who they were removing and whether this matches their reported exclusion criteria (further explained in Part 4 - Recommendations).

Sample Characteristics

Since the original dataset has 107 variable columns, it is very difficult to locate the variables needed for verification. As such, using the <- assignment operator from base R, we assign the aaec data to a new object (aaec_samplechar). We then pipe (an operator from dplyr) this into the select() function (also from dplyr) which lets us specify the relevant variable columns we want to include in the new object, making it easier to find and deal with each one later on.

aaec_samplechar <- aaec %>% 
  select(age, gender_bin_f,educ, emp_bin, livealone, income, race_bin)

Since the researchers reported many sample characteristics in a binary fashion (i.e. 75% of participants are White), it was helpful that they had already created binarised variable columns in the dataset. The binarised variables relevant to the sample characteristics are gender_bin_f, emp_bin, and race_bin, where participants were coded as either Female or Non-female, Working or Not_working, and White or Non-white respectively.

Sex Distribution

We used two methods to calculate the proportion of females in the sample, both using the general equation below, just in slightly different ways:
Number of females / Total number of participants * 100

Gender Attempt 1

We created a new dataset (aaec_femalecount) and used the filter() function (from dplyr) to only keep participants who identified as “Female” in this new dataset.

aaec_femalecount <- aaec_samplechar %>%
  filter(gender_bin_f == "Female")

We then used nrow() to ascertain the number of participants in this new filtered dataset and used simple mathematical symbols to divide it by the total number of participants in the entire sample (also using the nrow() function). Then we multiplied it all by 100 to get a proportion.

nrow(aaec_femalecount) / nrow(aaec_samplechar) *100
## [1] 49.20635

We got 49.2% which is exactly what was reported in the paper. This was very exciting as it was the first sample characteristic we were able to verify. However, we thought that there must be a simpler and cleaner way to do so without creating an entirely new dataset.

Gender Attempt 2 - A more straightforward method

The ifelse() function (from R base) is used to create a new column in aaec_samplechar labelled female, which gives all participants who identified as “Female” (in the gender_bin_f variable) a value of 1, and everyone who did not identify as “Female” is given a value of 0.

aaec_samplechar$female <- ifelse(aaec_samplechar$gender_bin_f == "Female", 1,0) 

We calculate the percentage of females by dividing the number of participants with a value of 1 in this “female” column (determined by using the sum() function from R base) by the combined number of participants with a value of 1 or 0. Then, in normal percentage fashion, multiplying this all by 100.

sum(aaec_samplechar$female == "1") / (sum(aaec_samplechar$female == "1") + sum(aaec_samplechar$female == "0")) *100
## [1] 49.20635

This also gives us 49.2%! It was really interesting to discover that there are often many different ways to approach things in R.

Employment status

Since the gender attempt 2 (creating a new column within the existing dataset) was so successful and much less convoluted than attempt 1 (creating a new dataset), we decided to use the same method for the employment status verification. Here we assigned “Working” participants the value of 1, and “Non-working” participants the value of 0, then again used the sum() function to calculate the proportion of working participants.

# Assigning value
aaec_samplechar$working <- ifelse(aaec_samplechar$emp_bin == "Working", 1,0)
# Calculating percent
sum(aaec_samplechar$working == "1") / (sum(aaec_samplechar$working == "1") + sum(aaec_samplechar$working == "0")) * 100
## [1] 56.19048

This is the same as reported in the paper.

Living alone

We did the exact same thing to determine the proportion of participants who live alone.

# Assigning values
aaec_samplechar$alone <- ifelse(aaec_samplechar$livealone == "Yes", 1,0)
# Calculating percent
sum(aaec_samplechar$alone == "1") / (sum(aaec_samplechar$alone == "1") + sum(aaec_samplechar$alone == "0")) * 100
## [1] 22.96296

22.9% rounds up to the 23% that was reported in the paper!

Race

Despite collecting data about which specific races participants were, the researchers binarised race into white and non-white in the race_bin variable. We again used the ifelse() function to give White participants a value of 1 and Non-white participants a value of 0.

aaec_samplechar$white <- ifelse(aaec_samplechar$race_bin == "White", 1,0)

When calculating the percentage of white participants in the sample using the sum() function, we had to make a slight change to our code. This change was to add in na.rm = TRUE within each of the sum() functions, to remove the participants who did not report their race.

sum(aaec_samplechar$white == "1", na.rm = TRUE) / (sum(aaec_samplechar$white == "1", na.rm = TRUE) + sum(aaec_samplechar$white == "0", na.rm = TRUE)) * 100
## [1] 75.69002

It is important to note that the paper reported that 75% of participants were White, however we found that 75.69% are white, which would have rounded up to 76%. Since the researchers have not shared their code, I can only make assumptions about what happened here. My two theories are that either this was a simple rounding error, or the researchers have included participants who did not report their race in the total number of participants in this calculation. If we use nrow() to count the total number of participants (rather than just including those who reported race), we get 75.45% which would round down to the reported 75%.

sum(aaec_samplechar$white == "1", na.rm = TRUE) / nrow(aaec_samplechar) * 100
## [1] 75.44974

I personally think it would be more accurate to not include those who did not report race in this calculation since we really do not know which race they are.

Education levels

We quickly realised that when the paper reports the proportion of participants who attended “at least some college”, they have combined the 3 following responses to this question in the survey:

  • Completed graduate or professional degree
  • Completed 4-year college (BA, BS)
  • Some college or technical school

Since there was no binary form of the education variable in the dataset, we used the same general method as the gender distribution attempt 1, for simplicity.
Firstly, we tried to filter the data to include participants with the three responses as above; However, this method did not work as we got the following error message:

After some googling, I found this link explaining how to filter data to more than one category. We found that using the fourth suggested method (%in%) worked.

aaec_filteredcollegecount <- aaec_samplechar %>% 
  filter(educ %in% c(
    "Completed graduate or professional degree",
    "Completed 4-year college (BA, BS)",
    "Some college or technical school"))

Initially, I wasn’t really sure why this worked, but I did some more googling and found that this line of code creates a vector for the education level categories that you want to include, and that the %in% can be used to filter multiple categories in the vector. Once this filtered table had been created, we were able to use the same mathematical symbols to calculate the percent of participants with at least some college.

nrow(aaec_filteredcollegecount) / nrow(aaec_samplechar) *100
## [1] 87.72487

We again got the same answer as reported in the paper!

Age

For age, the paper reports the range (minimum and maximum ages), along with mean and SD. The summarise() function (from dplyr), along with mean() and sd() functions (from R base and stats respectively), can be used to calculate these four values for the age variable.

aaec_samplechar %>% 
  summarise(mean_age = mean(age),
            SD = sd(age),
            min_age = min(age),
            max_age = max(age))
##   mean_age       SD min_age max_age
## 1 45.15238 16.78756      18      76

These all match the paper’s values!

Median income

The first thing I noticed when figuring out how to verify the median was that participants did not report a specific numeric income value, rather they indicated which range of incomes they fell into. Since the income data collected is categorical rather than numerical, we cannot simply use the median() function with the data as is.
A few participants did not disclose their income, so the first step is to remove them. Since we do not want to remove them entirely from the aaec_samplechar dataset, we assign the data to a new object. We then pipe this into filter(), where != is used to keep all of the participants whose income is not equal to “Decline to answer”.

aaec_filteredincome <- aaec_samplechar %>%
  filter(income != "Decline to answer")

To determine where the midpoint for the median is, we need to know many participants there are left; Thus, we use nrow() to count the number of rows in the aaec_filteredincome data set.

nrow(aaec_filteredincome)
## [1] 923

In the survey, there were 16 income categories to choose from, each with a label that is not easy to numerically order from lowest to highest.

As such, we used the mutate() and recode() functions (from dplyr) to change the income labels in the income variable to be a number from 01 to 16. We then piped this into the arrange() function (also from dplyr) to organise this column in ascending order.

aaec_newfilteredincome <- aaec_filteredincome %>% 
  mutate(income = recode(income, "Less than $10,000" = "01", "$10,000 to $20,000" = "02", "$20,000 to $30,000" = "03", "$30,000 to $40,000" = "04", "$40,000 to $50,000" = "05", "$50,000 to $60,000" = "06", "$60,000 to $80,000" = "07", "$80,000 to $100,000" = "08", "$100,000 to $120,000" = "09", "$120,000 to $140,000" = "10", "$140,000 to $160,000" = "11", "$160,000 to $180,000" = "12", "$180,000 to $200,000" = "13", "$200,000 to $220,000" = "14", "$220,000 to $250,000" = "15", "Greater than $250,000" = "16")) %>% 
  arrange(income)

The median() function (from the stats package) can now be used to determine which row (from 1 to 923) the middle income is located.

median(1:923)
## [1] 462

To find the exact point of the median income we look at the recoded value of the income column (the 6th column in the aaec_newfilteredincome dataset) and the 462nd row (the position in the data at which the median income value is located).

aaec_newfilteredincome[462, 6]
## # A tibble: 1 × 1
##   income
##   <chr> 
## 1 06

This gives us a recoded value of 06, which corresponds to the $50,000 to $60,000 income bracket, and matches the median income that was reported in the paper!

Summary Statistics - Frequency and Intensity of Positive and Negative Emotions

Means and SDs

The researchers had already created new variables for each participant and averaged their frequency and intensity ratings of each specific emotion. As such, we created a new table and used the select() function to just keep these four “average” variable columns.

aaec_summarystatistics <- aaec %>%
  select(avg_f_pos, avg_f_neg, avg_i_pos, avg_i_neg)

We first used a very repetitive method to individually calculate the means and SDs for these four variables (using na.rm=TRUE to ignore participants who did not rate the intensity of one or more negative emotions).

# Means
mean(aaec_summarystatistics$avg_f_pos)
## [1] 1.972079
mean(aaec_summarystatistics$avg_f_neg)
## [1] 1.41762
mean(aaec_summarystatistics$avg_i_pos)
## [1] 1.929739
mean(aaec_summarystatistics$avg_i_neg, na.rm=TRUE)
## [1] 1.771569
# Standard deviations
sd(aaec_summarystatistics$avg_f_pos)
## [1] 0.5627299
sd(aaec_summarystatistics$avg_f_neg)
## [1] 0.6613335
sd(aaec_summarystatistics$avg_i_pos)
## [1] 0.5880139
sd(aaec_summarystatistics$avg_i_neg, na.rm=TRUE)
## [1] 0.6998724

Even though this gave us the correct values reported in the paper (except for mean intensity of positive emotions which I will come back to later), we wanted to make this code more concise.
Firstly, we used the colMean() function (from R base) which calculates the mean for each of the four columns in the aaec_summarystatistics table.

colMeans(aaec_summarystatistics, na.rm = TRUE)
## avg_f_pos avg_f_neg avg_i_pos avg_i_neg 
##  1.972079  1.417620  1.929739  1.771569

It is so nice that we only needed one line of code to get these four values, and it labels each of them as well! The paper reports that the average of the intensity of the positive emotions is 1.92, however we got 1.929 which would round up to 1.93. Since all of the other values match what was reported, I believe the researchers made a rounding error.

Next, to calculate the SDs we pipe the dataframe into the summarise_if() function (from dplyr). The summarise_if() function only performs it’s actions on the values in the table if they meet the specified conditions. Thus, this following line of code will calculate the SD for the numeric columns, ignoring the NAs.

aaec_summarystatistics %>% 
  summarise_if(is.numeric, sd, na.rm = TRUE)
## # A tibble: 1 × 4
##   avg_f_pos avg_f_neg avg_i_pos avg_i_neg
##       <dbl>     <dbl>     <dbl>     <dbl>
## 1     0.563     0.661     0.588     0.700

These SDs match what was reported in the paper. How neat is this function!

Mean Difference for Emotion Frequency

The researchers have reported the mean difference between the frequency of positive and negative emotions as per the following:

“t(944) = 15.41, p < .001, d = 0.53, 95% confidence interval (CI) for the mean difference = [0.48, 0.63]”

We found that the t.test() function (from the stats package) not only gives the degrees of freedom and t-statistic, but it also produces confidence intervals. Within this function we have specified which variables we want to calculate the mean difference between (avg_f_neg and avg_f_pos from the aaec dataset) and that we want it to be a paired samples t-test (because the data for these variables comes from the same participants).

t.test(aaec$avg_f_neg, aaec$avg_f_pos, paired = TRUE) 
## 
##  Paired t-test
## 
## data:  aaec$avg_f_neg and aaec$avg_f_pos
## t = -15.409, df = 944, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.6250760 -0.4838417
## sample estimates:
## mean difference 
##      -0.5544588

The confidence intervals (0.48, 0.63), degrees of freedom (944), and t-statistic (-15.41) match the paper.

Mean Difference for Emotion Intensity

We then repeated this same process for the mean difference between the intensity of positive and negative emotions.
This mean difference was reported as “t(941) = 4.48, p < .001, d = 0.15, 95% CI for the mean difference = [0.09, 0.22]”

t.test(aaec$avg_i_pos, aaec$avg_i_neg, paired = TRUE)
## 
##  Paired t-test
## 
## data:  aaec$avg_i_pos and aaec$avg_i_neg
## t = 4.4788, df = 941, p-value = 8.43e-06
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.08571862 0.21942630
## sample estimates:
## mean difference 
##       0.1525725

The confidence intervals (0.09, 0.22), degrees of freedom (941), and t-statistic (4.48) also match the paper.

Summary Statistics - Measures

As done previously with the average frequency and intensity of positive and negative emotions, we first took the long and tedious path of using the mean() and sd() functions for each of the following variables: FTP_av, FTP_Ext, FTP_Opp, FTP_Con, risk_self, risk_comp, risk_pop, emp_affect, health, Extraversion, Conscientiousness.
To calculate the means and SDs much more efficiently, I used the colMeans() and summarise_if() functions.
Firstly, I create a new object and use select() to keep the relevant variables.

aaec_measures <- aaec %>%
  select(FTP_av, FTP_Ext, FTP_Opp, FTP_Con, risk_self, risk_comp, risk_pop, emp_affect, health, Extraversion, Conscientiousness)

Then, I calculate the means for the variables in this new dataset using colMeans().

colMeans(aaec_measures, na.rm = TRUE)
##            FTP_av           FTP_Ext           FTP_Opp           FTP_Con 
##          4.094168          4.018028          4.727466          3.740191 
##         risk_self         risk_comp          risk_pop        emp_affect 
##          2.362963          2.447619          2.965079          1.412698 
##            health      Extraversion Conscientiousness 
##          3.357672          3.440212          5.224868

Except for mean conscientiousness (which the paper incorrectly rounded to 5.23 instead of 5.22), all of these values match what was reported.

Since there are so many variables that I want to find the SDs of, using summarise_if() produces really unclear output and cuts of a few of the values. As such, I use the kable() function (from knitr) to display these SDs in a much nicer table.

measures_sd_table <- aaec_measures %>% 
  summarise_if(is.numeric, sd, na.rm = TRUE)

knitr::kable(measures_sd_table)
FTP_av FTP_Ext FTP_Opp FTP_Con risk_self risk_comp risk_pop emp_affect health Extraversion Conscientiousness
1.352868 1.906157 1.618943 1.755594 1.113526 1.343399 0.9387196 1.45521 0.9856528 1.64625 1.327658

These all match what was reported!

Table 1 - Means, standard deviations, and 95% confidence intervals for the frequencies of emotions

Table 1 has the following columns; Valence and emotion (list of all surveyed emotions), M (mean frequency of each emotion), SD (standard deviation for the frequency of each emotion), and 95% CI (upper and lower limits of the confidence intervals). In reproducing this table, the OSF codebook was helpful as it specifies what each variable name is and made it easy to find each emotion’s frequency ratings in the data.
There are 7 main steps to reproduce this table which I will break down.

1. Calculate the means and SDs of each individual emotion

We used the mean() function, ignoring all cases where participants didn’t rate an emotion by using na.rm = TRUE, to calculate the mean frequency rating of each emotion across all participants. To be able to combine these all into a dataframe later on, we assign each emotion mean to its own object using the <- assignment operator.

# Mean of positive emotions
mean_calm <- mean(aaec$f_calm, na.rm = TRUE)
mean_quiet <- mean(aaec$f_qui, na.rm = TRUE)
mean_appreciative <- mean(aaec$f_app, na.rm = TRUE)
mean_interested <- mean(aaec$f_int, na.rm = TRUE)
mean_content <- mean(aaec$f_cont, na.rm = TRUE)
mean_happy <- mean(aaec$f_hap, na.rm = TRUE)
mean_relaxed <- mean(aaec$f_rela, na.rm = TRUE)
mean_peaceful <- mean(aaec$f_pea, na.rm = TRUE)
mean_energetic <- mean(aaec$f_ener, na.rm = TRUE)
mean_affectionate <- mean(aaec$f_aff, na.rm = TRUE)
mean_amused <- mean(aaec$f_amu, na.rm = TRUE)
mean_accomplished <- mean(aaec$f_acc, na.rm = TRUE)
mean_joyful <- mean(aaec$f_joy, na.rm = TRUE)
mean_proud <- mean(aaec$f_pro, na.rm = TRUE)
mean_relieved <- mean(aaec$f_reli, na.rm = TRUE)
mean_excited <- mean(aaec$f_exc, na.rm = TRUE)

# Mean of negative emotions
mean_concerned <- mean(aaec$f_conc, na.rm = TRUE)
mean_anxious <- mean(aaec$f_anx, na.rm = TRUE)
mean_bored <- mean(aaec$f_bor, na.rm = TRUE)
mean_frustrated <- mean(aaec$f_fru, na.rm = TRUE)
mean_irritated <- mean(aaec$f_irr, na.rm = TRUE)
mean_sad <- mean(aaec$f_sad, na.rm = TRUE)
mean_lonely <- mean(aaec$f_lon, na.rm = TRUE)
mean_fearful <- mean(aaec$f_fear, na.rm = TRUE)
mean_angry <- mean(aaec$f_ang, na.rm = TRUE)
mean_disgusted <- mean(aaec$f_dis, na.rm = TRUE)
mean_guilty <- mean(aaec$f_gui, na.rm = TRUE)
mean_embarassed <- mean(aaec$f_emb, na.rm = TRUE)
mean_ashamed <- mean(aaec$f_sha, na.rm = TRUE)

We then do the exact same thing for standard deviation using the sd() function.

# SD of positive emotions
sd_calm <- sd(aaec$f_calm, na.rm = TRUE)
sd_quiet <- sd(aaec$f_qui, na.rm = TRUE)
sd_appreciative <- sd(aaec$f_app, na.rm = TRUE)
sd_interested <- sd(aaec$f_int, na.rm = TRUE)
sd_content <- sd(aaec$f_cont, na.rm = TRUE)
sd_happy <- sd(aaec$f_hap, na.rm = TRUE)
sd_relaxed <- sd(aaec$f_rela, na.rm = TRUE)
sd_peaceful <- sd(aaec$f_pea, na.rm = TRUE)
sd_energetic <- sd(aaec$f_ener, na.rm = TRUE)
sd_affectionate <- sd(aaec$f_aff, na.rm = TRUE)
sd_amused <- sd(aaec$f_amu, na.rm = TRUE)
sd_accomplished <- sd(aaec$f_acc, na.rm = TRUE)
sd_joyful <- sd(aaec$f_joy, na.rm = TRUE)
sd_proud <- sd(aaec$f_pro, na.rm = TRUE)
sd_relieved <- sd(aaec$f_reli, na.rm = TRUE)
sd_excited <- sd(aaec$f_exc, na.rm = TRUE)

# SD of negative emotions
sd_concerned <- sd(aaec$f_conc, na.rm = TRUE)
sd_anxious <- sd(aaec$f_anx, na.rm = TRUE)
sd_bored <- sd(aaec$f_bor, na.rm = TRUE)
sd_frustrated <- sd(aaec$f_fru, na.rm = TRUE)
sd_irritated <- sd(aaec$f_irr, na.rm = TRUE)
sd_sad <- sd(aaec$f_sad, na.rm = TRUE)
sd_lonely <- sd(aaec$f_lon, na.rm = TRUE)
sd_fearful <- sd(aaec$f_fear, na.rm = TRUE)
sd_angry <- sd(aaec$f_ang, na.rm = TRUE)
sd_disgusted <- sd(aaec$f_dis, na.rm = TRUE)
sd_guilty <- sd(aaec$f_gui, na.rm = TRUE)
sd_embarassed <- sd(aaec$f_emb, na.rm = TRUE)
sd_ashamed <- sd(aaec$f_sha, na.rm = TRUE)

2. Define the z-score and sample size (n)

We use qnorm() from the stats package to produce the z-score of 1.96. Note that 0.975 is the alpha level divided by 2. Additionally, so we don’t have to keep repeating 945 (the number of participants which we previously verified) in later calculations, we assigned it to n.

z_score <- qnorm(0.975)
n <- 945

3. Calculate standard error for each individual emotion

The equation for standard error is as follows:

Just like the means and SDs, we assigned each emotion standard error to its own term. Since we had already calculated the SDs of each emotion and defined n, we used these values in each standard error equation.

# Standard error for positive emotions
se_calm <- sd_calm/sqrt(n)
se_quiet <- sd_quiet/sqrt(n)
se_appreciative <- sd_appreciative/sqrt(n) 
se_interested <- sd_interested/sqrt(n)
se_content <- sd_content/sqrt(n)
se_happy <- sd_happy/sqrt(n)
se_relaxed <- sd_relaxed/sqrt(n)
se_peaceful <- sd_peaceful/sqrt(n)
se_energetic <- sd_energetic/sqrt(n)
se_affectionate <- sd_affectionate/sqrt(n)
se_amused <- sd_amused/sqrt(n)
se_accomplished <- sd_accomplished/sqrt(n)
se_joyful <- sd_joyful/sqrt(n)
se_proud <- sd_proud/sqrt(n)
se_relieved <- sd_relieved/sqrt(n)
se_excited <- sd_excited/sqrt(n)
  
# Standard error for negative emotions
se_concerned <- sd_concerned/sqrt(n)  
se_anxious <- sd_anxious/sqrt(n) 
se_bored <- sd_bored/sqrt(n)  
se_frustrated <- sd_frustrated/sqrt(n)  
se_irritated <- sd_irritated/sqrt(n)
se_sad <- sd_sad/sqrt(n)
se_lonely <- sd_lonely/sqrt(n)
se_fearful <- sd_fearful/sqrt(n)  
se_angry <- sd_angry/sqrt(n)  
se_disgusted <- sd_disgusted/sqrt(n)  
se_guilty <- sd_guilty/sqrt(n)  
se_embarassed <- sd_embarassed/sqrt(n) 
se_ashamed <- sd_ashamed/sqrt(n)

4. Calculate the upper and lower confidence interval limits for each individual emotion

Upper and lower confidence interval limits are calculated by the following:

  • Upper limit = mean + (z-score * standard error)
  • Lower limit = mean - (z-score * standard error)

Using mathematical symbols and the values we have previously determined (mean, z-score, and standard error), we calculated the upper and lower confidence interval limits for each emotion. We also assigned each value to its own object.

# Lower limits of positive emotions
ll_calm <- mean_calm - (z_score*se_calm)
ll_quiet <- mean_quiet - (z_score*se_quiet)
ll_appreciative <- mean_appreciative - (z_score*se_appreciative)
ll_interested <- mean_interested - (z_score*se_interested)
ll_content <- mean_content - (z_score*se_content)
ll_happy <- mean_happy - (z_score*se_happy)
ll_relaxed <- mean_relaxed - (z_score*se_relaxed)
ll_peaceful <- mean_peaceful - (z_score*se_peaceful)
ll_energetic <- mean_energetic - (z_score*se_energetic)
ll_affectionate <- mean_affectionate - (z_score*se_affectionate)
ll_amused <- mean_amused - (z_score*se_amused)
ll_accomplished <- mean_accomplished - (z_score*se_accomplished)
ll_joyful <- mean_joyful - (z_score*se_joyful)
ll_proud <- mean_proud - (z_score*se_proud)
ll_relieved <- mean_relieved - (z_score*se_relieved)
ll_excited <- mean_excited - (z_score*se_excited)

# Lower limits of negative emotions
ll_concerned <- mean_concerned - (z_score*se_concerned)
ll_anxious <- mean_anxious - (z_score*se_anxious)
ll_bored <- mean_bored - (z_score*se_bored)
ll_frustrated <- mean_frustrated - (z_score*se_frustrated)
ll_irritated <- mean_irritated - (z_score*se_irritated)
ll_sad <- mean_sad - (z_score*se_sad)
ll_lonely <- mean_lonely - (z_score*se_lonely)
ll_fearful <- mean_fearful - (z_score*se_calm)
ll_angry <- mean_angry - (z_score*se_angry)
ll_disgusted <- mean_disgusted - (z_score*se_disgusted)
ll_guilty <- mean_guilty - (z_score*se_guilty)
ll_embarassed <- mean_embarassed - (z_score*se_embarassed)
ll_ashamed <- mean_ashamed - (z_score*se_ashamed)

# Upper limits of positive emotions
ul_calm <- mean_calm + (z_score*se_calm)
ul_quiet <- mean_quiet + (z_score*se_quiet)
ul_appreciative <- mean_appreciative + (z_score*se_appreciative)
ul_interested <- mean_interested + (z_score*se_interested)
ul_content <- mean_content + (z_score*se_content)
ul_happy <- mean_happy + (z_score*se_happy)
ul_relaxed <- mean_relaxed + (z_score*se_relaxed)
ul_peaceful <- mean_peaceful + (z_score*se_peaceful)
ul_energetic <- mean_energetic + (z_score*se_energetic)
ul_affectionate <- mean_affectionate + (z_score*se_affectionate)
ul_amused <- mean_amused + (z_score*se_amused)
ul_accomplished <- mean_accomplished + (z_score*se_accomplished)
ul_joyful <- mean_joyful + (z_score*se_joyful)
ul_proud <- mean_proud + (z_score*se_proud)
ul_relieved <- mean_relieved + (z_score*se_relieved)
ul_excited <- mean_excited + (z_score*se_excited)

# Upper limits of negative emotions
ul_concerned <- mean_concerned + (z_score*se_concerned)
ul_anxious <- mean_anxious + (z_score*se_anxious)
ul_bored <- mean_bored + (z_score*se_bored)
ul_frustrated <- mean_frustrated + (z_score*se_frustrated)
ul_irritated <- mean_irritated + (z_score*se_irritated)
ul_sad <- mean_sad + (z_score*se_sad)
ul_lonely <- mean_lonely + (z_score*se_lonely)
ul_fearful <- mean_fearful + (z_score*se_calm)
ul_angry <- mean_angry + (z_score*se_angry)
ul_disgusted <- mean_disgusted + (z_score*se_disgusted)
ul_guilty <- mean_guilty + (z_score*se_guilty)
ul_embarassed <- mean_embarassed + (z_score*se_embarassed)
ul_ashamed <- mean_ashamed + (z_score*se_ashamed)

5. Creating each column and rounding values

Now that we have determined all of the values that are to go into the table, we need to actually insert them by creating a vector for each column.
The values in the first column are just the emotion names:

# Column 1 - Emotions
Valence_and_emotion <- c("Positive emotions", "Calm", "Quiet", "Appreciative", "Interested", "Content", "Happy", "Relaxed", "Peaceful", "Energetic", "Affectionate", "Amused", "Accomplished", "Joyful", "Proud", "Relieved", "Excited", "Negative emotions", "Concerned", "Anxious/worried", "Bored", "Frustrated", "Irritated", "Sad", "Lonely", "Fearful", "Angry", "Disgusted", "Guilty", "Embarassed", "Ashamed")

We used the same method to insert columns 2 through 5 (mean, SD, lower limit, and upper limit). It is important to note that Table 1 has two rows (“Positive emotions” and “Negative emotions”) which act as subheaders, and as such do not have any numerical values in the 4 columns. To account for this, we have used empty quotation marks (“ ”) so that these spaces are left blank in our table.

# Column 2 - Mean
Mean <- c(" ", mean_calm, mean_quiet, mean_appreciative, mean_interested, mean_content, mean_happy, mean_relaxed, mean_peaceful, mean_energetic, mean_affectionate, mean_amused, mean_accomplished, mean_joyful, mean_proud, mean_relieved, mean_excited, " ", mean_concerned, mean_anxious, mean_bored, mean_frustrated, mean_irritated, mean_sad, mean_lonely, mean_fearful, mean_angry, mean_disgusted, mean_guilty, mean_embarassed, mean_ashamed)

# Column 3 - SD
Standard_Deviation <- c(" ", sd_calm, sd_quiet, sd_appreciative, sd_interested, sd_content, sd_happy, sd_relaxed, sd_peaceful, sd_energetic, sd_affectionate, sd_amused, sd_accomplished, sd_joyful, sd_proud, sd_relieved, sd_excited, " ", sd_concerned, sd_anxious, sd_bored, sd_frustrated, sd_irritated, sd_sad, sd_lonely, sd_fearful, sd_angry, sd_disgusted, sd_guilty, sd_embarassed, sd_ashamed)

# Column 4 - 95% CI Lower limit
Lower_Limit <- c(" ", ll_calm, ll_quiet, ll_appreciative, ll_interested, ll_content, ll_happy, ll_relaxed, ll_peaceful, ll_energetic, ll_affectionate, ll_amused, ll_accomplished, ll_joyful, ll_proud, ll_relieved, ll_excited, " ", ll_concerned, ll_anxious, ll_bored, ll_frustrated, ll_irritated, ll_sad, ll_lonely, ll_fearful, ll_angry, ll_disgusted, ll_guilty, ll_embarassed, ll_ashamed)

# Column 5 - 95% CI Upper limit
Upper_Limit <- c(" ", ul_calm, ul_quiet, ul_appreciative, ul_interested, ul_content, ul_happy, ul_relaxed, ul_peaceful, ul_energetic, ul_affectionate, ul_amused, ul_accomplished, ul_joyful, ul_proud, ul_relieved, ul_excited, " ", ul_concerned, ul_anxious, ul_bored, ul_frustrated, ul_irritated, ul_sad, ul_lonely, ul_fearful, ul_angry, ul_disgusted, ul_guilty, ul_embarassed, ul_ashamed)

When we combined these column vectors into a table as is (I will explain the process in step 7), it produced a table where all of the values had 14 decimal places.

6. Rounding

To match Table 1 as reported in the paper, we needed to round everything to 2 decimal places. We thought this would be an easy task, but it showed us how time consuming learning R can be.
To do this, we used the as.numeric() function (from R base) to convert the columns from being categorical to numerical. Then we used the round() function (from R base) to select the column and how many decimal places we wanted the values to be rounded to, and simultaneously used the as.character() function (also from R base) to change these back into categorical columns such that the subheader rows (“Positive emotions” and “Negative emotions”) are given the value N/A.

Mean <- as.numeric(Mean)
Mean <- as.character(round(Mean,2))

Standard_Deviation <- as.numeric(Standard_Deviation)
Standard_Deviation <- as.character(round(Standard_Deviation,2))

Lower_Limit <- as.numeric(Lower_Limit)
Lower_Limit <- as.character(round(Lower_Limit,2))

Upper_Limit <- as.numeric(Upper_Limit)
Upper_Limit <- as.character(round(Upper_Limit,2))

7. Generating the final version of the table

Once these values were rounded, we combined them into a table using the data.frame() function, from R base.

Table1_MeanFreq_Emotions <- data.frame(Valence_and_emotion, Mean, Standard_Deviation, Lower_Limit, Upper_Limit)

This, very excitingly, produced a table that matches the original paper. I have included a screenshot below because the print() function does not do the table justice.

Figure 1 - Attempt 1

Surprisingly, our first attempt at re-creating Figure 1 was much less complicated than anticipated. This is because, once we had figured out the code for the first plot, we could copy it and just change the relevant parts to suit the other three. Additionally, having our group work together to Google solutions came in really handy when troubleshooting, because everyone had such a unique take on the issues.

Here is the code for the average frequency of negative emotions plot within Figure 1:

Freq_Neg_Plot <- ggplot(data = aaec, aes(x = aaec$age, y = aaec$avg_f_neg)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Negative Emotions",
       x = element_blank(),
       y = "Frequency of 
    Emotional Experience") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 10, hjust = 0.5), axis.title = element_text(face = "bold", size = 10), axis.text.x = element_blank(), axis.ticks.x = element_blank(), plot.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +
  coord_cartesian(ylim = c(0, 2.5))

To thoroughly explain each part of our code for this figure, I will describe what each function (all from the ggplot2 package) is accomplishing and what specifications we have made. I found this link incredibly useful in understanding how to make adjustments to the plot.

ggplot() → This is used to produce each of the four plots. We are storing each plot in their own object (using <-) so that we can combine the four later on using the Patchwork package.

  • Within this function, we specify that it should source the data from the “aaec” dataframe, and aes() indicates the features of the plot. More specifically, that age should be on the x-axis and avg_f_neg should be on the y-axis.

geom_point() → Since the figure is a scatter plot, geom_point() is used to add the data points.

  • We set alpha at 0.1 to make the points more transparent (which likely helps viewers see where points tend to cluster, as overlapping areas become darker).

geom_smooth() → This function inserts a line of best fit into the plot.

  • We specified the method as “lm” to make the line linear, and made the line colour to be “black”.

labs() → This adds a title and axis labels to the plot.

  • For the y-axis, we separated “Frequency of” and “Emotional Experience” into two lines of code to ensure the label fits. Since there is no x-axis label, this was left empty by using element_blank().

theme_bw() → This changes the background of the plot from grey to white

  • We initially tried theme_minimal() but found that this also removed the figure’s borders.

theme() → This helps the plot visually match Figure 1 characteristics. Within this function, we also used element_text() to alter text features, and element_blank() to leave spaces empty as explained below.

  • plot.title and axis.title → we made the main and axis titles bold and size 10, and also moved the main title to the middle using “hjust = 0.5”.
  • axis.text.x → used element_blank() so that the numbers on the x-axis would not show.
  • axis.ticks.x → used element_blank() to remove the x-axis ticks.
  • plot.background → to remove the background outside of the plot.
  • panel.grid.major and panel.grid.minor → removes the major and minor gridlines.
  • We originally also had axis.line (but realised that we had already made the regression line black in geom_smooth).

coord_cartesian() → Ensures that the y-axis only goes to 2.5.

  • We initially used scale_y_continuous(breaks = seq(from = 0, to = 2.5, by = 0.5)), but this left all of the data above 2.5 (which is not what we wanted since the original paper cut everything off above 2.5).
print(Freq_Neg_Plot)

First plot looks good!

The remaining three plots

For the other three plots, the same functions were used, just with slight changes to some of the aesthetics to match Figure 1.

  • For the frequency of positive emotions we have included axis.text.y and axis.ticks.y which serve the same purpose axis.text.x and axis.ticks.x, but for the y-axis.
  • For the intensity of negative emotions in labs() we removed the title and added an x-axis label and in theme() we no longer use axis.text.x and axis.ticks.x to remove the numbers and ticks from the x-axis (since these are present for this plot in Figure 1). In-order to get the x-axis label to be between the left and right plots, we just kept adding spaces before the “Age (Years)” text to move it to the right. This is not the cleanest code which is why we used a different method in our second attempt (explained later).
  • For the intensity of positive emotions we made similar changes to account for the fact that there is no title, y-axis label, y-axis numbers, or y-axis ticks.
# Frequency of Positive
Freq_Pos_Plot <- ggplot(data = aaec, aes(x = aaec$age, y = aaec$avg_f_pos)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Positive Emotions",
       x = element_blank(),
       y = element_blank()) +
  theme_bw() + 
  theme(plot.title = element_text(face = "bold", size = 10, hjust = 0.5), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +
  coord_cartesian(ylim = c(0, 2.5))

# Intensity of Negative
Int_Neg_Plot <- ggplot(data = aaec, aes(x = aaec$age, y = aaec$avg_i_neg)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = element_blank(),
       x = "                                                                                      Age (Years)",
       y = "Intensity of 
      Emotional Experience") +
  theme_bw() + 
  theme(axis.title = element_text(face = "bold", size = 10), panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 2.5))

# Intensity of Positive
Int_Pos_Plot <- ggplot(data = aaec, aes(x = aaec$age, y = aaec$avg_i_pos)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = element_blank(),
       x = element_blank(),
       y = element_blank()) +
  theme_bw() + 
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 2.5))

Though watching videos about how to code plots has been helpful, actually getting to apply these concepts and practice them for ourselves has been paramount to my journey of learning and understanding how to code in R.

Combining all four plots

By loading the Patchwork package we easily combined the four plots. We chose the order in the following code, where “+” was used to indicate which plots go on the left/right, and “/” was used to indicate which plots go on the top/bottom.

(Freq_Neg_Plot+Freq_Pos_Plot) / (Int_Neg_Plot+Int_Pos_Plot)

This looks almost exactly the same as Figure 1 in the paper!

Figure 1 - Attempt 2

As mentioned previously, adding a lot of spaces before the “Age (Years)” text in the intensity of negative emotions plot seems like a mildly questionable way to get the x-axis label in the middle. Additionally, this code is really repetitive and long since we are using almost the same code four times. To remedy this, we tried using the facet_grid() function from ggplot2.

Step 1

First we assigned the aaec data to a new object, piping to the select() function to keep the relevant variables (which we used starts_with(“avg”) to select the four averages). In hindsight, we could have used this from the start since it is a highly efficient way to select variables; However, writing out each of the four may be better for code readability since one doesn’t have to search for the variables when stated plainly. We then pipe this into pivot_longer() (from Tidyr in tidyverse) which allows us to move the values from the four “avg” columns into one column, such that each participant will have four rows. These are the four specifications that we are making within the pivot_longer() function:

  • cols → we create a vector with the four variables that we want to combine into one column (avg_f_pos, avg_f_neg, avg_i_pos, avg_i_neg)
  • names_to → moves the four column names into one column called “Variable_Type”
  • values_to → moves the values (average frequency and intensity ratings) into one column called “Score”
  • values_drop_na → removes all NAs

Then we pipe this into mutate(), which we use to create two new variable columns called measure_type and valence_type.
Using case_when() from dplyr and grepl() from R base:

  • Within the measure_type column, anything in the “Variable_Type” column with an “f” or an “i” will be assigned the value of “Frequency” or “Intensity” respectively.
  • Within the valence_type column, anything in the “Variable_Type” column with “pos” or “neg” will receive the value of “Positive” or “Negative” respectively.

As explained later on, doing this is important so facet_grid() can create four separate plots based on the four possible combinations of measure_type and valence_type (frequency of positive, frequency of negative, intensity of positive, and intensity of negative).

aaec_figure1_variables <- aaec %>%
  select(age, starts_with("avg")) %>%
  pivot_longer(
    cols = c(avg_f_pos, avg_f_neg, avg_i_pos, avg_i_neg),
    names_to = "Variable_Type",
    values_to = "Score",
    values_drop_na = TRUE) %>%
  mutate(
    measure_type = case_when(
      grepl("f", Variable_Type) ~ "Frequency",
      grepl("i", Variable_Type) ~ "Intensity"), 
    valence_type = case_when(
      grepl("pos", Variable_Type) ~ "Positive",
      grepl("neg", Variable_Type) ~ "Negative"))

Step 2

Next we created two vectors of the words we want in the y-axis label and assigned each to their respective frequency/intensity values in the measure_type and positive/negative values in the valence_type columns.
To replicate the y-axis label formatting in Figure 1, we used \n to split each into two rows instead of one long line (as seen in the y-axis of the figure that we produce below in step 3).

labelsM <- c(Frequency = "Frequency of\nEmotional Experience", Intensity = "Intensity of\nEmotional Experience")
labelsV <- c(Negative = "Negative Emotions", Positive = "Positive Emotions")

Step 3

A significant portion of this code is quite similar to our first attempt: Specifications within geom_point(), geom_smooth(), theme_bw(), panel.grid.minor, panel.grid.major, and coord_cartesian() have not changed.

Change A) Since we used pivot_longer(), the y-axis in the ggplot() function is now “Score” instead of each individual average variable (e.g. avg_f_pos).
Change B) We are now using facet_grid() to create the four plots from our pivoted dataframe. vars(), from dplyr, is used to specify that we want data for the rows to be split by Frequency and Intensity from the measure_type variable, and for the columns to be split by Positive and Negative from the valence_type variable. labeller() is used to give the rows the labels from the previously produced “labelsM” and give the columns labels from “labelsV”. These labels are automatically placed on the right hand side of the plots, so we move it to the left through switch = “y” (where y stands for yes, thus requesting the labels to switch sides).
Change C) Within theme(), strip.placement is used to move the y-axis label to the left of the y-axis numbers (outside rather than between these numbers and the plot) and strip.background is used to remove the grey boxes surrounding the titles and y-axis labels (as shown in the screenshot below).

aaec_figure1 <- ggplot(data = aaec_figure1_variables, aes(x = age, y = Score)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm",
              color = "black") +
  facet_grid(rows = vars(measure_type),
             cols = vars(valence_type),
             labeller = labeller(measure_type = labelsM, valence_type = labelsV),
             switch = "y") +
  theme_bw() +
  theme(strip.placement = "outside",
        strip.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank()) +
  coord_cartesian(ylim = c(0, 2.5)) +
labs(title = element_blank(),
       x = "Age (Years)",
       y = element_blank())
print(aaec_figure1)

It worked!

Part 3 - Exploration

Exploration 1 - Do the frequency and intensity of positive and negative emotions in the pandemic differ across races?

In one of my reactions to the paper, I mentioned my concern with the researchers binarising many sample characteristic variables despite collecting more specific information. Race particularly stood out to me because previous research has indicated diverging pandemic effects across races; Therefore, I am curious whether this is also true of frequency and intensity of emotions.
There are 8 races present in the race_all variable but only two (White and Non-white) in race_bin.

1. Data Wrangling

Firstly, since it is often said that you need at least 30 participants per condition to reveal differences between conditions, it is important to determine which race groups meet this criteria. To determine the number of participants per race, I pipe the aaec dataframe into the count() function, from the plyr package. This was quite humbling as I initially forgot to put quotation marks around the race_all variable and kept getting an error, which made me realise how specific you need to be when inputting commands into R. After figuring this out, the count() function produced a table with the frequency of each race type.

aaec %>% count("race_all")
##                          race_all freq
## 1         Asian or Asian-American   78
## 2       Black or African-American   55
## 3              Hispanic or Latinx   49
## 4 Middle Eastern or North African    2
## 5              more than one race   41
## 6                 Native American    4
## 7               prefer not to say    3
## 8      White or European-American  713

Evidently, “Middle Eastern or North African”, “Native American”, and “prefer not to say” all have fewer than 30 participant’s identifying as those races. Thus, I need to remove these participants by assigning the data to a new object and using filter() as previously explained. Additionally, since some participants did not have an average intensity of negative emotions value, I also removed them using ! and the is.na() function (from R base) within filter() to locate and remove those without an avg_i_neg value.
I then pipe this into select() to just choose the variables I want to keep (race_all, avg_f_pos, avg_f_neg, avg_i_pos, avg_i_neg). I finally pipe into the mutate() and recode() functions to change the label for the “more than one race” group (because the lack of capitalisation looked bad in the later plots).

aaec_race <- aaec %>% 
  filter(!is.na(avg_i_neg), race_all != "Middle Eastern or North African", race_all != "Native American", race_all != "prefer not to say") %>%
  select(race_all, avg_f_pos, avg_f_neg, avg_i_pos, avg_i_neg) %>%
  mutate(race_all = recode(race_all, "more than one race" = "More than one race"))

2. Descriptive Statistics

Since I wanted to make four separate graphs of the means later on, I assign the aaec_race data to four new objects. I use group_by() (from dplyr) to split participants into the different races and pipe this into summarise() (also from dplyr) to produce means and SDs for each race (as previously explained when verifying age of the sample).
Danielle’s coding modules taught me these functions and I was inputting the code exactly like she had it (as in the screenshot below):

However, it produced a table that did not separate the means and SDs by race, rather it calculated a grand mean:

After a bit of Googling, and looking at this webpage from one of my favourite R coding youtubers, I discovered that that the issue was because R was using the summarise() function from the plyr package rather than dplyr. Therefore, to ensure R knows to use the function from dplyr you just have to include dplyr:: before summarise().

aaec_fpos_race <- aaec_race %>%
  group_by(race_all) %>%
  dplyr::summarise(mean_fpos = mean(avg_f_pos), sd_fpos = sd(avg_f_pos))

Here is what that new table looks like using the print() function from R base:

print(aaec_fpos_race)
## # A tibble: 5 × 3
##   race_all                   mean_fpos sd_fpos
##   <chr>                          <dbl>   <dbl>
## 1 Asian or Asian-American         1.93   0.556
## 2 Black or African-American       2.05   0.600
## 3 Hispanic or Latinx              1.90   0.584
## 4 More than one race              1.91   0.546
## 5 White or European-American      1.97   0.547

Then I do the exact same thing for the avg_f_neg, avg_i_pos, and avg_i_neg variables.

# Frequency of negative emotions
aaec_fneg_race <- aaec_race %>%
  group_by(race_all) %>%
  dplyr::summarise(mean_fneg = mean(avg_f_neg), sd_fneg = sd(avg_f_neg))
print(aaec_fneg_race)
## # A tibble: 5 × 3
##   race_all                   mean_fneg sd_fneg
##   <chr>                          <dbl>   <dbl>
## 1 Asian or Asian-American         1.53   0.672
## 2 Black or African-American       1.38   0.689
## 3 Hispanic or Latinx              1.58   0.802
## 4 More than one race              1.63   0.529
## 5 White or European-American      1.39   0.646
# Intensity of positive emotions
aaec_ipos_race <- aaec_race %>%
  group_by(race_all) %>%
  dplyr::summarise(mean_ipos = mean(avg_i_pos), sd_ipos = sd(avg_i_pos))
print(aaec_ipos_race)
## # A tibble: 5 × 3
##   race_all                   mean_ipos sd_ipos
##   <chr>                          <dbl>   <dbl>
## 1 Asian or Asian-American         1.90   0.511
## 2 Black or African-American       1.95   0.613
## 3 Hispanic or Latinx              1.77   0.618
## 4 More than one race              1.99   0.602
## 5 White or European-American      1.93   0.580
# Intensity of negative emotions
aaec_ineg_race <- aaec_race %>%
  group_by(race_all) %>%
  dplyr::summarise(mean_ineg = mean(avg_i_neg), sd_ineg = sd(avg_i_neg))
print(aaec_ineg_race)
## # A tibble: 5 × 3
##   race_all                   mean_ineg sd_ineg
##   <chr>                          <dbl>   <dbl>
## 1 Asian or Asian-American         1.79   0.635
## 2 Black or African-American       1.68   0.712
## 3 Hispanic or Latinx              1.94   0.724
## 4 More than one race              2.01   0.616
## 5 White or European-American      1.75   0.707

3. Data Visualisation

To visualise and compare means between races, I want to create and combine four column graphs using the Patchwork package (like in Figure 1 Attempt 1).
Firstly, I create a new object for the graph of the frequency of positive emotions and use <- to assign the plot to it. As previously explained, I use ggplot() and specify that the data is coming from the aaec_fpos_race dataframe. Then I use geom_col() to create a column graph, specifying that I want the mean_fpos variable (that I created in the descriptive statistics section above) and race groups to be on the x-axis and y-axis respectively. The reason I put the races on the y-axis instead of the x-axis is because the labels would have been very crowded and unreadable.
I use fill = race_all to indicate that I want each race to have a different column colour. Since I wanted to pick my own colours to make each column easily distinguishable and encourage comparison across all four graphs, I used the scale_fill_manual() function from ggplot2, which I learned from Jenny Richmond’s blog!

I used this handy website which gives you the hexadecimal number corresponding to the colour you like, that you can input into scale_fill_manual(). I then used theme() and legend.position to remove the legend that indicated which colour corresponds to which race, since this is already listed on the y-axis.
Finally, I use labs() to give the graph a title and axis labels.

graph_fpos_race <- ggplot(data = aaec_fpos_race) + geom_col(mapping = aes(x = mean_fpos, y = race_all, fill = race_all)) +  
  scale_fill_manual(values = c("#c793f1", "#8ecfe6", "#aae68e", "#eda673", "#e68e8e")) +
  theme(legend.position = "none") +
  labs(title = "Positive Emotions",
       x = "Mean Frequency",
       y = "Race")

I then did exactly the same thing for the other three graphs, just with slight changes to some of the function specifications:

  • Changed the title and x-axis labels for each graph, within the labs() function.
  • Included the coord_cartesian() function in the frequency of negative emotions graph because it was the only graph that went up to 1.5 instead of 2.0 on the x-axis.
graph_fneg_race <- ggplot(data = aaec_fneg_race) + geom_col(mapping = aes(x = mean_fneg, y = race_all, fill = race_all)) + 
  scale_fill_manual(values = c("#c793f1", "#8ecfe6", "#aae68e", "#eda673", "#e68e8e")) +
  theme(legend.position = "none") +
  labs(title = "Negative Emotions",
       x = "Mean Frequency",
       y = "Race") +
  coord_cartesian(xlim = c(0, 2.0))

graph_ipos_race <- ggplot(data = aaec_ipos_race) + geom_col(mapping = aes(x = mean_ipos, y = race_all, fill = race_all)) +  
  scale_fill_manual(values = c("#c793f1", "#8ecfe6", "#aae68e", "#eda673", "#e68e8e")) +
  theme(legend.position = "none") +
  labs(title = element_blank(),
       x = "Mean Intensity",
       y = "Race")

graph_ineg_race <- ggplot(data = aaec_ineg_race) + geom_col(mapping = aes(x = mean_ineg, y = race_all, fill = race_all)) + 
  scale_fill_manual(values = c("#c793f1", "#8ecfe6", "#aae68e", "#eda673", "#e68e8e")) +
  theme(legend.position = "none") +
  labs(title = element_blank(),
       x = "Mean Intensity",
       y = "Race")

Finally, using Patchwork, I combine these four graphs. As previously explained in the verification portion of this report, “+” assigns plots from left to right and “/” assigns plots from top to bottom.

(graph_fpos_race+graph_fneg_race) / (graph_ipos_race+graph_ineg_race) 

It’s looking like there might be some differences between races for the mean frequency and intensity of negative emotions, though a statistical test is needed to verify this.

4. Statistical Analysis

Once I found this website, it was really easy to figure out a one-way between-subjects ANOVA test. I used the aov() function from the stats package, specifying the dependent and independent variables, and the dataset. Then I used the summary() function to show the output of the statistical tests.

Frequency of positive emotions

race_fpos_ANOVA <- aov(avg_f_pos ~ race_all, data = aaec_race)
summary(race_fpos_ANOVA)
##              Df Sum Sq Mean Sq F value Pr(>F)
## race_all      4   0.81  0.2021   0.662  0.619
## Residuals   928 283.32  0.3053

The difference in frequency of positive emotions between races is not significant at the 0.05 level.

Frequency of negative emotions

race_fneg_ANOVA <- aov(avg_f_neg ~ race_all, data = aaec_race)
summary(race_fneg_ANOVA)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## race_all      4    4.8  1.1903   2.776  0.026 *
## Residuals   928  397.9  0.4287                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “*” indicates that there is a significant difference in frequency of negative emotions between the races at the 0.05 level.

Intensity of positive emotions

race_ipos_ANOVA <- aov(avg_i_pos ~ race_all, data = aaec_race)
summary(race_ipos_ANOVA)
##              Df Sum Sq Mean Sq F value Pr(>F)
## race_all      4   1.39  0.3480   1.036  0.388
## Residuals   928 311.86  0.3361

The difference in intensity of positive emotions between races is not significant at the 0.05 level.

Intensity of negative emotions

race_ineg_ANOVA <- aov(avg_i_neg ~ race_all, data = aaec_race)
summary(race_ineg_ANOVA)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## race_all      4    4.4  1.1009   2.256 0.0614 .
## Residuals   928  452.9  0.4881                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As indicated by the “.”, the difference between races in terms of intensity of negative emotions is significant at the 0.1 level but not the 0.05 level.

5. Conclusion

Though these statistical tests do not directly indicate which races differ from each other, they do indicate that frequency of negative emotions in the pandemic is not the same across all races (this difference is significant at the 0.05 level).

Exploration 2 - Does the relationship between age and better emotional wellbeing still remain when including older and more high risk individuals?

In the discussion section of the paper, the researchers state that “It is also possible that findings would be different in a much older sample or in populations at highest risk, such as frail nursing-home residents.”
Their original csv file does not allow for an investigation into this, as the maximum age was 76 years and they did not include individuals in assisted living facilities (whose emotional experiences may be impacted differently due to increased vulnerability to the virus). Fortunately, in the excel file, the researchers collected data from individuals aged 76-98 in an aged care facility called the “Eskaton” sample.

1. Descriptive Statistics

The first step is to get this new dataset into the Rmd file.
I found an easy way to get this excel data into the Rmd file, using the Environment panel:

However, I found that you have to do this every time you close and open R Studio, which can get quite tedious.
Instead, I used the read_excel() function from the readxl package to upload and assign the data to a new aaec_eskaton object. I then pipe this into the select() function to just keep the five variables that I am interested in.

aaec_eskaton <- read_excel("~/Downloads/aaec_Eskaton_full.xlsx") %>%
  select(age, avg_f_pos, avg_f_neg, avg_i_pos, avg_i_neg)

I then pipe into the dplyr summarise() function to ascertain the means and SDs (using the mean() and sd() functions) for each of the four frequency/intensity of positive/negative emotions variables. I again use the kable() function to display these values in a nicer format.

eskaton_descriptives <- aaec_eskaton %>% 
  dplyr::summarise(
    mean_fpos = mean(avg_f_pos), sd_fpos = sd(avg_f_pos),
    mean_fneg = mean(avg_f_neg), sd_fneg = sd(avg_f_neg),
    mean_ipos = mean(avg_i_pos), sd_ipos = sd(avg_i_pos), 
    mean_ineg = mean(avg_i_neg), sd_ineg = sd(avg_i_neg))

knitr::kable(eskaton_descriptives)
mean_fpos sd_fpos mean_fneg sd_fneg mean_ipos sd_ipos mean_ineg sd_ineg
2.019779 0.5781814 1.369817 0.6596563 1.950736 0.5848439 NA NA

Unfortunately, this does not give me the mean and SD for the average intensity of negative emotions. I decided to check whether the issue is because the variable is not numeric, by using the class() function (from R base).

class(aaec_eskaton$avg_i_neg)
## [1] "character"

Looks like it is categorical. This is likely because some participants had missing values, thus they need to be removed.

Attempt 1 - as.numeric(unlist())

Firstly, I assign the aaec_eskaton data frame to a new object. I pipe this into the select() function to just keep the age and avg_i_neg variables, and then into filter() to keep all participants who do not have a value of NA.

aaec_eskaton_A1 <- aaec_eskaton %>% 
  select(age, avg_i_neg) %>% 
  filter(avg_i_neg != "NA")

I then use class() to check whether it is now numeric.

class(aaec_eskaton_A1$avg_i_neg)
## [1] "character"

Sadly it is still categorical. In an attempt to convert this variable to be numeric, I used the as.numeric() function which did not work as I got an error telling me that the variable was a list. Thus, I used the unlist() function within as.numeric(), as advised in this link

aaec_eskaton_A1_numeric <- as.numeric(unlist(aaec_eskaton_A1))

I use class() again to check whether this made it numeric.

class(aaec_eskaton_A1_numeric)
## [1] "numeric"

It is! So I should be able to obtain the mean and SD values using the summarise() function now…

# aaec_eskaton_A1_numeric %>%
  # dplyr::summarise(mean_ineg = mean(avg_i_neg), sd_ineg = sd(avg_i_neg))

…which still doesn’t work.

Attempt 2 - Confirming if the NAs are really the issue

Since this didn’t work, I questioned whether the NAs were the real problem. As such, I removed them from the original excel file and re-read that into the Rmd to determine whether this would solve the problem. I did this using read_excel() to upload it, and select() to choose the variables I wanted to keep.

aaec_eskaton_removedNA <- read_excel("~/Downloads/AAEC_full_removedNA.xlsx") %>% 
  select(age, avg_i_neg)

Now, I use the summarise() function to calculate the mean and SD.

aaec_eskaton_removedNA %>% 
  dplyr::summarise(mean_ineg = mean(avg_i_neg), sd_ineg = sd(avg_i_neg))
## # A tibble: 1 × 2
##   mean_ineg sd_ineg
##       <dbl>   <dbl>
## 1      1.69   0.714

It works, which must mean that the NAs are the issue.

Attempt 3 - mutate_all(~as.numeric(as.character(.)))

After some googling, I discovered that subset(), from R base, accomplishes the same thing as filter() so I wanted to give that a go to only keep participants who don’t have a value of NA.

aaec_eskaton_A3 <- subset(aaec_eskaton, avg_i_neg != "NA")

I then did a lot of googling and found this link where someone suggested using mutate_all(), as.numeric(), and as.character() together to make all variables numeric.
Though I was sceptical, I gave it a go by creating a new object and piping it into the function.

aaec_eskaton_A3_numeric <- aaec_eskaton_A3 %>% 
  mutate_all(~as.numeric(as.character(.))) 

I then checked the class to see if it worked.

class(aaec_eskaton_A3_numeric$avg_i_neg)
## [1] "numeric"

Now the real test is to see if the summarise() function will work to calculate the mean and SD

aaec_eskaton_A3_numeric %>% 
  dplyr::summarise(mean_ineg = mean(avg_i_neg), sd_ineg = sd(avg_i_neg))
## # A tibble: 1 × 2
##   mean_ineg sd_ineg
##       <dbl>   <dbl>
## 1      1.69   0.714

Amazing, it finally works!

2. Data Visualisation

To see whether the correlations look different with the Eskaton sample compared to the original sample used in Figure 1 of the paper, I created a new plot using the same facet_grid() method (that I thoroughly explained in the replication portion of this report).

The vast majority of this code is the same as for Figure 1, though two changes have been made:

  • Firstly, instead of using the original reported dataset, I am assigning the numeric version of the Eskaton data to a new object to pivot (as the code won’t run if categorical).
  • Secondly, the researchers chose to end the y-axis at 2.5, but I didn’t want to cut off a lot of participants whose scores are above 2.5 (this is especially important for the frequency of positive emotions plot where some of the really high scorers were part of the Eskaton sample). As such, I used the coord_cartesian() function and extended the y-axis to 3.0
aaec_eskatonpivot <- aaec_eskaton_A3_numeric %>%
  select(age, starts_with("avg")) %>%
  pivot_longer(
    cols = c(avg_f_pos, avg_f_neg, avg_i_pos, avg_i_neg),
    names_to = "Variable_Type",
    values_to = "Score",
    values_drop_na = TRUE) %>%
  mutate(
    measure_type = case_when(
      grepl("f", Variable_Type) ~ "Frequency",
      grepl("i", Variable_Type) ~ "Intensity"), 
    valence_type = case_when(
      grepl("pos", Variable_Type) ~ "Positive",
      grepl("neg", Variable_Type) ~ "Negative"))

aaec_esaktonfigure <- ggplot(data = aaec_eskatonpivot, aes(x = age, y = Score)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm",
              color = "black") +
  facet_grid(rows = vars(measure_type),
             cols = vars(valence_type),
             labeller = labeller(measure_type = labelsM, valence_type = labelsV),
             switch = "y") +
  theme_bw() +
  theme(strip.placement = "outside",
        strip.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank()) +
  coord_cartesian(ylim = c(0, 3.0)) +
labs(title = element_blank(),
       x = "Age (Years)",
       y = element_blank())

This is the new plot including the Eskaton sample:

print(aaec_esaktonfigure)

Here is a picture of the Figure 1 plot (without the Eskaton sample) for reference:

Just looking at it visually, the line of best fit for the intensity of negative emotions plot seems potentially steeper when including the Eskaton sample.

3. Statistical Analysis

I then use the cor.test() function (from stats) to do a correlation test between age and the frequency/intensity of positive/negative emotions, specifying that I want to use the Pearson correlation method.
To interpret the correlation coefficients in terms of strength, I have followed this table:

Frequency of positive emotions

# Original (no Eskaton)
cor.test(aaec$age, aaec$avg_f_pos, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec$age and aaec$avg_f_pos
## t = 5.9703, df = 943, p-value = 3.35e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1286406 0.2515581
## sample estimates:
##       cor 
## 0.1908474
# Eskaton
cor.test(aaec_eskaton$age, aaec_eskaton$avg_f_pos, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec_eskaton$age and aaec_eskaton$avg_f_pos
## t = 8.9683, df = 1099, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2052275 0.3153559
## sample estimates:
##       cor 
## 0.2611413

Both the original and Eskaton correlation coefficients indicate a small positive relationship between age and frequency of positive emotions.

Frequency of negative emotions

# Original (no Eskaton)
cor.test(aaec$age, aaec$avg_f_neg, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec$age and aaec$avg_f_neg
## t = -8.6621, df = 943, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3295509 -0.2113712
## sample estimates:
##        cor 
## -0.2714841
# Eskaton
cor.test(aaec_eskaton$age, aaec_eskaton$avg_f_neg, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec_eskaton$age and aaec_eskaton$avg_f_neg
## t = -10.67, df = 1099, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3589542 -0.2518499
## sample estimates:
##        cor 
## -0.3063714

Though the correlation coefficient for the original sample suggests a small negative relationship between age and frequency of negative emotions, strength is increased in the Eskaton sample to a medium negative relationship.

Intensity of positive emotions

# Original (no Eskaton)
cor.test(aaec$age, aaec$avg_i_pos, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec$age and aaec$avg_i_pos
## t = 2.7726, df = 943, p-value = 0.00567
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0263025 0.1528201
## sample estimates:
##        cor 
## 0.08992408
# Eskaton
cor.test(aaec_eskaton$age, aaec_eskaton$avg_i_pos, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec_eskaton$age and aaec_eskaton$avg_i_pos
## t = 3.8789, df = 1099, p-value = 0.0001112
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05752836 0.17409814
## sample estimates:
##       cor 
## 0.1162134

The very weak positive relationship between age and intensity of positive emotions in the original sample is increased to a small positive relationship in the Eskaton sample.

Intensity of negative emotions

# Original (no Eskaton)
cor.test(aaec$age, aaec$avg_i_neg, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec$age and aaec$avg_i_neg
## t = -7.6212, df = 940, p-value = 6.131e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3004785 -0.1801362
## sample estimates:
##        cor 
## -0.2412345
# Eskaton
cor.test(aaec_eskaton_A3_numeric$age, aaec_eskaton_A3_numeric$avg_i_neg, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec_eskaton_A3_numeric$age and aaec_eskaton_A3_numeric$avg_i_neg
## t = -12.951, df = 1093, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4150308 -0.3122633
## sample estimates:
##        cor 
## -0.3647574

The correlation coefficient for the original sample indicates a small negative relationship between age and intensitive of negative emotions, but there is a medium negative relationship for the Eskaton sample.

4. Conclusion

All correlations were statistically significant at the 0.05 level (as per the p-values above), indicating that the population correlation between age and the four variables is not 0. Additionally, both the original and Eskaton correlations indicate that as age increases: frequency and intensity of positive emotions increases, and frequency and intensity of negative emotions decreases.
To answer this exploratory question, not only does the relationships between age and the frequency and intensity of positive emotions remain when including older and more at-risk individuals, the relationships between age and the frequency and intensity of negative emotions actually gets stronger.

Exploration 3 - Do the same relationships between age and Future Time Perspectives (FTP) remain when including older and more high risk individuals?

The logic of the paper was that since better wellbeing with age remained during the pandemic, this refutes the main claim of the SAVI model that stress evasion is the mechanism behind this relationship. In the exploratory question 2, I determined that this trend in increased wellbeing across ages still remained and was stronger when including much older and more at-risk individuals, thus providing additional support to reject the SAVI model.
SST, the alternative theory, suggests that older individual’s views on reduced time left drives their better emotional wellbeing. The researchers argued that Extension in the Future Time Perspective (FTP) scale (which was most strongly correlated with age in their sample, r = -0.62) best represents this “perceived time left” factor as per the SST.
As such, in this third exploration, I want to investigate whether the relationship between age and future time perspectives remain when including the older Eskaton sample (as per the discussion section’s suggestion for future research), and whether extension is still the most strongly correlated with age.

1. Descriptive Statistics

First, I assign the Eskaton data to a new object using the read_excel() function and <- operator. I then pipe this into select() to keep the relevant variables (age, FTP_av, FTP_Opp, FTP_Ext, FTP_Con).

aaec_eskaton_FTP <- read_excel("~/Downloads/aaec_eskaton_FTP.xlsx") %>% 
  select(age, FTP_av, FTP_Opp, FTP_Ext, FTP_Con)

As done previously, I pipe this new object into the summarise() function and use the mean() and sd() functions to calculate the means and SDs for the four FTP variables.

aaec_eskaton_FTP %>% 
  dplyr::summarise(
    mean_av = mean(FTP_av), sd_av = sd(FTP_av),
    mean_opp = mean(FTP_Opp), sd_opp = sd(FTP_Opp),
    mean_ext = mean(FTP_Ext), sd_ext = sd(FTP_Ext), 
    mean_con = mean(FTP_Con), sd_con = sd(FTP_Con))
## # A tibble: 1 × 8
##   mean_av sd_av mean_opp sd_opp mean_ext sd_ext mean_con sd_con
##     <dbl> <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
## 1    3.95  1.40     4.45   1.64     3.63   1.68     3.60   1.52

2. Data Visualisation

Since the four FTP variables are no longer a matrix (they don’t overlap like frequency and intensity, and positive and negative), I will be using the Patchwork package instead of facet_grid() to combine the four plots.
As explained previously in the first attempt at Figure 1 in the verification portion of this report, from the ggplot2 package I use ggplot(), geom_point(), geom_smooth(), labs(), theme_bw(), and theme(). The code remains largely the same for each of the four FTP plots, just with minor changes to the title, x-axis, and y-axis labels.

# Plot for Overall FTP
Scatterplot_FTP_av <- ggplot(data = aaec_eskaton_FTP, aes(x = age, y = FTP_av)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "FTP Overall",
       x = "Age (Years)",
       y = "FTP Score") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), plot.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank())

# Plot for Extension
Scatterplot_FTP_Ext <- ggplot(data = aaec_eskaton_FTP, aes(x = age, y = FTP_Ext)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Extension",
       x = "Age (Years)",
       y = "Extension Score") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), plot.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank())

# Plot for Opportunity
Scatterplot_FTP_Opp <- ggplot(data = aaec_eskaton_FTP, aes(x = age, y = FTP_Opp)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Opportunity",
       x = "Age (Years)",
       y = "Opportunity Score") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), plot.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank())

# Plot for Constraint
Scatterplot_FTP_Con <- ggplot(data = aaec_eskaton_FTP, aes(x = age, y = FTP_Con)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Constraint",
       x = "Age (Years)",
       y = "Constraint Score") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), plot.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank())

Then I use the same method for combining the four plots using the Patchwork package.

(Scatterplot_FTP_av+Scatterplot_FTP_Ext) / (Scatterplot_FTP_Opp+Scatterplot_FTP_Con)

It looks like extension is still the most strongly correlated with age, though the opportunity correlation also looks strong.

3. Statistical Analysis

Similar to exploratory question 2, I use cor.test() to find the Pearson correlation coefficients that indicate the relationship between age and the FTP scores for both the original and Eskaton-included data.

Overall FTP - FTP_av

# Original (no Eskaton)
cor.test(aaec$age, aaec$FTP_av, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec$age and aaec$FTP_av
## t = -11.5, df = 941, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4057860 -0.2937836
## sample estimates:
##        cor 
## -0.3510398
# Eskaton
cor.test(aaec_eskaton_FTP$age, aaec_eskaton_FTP$FTP_av, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec_eskaton_FTP$age and aaec_eskaton_FTP$FTP_av
## t = -15.155, df = 1097, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4638102 -0.3659590
## sample estimates:
##        cor 
## -0.4160884

Both the original and Eskaton datasets exhibit a medium negative relationship between age and overall FTP.

Extension - FTP_Ext

# Original (no Eskaton)
cor.test(aaec$age, aaec$FTP_Ext, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec$age and aaec$FTP_Ext
## t = -24.37, df = 941, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6596864 -0.5812869
## sample estimates:
##        cor 
## -0.6220433
# Eskaton
cor.test(aaec_eskaton_FTP$age, aaec_eskaton_FTP$FTP_Ext, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec_eskaton_FTP$age and aaec_eskaton_FTP$FTP_Ext
## t = -18.973, df = 1097, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5403185 -0.4511949
## sample estimates:
##        cor 
## -0.4970665

What was a large negative relationship between age and extension in the original sample, is only a medium negative relationship with the Eskaton sample.

Opportunity - FTP_Opp

# Original (no Eskaton)
cor.test(aaec$age, aaec$FTP_Opp, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec$age and aaec$FTP_Opp
## t = -7.3067, df = 941, p-value = 5.837e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2912418 -0.1703904
## sample estimates:
##        cor 
## -0.2317099
# Eskaton
cor.test(aaec_eskaton_FTP$age, aaec_eskaton_FTP$FTP_Opp, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec_eskaton_FTP$age and aaec_eskaton_FTP$FTP_Opp
## t = -15.386, df = 1097, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4687555 -0.3714192
## sample estimates:
##        cor 
## -0.4212998

The strength of the relationship between age and opportunity is greater when including the Eskaton participants (medium negative relationship) compared to the original sample (small negative relationship).

Constraint - FTP_Con

# Original (no Eskaton)
cor.test(aaec$age, aaec$FTP_Con, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec$age and aaec$FTP_Con
## t = -3.5477, df = 941, p-value = 0.0004077
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17742615 -0.05142457
## sample estimates:
##        cor 
## -0.1148874
# Eskaton
cor.test(aaec_eskaton_FTP$age, aaec_eskaton_FTP$FTP_Con, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  aaec_eskaton_FTP$age and aaec_eskaton_FTP$FTP_Con
## t = -4.1168, df = 1097, p-value = 4.13e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1811582 -0.0646837
## sample estimates:
##        cor 
## -0.1233457

The relationship between age and constraint is small and negative for both the original and Eskaton datasets.

4. Conclusion

All p-values indicated significance at the 0.05 level, such that all negative correlations exist for both the original and Eskaton data.
Though the strength of the negative relationships between age and constraint and overall FTP remained similar, the relationship between age and opportunity is stronger when including the older participants. Surprisingly, even though extension is still the most strongly correlated FTP measure with age when including the Eskaton participants, the strength of the relationship substantially decreased (from large to medium).
Therefore, the same relationships between age and Future Time Perspectives (FTP) remain when including older and more high risk individuals, with extension still being most strongly correlated; However, the relationship between age and opportunity is stronger, while the relationship between age and extension is weaker when including high-risk individuals over 76.

Part 4 - Recommendations

1. Include comprehensively documented code in the OSF

My first recommendation is that the researchers should share their code with detailed explanations in the OSF archive. Since they did not share their code, reproducing their results was more difficult as we did not know where to start or which methods would be most efficient. Additionally, attempting to reproduce their models would have been impossible without the code. Furthermore, when we got different results to what was reported, we couldn’t look at their code to figure out the issue. For example, we could not determine whether they simply rounded incorrectly or if they included individuals who did not report their race in their proportion of white people calculation. I would suggest using R Markdown to write notes explaining what each section of code is doing, and why they have chosen certain packages and specific functions. When explaining their code, the researchers should aim to do so in the simplest way without excessive jargon use, such that even an amateur coder (or rubber duck) could understand and there is no confusion over their choices. This workshop from Utrecht University on how to write code in a way that supports reproducibility, is a great resource because it includes links to videos and practice exercises (which enhance understanding), and thoroughly explains the utility of each suggestion.

2. Share the original data without post-collection manipulation

We noticed quite early on that the data from the csv file in the OSF repository only included participants in the final sample, as the researchers had already removed the excluded participants. As such, we could only verify that the final sample size was equivalent to what was reported, but were unable to verify whether they started with 974 participants and excluded 29. I mentioned in reaction 3 that I thought it was great that they pre-registered their exclusions, which thus increases confidence in the paper’s integrity; however, not sharing their original collected data breaks transparency which is key to open science. Since we cannot verify whether they actually stuck to the pre-registered exclusion criteria rather than manipulating the data to reach a desirable outcome, confidence in the paper’s conclusions is reduced. To solve this issue, the simple fix is to share both the original data without removing any participants or variables, and a final version of the data that the researchers used in the analysis. The researchers should have a look at this link detailing what to consider when sharing raw data, and this link which includes a checklist of what researchers should be uploading to their OSF repository.

3. Have better OSF file organisation

The researchers attempted to make their files easy to understand by listing out their column variable names and explanations, and the possible options for scores within each variable; However, since they had two of these documents (Codebook and Variables Name and Meaning), I often found myself switching rapidly between them. I would also keep accidentally opening the wrong files from the OSF archive because they were poorly named. Additionally, these two documents did not entirely match the data, as 10 variables were missing in the csv. For example, we had to recode the income brackets and make them numerical ourselves, despite the Variables Name and Meaning document claiming they had done this in the (non-existent) “income_num” variable column. This all made the start of our reproducibility journey difficult because we weren’t sure which documents were the most applicable versions and which code names were correct. To improve, the researchers should rename files so they are clear and concise (keeping them short and to the point). Furthermore, they should have had just one codebook document with the final updated version of the variable names and meanings. Here is a link to some tips for file organisation specifically in the OSF archive, and here is a video which explains the best naming conventions and includes a variety of tips for more generally keeping files organised.