After adding some basic comments to my code last week, this week I hope to really expand my explanation of what the code within my verification report does. It may take a few hours, but I hope to have much more detailed comments for each chunk of code by the end of this week.
The following packages were used to produce the verification report.
library(tidyverse)
library(ggplot2)
library(here)
library(dplyr)
library(papaja)
library(psych)
library(MOTE)
library(wesanderson)
library(ggpubr)
master <- read.csv("AgeAdvantagesEmotionCovid_Data.csv", header = TRUE)
The role of each package in this verification report:
tidyverse: Allows for easy data wrangling with shared language across packages.
ggplot2: Allows for detailed data visualisation using a layered approach.
here: Allows for construction of a path to project files instead of setting working directory.
dplyr: Allows for simple data manipulation.
papaja: Allows for APA formatting of results and tables.
psych: A toolbox for personality, psychometrics and experimental psychology. Includes “describe” function (M and SD).
MOTE: Easy-to-use effect size calculator compatible with “papaja” package.
wesanderson: Cool colour palettes for graphs.
ggpubr: Allows for quick creation of quantile plots and other publication-ready plots.
Here are the demographics we are trying to reproduce:
For any demographic statistic in percentage form, a single process can be used. First, select from the master data set the target variable, then filter() the response you are interested in. Next, calculate the percentage of participants who answered with that response by counting the number of rows (participants) with that response and dividing it by the total number of rows in the master data set. This process is used to reproduce several of the demographics, so I won’t repeat myself later on.
# Demographics
## Percentage of females
### Filter female participants
n_female <- master %>%
select(gender_bin_f) %>%
filter(gender_bin_f == "Female")
### Calculate percentage of females
100 * nrow(n_female) / nrow(master)
## [1] 49.20635
To find the age range of the sample, select the age variable and apply the range() function. To find the age M and SD of the sample, the summarise() function from the plyr package can be used and the mean and sd can be called from within the function.
## Age range, mean and SD
### Use range function to find age range
age_range <- master$age %>% range()
### Use summarise function to find age mean and SD
age_mean_sd <- master %>%
summarise(mean_age = mean(age), sd_age = sd(age))
age_mean_sd; age_range
## mean_age sd_age
## 1 45.15238 16.78756
## [1] 18 76
## Percentage white
### Filter white participants
n_white <- master %>%
select(race_bin) %>%
filter(race_bin == "White")
### Calculate percentage white
100 * nrow(n_white) / nrow(master)
## [1] 75.44974
Reproducing the median household income is a difficult task. The way the data is coded makes it particularly difficult. We must reorganise and rename the answers to the income variable. Firstly, we use the select() function to select the income variable and use the filter() function to filter out the responses of participants who declined to answer. Next, we separate each individual response into three columns separated by a space, since each response contains three words/numbers. Next, we select only the “low” and “high” columns because the “to” column only contains words like “to” and “than”.
The next part is a little more complicated. The ifelse() function allows us to apply an action to all vectors within a variable. Here, we are creating a new variable called “low2”, and saying that this variable should be identical to the “low variable” but if the response begins with the word “Less” replace it with a value of 0. Similarly, if the value begins with the word “Greater”, we replace it with the value 275,000. Lastly, we drop the old “low” column because now we have our new and improved “low2” column. These two columns now contain the upper and low limits of each participants reported income range.
## Median household income
### Reorganise and rename income answers
dem_income <- master %>%
select(income) %>%
filter(income != "Decline to answer") %>%
separate(income, c("low", "to", "high"), sep = " ") %>%
select(low, high) %>%
mutate(low2 = ifelse(low == "Less", 0,
ifelse(low == "Greater", 275000, low))) %>%
select(-low)
Next, we simply use the gsub() function to substitute the ‘$’ in each variable with nothing. This effectively removes the dollar sign from each response. After that, we make each variable a numeric variable using the as.numeric() function, again substituting the commas in each response with nothing (effectively removing the commas). Finally, we can calculate the median income by creating a new variable midpoint, which is the midpoint of the upper and lower income limits for each participant, and then applying the summarise() function to find the median value of these midpoints. Sure enough, the answer falls within the range described in the original paper.
dem_income$high <- gsub('[$]','',dem_income$high)
dem_income$low2 <- gsub('[$]','',dem_income$low2)
dem_income$high <- as.numeric(gsub(',','',dem_income$high))
dem_income$low2 <- as.numeric(gsub(',','',dem_income$low2))
### Calculate median income
dem_income %>%
mutate(midpoint = (high+low2)/2) %>%
summarise(med_income = median(midpoint))
## med_income
## 1 55000
The rest of the demographics can be found using the same method described earlier regarding how to find percentages.
## Percentage working for pay
### Filter participants that work for pay
n_pay <- master %>%
select(emp_bin) %>%
filter(emp_bin == "Working")
### Calculate percentage working for pay
100 * nrow(n_pay) / nrow(master)
## [1] 56.19048
## Percentage educated
### Filter eductaed participants
n_educated <- master %>%
select(educ) %>%
filter(educ != "Graduated high school (or GED)",
educ != "Less than high school")
### Calculate percentage educated
100 * nrow(n_educated) / nrow(master)
## [1] 87.72487
## Percentage living alone
### Filter participants living alone
n_alone <- master %>%
select(livealone) %>%
filter(livealone == "Yes")
### Calculate percentage living alone
100 * nrow(n_alone) / nrow(master)
## [1] 22.96296
Success! All of the demographic statistics were reproduced. This process was lengthy and required a lot of repetition. The tidyverse package is extremely valuable for data wrangling, particularly because of the ability to pipe between actions, allowing you to construct a hierarchy of steps (i.e. select, then filter, then mutate etc). The median household income was definitely the hardest to reproduce, because the original data was not in a format appropriate to find the median. A little bit of re-coding in the original survey from the authors would have saved a lot of time here.
The main lesson we learned from our first step in the verification report was that there are multiple ways to achieve the same result in R. Our original code for reproducing the demographics was much longer and less efficient, but the ability to pipe between functions granted by the tidyverse package allowed us to “trim the fat” of our code significantly.
We intend to reproduce the summary statistics (means and SDs) for all of the following variables described in the paper:
After the lengthy process of reproducing the demographics, we decided to look for a more efficient way to reproduce the summary statistics. We found the psych package which contained the describe() function, which effectively produces the means and SDs for all variables you place within it. As such, first we created a new data frame with all the variables of interest, then ran it through the describe function. If you print the result, you get a table of summary statistics. The describe() function has a lot of arguments. For our example, most are not necessary, but here is a description of each one:
na.rm: The default is to delete missing data. na.rm=FALSE will delete the case.
interp: Should the median be standard or interpolated
skew: Should the skew and kurtosis be calculated?
ranges: Should the range be calculated?
trim: trim=.1 – trim means by dropping the top and bottom trim fraction
type: Which estimate of skew and kurtosis should be used?
check: Should we check for non-numeric variables?
fast: if TRUE, will do n, means, sds, min, max, ranges for an improvement in speed. If NULL, will switch to fast mode for large (ncol * nrow > 10^7) problems, otherwise defaults to fast = FALSE
quant : if not NULL, will find the specified quantiles (e.g. quant=c(.25,.75) will find the 25th and 75th percentiles)
IQR: If TRUE, show the interquartile range
omit: Do not convert non-numerical variables to numeric, omit them instead
# Summary Statistics
### Select the appropriate variables from the data set
sum_stat <- master %>%
select(FTP_av, FTP_Ext, FTP_Opp, FTP_Con,
risk_self, risk_comp, risk_pop, emp_affect,
health, Extraversion, Conscientiousness, avg_f_pos,
avg_f_neg, avg_i_pos, avg_i_neg)
### Create a table of summary statistics using the describe function
sum_table <- describe(sum_stat, na.rm = TRUE, interp=FALSE,
skew = FALSE, ranges = FALSE,trim=.1,
type=3,check=FALSE,fast=NULL,quant=NULL,
IQR=FALSE,omit=TRUE,data=NULL) %>%
select(-starts_with("vars"))
print(sum_table)
## n mean sd se
## FTP_av 943 4.09 1.35 0.04
## FTP_Ext 943 4.02 1.91 0.06
## FTP_Opp 943 4.73 1.62 0.05
## FTP_Con 943 3.74 1.76 0.06
## risk_self 945 2.36 1.11 0.04
## risk_comp 945 2.45 1.34 0.04
## risk_pop 945 2.97 0.94 0.03
## emp_affect 945 1.41 1.46 0.05
## health 945 3.36 0.99 0.03
## Extraversion 945 3.44 1.65 0.05
## Conscientiousness 945 5.22 1.33 0.04
## avg_f_pos 945 1.97 0.56 0.02
## avg_f_neg 945 1.42 0.66 0.02
## avg_i_pos 945 1.93 0.59 0.02
## avg_i_neg 942 1.77 0.70 0.02
Success! All of the summary statistics were reproduced. This was much quicker than reproducing the demographics. The describe() function from the psych package is extremely useful in creating the summary statistics for a range of numeric variables.
Once again, our original code for this section was much longer, and involved using the mean() and sd() function on each individual variable. We could have used the summarise() and across() functions together to produce the same summary stats for all of the variables but the psych package offers a much simpler method and produces the results in a nicely formatted table.
Here is the table we are trying to reproduce:
Our plan to reproduce this table revolves around the apa_table() function within the papaja package. After reading about the papaja package and how its functions can be used to produce APA formatted tables and statistical output, we found that Table 1 looks very similar to the tables which are produced by the apa_table() function. As such, we must reorganise the raw data into the appropriate long format, where the name, mean, SD and CI for each emotion frequency is a separate variable.
Firstly, we must select the relevant variables from the master data set.
# Table 1 - Mean Frequencies of Emotions
### Select the frequencies of each emotion from master data set.
emot <- master %>%
select(f_calm, f_qui, f_app, f_int, f_cont, f_hap, f_rela, f_pea,
f_ener, f_aff, f_amu, f_acc, f_joy, f_pro, f_reli, f_exc,
f_conc, f_anx, f_bor, f_fru, f_irr, f_sad, f_lon, f_fear,
f_ang, f_dis, f_gui, f_emb, f_sha)
Next, we can obtain the means and SDs for each emotion frequency using the summarise() function, in conjunction with the across() function, which allows us to summarise across all of our variables. The “na.rm = TRUE” argument tells R to remove the NA responses from the calculation of the means and SDs.
### Create a new data frame with the descriptives (M and SD) for the emotions.
emot_des <- emot %>%
summarise(
across(.cols = everything(), na.rm = TRUE, list(M = mean, SD = sd)))
Next, we can rename each mean variable according the the name of each respective emotion using the rename() function. The format is “new name” = “old name”.
### Rename the emotions to table-friendly names.
emot_des <- emot_des %>%
rename(Calm = f_calm_M, Quiet = f_qui_M, Appreciative = f_app_M,
Interested = f_int_M, Content = f_cont_M, Happy = f_hap_M,
Relaxed = f_rela_M, Peaceful = f_pea_M, Energetic = f_ener_M,
Affectionate = f_aff_M, Amused = f_amu_M, Accomplished = f_acc_M,
Joyful = f_joy_M, Proud = f_pro_M, Relieved = f_reli_M,
Excited = f_exc_M, Concerned = f_conc_M, "Anxious/worried" = f_anx_M,
Bored = f_bor_M, Frustrated = f_fru_M, Irritated = f_irr_M,
Sad = f_sad_M, Lonely = f_lon_M, Fearful = f_fear_M,
Angry = f_ang_M, Disgusted = f_dis_M, Guilty = f_gui_M,
Embarrassed = f_emb_M, Ashamed = f_sha_M)
Now we have to pivot the data into long form. First, we select the columns which we just renamed, by telling R to select all the columns that DON’T start with “f_”. The pivot_longer() function allows us to take all of the variables we just renamed, put their names into a new column called “Valence and emotion”, and put their values to another column called “M” (for mean).
After that, we can repeat this pivot process for the SD values, by selecting all columns that DO start with “f_” (which are the SD columns) and pivoting their values to a new column called “SD”.
We can combine our two pivoted data frames into one, using the bind_cols() function. Now we have a data frame with three columns which display the names, means and SDs.
### Pivot the descriptive twice - one creates two variable with the name and mean of each emotion, the other creates the SD.
emot_pivot1 <- emot_des %>%
select(-starts_with("f_")) %>%
pivot_longer(cols = everything(),
names_to = "Valence and emotion",
values_to = "M")
emot_pivot2 <- emot_des %>%
select(starts_with("f_")) %>%
pivot_longer(cols = everything(),
names_to = NULL,
values_to = "SD")
### Bind them together to create table-friendly descriptive stats.
emot_des_table <- bind_cols(emot_pivot1, emot_pivot2)
Now it is time to find the confidence intervals for the frequency of each emotion. First, we create a new value for the number of participants and the standard error, which can be found by dividing the SD by the square-root of the number of participants.
Next, we can create two new variables: a lower CI limit and an upper CI limit, using the equation to produce t-based 95% CIs. The qt() function produces the t-value which contains x% of the distribution to left of it, given y degrees of freedom. As such, the use of it below gives the t-value which has 97.5% of the distribution to the left of it, given n-1 degrees of freedom. We can multiply that t-value by the newly created SE and add/subtract it from the means in the “M” column of data frame. Consequently, we get two new variables in our data frame: one for our lower CI limits and another for our upper CI limits.
### Create lower and upper 95% CI limits
n <- 945; se <- emot_des_table$SD / sqrt(n)
emot_des_table <- emot_des_table %>%
mutate(lower_CI = M - qt(1 - (0.05 / 2), n - 1) * se,
upper_CI = M + qt(1 - (0.05 / 2), n - 1) * se)
Next, we must round the CI limit values to two decimal planes using the format() and round() functions. We specify the number of decimals using nsmall = 2.
### Round CIs to 2 d.p. (can't do this above because of non-numeric variable)
emot_des_table$lower_CI <- format(round(emot_des_table$lower_CI, 2), nsmall = 2)
emot_des_table$upper_CI <- format(round(emot_des_table$upper_CI, 2), nsmall = 2)
Time to merge the CI limits into one column. This is a tricky task. After consulting with Jenny, she pointed us in the direction of the unite() function, which combines the values of two columns into one column. The “col” argument specifies the name of the new column we are going to produce, then we specify which columns we want to merge and what we wish to separate the values with. In this case, we wish to separate them with a comma. Finally, to add square brackets around the CI limits, we can use the paste0() function to insert a square bracket on either side of the CI interval.
### Combine the CIs into one column
emot_des_table <- emot_des_table %>%
unite(col = "95% CI",
lower_CI:upper_CI,
sep = ", ") %>%
mutate("95% CI" = paste0("[", `95% CI`, "]"))
Table 1 groups each emotion under “Positive” and “Negative” in the original paper. Originally, we weren’t sure how we could replicate this using the apa_table() function, so we thought we would have to a separate table for each group of emotions. However, after a little investigation we found a solution.
We must first separate all of our variables so far into two new data frames. We can do this using the slice() function, which effectively acts as the select() function, but selects rows instead of columns. Since the emotions are already ordered according to the original table, we can take the first 16 emotions and put them under “Positive”, and put the last 16 emotions under “Negative”.
### Split the data into positive/negative emotions
pos_emot_table <- emot_des_table %>%
slice(1:16)
neg_emot_table <- emot_des_table %>%
slice(17:29)
Finally, it is time to put the data through the apa_table() function. All of this time and effort put in to correctly formatting the data will balance out with the time saved in formatting the aesthetics of the table. Within the apa_table() function, we specify that we would like to list the emotions under “Positive Emotions” and “Negative Emotions”. We can also specify a caption, a note at the bottom of the table, and how to align the values within each column. We can also specify the font size of the values within the table.
### Apply the apa_table function (NOTE: Only produces correct output in APA formatted markdown template which comes with the papaja package)
apa_table(list('Positive emotions' = pos_emot_table,
'Negative emotions' = neg_emot_table),
caption = "Mean Frequencies of Emotions",
note = "N = 945. CI = confidence interval",
align = c('l', 'c', 'c', 'c'),
font_size = "scriptsize")
| Valence and emotion | M | SD | 95% CI |
|---|---|---|---|
| Positive emotions | |||
| Calm | 2.44 | 0.87 | [2.39, 2.50] |
| Quiet | 2.43 | 0.87 | [2.38, 2.49] |
| Appreciative | 2.40 | 0.93 | [2.35, 2.46] |
| Interested | 2.28 | 0.83 | [2.23, 2.33] |
| Content | 2.15 | 0.94 | [2.09, 2.21] |
| Happy | 2.13 | 0.80 | [2.08, 2.19] |
| Relaxed | 2.13 | 0.89 | [2.07, 2.19] |
| Peaceful | 2.05 | 0.95 | [1.99, 2.11] |
| Energetic | 1.90 | 0.80 | [1.85, 1.95] |
| Affectionate | 1.89 | 0.86 | [1.83, 1.94] |
| Amused | 1.87 | 0.72 | [1.83, 1.92] |
| Accomplished | 1.84 | 0.87 | [1.78, 1.89] |
| Joyful | 1.71 | 0.90 | [1.65, 1.76] |
| Proud | 1.67 | 0.97 | [1.61, 1.73] |
| Relieved | 1.48 | 0.88 | [1.42, 1.53] |
| Excited | 1.46 | 0.79 | [1.41, 1.51] |
| Negative emotions | |||
| Concerned | 2.23 | 0.91 | [2.17, 2.29] |
| Anxious/worried | 2.00 | 1.06 | [1.94, 2.07] |
| Bored | 1.88 | 1.12 | [1.80, 1.95] |
| Frustrated | 1.85 | 0.93 | [1.79, 1.90] |
| Irritated | 1.75 | 0.89 | [1.70, 1.81] |
| Sad | 1.69 | 0.97 | [1.63, 1.75] |
| Lonely | 1.55 | 1.25 | [1.47, 1.63] |
| Fearful | 1.38 | 1.06 | [1.31, 1.44] |
| Angry | 1.35 | 0.89 | [1.29, 1.40] |
| Disgusted | 1.16 | 0.99 | [1.10, 1.22] |
| Guilty | 0.63 | 0.87 | [0.58, 0.69] |
| Embarrassed | 0.51 | 0.76 | [0.47, 0.56] |
| Ashamed | 0.44 | 0.76 | [0.40, 0.49] |
Note. N = 945. CI = confidence interval
Success! Table 1 was reproduced. It is important to note that the results from the apa_table() function look best when the markdown file is knitted to a Word document or PDF. In HTML form, the table still looks nice, but it is definitely not in APA format. This function (and package) is super cool. It is extremely useful for anybody who needs to produce APA formatted output for manuscripts or assignments. It does all the formatting for us, which is nice considering what is coming next!
Here are the scatter plots we are trying to reproduce:
This is a tough task, so we can break it down into two parts - preparing the data and plotting the data.
First we need to prepare the data for visualisation. We start by selecting the appropriate variables (age and emotion frequency and intensity), omitting the NA responses, and pivoting the emotion data into long form.
Next, we create a new variable called “freqint2” and use the ifelse() function again to apply two conditions to the values in the variable. The first condition specifies that all values that begin with “i” (signalling intensity) are to be replaced by “Intensity of Emotional Experience”. Secondly, everything that begins with “f” (signalling frequency) is replaced with “Frequency of Emotional Experience”. The “/n” tells R to push the rest of the words to the next line (akin to pressing “enter” on your keyboard).
We repeat this step for the valence of the emotions, making the values either “Positive Emotions” or “Negative Emotions”.
Finally, the use the select() function to select only the variables of interest, discarding our old variables which we have since mutated.
### Select relevant variables, omit missing data, reorganise data into long form, create new variables for intensity and valence.
plot_data <- master %>%
select(age, avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg) %>%
na.omit() %>%
pivot_longer(cols = -age,
names_to = "fivalence",
values_to = "means") %>%
separate(fivalence, c("average", "freqint", "valence", sep = "_")) %>%
select(age, freqint, valence, means) %>%
mutate(freqint2 = ifelse(freqint == "i", "Intensity of \n Emotional Experience",
ifelse(freqint == "f", "Frequency of \n Emotional Experience", freqint))) %>%
mutate(valence2 = ifelse(valence == "pos", "Positive Emotions",
ifelse(valence == "neg", "Negative Emotions", valence))) %>%
select(valence2, age, freqint2, means)
The second step is to plot the data. This is where we first used the ggplot package, which allows for effective visualisation of data by creating layered graphs. Each line of code below specifies an aesthetic component of the graph.
First, we create the plot using the ggplot() function, assigning the data, the “age” variable along the x-axis, the emotion intensity/frequency along the y-axis.
The geom_point() function specifies that the data should be plotted as dots (scatter plot), and the arguments within specify the transparency (alpha) and size of the dots.
The coord_cartesian() function allows us to set the appropriate y-axis limits to reflect the scale of the original figure. The bulk of the rest of the code deals with the aesthetics of the figure text to help it resemble the original figure.
By layering the theme() function over and over, we can add multiple aesthetic changes. These changes include:
Change the colour scheme to black and white using the theme_bw() function.
Change the axis text to Arial size 12 using the element_text() function.
Remove the background grids from the plots by assigning panel.grid.major = element_blank() and panel.grid.minor = element_blank().
Change the background colour to black using panel.background = element_rect()
Change the size and font of the axis titles and make them bold using the same element_text() function as above.
Relocate the axis titles to outside the plots using strip.placement = "outside". This becomes more important when we group the plots together.
Next, we add the regression lines using the geom_smooth() function, assigning “lm” (linear model) as the method as well as specifying the size and colour. After that, we can combine the four plots using the facet_grid function, specifying which axis we want to plot against which using the “~” symbol. Finally, we rename the x and y labels by using the xlab() and ylab() functions accordingly.
### Create scatter plots with ggplot and geom_point. geom_smooth adds regression lines and facet_grid arranges the four plots together. Rest of code makes aesthetic changes.
scatter_plot <- ggplot(plot_data, aes(x = age, y = means)) +
geom_point(alpha = 1/13, size = 2) +
coord_cartesian(ylim = c(0,2.5)) +
theme_bw() +
theme(axis.text = element_text(family="Arial Narrow", size = 12)) +
theme (panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
theme(panel.background = element_rect(colour = "black", size = 1/40),
strip.text.x = element_text(family="Arial Narrow", size = 14, face = "bold"),
strip.text.y = element_text(family="Arial Narrow", size = 13, face = "bold"),
strip.placement = "outside",
strip.background = element_blank(),
axis.title.x = element_text(family="Arial Narrow", size = 14, face = "bold")) +
geom_smooth(method = "lm", colour = "black", size = 1.5) +
facet_grid(freqint2 ~ valence2, switch = "y") +
xlab("Age (Years)") +
ylab("")
print(scatter_plot)
After all that, we get our scatter plots!. Unlike Table 1, a lot of time went into formatting the aesthetics of this figure. Having said that, the ggplot package and associated functions made it easy to layer on small aesthetic changes one after the other as we figured out how to alter every little detail of our plots.
The way we display data has a tremendous impact on how it is interpreted. Figures like these demonstrate why sharing code is so important when it comes to reproducibility. It was an arduous journey reproducing these plots, and although it wasn’t necessarily difficult to obtain the same relevant data, it was certainly difficult to visualise it in the same manner as the original authors.
Although I personally did not work on reproducing these plots, this part of the verification report was a real test of endurance and team work in a coding environment. Whilst we took a “divide and conquer” approach to the overall assignment, this particular reproducibility triumph was achieved using the combined effort of multiple group members. Rory, Josh and Lucas did a great job building on each others efforts and helping me understand what each line of code does.
One large challenge I faced this week was explaining the function of code written by my other group members. This helped me realise how important it is to accompany your code with explanatory comments. If I am having a hard time understanding the code written by my own group members, how hard must it be for others to understand my code? As such, I tried to make my comments sufficiently detailed so that any reader with a basic knowledge of R could understand what each line of our code does. I also tried to follow to the principles of DRY and KISS (although both are easy to violate in a coding context where there is a LOT of repetition and complexity haha).
I still think I have some way to go in this area, but I feel like I have improved significantly on my efforts from last week. Where possible, I have explained the role of each function and even some of the individual arguments. Perhaps more notably, I have included comments about what we had learned at different stages of the verification process. These lessons might be useful for others to know.
My next and final step is to complete my verification report. I hope to finish my summary and reactions, add the same detailed comments to my exploratory section, and provide some reasonable reproducibility recommendations to the authors of the original paper.