The Goal

After my error-laden attempt at reproducing the scatterplot, in the past week a fellow group member, Lucas, was very successful with his attempt where he managed to produce something very similar to the original plot

Plot to be reproduced

Lucas’ attempt

As you can see he’s done a marvellous job and has gotten extraordinarily close to fully reproducing it. My goal for the week was to see if I could expand on his code and get it that bit further towards a perfect replication. Main areas to address are:

Progress

# let's get this out of the way

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

From observing Lucas’ code, he achieved this result through the use of grid.arrange() where he made each plot separately and combined them all at the very end. I was having difficulty making adjustments to this code to address most of the areas listed above so I opted to continue my former approach of using pivot_longer() followed by facet_wrap() to see whether it could help make this minor tweaks.

Thanks to some much-needed help by Jenny, I managed to figure out why my pivot_longer() function wasn’t working as intended - and it was the simple, yet infuriating fact that I just did not run the code separately to make the data frame first then apply the pivot_longer()

# let's try this again

argh <- "Covid_Data.csv" %>% 
  read_csv() %>% 
  select(age, avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg) %>% 
  na.omit(argh) %>% 
  pivot_longer(
    cols = -age,
    names_to = "fivalence",
    values_to = "means")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   ResponseId = col_character(),
##   gender_all = col_character(),
##   race_all = col_character(),
##   educ = col_character(),
##   income = col_character(),
##   emp = col_character(),
##   livealone = col_character(),
##   gender_bin_f = col_character(),
##   race_bin = col_character(),
##   emp_bin = col_character()
## )
## i Use `spec()` for the full column specifications.
print(argh)
## # A tibble: 3,768 x 3
##      age fivalence means
##    <dbl> <chr>     <dbl>
##  1    18 avg_i_pos  2.25
##  2    18 avg_i_neg  2.33
##  3    18 avg_f_pos  1.81
##  4    18 avg_f_neg  1.33
##  5    18 avg_i_pos  1.15
##  6    18 avg_i_neg  2.36
##  7    18 avg_f_pos  1.45
##  8    18 avg_f_neg  2.31
##  9    18 avg_i_pos  1.81
## 10    18 avg_i_neg  1.62
## # ... with 3,758 more rows

Next is to make it resemble this lovely depiction below.

# forming an organised data set

argh <- "Covid_Data.csv" %>% 
  read_csv() %>% 
  select(age, avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg) %>% 
  na.omit(argh) %>% 
  pivot_longer(
    cols = -age,
    names_to = "fivalence",
    values_to = "means") %>% 
  separate(fivalence, c("average", "freqint", "valence", sep = "_")) %>% 
  select(age, freqint, valence, means) %>% 
  mutate(freqint2 = ifelse(freqint == "i", "Intensity of \n Emotional Experience",
                           ifelse(freqint == "f", "Frequency of \n Emotional Experience", freqint))) %>% 
  mutate(valence2 = ifelse(valence == "pos", "Positive Emotions", 
                           ifelse(valence == "neg", "Negative Emotions", valence))) %>%
  select(valence2, age, freqint2, means)
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   ResponseId = col_character(),
##   gender_all = col_character(),
##   race_all = col_character(),
##   educ = col_character(),
##   income = col_character(),
##   emp = col_character(),
##   livealone = col_character(),
##   gender_bin_f = col_character(),
##   race_bin = col_character(),
##   emp_bin = col_character()
## )
## i Use `spec()` for the full column specifications.
## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 3768 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
print (argh)
## # A tibble: 3,768 x 4
##    valence2            age freqint2                               means
##    <chr>             <dbl> <chr>                                  <dbl>
##  1 Positive Emotions    18 "Intensity of \n Emotional Experience"  2.25
##  2 Negative Emotions    18 "Intensity of \n Emotional Experience"  2.33
##  3 Positive Emotions    18 "Frequency of \n Emotional Experience"  1.81
##  4 Negative Emotions    18 "Frequency of \n Emotional Experience"  1.33
##  5 Positive Emotions    18 "Intensity of \n Emotional Experience"  1.15
##  6 Negative Emotions    18 "Intensity of \n Emotional Experience"  2.36
##  7 Positive Emotions    18 "Frequency of \n Emotional Experience"  1.45
##  8 Negative Emotions    18 "Frequency of \n Emotional Experience"  2.31
##  9 Positive Emotions    18 "Intensity of \n Emotional Experience"  1.81
## 10 Negative Emotions    18 "Intensity of \n Emotional Experience"  1.62
## # ... with 3,758 more rows

Finally, things are going well

Some of the functions that I used above (namely separate() and ifelse()) I would not know how to use or if it even existed had Jenny not introduced them to me as part of a solution to the median household income code last week so quick shoutout to her.

Now, with this table formed, the actual plotting should be relatively easy since I’ll just “borrow” some code from Lucas’ attempt.

shouldwork <- ggplot(argh, aes(x = age, y = means)) +
  geom_point(alpha = 1/18) +
  ylim(0, 2.5) +
  theme_classic() +
  theme(panel.background = element_rect(colour = "black",
                                        size = 1/40),
        strip.text.x = element_text(size = 10, face = "bold"),
        strip.text.y = element_text(size = 10, face = "bold"),
        strip.background = element_blank(),
        axis.title.x = element_text(face = "bold")) +
  geom_smooth(method = "lm", colour = "black", size = 1.5) +
  facet_wrap(vars(freqint2, valence2), strip.position = "left") +
  xlab("Age (Years)") +
  ylab("")

print(shouldwork)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 476 rows containing non-finite values (stat_smooth).
## Warning: Removed 476 rows containing missing values (geom_point).

What have I done

So, not quite. I can feel the frustration already starting to build so perhaps I should consult google before further butchering this.

And what does google suggest?

Use facet_grid

easypeasy <- ggplot(argh, aes(x = age, y = means)) +
  geom_point(alpha = 1/18) +
  ylim(0, 2.5) +
  theme_classic() +
  theme(panel.background = element_rect(colour = "black",
                                        size = 1/40),
        strip.text.x = element_text(size = 10, face = "bold"),
        strip.text.y = element_text(size = 10, face = "bold"),
        strip.placement = "outside",
        strip.background = element_blank(),
        axis.title.x = element_text(face = "bold")) +
  geom_smooth(method = "lm", colour = "black", size = 1.5) +
  facet_grid(freqint2 ~ valence2, switch = "y") +
  xlab("Age (Years)") +
  ylab("")

print(easypeasy)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 476 rows containing non-finite values (stat_smooth).
## Warning: Removed 476 rows containing missing values (geom_point).

YESSSSSS

And it turned out beautifully, if I do say so myself. For the sake of comparison, here was the plot to be reproduced:

Bears quite the similarity

HOWEVER, to fit the recurring theme of the past couple of weeks, my brief moment of happiness is spoilt as I am now just realising that the actual plots themselves do not align with the ones in the study. For example, in the negative emotions/intensity graph, the 20 year old data line starts at around 2.0 units but in the one I produced, it starts below that at around 1.8 units.

Yet another premature celebration, I guess.

Challenges/Successes

To start off on a positive note, I managed to make the final few tweaks on Lucas’ plot to get a very similar graph to the one presented in the COVID paper. However, I later realised the data in the graph was off and now my day is ruined.

Some challenges I encountered for this week was figuring out the use of the facet_grid function. As I mentioned earlier, I intended to use facet_wrap and I actually did use that initially but it turned out awful and no matter how long and how much I continued fiddling around with it, it simply did not want to cooperate. Upon switching to facet_grid it started to sing for me but it did take some time to familiarise myself with the functions and arguments and such before reaching the graph above.

Next steps

First, I am going to pretend that I did not notice the misalignment with the data and celebrate with a long-awaited nap. Though I inevitably need to address this issue so I will ask my fellow group members for help and see if they can figure it out. Currently, my suspicions are of the geom_smooth and geom_point functions since their use elicited a short string of messages that make mention of omitting some rows of data that contain “non-finite” values.