After my error-laden attempt at reproducing the scatterplot, in the past week a fellow group member, Lucas, was very successful with his attempt where he managed to produce something very similar to the original plot
Plot to be reproduced
Lucas’ attempt
As you can see he’s done a marvellous job and has gotten extraordinarily close to fully reproducing it. My goal for the week was to see if I could expand on his code and get it that bit further towards a perfect replication. Main areas to address are:
# let's get this out of the way
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
From observing Lucas’ code, he achieved this result through the use of grid.arrange() where he made each plot separately and combined them all at the very end. I was having difficulty making adjustments to this code to address most of the areas listed above so I opted to continue my former approach of using pivot_longer() followed by facet_wrap() to see whether it could help make this minor tweaks.
Thanks to some much-needed help by Jenny, I managed to figure out why my pivot_longer() function wasn’t working as intended - and it was the simple, yet infuriating fact that I just did not run the code separately to make the data frame first then apply the pivot_longer()
# let's try this again
argh <- "Covid_Data.csv" %>%
read_csv() %>%
select(age, avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg) %>%
na.omit(argh) %>%
pivot_longer(
cols = -age,
names_to = "fivalence",
values_to = "means")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## ResponseId = col_character(),
## gender_all = col_character(),
## race_all = col_character(),
## educ = col_character(),
## income = col_character(),
## emp = col_character(),
## livealone = col_character(),
## gender_bin_f = col_character(),
## race_bin = col_character(),
## emp_bin = col_character()
## )
## i Use `spec()` for the full column specifications.
print(argh)
## # A tibble: 3,768 x 3
## age fivalence means
## <dbl> <chr> <dbl>
## 1 18 avg_i_pos 2.25
## 2 18 avg_i_neg 2.33
## 3 18 avg_f_pos 1.81
## 4 18 avg_f_neg 1.33
## 5 18 avg_i_pos 1.15
## 6 18 avg_i_neg 2.36
## 7 18 avg_f_pos 1.45
## 8 18 avg_f_neg 2.31
## 9 18 avg_i_pos 1.81
## 10 18 avg_i_neg 1.62
## # ... with 3,758 more rows
Next is to make it resemble this lovely depiction below.
# forming an organised data set
argh <- "Covid_Data.csv" %>%
read_csv() %>%
select(age, avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg) %>%
na.omit(argh) %>%
pivot_longer(
cols = -age,
names_to = "fivalence",
values_to = "means") %>%
separate(fivalence, c("average", "freqint", "valence", sep = "_")) %>%
select(age, freqint, valence, means) %>%
mutate(freqint2 = ifelse(freqint == "i", "Intensity of \n Emotional Experience",
ifelse(freqint == "f", "Frequency of \n Emotional Experience", freqint))) %>%
mutate(valence2 = ifelse(valence == "pos", "Positive Emotions",
ifelse(valence == "neg", "Negative Emotions", valence))) %>%
select(valence2, age, freqint2, means)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## ResponseId = col_character(),
## gender_all = col_character(),
## race_all = col_character(),
## educ = col_character(),
## income = col_character(),
## emp = col_character(),
## livealone = col_character(),
## gender_bin_f = col_character(),
## race_bin = col_character(),
## emp_bin = col_character()
## )
## i Use `spec()` for the full column specifications.
## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 3768 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
print (argh)
## # A tibble: 3,768 x 4
## valence2 age freqint2 means
## <chr> <dbl> <chr> <dbl>
## 1 Positive Emotions 18 "Intensity of \n Emotional Experience" 2.25
## 2 Negative Emotions 18 "Intensity of \n Emotional Experience" 2.33
## 3 Positive Emotions 18 "Frequency of \n Emotional Experience" 1.81
## 4 Negative Emotions 18 "Frequency of \n Emotional Experience" 1.33
## 5 Positive Emotions 18 "Intensity of \n Emotional Experience" 1.15
## 6 Negative Emotions 18 "Intensity of \n Emotional Experience" 2.36
## 7 Positive Emotions 18 "Frequency of \n Emotional Experience" 1.45
## 8 Negative Emotions 18 "Frequency of \n Emotional Experience" 2.31
## 9 Positive Emotions 18 "Intensity of \n Emotional Experience" 1.81
## 10 Negative Emotions 18 "Intensity of \n Emotional Experience" 1.62
## # ... with 3,758 more rows
Finally, things are going well
Some of the functions that I used above (namely separate() and ifelse()) I would not know how to use or if it even existed had Jenny not introduced them to me as part of a solution to the median household income code last week so quick shoutout to her.
Now, with this table formed, the actual plotting should be relatively easy since I’ll just “borrow” some code from Lucas’ attempt.
shouldwork <- ggplot(argh, aes(x = age, y = means)) +
geom_point(alpha = 1/18) +
ylim(0, 2.5) +
theme_classic() +
theme(panel.background = element_rect(colour = "black",
size = 1/40),
strip.text.x = element_text(size = 10, face = "bold"),
strip.text.y = element_text(size = 10, face = "bold"),
strip.background = element_blank(),
axis.title.x = element_text(face = "bold")) +
geom_smooth(method = "lm", colour = "black", size = 1.5) +
facet_wrap(vars(freqint2, valence2), strip.position = "left") +
xlab("Age (Years)") +
ylab("")
print(shouldwork)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 476 rows containing non-finite values (stat_smooth).
## Warning: Removed 476 rows containing missing values (geom_point).
What have I done
So, not quite. I can feel the frustration already starting to build so perhaps I should consult google before further butchering this.
And what does google suggest?
Use facet_grid
easypeasy <- ggplot(argh, aes(x = age, y = means)) +
geom_point(alpha = 1/18) +
ylim(0, 2.5) +
theme_classic() +
theme(panel.background = element_rect(colour = "black",
size = 1/40),
strip.text.x = element_text(size = 10, face = "bold"),
strip.text.y = element_text(size = 10, face = "bold"),
strip.placement = "outside",
strip.background = element_blank(),
axis.title.x = element_text(face = "bold")) +
geom_smooth(method = "lm", colour = "black", size = 1.5) +
facet_grid(freqint2 ~ valence2, switch = "y") +
xlab("Age (Years)") +
ylab("")
print(easypeasy)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 476 rows containing non-finite values (stat_smooth).
## Warning: Removed 476 rows containing missing values (geom_point).
YESSSSSS
And it turned out beautifully, if I do say so myself. For the sake of comparison, here was the plot to be reproduced:
Bears quite the similarity
HOWEVER, to fit the recurring theme of the past couple of weeks, my brief moment of happiness is spoilt as I am now just realising that the actual plots themselves do not align with the ones in the study. For example, in the negative emotions/intensity graph, the 20 year old data line starts at around 2.0 units but in the one I produced, it starts below that at around 1.8 units.
Yet another premature celebration, I guess.
To start off on a positive note, I managed to make the final few tweaks on Lucas’ plot to get a very similar graph to the one presented in the COVID paper. However, I later realised the data in the graph was off and now my day is ruined.
Some challenges I encountered for this week was figuring out the use of the facet_grid function. As I mentioned earlier, I intended to use facet_wrap and I actually did use that initially but it turned out awful and no matter how long and how much I continued fiddling around with it, it simply did not want to cooperate. Upon switching to facet_grid it started to sing for me but it did take some time to familiarise myself with the functions and arguments and such before reaching the graph above.
First, I am going to pretend that I did not notice the misalignment with the data and celebrate with a long-awaited nap. Though I inevitably need to address this issue so I will ask my fellow group members for help and see if they can figure it out. Currently, my suspicions are of the geom_smooth and geom_point functions since their use elicited a short string of messages that make mention of omitting some rows of data that contain “non-finite” values.