Goals

  1. Improve our code from last week
  2. Get started on plotting

set up

load packages

library(tidyverse) # for readr, dplyr and ggplot
library(here) # to help with specifying file paths
library(papaja) # for theme_apa()  
library(ggeasy) # for easy customisation of ggplots

read the data

sdmp <- read_csv(here("data", "ReplicationProcessedfinaldata04202018.csv")) 
mahaD <- read_csv(here("data", "MahaDandCIs.csv"))

improving our code from last week

rounding things

When calculating our participant demographics and means/SDs last week, we got many more decimal places than we need.

sdmp %>% 
  filter(sex == "0") %>% 
  summarise(count = n(), percent =  100 * n()/ nrow(sdmp)) 
## # A tibble: 1 x 2
##   count percent
##   <int>   <dbl>
## 1  7909    54.9

To round to just 2 decimal places, we can use round() from base R. To use this function, we write round(x, number of digits), where x is the value we want to round. It was a little difficult to first learn to use round() in an expression with multiple parts and brackets. At first I put the number of digits in the wrong place, like this, and I was stumped about why I was not getting the correct output.

sdmp %>% 
  filter(sex == "0") %>% 
  summarise(count = n(), percent =  round(100 * n()/ nrow(sdmp)), 2)
## # A tibble: 1 x 3
##   count percent   `2`
##   <int>   <dbl> <dbl>
## 1  7909      55     2

But I eventually got the right code by looking at the examples in the R Documentation and thinking carefully about what “x” is in my context - 100 * n()/ nrow(sdmp).

sdmp %>% 
  filter(sex == "0") %>% 
  summarise(count = n(), percent =  round(100 * n()/ nrow(sdmp), 2))
## # A tibble: 1 x 2
##   count percent
##   <int>   <dbl>
## 1  7909    54.9

Perhaps if I had tried applying round() in different and more simple contexts (e.g., rounding a single number), I would have been able to troubleshoot this issue more quickly.

recoding things

We were unsure about why we need backticks around the variable values when using recode(). We googled this and asked our instructors but were unable to find a clear answer. Therefore we followed Jenny S’s advice to use case_when() from the dplyr package instead of recode(). Here, we are telling R that if sex is exactly equal to 0, it should be recoded to female, and if sex is exactly equal to 1, it should be recoded to male.

sdmp <- sdmp %>% 
  mutate(sex = case_when(sex == 0 ~ "female", 
                         sex == 1 ~ "male"))

making figures

figure 2: actual partner age

Our goal is to reproduce the following plot from page 418 of our paper.

We are starting with figure 2 because we already have the variables we need (participant age & age difference compared to partner) in our dataframe.

make the age difference dataframe

We already did this last week, but here I am doing it again.

age_diff_sdmp <- sdmp %>%
  filter(mate_age > 10) %>%
  mutate(age_difference = mate_age - age) %>%
  relocate(age_difference, .after = mate_age)

attempt 1

We are using the ggplot2 package because it provides a simple way to create complex plots. We are specifying the data age_diff_sdmp because this is the dataframe that has our age_difference variable, which excludes participants who reported mate ages of 10 or under. The mapping argument, together with the function aes() refers to the aesthetic mappings of the plot. They tell R which characteristics of the data correspond to which visual characteristics of the plot. Here, we are specifying that the age variable goes on the x-axis, and the age_difference variable goes on the y-axis. We are also specifying that the colour of the plot should correspond to the sex variable (as the data for females and males are in different colours).

In ggplot2, we use a geom_ (geometric object) function to specify the type of plot that we want. geom_point() is used to create a scatterplot. We can use + to add components to a plot (+ is like the ggplot equivalent of the pipe %>%). Here we are adding another layer to our plot, geom_smooth(), which is a smoothing line that helps us more easily see the patterns in the data. We are also adding a horizontal reference line using geom_hline() specifying that it should lie at 0 on the y-axis with yintercept = 0. We are using theme_apa() from the papaja package to remove the gridlines and grey background, making the plot look cleaner.

scale_x_continuous() and scale_y_continuous() refer to the aesthetics of the x and y axes, which are both continuous variables. The name argument allows us to specify the label of the axes. The breaks argument allows us to adjust the tick marks on the plot. We learned that we can use the seq() function from base R within this argument. seq refers to a sequence. from and to are the starting and ending points of the sequence. They specify the minimum and maximum values of the tick marks. by is the increment of the sequence; here, we want each tick mark to go up by 10.

fig2 <- ggplot(data = age_diff_sdmp, 
  mapping = aes(
    x = age, 
    y = age_difference, 
    colour = sex 
  )) + 
  geom_point() + 
  geom_smooth() + 
  geom_hline(yintercept = 0) + 
  theme_apa() + 
  scale_y_continuous(name = "Age Difference (Partner - Self)", 
                     breaks = seq(from = -50, to = 30, by = 10)) + 
  scale_x_continuous(name = "Participant Age (Years)", 
                     breaks = seq(from = 20, to = 90, by = 10)) 

print(fig2)

The above plot is pretty good, but some parts look incorrect. In particular, the smoothing lines look quite different. We also know from the figure caption that the shaded areas of the smoothing lines should indicate 95% confidence intervals. Perhaps we need to calculate these confidence intervals.

attempt 2

We consulted the authors’ script to adjust the plot. Although the authors used qplot() which is a different and quicker way of writing ggplot code, we could still pick out the key elements they used. We are updating our code to reflect that they used size 3 and the loess method in their smoothing lines. Loess is a statistical method of drawing smoothing lines which uses a t-based approximation. We also realised that they used theme_classic() rather than theme_apa(). We made adjustments within scale_colour_manual(), which refers to the colour aesthetic. It is “manual” because we are setting the colours manually. The legend tells us about the colour aesthetic. We are making the legend title blank using the name argument. We are also capitalising the labels. We use the notation c() to depict a vector, or a list of elements of the same type. Finally, we are using the values argument to set the colours themselves to the ones the authors used.

fig2 <- ggplot(data = age_diff_sdmp, 
  mapping = aes(
    x = age, 
    y = age_difference, 
    colour = sex 
  )) + 
  geom_point() + 
  geom_smooth(
    size = 3, 
    method = "loess") +  
  geom_hline(yintercept = 0) + 
  theme_classic() + 
  scale_y_continuous(
    name = "Age Difference (Partner - Self)", 
    breaks = seq(from = -50, to = 30, by = 10)) + 
  scale_x_continuous(
    name = "Participant Age (Years)", 
    breaks = seq(from = 20, to = 90, by = 10)) + 
  scale_colour_manual(
    name="",  
    labels=c("Female","Male"),
    values = c("dodgerblue", "limegreen"))  


print(fig2)

The plot is looking much nicer. Both the smoothing lines and the shaded areas now look the same as the plot in the paper. We realised (via googling and looking at the R Documentation) that by default, the shaded areas of smoothing lines in R depict 95% confidence intervals. This is good news, because we do not need to calculate the confidence intervals ourselves, as we initially thought.

However, our plot doesn’t have some semi-transparent dots like in the paper. It would also be better to remove the legend title in a more elegant way, since keeping the name blank seems a little like cheating.

attempt 3

I saw the line alpha=I(.5) in the authors’ script, and I know that alpha is used to modify transparency. Alpha ranges from 0 to 1, with lower values corresponding to greater transparency. We ran alpha = 0.5 within the geom_point() object, since we only want the scatterplot to have transparency, and got the aesthetic we wanted.

It took me a while to figure out an alternative way to remove the legend title. I found an article that seemingly covered everything to do with legends, except what I wanted. However, a few days later, I googled “remove legend title ggeasy” because I know that the ggeasy package allows you to easily customise plots. I then got the (very simple) solution I wanted straightaway, which is easy_remove_legend_title().

To finish off this plot, we are putting the figure caption as text below the figure.

fig2 <- ggplot(data = age_diff_sdmp, 
  mapping = aes(
    x = age, 
    y = age_difference, 
    colour = sex 
  )) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(
    size = 3, 
    method = "loess") +  
  geom_hline(yintercept = 0) + 
  theme_classic() + 
  scale_y_continuous(
    name = "Age Difference (Partner - Self)", 
    breaks = seq(from = -50, to = 30, by = 10)) + 
  scale_x_continuous(
    name = "Participant Age (Years)", 
    breaks = seq(from = 20, to = 90, by = 10)) + 
  scale_colour_manual(
    labels=c("Female","Male"),
    values = c("dodgerblue", "limegreen")) +   
  easy_remove_legend_title()


print(fig2)

Fig. 2 Age difference between participants and their partners as a function of participants’ ages, separately for female and male participants. Data are jittered to reduce overplotting. Trend lines were generated by locally estimated scatterplot smoothing to illustrate the pattern of the data. Shaded areas indicate 95% confidence intervals.

figure 3: maha D

Our next goal is to reproduce the following plot from page 419 of our paper.

We are attempting this plot next because looking at the mahaD dataframe, we can see that all the values we need have been given to us.

attempt 1

For this plot we are layering geom_point() (scatterplot) and geom_errorbarh() (horizontal error bars). We are setting the x axis to correspond to the D values, and the y axis to correspond to country. The colour also corresponds to country, as we can see in the graph that each country has a different colour. According to the figure caption, we are setting the errorbars to depict the confidence intervals. The lines go from the lower bound (DlowCI) to the upper bound (DhighCI).

Instead of a horizontal line, we are now adding a vertical line at x = 0 with geom_vline(). labs() allows us to change the x and y axis labels, and we used it because it’s shorter to type than the scale function we used above. Also, unlike before, we do not need to change anything else on the x or y axes, so it’s easier to use labs(). Finally we are using easy_remove_legend() from the ggeasy package to easily remove the legend.

ggplot(data = mahaD) +
  geom_point(mapping = aes(x = D, 
                           y = country, 
                           colour = country)) + 
  geom_vline(xintercept = 0, size = 1) +
  geom_errorbarh(aes(y = country, 
                     xmin = DlowCI, xmax = DhighCI,  
                     colour = country)) + 
  labs(x = "Mahalanobis D", y = "Country") + 
  theme_classic() + 
  easy_remove_legend() 

attempt 2

In our second attempt, we are making the dots bigger to better resemble the authors’ plot, italicising the “D” following the instructions here, and making the code more concise. Again, we are adding the figure caption as text below the plot.

ggplot(
  data = mahaD, 
  mapping = aes(
    x = D, 
    y = country, 
    colour = country 
  )
) +
  geom_point(size = 3) + 
  geom_vline(xintercept = 0, size = 1) + 
  geom_errorbarh(
    mapping = aes(xmin = DlowCI, xmax = DhighCI)) + 
  labs(
    x = expression(paste("Mahalanobis ", italic("D"))), 
    y = "Country") + 
  theme_classic() + 
  easy_remove_legend() 

Fig. 3. Mahalanobis distance (D) for each country in the current data set. Larger D values indicate more sex differentiation in the overall pattern of mate preferences. Error bars represent bootstrapped 95% confidence intervals (CIs). Ds and CIs for each country are color coded differently for easier readability.

Challenges/successes

With Jenny S’s help, we were able to address last week’s challenges of not being able to round the values in our summary statistics, and not really understanding recode(). We also successfully reproduced two out of our four plots quite quickly this week, using our ggplot knowledge and coding like caterpillars.

However, a major challenge this week was looking into the authors’ script to get the variables we need for figure 1 and figure 4. Their script had over a thousand lines of code, and I could not run the parts we needed for our plots as I kept getting an error message, even though Jenny S and my teammate Ayesha did not get this error message. At the very end of the week, I tried downloading a fresh copy from the OSF and used read.csv() to read in the data instead of here() and read_csv(). And the script ran! I am not sure what exactly did the trick but I am very relieved, because I spent A LOT of time this week just wrestling with this script.

Another challenge was that it took me a while to figure out how to make some of the dots transparent in figure 2. I first tried running the line alpha=I(.5) in our plot, which didn’t work.This is because I() is specifically used in qplot() to set aesthetics. I was also stumped because I thought that just putting alpha = 0.5 in our code would make all of the points half-transparent, and the authors only made some of their points transparent. However, after running the alpha = 0.5 line anyway, we succeeded. It turns out that for large datasets, the alpha argument will make the points transparent not according to certain dots, but in a way that just makes the plot easier to see. It took a few days for me to realise this, which taught me that I should try all the (reasonable) options that I can think of when troubleshooting in R, even if I feel that it won’t work.

Next steps

Next, we will try to reproduce figure 1 and figure 4. We will use the authors’ script for the data analysis, but write our own code for the plots themselves.