Our team managed to successfully reproduce the descriptive statistics reported in the original paper, as well the means and SDs. Therefore, our goal this week was to reproduce all (or most) of the plots/figures in the paper.
Initially we weren’t quite sure how to do this as our data set seemed to be missing variables that would be needed for the plots. After discussing this issue with Jenny Sloane in the coding Q&A session, she suggested that perhaps we could write a new csv file from their r script and use the data they had prepared for the plots. That seemed like a good fix and it simplified the process greatly, as all the variables were included and we could begin.
Well…coding is a journey indeed! Maybe my ego was getting the best of me because we managed to do so well thus far and I thought that it would stay that way, but this week has been SO MUCH MORE DIFFICULT.
As Janice from Friends would say…..
“OH MY GOD”
“We” (though really, credit goes to Lucy and Sakiko) were able to reproduce two figures so far! The maha D plot is more aesthetic so I’m including it here:
The ‘mlbd’ refers to the ‘MahaDandCIs.csv’ file we downloaded from OSF. Essentially, this code is using that data and mapping the countries by their Mahalanobis D value.
The function colour() is being used to differentiate the various countries by colour (ie. colour code the countries). Geom_point() is being used to create a scatterplot, with the xlim() shortcut used to specify the limits for the x axis. This was important as the scale began at 0 and we wanted it to start at 0.5. Geom_vline() was used to add a vertical line in the 0 intercept on the x axis, as seen in the figure in the paper. Geom_errorbarh() was used to add horizontal error bars to the plot, with the aesthetic mappings being used to add the CI limits as the error bars. Labs() refers to the labels we added to the plot, and the additional theme() related code was used to clean up the plot a bit and make it resemble the figure in the paper more.
Not entirely perfect but getting there! Lucy has a good base uploaded on our shared file and although further work is likely needed (maybe changing the font etc.) but at this point, it’s more of an aside.
# load relevant packages
library(tidyverse)
library(ggplot2)
library(lmerTest)
library(psych)
library(gridExtra)
library(ggpubr)
library(tidyr)
library(ggeasy)
library(extrafont)
library(patchwork)
# read in the file: MahaDandCIs.csv (available to be downloaded at the open science framework)
# Mahalanobis D on the basis of all five of the preference variables
mlbd <- read.csv("MahaDandCIs.csv")
# Code for figure 3
plot_three <- ggplot(
data = mlbd,
mapping = aes(
x = D,
y = country,
colour = country
)
) +
geom_point(size = 2) +
xlim(0,2) +
geom_vline(xintercept = 0, size = 1) +
geom_errorbarh(
mapping = aes(xmin = DlowCI, xmax = DhighCI)) +
labs(
x = expression(paste("Mahalanobis ", italic("D"))),
y = "Country") +
theme_classic() +
easy_remove_legend()
print(plot_three)
For the remaining plots, we were stumped. We weren’t sure where specific variables needed for the plots were, like ‘gepca’ scores.
As Jenny suggested, I created a new csv file with the prepared data from lines 1-71 from their R script. However, when I began working on Figure 1, I realised that certain variables hadn’t been included. After inspecting the data and then referring back to their r script, I realised that they had included new variables in their ‘analyses’ section of code. These variables were necessary for figure 1, so I had to repeat the process again to include all of the variables needed.
You can see the difference in the number of variables between the original data and the data I wrote from their r script.
# Read in original data file
sdmp <- read.csv("ReplicationProcessedfinaldata04202018.csv")
glimpse(sdmp)
## Rows: 14,399
## Columns: 35
## $ PIN <int> 12506, 1997, 1448, 10625, 6106, 4078, 3034, 5281, 1…
## $ CIN <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "…
## $ country <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Algeri…
## $ city <chr> "Algiers", "Algiers", "Setif", "Setif", "Algiers", …
## $ countrycode <int> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,…
## $ partnum <chr> "85", "72", "277", "229", "23", "82", "86", "135", …
## $ partcode <chr> "A85", "A72", "SB277", "S229", "A23", "A82", "A86",…
## $ sample <int> 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, …
## $ sex <int> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ age <int> 21, 22, 47, 20, 23, 20, 22, 27, 19, 19, 19, 28, 22,…
## $ religious <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ religion <chr> "islam", "islam", "Islam", "Islam", "islam", "islam…
## $ relstat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ relstat2 <int> 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, …
## $ rellength <dbl> 28, NA, 307, NA, NA, 14, 14, 9, NA, NA, NA, 14, NA,…
## $ ideal_intelligence <int> 3, 7, 5, 4, 7, 6, 6, 7, 5, 5, 4, 6, 6, 7, 6, 7, 4, …
## $ ideal_kindness <int> 7, 7, 5, 7, 7, 7, 7, 7, 7, 5, 7, 6, 7, 7, 7, 7, 5, …
## $ ideal_health <int> 6, 7, 5, 7, 7, 7, 7, 7, 5, 7, 7, 6, 7, 7, 6, 4, 5, …
## $ ideal_physatt <int> 4, 7, 5, 7, 7, 7, 7, 6, 7, 5, 5, 6, 6, 7, 6, 2, 6, …
## $ ideal_resources <int> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4, …
## $ mate_age <int> 16, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19,…
## $ popsize <int> 39667, 39667, 39667, 39667, 39667, 39667, 39667, 39…
## $ country_religion <chr> "Muslim", "Muslim", "Muslim", "Muslim", "Muslim", "…
## $ lattitude <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,…
## $ gem1995 <dbl> 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.…
## $ gdi1995 <dbl> 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.…
## $ gii <dbl> 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.…
## $ gdi2015 <dbl> 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.…
## $ gggi <dbl> 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.…
## $ gdp_percap <int> 15100, 15100, 15100, 15100, 15100, 15100, 15100, 15…
## $ infect_death <dbl> 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7…
## $ infect_yll <dbl> 406.6, 406.6, 406.6, 406.6, 406.6, 406.6, 406.6, 40…
## $ cmc_yll <dbl> 2039.5, 2039.5, 2039.5, 2039.5, 2039.5, 2039.5, 203…
## $ gb_path <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#Read in prepared data file
data <- read_csv("walter_plot_data.csv")
glimpse(data)
## Rows: 14,399
## Columns: 46
## $ PIN <dbl> 12506, 1997, 1448, 10625, 6106, 4078, 3034, 5281, …
## $ CIN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", …
## $ country <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Alger…
## $ city <chr> "Algiers", "Algiers", "Setif", "Setif", "Algiers",…
## $ countrycode <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24…
## $ partnum <chr> "85", "72", "277", "229", "23", "82", "86", "135",…
## $ partcode <chr> "A85", "A72", "SB277", "S229", "A23", "A82", "A86"…
## $ sample <dbl> 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
## $ sex <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ age <dbl> 21, 22, 47, 20, 23, 20, 22, 27, 19, 19, 19, 28, 22…
## $ religious <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ religion <chr> "islam", "islam", "Islam", "Islam", "islam", "isla…
## $ relstat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ relstat2 <dbl> 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3,…
## $ rellength <dbl> 28, NA, 307, NA, NA, 14, 14, 9, NA, NA, NA, 14, NA…
## $ ideal_intelligence <dbl> 3, 7, 5, 4, 7, 6, 6, 7, 5, 5, 4, 6, 6, 7, 6, 7, 4,…
## $ ideal_kindness <dbl> 7, 7, 5, 7, 7, 7, 7, 7, 7, 5, 7, 6, 7, 7, 7, 7, 5,…
## $ ideal_health <dbl> 6, 7, 5, 7, 7, 7, 7, 7, 5, 7, 7, 6, 7, 7, 6, 4, 5,…
## $ ideal_physatt <dbl> 4, 7, 5, 7, 7, 7, 7, 6, 7, 5, 5, 6, 6, 7, 6, 2, 6,…
## $ ideal_resources <dbl> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4,…
## $ mate_age <dbl> 16, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19…
## $ popsize <dbl> 39667, 39667, 39667, 39667, 39667, 39667, 39667, 3…
## $ country_religion <chr> "Muslim", "Muslim", "Muslim", "Muslim", "Muslim", …
## $ lattitude <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28…
## $ gem1995 <dbl> 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0…
## $ gdi1995 <dbl> 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0…
## $ gii <dbl> 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0…
## $ gdi2015 <dbl> 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0…
## $ gggi <dbl> 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0…
## $ gdp_percap <dbl> 15100, 15100, 15100, 15100, 15100, 15100, 15100, 1…
## $ infect_death <dbl> 0.000196637, 0.000196637, 0.000196637, 0.000196637…
## $ infect_yll <dbl> 0.01025033, 0.01025033, 0.01025033, 0.01025033, 0.…
## $ cmc_yll <dbl> 0.05141553, 0.05141553, 0.05141553, 0.05141553, 0.…
## $ gb_path <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ agediff <dbl> -5, -5, -30, -3, -5, -2, -4, -9, -1, -1, -1, -9, -…
## $ pathindex <dbl> -0.1635533, -0.1635533, -0.1635533, -0.1635533, -0…
## $ gepcascores <dbl> -1.298351, -1.298351, -1.298351, -1.298351, -1.298…
## $ zagediff <dbl> -0.66236022, -0.66236022, -4.37646742, -0.36523164…
## $ zideal_resources <dbl> -3.5742767, 0.5722421, -0.2570617, -1.0863654, 1.4…
## $ zideal_intelligence <dbl> -3.07225356, 1.05405819, -1.00909768, -2.04067562,…
## $ zideal_kindness <dbl> 0.8156212, 0.8156212, -1.1846226, 0.8156212, 0.815…
## $ zideal_health <dbl> -0.0523192, 0.8896893, -0.9943277, 0.8896893, 0.88…
## $ zideal_physatt <dbl> -1.5139923, 1.1689482, -0.6196788, 1.1689482, 1.16…
## $ zcmc_yll <dbl> 0.09657417, 0.09657417, 0.09657417, 0.09657417, 0.…
## $ zpathindex <dbl> -0.163697, -0.163697, -0.163697, -0.163697, -0.163…
Having this data frame is helpful however, there remains issues…
This is really confusing me because like I said before, I haven’t made any edits to their code. I just ran it so it should produce the same plot but it isn’t. The only explanation I can think of is that perhaps they’ve edited their plot further and neglected to include that code? I’m not too sure however, that doesn’t seem like it would be in line with open science so there might be another explanation.
I’ll continue working on it and hopefully I can figure it out- if not, I’m sure someone from my group will be able to. After all, we have four minds working on it.
EXAMPLE OF FAULTY CODE:
recode_sdmp %>%
group_by(sex) %>%
summarise(count = n(), percent = 100 * n()/ nrow(sdmp)) %>%
round(c(54.92743, 45.07257), 2)
This prompted an error message to appear every time- "Error in recode_sdmp %>% group_by(sex) %>% summarise(count = n(), percent = 100 *: 3 arguments passed to ‘round’ which requires 1 or 2 arguments".
I wasn’t able to find much help online but after playing around with the code for a bit, I realised that perhaps piping it in with the other lines of code was the issue. This would explain the error message saying 3 arguments were passed to round.
After removing the pipe, the code worked. However, when knitting the code, the tibble with the rounded percentages appears seperately from the rest despite being in the same chunk. Not too sure how to resolve that but if there’s time in the end (hopefully after all the figures are produced), I can work on that.
As mentioned above, a key issue is that the plot doesn’t look the same despite the fact that the code is the exact same. The aesthetic features look different (font, size of the dots, perhaps spacing even?), and the plot itself seems to be different too. Certain dots/data points in the figure seem to be missing from the plot produced by the code. Is there any explanation you can think of for this?
Using ‘case_when’ (although Sakiko will be working with this more on Sunday!)