To continue working on plot 4 and (hopefully) reproduce it!!
To continue developing my verification report.
In terms of goal 1 (ie. reproducing plot 4), one of the key issues was removing the outlier. Initially, I thought that perhaps the data points not matching up with the figure was because we weren’t using a seed (like in figure 1). I added the line of code that included the seed but it didn’t seem to make a difference at all.
Loading the relevant packages and viewing the data set.
# Load relevant packages
library(tidyverse)
library(ggplot2)
library(lmerTest)
library(psych)
library(gridExtra)
library(ggpubr)
library(tidyr)
library(ggeasy)
library(here)
library(utils)
library(emo)
library(kableExtra)
# View new data frame
data <- read_csv("walter.csv")
glimpse(data)
## Rows: 14,399
## Columns: 46
## $ PIN <dbl> 12506, 1997, 1448, 10625, 6106, 4078, 3034, 5281, …
## $ CIN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", …
## $ country <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Alger…
## $ city <chr> "Algiers", "Algiers", "Setif", "Setif", "Algiers",…
## $ countrycode <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24…
## $ partnum <chr> "85", "72", "277", "229", "23", "82", "86", "135",…
## $ partcode <chr> "A85", "A72", "SB277", "S229", "A23", "A82", "A86"…
## $ sample <dbl> 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
## $ sex <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ age <dbl> 21, 22, 47, 20, 23, 20, 22, 27, 19, 19, 19, 28, 22…
## $ religious <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ religion <chr> "islam", "islam", "Islam", "Islam", "islam", "isla…
## $ relstat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ relstat2 <dbl> 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3,…
## $ rellength <dbl> 28, NA, 307, NA, NA, 14, 14, 9, NA, NA, NA, 14, NA…
## $ ideal_intelligence <dbl> 3, 7, 5, 4, 7, 6, 6, 7, 5, 5, 4, 6, 6, 7, 6, 7, 4,…
## $ ideal_kindness <dbl> 7, 7, 5, 7, 7, 7, 7, 7, 7, 5, 7, 6, 7, 7, 7, 7, 5,…
## $ ideal_health <dbl> 6, 7, 5, 7, 7, 7, 7, 7, 5, 7, 7, 6, 7, 7, 6, 4, 5,…
## $ ideal_physatt <dbl> 4, 7, 5, 7, 7, 7, 7, 6, 7, 5, 5, 6, 6, 7, 6, 2, 6,…
## $ ideal_resources <dbl> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4,…
## $ mate_age <dbl> 16, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19…
## $ popsize <dbl> 39667, 39667, 39667, 39667, 39667, 39667, 39667, 3…
## $ country_religion <chr> "Muslim", "Muslim", "Muslim", "Muslim", "Muslim", …
## $ lattitude <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28…
## $ gem1995 <dbl> 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0…
## $ gdi1995 <dbl> 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0…
## $ gii <dbl> 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0…
## $ gdi2015 <dbl> 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0…
## $ gggi <dbl> 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0…
## $ gdp_percap <dbl> 15100, 15100, 15100, 15100, 15100, 15100, 15100, 1…
## $ infect_death <dbl> 0.000196637, 0.000196637, 0.000196637, 0.000196637…
## $ infect_yll <dbl> 0.01025033, 0.01025033, 0.01025033, 0.01025033, 0.…
## $ cmc_yll <dbl> 0.05141553, 0.05141553, 0.05141553, 0.05141553, 0.…
## $ gb_path <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ agediff <dbl> -5, -5, -30, -3, -5, -2, -4, -9, -1, -1, -1, -9, -…
## $ pathindex <dbl> -0.1635533, -0.1635533, -0.1635533, -0.1635533, -0…
## $ gepcascores <dbl> -1.298351, -1.298351, -1.298351, -1.298351, -1.298…
## $ zagediff <dbl> -0.66236022, -0.66236022, -4.37646742, -0.36523164…
## $ zideal_resources <dbl> -3.5742767, 0.5722421, -0.2570617, -1.0863654, 1.4…
## $ zideal_intelligence <dbl> -3.07225356, 1.05405819, -1.00909768, -2.04067562,…
## $ zideal_kindness <dbl> 0.8156212, 0.8156212, -1.1846226, 0.8156212, 0.815…
## $ zideal_health <dbl> -0.0523192, 0.8896893, -0.9943277, 0.8896893, 0.88…
## $ zideal_physatt <dbl> -1.5139923, 1.1689482, -0.6196788, 1.1689482, 1.16…
## $ zcmc_yll <dbl> 0.09657417, 0.09657417, 0.09657417, 0.09657417, 0.…
## $ zpathindex <dbl> -0.163697, -0.163697, -0.163697, -0.163697, -0.163…
Creating a country list
countrylist <- unique(data$country)
Running lines (1174-1188) from the author’s script (see ‘gender equality analyses’ section in the author’s r script)
#### Gender Equality #####
#Find the value of each gender equality variable for each country
gem1995av<-tapply(data$gem1995,data$country,mean)
gdi1995av<-tapply(data$gdi1995,data$country,mean)
giiav<-tapply(data$gii,data$country,mean)
gdi2015av<-tapply(data$gdi2015,data$country,mean)
gggiav<-tapply(data$gggi,data$country,mean)
ge_compav<-tapply(data$gepcascores,data$country,mean)
#new dataframe with national level gender equality variables separated by sex
ge_plotdata<-data.frame("country"=rep(countrylist,2),"sex"=rep(c("Female","Male"),each=45),"gem1995"=rep(gem1995av,2),"gdi1995"=rep(gdi1995av,2),"gii"=rep(giiav,2),"gdi2015"=rep(gdi2015av,2),"gggi"=rep(gggiav,2),"ge_comp"=rep(ge_compav,2))
#make dataframe for age difference
age_ge_plotdata <- data.frame(ge_plotdata,"agedif"=c(tapply(data$agediff[data$sex==0],data$country[data$sex==0],function(x) mean(x,na.rm=T)),tapply(data$agediff[data$sex==1],data$country[data$sex==1],function(x) mean(x,na.rm=T))))
Creating the plot with the seed!
plot_four <- ggplot(data = age_ge_plotdata,
mapping = aes(
x = ge_comp,
y = agedif,
colour = sex
)) +
geom_point(position = position_jitter(seed = 1302019), size = 1.5) +
geom_smooth(
method = "lm",
size = 2) +
ylim(-5, 5) +
theme_classic() +
labs(
x = "Gender-Equality Composite",
y = "Age Difference (Partner-Self)") +
scale_colour_manual(
values = c("dodgerblue", "limegreen")) +
easy_remove_legend_title()
print(plot_four)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (stat_smooth).
## Warning: Removed 14 rows containing missing values (geom_point).
Sakiko had mentioned that she made progress in terms of that and had documented that progress within her learning log. I took a quick look and the general gist of it seemed to be filtering out the outlier and then using that new filtered data frame to create the plot. With that knowledge, I tried to replicate her replication (I guess? hahaha)
To find the outlier, I’m starting off by filtering the females because the data point is blue and female scores are represented by the blue colour. I’m specifying that the data needs to be arranged according to ‘agedif’ as I’m looking for values with a large negative age difference. The select() function is being used to select all rows aside from gem1995 to ggi, as these rows contain irrelevant gender equality variables. Head() from the utils package is then being used to see the first few results from the output.
Unrelated but shout-out to Sakiko for finding head(), it’s definitely a useful function that I’ll use a lot in the future!
outlier <- age_ge_plotdata %>%
filter (sex == "Female") %>% # again, '==' refers to 'exactly equal to'
arrange(agedif) %>%
select(!gem1995:gggi)
head(outlier)
## country sex ge_comp agedif
## 1 Malaysia Female NA -8.93406593
## 2 Turkey Female -0.8874625 -4.37100737
## 3 Belgium Female 0.8974179 -0.27325581
## 4 Peru Female -0.3319754 -0.08235294
## 5 Norway Female 1.6578799 0.46666667
## 6 Lithuania Female 1.0676906 0.81521739
Malaysia is the top result however, it cannot be the outlier as it has a missing ge_comp value and therefore isn’t part of the scatterplot. The second result is Turkey and it has similar ge_comp and agedif values to the outlier so I’m filtering it out by creating a new dataframe without it. However, issues arise as this doesn’t work well for numeric variables.
The number ‘-4.37100737’ refers to the agedif value of the outlier.
filter_plot4_data <- age_ge_plotdata %>%
filter(agedif != -4.37100737)
Another approach would be to filter according to country and sex, as the outlier is the only female from Turkey. However, this doesn’t work very well as it removes anyone from Turkey and any females.
filter_plot4_data <- age_ge_plotdata %>%
filter(!country == "Turkey" & !sex == "Female")
The approach I finally settled on was to filter for people not from Turkey and people who are not female. This works well because it removes anyone from Turkey who is female, and can therefore sucessfully remove/filter out the outlier who IS from Turkey and female. In the code, ‘|’ means “or”.
filter_plot4_data <- age_ge_plotdata %>%
filter(!country == "Turkey" | !sex == "Female")
Now I’m trying out the new dataframe when creating the plot.
From the ggplot2 package, the aes() function is being used to specify how the variables will be mapped, with the ge_comp variable (ie. gender equality composite score) being on the x axis and the agedif variable (ie. age difference) on the y axis. Again, age difference refers to the difference between the participants age and their mates age. Colour is being used to differentiate between each sex.
The plot has two key elements: geom_point() is being used to create a scatterplot, whereas geom_smooth() adds a smoothing line and helps to see patterns within the plot. Method = “lm” is being used to fit a linear model (ie. a straight line), with size specifying the weighting of the smoothing line.
The theme is specified with theme_classic() and the labs() function is being used to rename the x and y axis labels. Scale_colour_manual() refers to the colour aesthetic which we’ve set manually to match the figure reported in the paper better. Easy_remove_legend_title from the ggeasy package is also being used to remove the legend title.
plot_four <- ggplot(data = filter_plot4_data,
mapping = aes(
x = ge_comp,
y = agedif,
colour = sex
)) +
geom_point() +
geom_smooth(
method = "lm",
size = 2) +
ylim(-5, 5) +
theme_classic() +
labs(
x = "Gender-Equality Composite",
y = "Age Difference (Partner-Self)") +
scale_colour_manual(
values = c("dodgerblue", "limegreen")) +
easy_remove_legend_title()
print(plot_four)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (stat_smooth).
## Warning: Removed 14 rows containing missing values (geom_point).
Overall, it doesn’t look too bad!! I’m glad the outlier is gone however, it still isn’t exact. I’ve been sat here for about ten minute comparing this plot with the figure reported in the paper and there still seems to be some variability within the data points/overall scatterplot. I’m not too sure how to fix that or even if we have the time to fix that- it took us a solid three weeks to be able to get this far! Ultimately, I think this is the best we can do in terms of reproducing the figure. It’s a bit disheartening because ultimately, you want to create a plot that looks close to the one reported in the paper, but it just isn’t realistic sometimes.
In terms of goal 2 (ie. making progress in my vr report), I think I’ve done well! Sections 1 and 2 are now all done, with sections 3 and 4 left to go! I’ve made some progress in terms of developing questions I would like to be able to answer however, I might have to wait for next week’s q&a session! Jenny mentioned that we’d be learning more about how to conduct basic statistical tests etc., so I’m looking forward to that!
I worked on it this morning and I managed to successfully create the table:
The function read_csv() is being used to read the data file ‘ReplicationProcessedfinaldata04202018’ available from the OSF Repository. The data is being saved as ‘sdmp’ to represent the title of the paper (“Sex Differences in Mate Preferences”)
sdmp <- read_csv("ReplicationProcessedfinaldata04202018.csv")
In the data frame, the variable “sex” uses values ‘0’ and ‘1’ to represent ‘women’ and ‘men’ respectively. By using the mutate() and recode() function from dplyr, we can change the ‘0’ and ‘1’ values to ‘female’ and ‘male’. The good thing about the recode () function is that it preserves the existing order of levels and just changes the values. As such, we can avoid getting confused (and potentially making mistakes) with the original ‘0’ and ‘1’ variables.
I created a new data frame (ie. table_meanssd) so that the calculations would be saved under it. I used the across() function and the starts_with() function to simplify the process and apply the same transformations to each of the variables starting with “ideal”. The list() function was used to calculate the means and SDs for each variable and na.rm = TRUE was used to exlude missing values from calculations. The names() function from the dplyr package was used, with the ‘.names’ argument setting names for the column through {.col}.{.fn}. This expression essentially sets the output columns names to the selected column name and the name of the function applied eg. “ideal_resources.mean”.
The kbl() function from the ‘kableExtra’ package was used to create a table for the data. Kable_material () was used to change the aesthetic features, with “striped” being used to better differentiate between the scores for men and women.
recode_sdmp <- sdmp %>% mutate(sex=recode(sex,
`0`="female", # old name = new name
`1`="male"))
table_meanssd <- recode_sdmp %>%
group_by(sex) %>%
summarise(across(starts_with("ideal_"), list(mean = mean, sd = sd), na.rm = TRUE, .names = "{.col}.{.fn}"))
table_meanssd %>%
kbl() %>%
kable_material(c("striped", "hover"))
| sex | ideal_intelligence.mean | ideal_intelligence.sd | ideal_kindness.mean | ideal_kindness.sd | ideal_health.mean | ideal_health.sd | ideal_physatt.mean | ideal_physatt.sd | ideal_resources.mean | ideal_resources.sd |
|---|---|---|---|---|---|---|---|---|---|---|
| female | 6.028282 | 0.9324396 | 6.234720 | 0.9798563 | 6.099530 | 1.030372 | 5.556241 | 1.113455 | 5.480709 | 1.139569 |
| male | 5.917258 | 1.0092708 | 6.123222 | 1.0204945 | 6.002011 | 1.096022 | 5.854876 | 1.102100 | 5.105989 | 1.250394 |
HOWEVER; I’m not too sure how to round the mean and SD to two decimal places. Additionally, after posting this code to our shared hackmd and speaking to Sakiko about it, I’m curious if there’s a way to make the table vertically orientated, as opposed to horizontal. It’s not a big issue but it’s something I might work on more because it’s still ehhh and I would like it to be ooohh.
I’ve looked into it a little and apparently the gather() function from tidyr might be helpful in grouping calculations for the variables (ie. ideal_resources, ideal_health etc.). I’ve played around with it a bit but I’m not too sure how to use it as the key arguments are ‘value’ and ‘variable’. I’ve tried setting the means and SDs as ‘value’ and the “ideal_” as variable but thus far, I’ve had no success. That’s something I can continue working on!
emo::ji("smile")
## 😄
Record our group presentation and present it on this upcoming Thursday (I’m excited to see everyone’s work!!)
Try to make the output for the table of sex differences in mate preferences long (ie. vertical), as opposed to horizontal. Maybe play around with gather() more, or keep googling!
Continue working on the verification report. In terms of specific next steps, it would be to 1) finish up the exploratory analyses, 2) write up the recommendations, 3) check everything and then work on knitting the rmd file to both pdf and html formats (I think that was something Jenny mentioned as being a key problem).
Go have dinner and maybe stretch. I have been sitting at my desk all day and at this point, I’m sure I resemble this guy.
To the dinner table!