My goal for the week

  1. Finish coding the plots! We managed to reproduce both figures 2 and 3 however, we struggled with figures 1 and 4. Sakiko had implied that she was working on troubleshooting plot 4 so I opted to try reproducing plot 1 and see how far I could get.

How did I go?

It was definitely a bit of a challenge to reproduce plot 1.

I was honestly overwhlemed with assignments (I had two due) but I managed to take a quick look at the author’s code for plot 1 early in the week. From that, I tried working out what pieces of code were relevant in terms of being able to reproduce the figure. As a group, it was kind of understood that we would try to minimise the use of their code as much as possible- we need to use some of their code in the analysis section of their r script but we want to do as much of it as we can by ourselves.

After examining it, I came to the conclusion that ony the code under the ‘Create dataframe’ and ‘Plot super slopes’ was necessary. So using that, I tried to work on reproducing the figure.

Shweta and I were both interested in working on it so we called and tried to reproduce the figure. This was actually kind of nice because although we were working on it seperately, we still could ask each other questions and get suggestions/help in real time. The added factor of having someone there through the process was also nice, I guess- we both had a bit of a rant every now and again so that was bonding hahaha.

# Load relevant packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(lmerTest)
## Loading required package: lme4
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggpubr)
library(tidyr)
library(ggeasy)
library(here)
## here() starts at /Users/ashadurrani/Documents/R
# View new data frame 

data <- read_csv("walter.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   continent = col_character(),
##   country = col_character(),
##   city = col_character(),
##   partnum = col_character(),
##   partcode = col_character(),
##   religion = col_character(),
##   country_religion = col_character(),
##   gb_path = col_logical()
## )
## ℹ Use `spec()` for the full column specifications.
## Warning: 4946 parsing failures.
##  row     col           expected actual         file
## 1320 gb_path 1/0/T/F/TRUE/FALSE     10 'walter.csv'
## 1321 gb_path 1/0/T/F/TRUE/FALSE     10 'walter.csv'
## 1322 gb_path 1/0/T/F/TRUE/FALSE     10 'walter.csv'
## 1323 gb_path 1/0/T/F/TRUE/FALSE     10 'walter.csv'
## 1324 gb_path 1/0/T/F/TRUE/FALSE     10 'walter.csv'
## .... ....... .................. ...... ............
## See problems(...) for more details.
glimpse(data)
## Rows: 14,399
## Columns: 46
## $ PIN                 <dbl> 12506, 1997, 1448, 10625, 6106, 4078, 3034, 5281, …
## $ CIN                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ continent           <chr> "Africa", "Africa", "Africa", "Africa", "Africa", …
## $ country             <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Alger…
## $ city                <chr> "Algiers", "Algiers", "Setif", "Setif", "Algiers",…
## $ countrycode         <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24…
## $ partnum             <chr> "85", "72", "277", "229", "23", "82", "86", "135",…
## $ partcode            <chr> "A85", "A72", "SB277", "S229", "A23", "A82", "A86"…
## $ sample              <dbl> 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
## $ sex                 <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ age                 <dbl> 21, 22, 47, 20, 23, 20, 22, 27, 19, 19, 19, 28, 22…
## $ religious           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ religion            <chr> "islam", "islam", "Islam", "Islam", "islam", "isla…
## $ relstat             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ relstat2            <dbl> 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3,…
## $ rellength           <dbl> 28, NA, 307, NA, NA, 14, 14, 9, NA, NA, NA, 14, NA…
## $ ideal_intelligence  <dbl> 3, 7, 5, 4, 7, 6, 6, 7, 5, 5, 4, 6, 6, 7, 6, 7, 4,…
## $ ideal_kindness      <dbl> 7, 7, 5, 7, 7, 7, 7, 7, 7, 5, 7, 6, 7, 7, 7, 7, 5,…
## $ ideal_health        <dbl> 6, 7, 5, 7, 7, 7, 7, 7, 5, 7, 7, 6, 7, 7, 6, 4, 5,…
## $ ideal_physatt       <dbl> 4, 7, 5, 7, 7, 7, 7, 6, 7, 5, 5, 6, 6, 7, 6, 2, 6,…
## $ ideal_resources     <dbl> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4,…
## $ mate_age            <dbl> 16, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19…
## $ popsize             <dbl> 39667, 39667, 39667, 39667, 39667, 39667, 39667, 3…
## $ country_religion    <chr> "Muslim", "Muslim", "Muslim", "Muslim", "Muslim", …
## $ lattitude           <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28…
## $ gem1995             <dbl> 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0…
## $ gdi1995             <dbl> 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0…
## $ gii                 <dbl> 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0…
## $ gdi2015             <dbl> 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0…
## $ gggi                <dbl> 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0…
## $ gdp_percap          <dbl> 15100, 15100, 15100, 15100, 15100, 15100, 15100, 1…
## $ infect_death        <dbl> 0.000196637, 0.000196637, 0.000196637, 0.000196637…
## $ infect_yll          <dbl> 0.01025033, 0.01025033, 0.01025033, 0.01025033, 0.…
## $ cmc_yll             <dbl> 0.05141553, 0.05141553, 0.05141553, 0.05141553, 0.…
## $ gb_path             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ agediff             <dbl> -5, -5, -30, -3, -5, -2, -4, -9, -1, -1, -1, -9, -…
## $ pathindex           <dbl> -0.1635533, -0.1635533, -0.1635533, -0.1635533, -0…
## $ gepcascores         <dbl> -1.298351, -1.298351, -1.298351, -1.298351, -1.298…
## $ zagediff            <dbl> -0.66236022, -0.66236022, -4.37646742, -0.36523164…
## $ zideal_resources    <dbl> -3.5742767, 0.5722421, -0.2570617, -1.0863654, 1.4…
## $ zideal_intelligence <dbl> -3.07225356, 1.05405819, -1.00909768, -2.04067562,…
## $ zideal_kindness     <dbl> 0.8156212, 0.8156212, -1.1846226, 0.8156212, 0.815…
## $ zideal_health       <dbl> -0.0523192, 0.8896893, -0.9943277, 0.8896893, 0.88…
## $ zideal_physatt      <dbl> -1.5139923, 1.1689482, -0.6196788, 1.1689482, 1.16…
## $ zcmc_yll            <dbl> 0.09657417, 0.09657417, 0.09657417, 0.09657417, 0.…
## $ zpathindex          <dbl> -0.163697, -0.163697, -0.163697, -0.163697, -0.163…

Creating a country list

countrylist <- unique(data$country)

Figure 1

Running lines (82-107, 1122-1156) from the author’s script.

################ Sex Differences in Preferences ###################

#Good Financial Prospects
gfpdiff<-lmer(zideal_resources~sex+(1+sex|CIN),data=data)
gfpslopes<-coef(gfpdiff)$CIN[,2]

#Physical Attractiveness
padiff<-lmer(zideal_physatt~sex+(1+sex|CIN),data=data)
paslopes<-coef(padiff)$CIN[,2]

#Health
hdiff<-lmer(zideal_health~sex+(1+sex|CIN),data=data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00221845 (tol = 0.002, component 1)
hslopes<-coef(hdiff)$CIN[,2]

#Intelligence
intdiff<-lmer(zideal_intelligence~sex+(1+sex|CIN),data=data)
intslopes<-coef(intdiff)$CIN[,2]

#Kindness
kinddiff<-lmer(zideal_kindness~sex+(1+sex|CIN),data=data)
kindslopes<-coef(kinddiff)$CIN[,2]

#Age
#all ages
agediff<-lmer(zagediff~sex+(1+sex|CIN),data=data)
ageslopes<-coef(agediff)$CIN[,2]


##### Sex Differences in Preferences Data #####

#health
datahealth<-data.frame("country"=unique(data$country),"slopes"=coef(hdiff)$CIN[,2],"n"=tapply(data$zideal_health,data$country,length))
healthslopes<-qplot(slopes,country,color=country,data=datahealth,xlab="Sex Difference in Health Preference", ylab="Country",size=I(4))+theme_classic(base_size=25)+geom_vline(xintercept =0, size=2)+theme(legend.position="none")+coord_cartesian(xlim=c(-1.25,1.25))

#physical attractiveness
datapa<-data.frame("country"=unique(data$country),"slopes"=coef(padiff)$CIN[,2],"n"=tapply(data$zideal_physatt,data$country,length))
physattslopes<-qplot(slopes,country,color=country,data=datapa,xlab="Sex Difference in Physical Attractiveness Preference", ylab="",size=I(4))+theme_classic(base_size=20)+geom_vline(xintercept =0, size=2)+theme(legend.position="none")+coord_cartesian(xlim=c(-1.25,1.25))

#good financial prospects
datagfp<-data.frame("country"=unique(data$country),"slopes"=coef(gfpdiff)$CIN[,2],"n"=tapply(data$zideal_resources,data$country,length))
resourceslopes<-qplot(slopes,country,color=country,data=datagfp,xlab="Sex Difference in Good Financial Prospects Preference",ylab="Country",size=I(4))+theme_classic(base_size=20)+geom_vline(xintercept =0, size=2)+theme(legend.position="none")+coord_cartesian(xlim=c(-1.25,1.25))

#kindness
datakind<-data.frame("country"=unique(data$country),"slopes"=coef(kinddiff)$CIN[,2],"n"=tapply(data$zideal_kindness,data$country,length))
kindslopes<-qplot(slopes,country,color=country,data=datakind,xlab="Sex Difference in Kindness Preference", ylab="Country", size=I(4))+theme_classic(base_size=25)+geom_vline(xintercept =0, size=2)+theme(legend.position="none")+coord_cartesian(xlim=c(-1.25,1.25))

#intelligence
dataint<-data.frame("country"=unique(data$country),"slopes"=coef(intdiff)$CIN[,2],"n"=tapply(data$zideal_intelligence,data$country,length))
intslopes<-qplot(slopes,country,color=country,data=dataint,xlab="Sex Difference in Intelligence Preference",ylab="Country",size=I(4))+theme_classic(base_size=25)+geom_vline(xintercept =0, size=2)+theme(legend.position="none")+coord_cartesian(xlim=c(-1.25,1.25))

##### Megaplot with All Preferences Together #####

#need to adjust age variable to eliminate mate_age under 10
### subset data to only include those who reported a mate age ### n = 8920
data3<-data[complete.cases(data$mate_age),]
### subset data to include only those with mates older than 10 ### n = 8614
data3<-data3[data3$mate_age>10,]

#create age difference variable
data3$agediff<-(data3$mate_age-data3$age)
#standardize outcome variable
data3$zagediff<-scale(data3$agediff)

#get slopes
agediff2<-lmer(zagediff~sex+(1+sex|CIN),data=data3)
ageslopes2<-coef(agediff2)$CIN[,2]

#create plotting dataframe for age difference w/o mate age under 10
levels(data3$sex)<-c("Females","Males")
data3$country<-factor(data3$country)
dataage2<-data.frame("country"=unique(data3$country),"slopes"=coef(agediff2)$CIN[,2],"n"=tapply(data3$zagediff,data3$country,length))


#tag dataframe with preference variable they refer to
datahealth$variable<-"Health"
datapa$variable<-"Physical Attractiveness"
datagfp$variable<-"Good Financial Prospects"
dataage2$variable<-"Age Choice"
datakind$variable<-"Kindness"
dataint$variable<-"Intelligence"

#create dataframe
superdata<-rbind(datahealth,datapa,datagfp,dataage2,datakind,dataint)
superdata$variable<-as.factor(superdata$variable)
superdata$variable<-factor(superdata$variable,levels=c("Age Choice","Good Financial Prospects","Physical Attractiveness","Intelligence","Kindness","Health"))

Viewing the ‘superdata’ data frame

glimpse(superdata)
## Rows: 266
## Columns: 4
## $ country  <chr> "Algeria", "Australia", "Austria", "Belgium", "Brazil", "Bulg…
## $ slopes   <dbl> 0.01684815, -0.06542704, -0.04481616, -0.28715237, -0.0909391…
## $ n        <int> 614, 508, 197, 419, 296, 115, 203, 396, 156, 148, 396, 260, 8…
## $ variable <fct> Health, Health, Health, Health, Health, Health, Health, Healt…

The ggplot() function from the ggplot package was used to create a plot with the ‘superdata’ data frame. Geom_jitter() was used to jitter the data and reduce overplotting, as that was something the authors had mentioned and included in their script. Geom_vline() was also used to add a vertical line along the x axis, at the ‘0’ intercept, with size being used to increase the line’s thickness. The labs() function set labels for the x and y axes and theme_classic() from the ggeasy package was used to change the background colour to white. Scale_x_continuous was used to specify the increments of the ticks on the x-axis being 0.5, ranging from -2 to 2. Easy_remove_legend was also used to remove the legend.

plot_one <- ggplot(data = superdata,
                   mapping = aes(
                     x = slopes,
                     y = variable,
                     geom = "blank",
                     colour = variable)
) + 
  labs(x = "Sex Difference (Males - Females)", y = "Difference Variable")+
  geom_jitter(size = I(2.5), height = .30)+
  geom_vline(xintercept = 0, size = 2)+
  theme_classic(base_size = 25)+
  scale_x_continuous(breaks = seq(-2, 2, .5),
                     limits = c(-2,2))+
  easy_remove_legend()+
  ggtitle(label = "Scatterplot of sex differences in mate preference ratings")


print(plot_one)

The plot looked similar to the figure but it still felt *off somehow.

The authors had said that the data was jittered in order to reduce overplotting. When discussing the way the plot looked as a team, Lucy had a breakthrough and realised that the points were off because the jitter added random noise to the points. As a result, the scattering of data was kind of random every time we ran the code.

She pointed us to a resource (see link below) that highlighted how we could use a seed. https://stackoverflow.com/questions/48822524/how-to-make-scatterplot-with-geom-jitter-plot-reproducible/48822620

So we decided not to use geom_jitter() like the authors had in their script, using geom_point() instead to set the seed and make the points jitter. We used the seed number from the authors script (line 18- number: 1302019).

plot_one <- ggplot(data = superdata,
                   mapping = aes(
                     x = slopes,
                     y = variable,
                     colour = variable)) + 
  geom_point(position = position_jitter(seed = 1302019), size = 1.5)+
  labs(x = "Sex Difference (Males - Females)", y = "Difference Variable")+
  geom_vline(xintercept = 0, size = 2)+
  theme_classic(base_size = 25)+
  scale_x_continuous(breaks = seq(from = -2, to =2, by = 0.5),
                     limits = c(-2,2))+
  easy_remove_legend()+
  ggtitle(label = "Scatterplot of sex differences in mate preference ratings")


print(plot_one)

Challenges

  • A key challenge was trying to get the data points to be exact- we had been frazzled with this for weeks now. Luckily Lucy was able to figure it out and we were able to successfully reproduce figure 1. It’s still nowhere near perfect (in terms of font etc.) but it’s still a major win.

  • Plot 4 still remains a challenge. We haven’t been able to successfully reproduce it, as there’s still a random outlier dot present and we’re not sure how to fix it. At the moment, it looks like this:

Creating a country list

countrylist<-unique(data$country)

Running lines (1174-1188) from the author’s script (see ‘gender equality analyses’ section in the author’s r script)

#### Gender Equality #####

#Find the value of each gender equality variable for each country
gem1995av<-tapply(data$gem1995,data$country,mean)
gdi1995av<-tapply(data$gdi1995,data$country,mean)
giiav<-tapply(data$gii,data$country,mean)
gdi2015av<-tapply(data$gdi2015,data$country,mean)
gggiav<-tapply(data$gggi,data$country,mean)
ge_compav<-tapply(data$gepcascores,data$country,mean)

#new dataframe with national level gender equality variables separated by sex
ge_plotdata<-data.frame("country"=rep(countrylist,2),"sex"=rep(c("Female","Male"),each=45),"gem1995"=rep(gem1995av,2),"gdi1995"=rep(gdi1995av,2),"gii"=rep(giiav,2),"gdi2015"=rep(gdi2015av,2),"gggi"=rep(gggiav,2),"ge_comp"=rep(ge_compav,2))

#make dataframe for age difference
age_ge_plotdata<-data.frame(ge_plotdata,"agedif"=c(tapply(data$agediff[data$sex==0],data$country[data$sex==0],function(x) mean(x,na.rm=T)),tapply(data$agediff[data$sex==1],data$country[data$sex==1],function(x) mean(x,na.rm=T))))
# creating a plot 

plot_four <- ggplot(data = age_ge_plotdata, 
               mapping = aes(
                 x = ge_comp, 
                 y = agedif, 
                 colour = sex
               )) + 
  geom_point() + 
  geom_smooth(
    method = "lm", 
    size = 2) + 
  ylim(-5, 5) + 
  theme_classic() +
  labs(
    x = "Gender-Equality Composite", 
    y = "Age Difference (Partner-Self)") + 
  scale_colour_manual(
    values = c("dodgerblue", "limegreen")) + 
  easy_remove_legend_title()

print(plot_four)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (stat_smooth).
## Warning: Removed 14 rows containing missing values (geom_point).

Let’s play a game!

Can you spot the outlier? You have five seconds.

Found it? Yay! It’s the single blue dot amongst the green dots!

Strengths

  • Aside from my (frustration disguised as) sarcasm above, I do think this week was quite successful because we were able to make progress in terms of reproducing the figures. We had been stuck for two weeks so being able to reproduce a figure successfully is a win.

  • Teamwork; I’m glad that everyone wants to do well and succeed, and that they’re willing to try- even when it is difficult.

Next Steps:

  • Continue to try and work on figure 4. Hopefully we can make it work!

  • Work on our group presentation. We’ve made some progress but we still have some work to do.