Mate preferences have been examined by many researchers, as they can provide greater information about human reproduction throughout history and clarify some of the most pertinent questions of our time regarding sex differences. The Evolutionary Psychology Perspective (Buss, 1989) has posited that when seeking a long-term partner, men and women place different value on traits that reflect economic stability and fertility (i.e. physical attractiveness, age etcetera). These sex differences are thought to be a reflection of evolutionary selection pressures. This is in contrast to the Biosocial Role Perspective (Eagly & Wood, 1992), which instead argued that sex differences reflect behaviours adopted on the basis of societal expectations.
The evidence backing the two predominant perspectives outlined above was somewhat ambiguous, as many studies had re-analysed the same data set; however, they lacked a common methodological framework and many studies had flexibility within their research designs. As a result, whether the findings could generalise beyond the studies themselves was unclear.
In their replication study, Walter et al. (2020) utilised a 45-country sample (N = 14,399) and collected participant questionnaire responses to examine sex differences in mate preferences. All of the competing hypotheses derived from the two perspectives were examined, and methodological issues were corrected for by utilising appropriate measures and a single analytic framework (i.e. the same outcome, predictor, and control variables).
Walter et al. (2020) found that cross culturally, universal sex differences in mate preferences remained robust. Across cultures, women had higher preferences for financial prospects than men, whereas men had higher preferences for physical attractiveness. Additionally, men reported having increasingly younger partners relative to their age, whereas women reported having consistently older partners. In line with Biosocial Role Theory, sex differences in age of partner decreased as gender equality increased. However, there was no evidence to suggest a relationship between gender equality and sex differences in mate preferences. Neither gender equality nor pathogen prevalence were seen to hold up as robust predictors of cross cultural variability over time.
The replication study by Walter et al. (2020) provided clarity on the accuracy of the hypotheses underlying the two perspectives, and in doing so, provided insight as to where further research may be necessary. Their work serves as an updated, accurate, and transparent foundation that can be used as a starting point for further research.
I was surprised that many of the studies examining sex differences in mate preferences had utilised the same sample data by Buss (1989). The fact that there were various different, and often conflicting, results generated from the same data set highlights the finickiness of data analysis. For example, Gangestad et al. (2006) re-examined cross-cultural variability using gender equality and pathogen prevalence as competing predictors and found that gender equality did not significantly predict sex differences in preferences. However, they analysed composites of ranked and rated preferences and controlled for variables that Eagly & Wood (1999) hadn’t controlled for. Careful consideration of data-analysis is needed in order to dissect the strategic choices made by the researchers, and how those choices may impact the data.
The evidence for the claim that gender equality and pathogen prevalence are not robust cross cultural predictors of variability within mate preferences seemed premature (potentially). Without control variables, pathogen prevalence was seen to predict mate preferences for all measures. Similarly, gender equality was seen to predict sex differences for every measure of gender equality. The researchers suggested that relatively abstract national-level predictors may not accurately reflect the ecological surroundings of the participants, and utilising measures that more directly tap into the information available to mate preference psychology might have yielded different results. This may have merit, particularly as national level predictors might not accurately reflect the experiences/understandings of participants.
I wonder whether the use of questionnaires was a potential limitation. The questionnaires included a 7-point scale for each of the 5 items and there were no practice questions to gauge participant reactivity/how they scale. This is not a major issue, as these individual differences reflect a source of variability that can be averaged over and controlled in a sense. However, a slightly more serious issue is demand characteristics- participant responses were collected in-person and as the questionnaire relied on directive questions, participants might have been influenced to alter their preferences for various traits. This could be a source of error as two individuals may value a given trait in the same way however, due to variations in cultural norms etcetera, only one may feel comfortable rating it highly.
I obtained the open data used within the paper from the OSF Repository. Our goal was to reproduce:
The demographic descriptives in the paper (as reported in the ‘Participants’ section of the paper).
The means and standard deviations in the paper (as reported in the ‘Results’ section of the paper).
The four figures within the paper.
Demographic Descriptives
Means/SD within the paper
Figures in the paper
Scatterplot of sex differences in mate preference ratings
Scatterplot of age difference (partner - self) as a function of participant age
Range plot of Mahalanobis distance (D) for each country
Scatterplot of age difference (partner - self) as a function of gender-equality composite score - with regression line
Reading the data into RStudio was definitely a challenge- one that I hadn’t expected. I initially used the ‘Import Dataset’ tool using ‘From Text (readr)’ because I had seen online that csv.files could be imported that way. However, when I inserted the code on the bottom of the pop-up window thingy, I kept getting an error message saying that there was no data file.
I reached out to my fellow team members and they mentioned that I could use a working directory to import the data file. I had no clue what a working directory was so I googled and found that it’s a file path that your computer can use to set the default location of any files read into R. I set up my working directory and then used the getwd() function to find out where it was, to which I saved the data file as an excel file and read into RStudio.
However, after the q&a session where Jenny spoke about how she used Rprojects, I made a project for the reproducibility challenge. I definitely found it to be more convenient.
Before diving into reproducibility, I found it helpful to learn a little bit more about coding best practices just so that I could ensure I was setting myself up well. One of the key things I found emphasised the importance of commenting your code in a concise and effective way. With that in mind, I began with a conscious intention to consistently explain why I was doing what I was doing. This surprisingly helped with the KISS Principle (i.e. keep it simple stupid), as I would think of what I was doing and then come up with alternative methods that were easier and more simplified.
Load the relevant packages needed
Loading these packages enabled me to process data using the various functions offered within the packages.
library(tidyverse) # used for dplyr, readr and ggplot.
library(papaja) # used to create APA tables.
library(janitor) # used for count() and to clean data.
library(here) # used to specify the path for files.
library(haven) # used for the as.factor() function.
library(kableExtra) # used to build tables and manipulate the style.
library(ggeasy) # used to fix the aesthetic features of plots.
library(qwraps2) # used for creating summary tables.
library(lme4) # used for 'lmer' function used in analyses.
library(utils) # used for r utility functions like head().
library(gt) # used to create gt tables.
library(nortest) # used for the Anderson Darling test of normality.
library(ggplot2) # used for data visualisations.
library(ggpubr) # used for the ggqqplot() function and to customise plots.
library(jmv) # used for the ttestIS() function.
library(stats) # used for cor.test() function.
library(formatR) # used to fix the formatting when knitting to PDF.
Load the data file needed
The read_csv() function from the ‘readr package’ was used to read the data file ‘ReplicationProcessedfinaldata04202018’ available from the OSF Repository.
The ‘<-’ is known as an “assignment operator” and it essentially means that the object on the left is equal to the object/code to the right. The data was saved as ‘sdmp’ to represent the title of the paper (“Sex Differences in Mate Preferences”)
sdmp <- read_csv("ReplicationProcessedfinaldata04202018.csv")
View variables in the data frame
Glimpse() was used to present a somewhat more compact view of the data, allowing me to see each of the columns (i.e. variables) in the data frame and their data type. This was helpful as I could then track what variables I needed to refer to when calculating specific descriptive statistics or means and SDs.
glimpse(sdmp)
## Rows: 14,399
## Columns: 35
## $ PIN <dbl> 12506, 1997, 1448, 10625, 6106, 4078, 3034, 5281, 1…
## $ CIN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "…
## $ country <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Algeri…
## $ city <chr> "Algiers", "Algiers", "Setif", "Setif", "Algiers", …
## $ countrycode <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,…
## $ partnum <chr> "85", "72", "277", "229", "23", "82", "86", "135", …
## $ partcode <chr> "A85", "A72", "SB277", "S229", "A23", "A82", "A86",…
## $ sample <dbl> 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, …
## $ sex <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ age <dbl> 21, 22, 47, 20, 23, 20, 22, 27, 19, 19, 19, 28, 22,…
## $ religious <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ religion <chr> "islam", "islam", "Islam", "Islam", "islam", "islam…
## $ relstat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ relstat2 <dbl> 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, …
## $ rellength <dbl> 28, NA, 307, NA, NA, 14, 14, 9, NA, NA, NA, 14, NA,…
## $ ideal_intelligence <dbl> 3, 7, 5, 4, 7, 6, 6, 7, 5, 5, 4, 6, 6, 7, 6, 7, 4, …
## $ ideal_kindness <dbl> 7, 7, 5, 7, 7, 7, 7, 7, 7, 5, 7, 6, 7, 7, 7, 7, 5, …
## $ ideal_health <dbl> 6, 7, 5, 7, 7, 7, 7, 7, 5, 7, 7, 6, 7, 7, 6, 4, 5, …
## $ ideal_physatt <dbl> 4, 7, 5, 7, 7, 7, 7, 6, 7, 5, 5, 6, 6, 7, 6, 2, 6, …
## $ ideal_resources <dbl> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4, …
## $ mate_age <dbl> 16, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19,…
## $ popsize <dbl> 39667, 39667, 39667, 39667, 39667, 39667, 39667, 39…
## $ country_religion <chr> "Muslim", "Muslim", "Muslim", "Muslim", "Muslim", "…
## $ lattitude <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,…
## $ gem1995 <dbl> 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.…
## $ gdi1995 <dbl> 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.…
## $ gii <dbl> 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.…
## $ gdi2015 <dbl> 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.…
## $ gggi <dbl> 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.…
## $ gdp_percap <dbl> 15100, 15100, 15100, 15100, 15100, 15100, 15100, 15…
## $ infect_death <dbl> 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7.8, 7…
## $ infect_yll <dbl> 406.6, 406.6, 406.6, 406.6, 406.6, 406.6, 406.6, 40…
## $ cmc_yll <dbl> 2039.5, 2039.5, 2039.5, 2039.5, 2039.5, 2039.5, 203…
## $ gb_path <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Data were collected in 2016 from participants in 45 different countries (N = 14,399; 7,909 female, or 54.93%).
Number of countries
This code used the dplyr function n_distinct() to count the number of observations for the variable ‘country’, with the $ operator used to specify the ‘country’ variable in the data frame. By doing this, I was able to calculate the number of countries in the data frame.
The function n_distinct() counted the number of values in the set of vectors, and $ specified the variable of interest within the data frame (ie. the vector) as ‘country’.
n_distinct(sdmp$country)
## [1] 45
Number of participants
Each participant’s data is represented in a separate row. Therefore, to get N, I used the nrow() function to count the number of rows in the data frame.
nrow(sdmp)
## [1] 14399
Breakdown of participant sex
First, I used the as.factor() function from the haven package to convert the ‘sex’ variable within the data frame from a double, which represents non-whole numbers, to a factor. This was done because sex is a categorical variable.
sdmp$sex <- as.factor(sdmp$sex)
In the data frame, the variable “sex” used the values ‘0’ and ‘1’ to represent ‘women’ and ‘men’ respectively. By using the mutate() and recode() function from dplyr, I was able to change the ‘0’ and ‘1’ values to ‘female’ and ‘male’. The use of the recode() function enabled me to change the values whilst preserving the existing order of levels.
Resource which helped recode the labels to male and female:
The count() function was also used to count the total number and percentage of males and females in the overall sample.
Resource that helped figure out the group by, summarise, and n() functions:
sdmp <- sdmp %>% mutate(sex=recode(sex,
`0`="female", # old name = new name
`1`="male")) # old name = new name
sdmp %>%
group_by(sex) %>%
summarise(count = n(), percent = 100 * n()/ nrow(sdmp))
## # A tibble: 2 × 3
## sex count percent
## <fct> <int> <dbl>
## 1 female 7909 54.9
## 2 male 6490 45.1
round(c(54.92743, 45.07257), 2)
## [1] 54.93 45.07
A problem that I encountered for some time was using the round() function for percentages. Initially, the code looked like this:
EXAMPLE OF FAULTY CODE:
sdmp %>%
group_by(sex) %>%
summarise(count = n(), percent = 100 * n()/ nrow(sdmp)) %>%
round(c(54.92743, 45.07257), 2)
This prompted an error message to appear every time- "Error in sdmp %>% group_by(sex) %>% summarise(count = n(), percent = 100 *: 3 arguments passed to ‘round’ which requires 1 or 2 arguments".
I wasn’t able to find much help online but after playing around with the code for a bit, I realised that perhaps piping it in with the other lines of code was the issue. This would explain the error message saying 3 arguments were passed to round.
After removing the pipe, the code worked. However, when knitting the code, the tibble with the rounded percentages appeared separately from the rest despite being in the same chunk. I wasn’t too sure how to resolve that, as it might just be the way it works.
From the sites that did keep records (n = 6,604), 47.21% (n = 3,118) came from community samples. ‘1’= “non community sample”, ‘2’= “community sample”
Again, I converted the ‘sample’ variable to a factor variable using as.factor() from the haven package. This was helpful, as ‘sample’ is a categorical variable, with the ‘1’ value representing ‘non-community sample’ and the ‘2’ value representing ‘community sample’.
sdmp$sample <- as.factor(sdmp$sample)
Similarly to the ‘sex’ variable, the mutate() function and the recode() function were used to change the ‘1’ and ‘2’ values to ‘university’ and ‘community’.
sdmp <- sdmp %>%
mutate(sample = recode(sample,
'1' = "university", # old name = new name
'2' = "community")) # '==' means 'exactly equal to'
The filter() function from the dyplr package was used to filter out the missing values from the ‘sample’ variable and the count() function was then used to count the number of sites that kept records of where their samples came from.
SDMP_filter_na <- sdmp %>%
filter(sample !="NA") # "!x' is used to denote 'not x'
count(SDMP_filter_na)
## # A tibble: 1 × 1
## n
## <int>
## 1 6604
The filter() function was used to find the total number of records from the community sample. The summarise() function then calculated the number and percentage of community samples compared to overall samples, with n() representing the total number of observations from community samples, and nrow() representing the number of observations from sites that kept records.
SDMP_filter_na %>%
filter(sample == "community") %>% # '==' means 'exactly equal to'
summarise(count = n(), percent = round(100 * n()/ nrow(SDMP_filter_na), 2))
## # A tibble: 1 × 2
## count percent
## <int> <dbl>
## 1 3118 47.2
Of the total sample, most participants reported being in ongoing, committed relationships (n = 9,206, or 63.93%). ‘0’= “not ongoing, committed”, ‘1’= “ongoing, committed”
Again, I converted the ‘relstat’ variable to a factor variable using as.factor(). This was helpful, as relstat (i.e. relationship status) is a categorical variable, with ‘0’ representing ‘not ongoing, committed’ and ‘1’ representing ‘ongoing, committed’.
sdmp$relstat <- as.factor(sdmp$relstat)
Again, I opted to change the ‘0’ and ‘1’ labels. Instead of using the mutate and recode() function however, I tried using case_when() instead, as that was something Jenny had mentioned during the Q&A sessions.
sdmp <- sdmp %>%
mutate(relstat = case_when(relstat == 0 ~ "not committed",
relstat == 1 ~ "committed"))
This code used the filter() function to only include those in ongoing, committed relationships. The count() function was then used to find the total number and percentage of people in ongoing, committed relationships out of the total number of participants.
sdmp %>%
filter(relstat == "committed") %>% # '==' means 'exactly equals to'
summarise(count = n(),
round(63.935, 2))
## # A tibble: 1 × 2
## count `round(63.935, 2)`
## <int> <dbl>
## 1 9206 63.9
First for all reported mate ages (n = 8,920)
The filter() function was used to exclude all missing values and include only the participants who reported the age of their mate. The count() function was then used to count those participants.
sdmp %>%
filter(mate_age !="na") %>% # "!x' is used to denote 'not x'
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 8920
Participants with reported mate ages older than 10 (n = 8,614)
The filter() function was used to include only those participants who reported their mate’s age as being above 10. The count() function was then used to count those participants.
sdmp %>%
filter(mate_age > 10) %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 8614
Participants ranged in age from 18 to 91 years (Mdn = 25, M = 28.78, SD = 10.62).
The summarise() function from dplyr was used to calculate summary statistics (i.e. mean, median, SD, min, and max) for the ‘age’ variable within the data. The apa_table() function from the papaja package was then used to create a table of descriptive statistics for participant age.
sdmp %>%
summarise(Mean = mean(age), Median = median(age), SD = sd(age),
Min = min(age), Max = max(age), ) %>%
apa_table(caption = "Descriptive statistics of participant age")
| Mean | Median | SD | Min | Max |
|---|---|---|---|---|
| 28.78 | 25.00 | 10.62 | 18.00 | 91.00 |
Continuing to work with the sdmp data set!
Sex Differences In Mate Preferences
The mate preferences examined by Walter et al. (2020) were: ideal financial prospects (i.e. ‘ideal_resources’), ideal intelligence, ideal kindness, ideal health, and ideal physical attractiveness (i.e. ‘ideal_physatt’).
Calculating the average preference ratings of men and women for their ideal mate’s financial prospects.
EXAMPLE OF FAULTY CODE:
sdmp %>%
group_by(sex) %>%
summarise(
Mean = mean(ideal_resources)
)
Initially, when I ran the code without “na.rm = TRUE”, it did not produce a mean. Instead, an ‘NA’ value was produced- which meant that the data had missing values. To solve this, I went online and found a few suggested functions. For example, the na.omit() function could be used to remove all incomplete cases within the data frame.
However, I was apprehensive about using that function, as I was worried that it would change the data in a negative way and we would no longer be able to replicate their results exactly. I opted to use the “na.rm = TRUE” argument instead, which essentially allows you to exclude missing values from mathematical operations.
This code calculated the mean score/rating for ‘ideal_resources’ (i.e. financial prospects) specifically for the variable ‘sex’, in which the two groups I defined were males and females.
group_by() was used to define the variable of interest as ‘sex’, with summarise() then being used to calculate the mean preference rating for ideal resources based on participant sex. The round() function from base r was also used to specify the number of decimal places desired (i.e. 2). Additionally, as stated above, na.rm = TRUE was used to exclude missing values from the calculations.
sdmp %>%
group_by(sex) %>%
summarise(
Mean = round(mean(ideal_resources, na.rm = TRUE), 2))
## # A tibble: 2 × 2
## sex Mean
## <fct> <dbl>
## 1 female 5.48
## 2 male 5.11
There was nothing wrong with the code above per se. However, it could be improved. Instead of having the results presented in a tibble, the output could be produced in a table. Additionally, despite not being reported in the Walter et al. (2020) paper, the standard deviations (SD) could be calculated and included.
The code remained the same as above however, the kbl() function from the ‘kableExtra’ package was included to create a table. The kable_material() function was then used to specify the aesthetic features and better differentiate the scores of men and women with the ‘striped’ option.
table_resources <- sdmp %>%
group_by(sex) %>%
summarise(Mean = round(mean(ideal_resources, na.rm = TRUE),
2), SD = sd(ideal_resources, na.rm = TRUE))
table_resources %>%
kbl() %>%
kable_material(c("striped", "hover"))
| sex | Mean | SD |
|---|---|---|
| female | 5.48 | 1.139569 |
| male | 5.11 | 1.250394 |
Calculating the average preference rating and SD of men and women for their ideal mate preferences
“Intelligence: The average for women was 6.03 and the average for men was 5.92. Kindness: the average for women was 6.23, and the average for men was 6.12. Health: The average for women was 6.10, and the average for men was 6.00. Physical Attractiveness: The average for women was 5.56, and the average for men was 5.85. Financial Prospects: The average for women was 5.48, and the average for men was 5.11.”
To better embody the principle of DRY (i.e. don’t repeat yourself), the across() function and the contains() functions were used to simplify the process and apply the same transformations for each of the variables starting with “ideal”. The list() function was then used to calculate the means and SDs for each variable and na.rm = TRUE was used to exclude missing values from calculations.
The gt() function from the gt package was also used to create a table. Jenny had mentioned gt tables in the Q&A session and it seemed like a good option. A few really helpful resources that highlighted the functions and arguments were:
https://gt.rstudio.com/articles/intro-creating-gt-tables.html
https://aosmith16.github.io/spring-r-topics/slides/week04_gt_tables.html#3
The fmt_number() function was used to format the numbers in the table, with the ‘columns’ argument selecting the variables to be used. The contains() function was used to select all of the variables which contained “ideal”. The decimals argument was also used to set the number of decimal places to 2.
The cols_merge() function merges columns, and this function was used five times as there were five columns that needed to be merged together. The ‘columns’ argument, and the contains() function specified the mate preference variable that was being selected (i.e. intelligence, kindness, health etcetera.) The ‘pattern’ argument also specified how the columns would be combined, with {1} and {2} being placeholders for ‘mean’ and ‘SD’ respectively. Brackets were placed around the {2} placeholder so that when the table would be formatted, the SD would be placed in brackets alongside the mean.
The cols_width() and px() functions were used to change the column widths by pixel, with the width of 60 being selected through trial and error. The vars() and everything() functions were then used to select the rest of the columns to make them the same width.
The cols_label() function was used to rename the columns and make them more clear. The tab_header() was then used to give the table a title.
sex_diffs <- sdmp %>%
group_by(sex) %>%
summarise(across(contains("ideal"), list(mean = mean, SD = sd),
na.rm = TRUE))
sex_diffs %>%
gt() %>%
fmt_number(columns = contains("ideal"), decimals = 2) %>%
fmt_number(columns = contains("sd"), pattern = "({x})") %>%
cols_merge(columns = contains("intelligence")) %>%
cols_merge(columns = contains("kindness")) %>%
cols_merge(columns = contains("health")) %>%
cols_merge(columns = contains("physatt")) %>%
cols_merge(columns = contains("resources")) %>%
cols_label(ideal_intelligence_mean = "intelligence", ideal_kindness_mean = "kindness",
ideal_health_mean = "health", ideal_physatt_mean = "physical attractiveness",
ideal_resources_mean = "good financial prospects") %>%
cols_width(vars(sex) ~ px(60), everything() ~ px(110)) %>%
tab_header("Sex differences in mate preferences")
## Warning: `columns = vars(...)` has been deprecated in gt 0.3.0:
## * please use `columns = c(...)` instead
| Sex differences in mate preferences | |||||
|---|---|---|---|---|---|
| sex | intelligence | kindness | health | physical attractiveness | good financial prospects |
| female | 6.03 (0.93) | 6.23 (0.98) | 6.10 (1.03) | 5.56 (1.11) | 5.48 (1.14) |
| male | 5.92 (1.01) | 6.12 (1.02) | 6.00 (1.10) | 5.85 (1.10) | 5.11 (1.25) |
Actual Partner Age
The filter() function was used to include only cases where the mate’s age was over 10 years. The group_by() function was then used to define the variable of interest as ‘sex’, so that the mean mate age for both males and females could be calculated using the summarise() and mean() functions.
sdmp %>%
filter(mate_age > 10) %>%
group_by(sex) %>%
summarise(
Mate_Age_mean = round(mean(mate_age, na.rm = TRUE), 2))
## # A tibble: 2 × 2
## sex Mate_Age_mean
## <fct> <dbl>
## 1 female 32.6
## 2 male 29.2
The same functions were used for the same purposes. The only difference was that this code calculated the mean ages for both males and females.
sdmp %>%
filter(mate_age > 10) %>%
group_by(sex) %>%
summarise(
Age_mean = round(mean(age, na.rm = TRUE), 2))
## # A tibble: 2 × 2
## sex Age_mean
## <fct> <dbl>
## 1 female 30.2
## 2 male 31.4
Women reported partners older than themselves, M = 2.43.
32.61 - 30.18 = 2.43
Men reported partners younger than themselves, M = −2.26.
29.19 - 31.45 = -2.26
Alternative Way
This code created a new data frame ‘age_diff_sdmp’ from the original ‘sdmp’ data set.
The filter() function was used to filter for mate ages over 10. The mutate() function was then used to create a new column called ‘age_difference’ (to represent the difference between participant age and their mate’s age), and this column was moved with the relocate() function so that it could come after mate_age. This was helpful as by default, R places new variables created through the mutate() function as the last column.
The new ‘age_diff_sdmp’ data frame was then used with the group_by() function to define the variable of interest as ‘sex’, with the summarise() function calculating the mean age difference for both males and females.
Women reported partners older than themselves, M = 2.43. Men reported partners younger than themselves, M = −2.26.
age_diff_sdmp <- sdmp %>%
filter(mate_age > 10) %>%
mutate(age_difference = mate_age - age) %>%
relocate(age_difference, .after = mate_age)
age_diff_sdmp %>% group_by(sex) %>%
summarise(
Mean = round(mean(age_difference, na.rm = TRUE), 2))
## # A tibble: 2 × 2
## sex Mean
## <fct> <dbl>
## 1 female 2.43
## 2 male -2.26
Multivariate Effect Sizes
Initially, I (and everyone else) did not understand what Mahalanobis D represented. After searching it up, we learned that Mahalanobis Distance is essentially a way of representing how many standard deviations a point is from the average value of a multivariate distribution.
The paper’s OSF Repository included three separate data files related to Mahalanobis D. Initially we thought that the data files with ‘SD’ represented ‘standard deviation’ or perhaps ‘standardised’ however, we later realised that it referred to ‘sex differentiated’.
The function read_csv() from the ‘readr package’ was used to read the data files available from the OSF Repository. The ‘<-’ is known as an “assignment operator” and it means that the object on the left is equal to the object/code to the right.
For reference:
The data file ‘MahaDandCIs.csv’ referred to the Mahalanobis D calculated on the basis of all five of the mate preference variables.
The data file ‘MahaDandCIs-SD.csv’ referred to the Mahalanobis D calculated for sex differentiated preferences (i.e. financial prospects, and physical attractiveness).
The data file ‘MahaDandCIs-noSD.csv’ referred to the Mahalanobis D calculated for non sex-differentiated preferences (i.e. intelligence, kindness, and health)
# read in the file: MahaDandCIs.csv and save as 'mlbd'
# Mahalanobis D on the basis of all five of the preference
# variables
mlbd <- read_csv("MahaDandCIs.csv")
# read in the file: MahaDandCIs-SD.csv and save as
# 'sd_mlbd' Mahalanobis D for sex differentiated
# preferences (i.e. financial prospects, and physical
# attractiveness)
sd_mlbd <- read_csv("MahaDandCIs-SD.csv")
# read in MahaDandCIs-noSD.csv and save as 'nosd_mlbd'
# Mahalanobis D for non sex-differentiated preferences
# (i.e. intelligence, kindness, and health)
nosd_mlbd <- read_csv("MahaDandCIs-noSD.csv")
“We found when calculating the Mahalanobis D between males and females that on the basis of all five preference variables within each country, the overall sex difference was relatively large, mean D = 0.73. These D values ranged from 1.42… in Georgia to 0.30… in Nigeria.”
Given that Walter et al. (2020) stated that “When calculating the Mahalanobis D…on the basis of all five preference variables”, the ‘mlbd’ data frame was used. The summarise() function was used to obtain the mean, minimum, and maximum values for the mahalanobis D, with the round() function specifying the desired number of decimal places as two.
mlbd %>%
summarise(
Mean = round(mean(mlbd$D),2),
Min = round(min(mlbd$D), 2),
Max = round(max(mlbd$D), 2)
)
## # A tibble: 1 × 3
## Mean Min Max
## <dbl> <dbl> <dbl>
## 1 0.73 0.3 1.42
To determine which countries had the highest and lowest D scores, the arrange() function was used to order rows by their D value. R arranges values by ascending order by default, so desc() was used to specify the order as descending.
The first row was Georgia (i.e. highest D value, 1.42)
mlbd %>%
arrange(desc(D))
## # A tibble: 45 × 13
## country D overlap overlap2 DlowCI DhighCI OlowCI OhighCI d_int d_kind
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Georgia 1.42 0.479 0.315 1.15 1.86 0.215 0.393 -0.444 -0.549
## 2 Iran 1.14 0.570 0.399 0.913 1.50 0.293 0.479 -0.414 -0.0186
## 3 Indone… 1.11 0.580 0.409 0.816 1.72 0.242 0.519 0 -0.203
## 4 El Sal… 1.10 0.582 0.410 0.740 1.92 0.202 0.552 0.131 -0.368
## 5 Germany 1.09 0.587 0.415 0.754 1.80 0.226 0.546 -0.618 -0.102
## 6 China 0.983 0.623 0.453 0.796 1.24 0.365 0.527 -0.508 0.00465
## 7 Poland 0.973 0.627 0.456 0.828 1.17 0.388 0.514 -0.337 -0.0468
## 8 Cuba 0.933 0.641 0.472 0.740 1.25 0.364 0.552 -0.142 -0.263
## 9 Peru 0.888 0.657 0.489 0.655 1.25 0.362 0.592 -0.281 -0.209
## 10 Austria 0.866 0.665 0.498 0.592 1.35 0.333 0.622 0.0569 -0.199
## # … with 35 more rows, and 3 more variables: d_health <dbl>, d_physatt <dbl>,
## # d_resources <dbl>
This code removed the desc() specification and relied on the default i.e. ascending order, in order to find the country with the lowest D score.
The first row was Nigeria (i.e. lowest D value, 0.30)
mlbd %>%
arrange(D)
## # A tibble: 45 × 13
## country D overlap overlap2 DlowCI DhighCI OlowCI OhighCI d_int d_kind
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Nigeria 0.300 0.881 0.787 0.193 0.619 0.609 0.858 -0.0734 -0.117
## 2 Uganda 0.337 0.866 0.764 0.210 0.702 0.569 0.846 0.0931 0.163
## 3 Romania 0.347 0.862 0.758 0.226 0.727 0.558 0.835 -0.0798 0.0152
## 4 Sweden 0.382 0.849 0.737 0.248 0.705 0.568 0.820 -0.152 -0.285
## 5 Slovenia 0.403 0.840 0.725 0.287 0.594 0.621 0.795 -0.0449 -0.229
## 6 Pakistan 0.472 0.813 0.685 0.337 0.680 0.580 0.764 -0.0767 -0.161
## 7 Costa Rica 0.478 0.811 0.682 0.316 0.935 0.471 0.777 -0.180 0.00981
## 8 Portugal 0.504 0.801 0.668 0.343 0.821 0.517 0.760 0.0215 -0.0343
## 9 Malaysia 0.508 0.800 0.666 0.338 0.812 0.520 0.764 -0.167 -0.190
## 10 Estonia 0.531 0.791 0.654 0.350 0.892 0.488 0.756 -0.214 0.0776
## # … with 35 more rows, and 3 more variables: d_health <dbl>, d_physatt <dbl>,
## # d_resources <dbl>
Alternative Way
An alternative method would be to use filter() with the D variable itself in order to obtain the minimum and maximum values.
Resource which highlighted how to do this:
mlbd %>%
filter(D == max(D)) # The use of '==' means 'exactly equals'
## # A tibble: 1 × 13
## country D overlap overlap2 DlowCI DhighCI OlowCI OhighCI d_int d_kind
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Georgia 1.42 0.479 0.315 1.15 1.86 0.215 0.393 -0.444 -0.549
## # … with 3 more variables: d_health <dbl>, d_physatt <dbl>, d_resources <dbl>
mlbd %>%
filter(D == min(D))
## # A tibble: 1 × 13
## country D overlap overlap2 DlowCI DhighCI OlowCI OhighCI d_int d_kind
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Nigeria 0.300 0.881 0.787 0.193 0.619 0.609 0.858 -0.0734 -0.117
## # … with 3 more variables: d_health <dbl>, d_physatt <dbl>, d_resources <dbl>
“Additionally, D was calculated separately for putatively sex-differentiated preferences (good financial prospects and physical attractiveness), resulting in an average D of 0.62, ranging from 0.26… in Sweden to 1.08… in Georgia.”
Essentially, this code repeated the process of 1) obtaining summary statistics and 2) filtering the D to obtain the minimum and maximum D values. The only difference was that the ‘sd_mlbd’ data frame was used, as Walter et al. (2020) stated “D was calculated separately for putatively sex-differentiated preferences…”
sd_mlbd %>%
summarise(
Mean = round(mean(sd_mlbd$D), 2),
Min = round(min(sd_mlbd$D), 2),
Max = round(max(sd_mlbd$D), 2)
)
## # A tibble: 1 × 3
## Mean Min Max
## <dbl> <dbl> <dbl>
## 1 0.62 0.26 1.08
sd_mlbd %>%
filter(D == max(D))
## # A tibble: 1 × 8
## country D overlap overlap2 DlowCI DhighCI OlowCI OhighCI
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Georgia 1.08 0.590 0.418 0.774 1.48 0.298 0.537
sd_mlbd %>%
filter(D == min(D))
## # A tibble: 1 × 8
## country D overlap overlap2 DlowCI DhighCI OlowCI OhighCI
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sweden 0.256 0.898 0.815 0.0776 0.518 0.661 0.940
“For those preferences not expected to be as strongly sex differentiated (intelligence, kindness, health), the Mahalanobis D was comparatively small: 0.33, ranging from 0.05… in Italy to 0.73… in Germany.”
Again, the same process was repeated with the ‘nosd_mlbd’ data frame, as Walter et al. (2020) stated “Preferences not expected to be as strongly sex differentiated…”.
nosd_mlbd %>%
summarise(
Mean = round(mean(nosd_mlbd$D), 2),
Min = round(min(nosd_mlbd$D), 2),
Max = round(max(nosd_mlbd$D), 2)
)
## # A tibble: 1 × 3
## Mean Min Max
## <dbl> <dbl> <dbl>
## 1 0.33 0.05 0.73
nosd_mlbd %>%
filter(D == max(D))
## # A tibble: 1 × 8
## country D overlap overlap2 DlowCI DhighCI OlowCI OhighCI
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 0.726 0.717 0.559 0.364 1.31 0.343 0.748
nosd_mlbd %>%
filter(D == min(D))
## # A tibble: 1 × 8
## country D overlap overlap2 DlowCI DhighCI OlowCI OhighCI
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Italy 0.0502 0.980 0.961 0.0505 0.336 0.765 0.961
Initially we weren’t quite sure how to reproduce the figures, as our data set seemed to be missing variables that would be needed for the plots e.g. ‘gepca’ scores. After discussing this issue with Jenny in a coding Q&A session, she suggested that perhaps we could write a new csv file from their r script and use the data they had prepared for the plots. This was necessary because the way they had gotten those variables was through running various analyses, many of which we weren’t familiar with. Her suggestion to use the necessary parts of their code to create a new data frame seemed like a good fix and it simplified the process greatly, as all the variables were included.
Using a new data frame that includes prepared data (i.e. lines 14-579) from the author’s r script. The title of the r script is ’OSFallages08122019.R and this is available from the OSF Repository
The read_csv() function from the ‘readr package’ was used to read the prepared data file ‘walter.csv’ that we created based on the Walter et al’s r script.
The ‘<-’ is known as an “assignment operator” and it means that the object on the left is equal to the object/code to the right. The walter.csv data set was saved as ‘data’, as Walter et al. (2020) had used a data frame called ‘data’ in their code. By keeping the same name, we were able to run the analyses in their code more easily, as we didn’t need to go through and alter their code so that the name of the data frame corresponded.
data <- read_csv("walter.csv")
glimpse(data)
## Rows: 14,399
## Columns: 46
## $ PIN <dbl> 12506, 1997, 1448, 10625, 6106, 4078, 3034, 5281, …
## $ CIN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", …
## $ country <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Alger…
## $ city <chr> "Algiers", "Algiers", "Setif", "Setif", "Algiers",…
## $ countrycode <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24…
## $ partnum <chr> "85", "72", "277", "229", "23", "82", "86", "135",…
## $ partcode <chr> "A85", "A72", "SB277", "S229", "A23", "A82", "A86"…
## $ sample <dbl> 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
## $ sex <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ age <dbl> 21, 22, 47, 20, 23, 20, 22, 27, 19, 19, 19, 28, 22…
## $ religious <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ religion <chr> "islam", "islam", "Islam", "Islam", "islam", "isla…
## $ relstat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ relstat2 <dbl> 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3,…
## $ rellength <dbl> 28, NA, 307, NA, NA, 14, 14, 9, NA, NA, NA, 14, NA…
## $ ideal_intelligence <dbl> 3, 7, 5, 4, 7, 6, 6, 7, 5, 5, 4, 6, 6, 7, 6, 7, 4,…
## $ ideal_kindness <dbl> 7, 7, 5, 7, 7, 7, 7, 7, 7, 5, 7, 6, 7, 7, 7, 7, 5,…
## $ ideal_health <dbl> 6, 7, 5, 7, 7, 7, 7, 7, 5, 7, 7, 6, 7, 7, 6, 4, 5,…
## $ ideal_physatt <dbl> 4, 7, 5, 7, 7, 7, 7, 6, 7, 5, 5, 6, 6, 7, 6, 2, 6,…
## $ ideal_resources <dbl> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4,…
## $ mate_age <dbl> 16, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19…
## $ popsize <dbl> 39667, 39667, 39667, 39667, 39667, 39667, 39667, 3…
## $ country_religion <chr> "Muslim", "Muslim", "Muslim", "Muslim", "Muslim", …
## $ lattitude <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28…
## $ gem1995 <dbl> 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0…
## $ gdi1995 <dbl> 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0…
## $ gii <dbl> 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0…
## $ gdi2015 <dbl> 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0…
## $ gggi <dbl> 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0…
## $ gdp_percap <dbl> 15100, 15100, 15100, 15100, 15100, 15100, 15100, 1…
## $ infect_death <dbl> 0.000196637, 0.000196637, 0.000196637, 0.000196637…
## $ infect_yll <dbl> 0.01025033, 0.01025033, 0.01025033, 0.01025033, 0.…
## $ cmc_yll <dbl> 0.05141553, 0.05141553, 0.05141553, 0.05141553, 0.…
## $ gb_path <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ agediff <dbl> -5, -5, -30, -3, -5, -2, -4, -9, -1, -1, -1, -9, -…
## $ pathindex <dbl> -0.1635533, -0.1635533, -0.1635533, -0.1635533, -0…
## $ gepcascores <dbl> -1.298351, -1.298351, -1.298351, -1.298351, -1.298…
## $ zagediff <dbl> -0.66236022, -0.66236022, -4.37646742, -0.36523164…
## $ zideal_resources <dbl> -3.5742767, 0.5722421, -0.2570617, -1.0863654, 1.4…
## $ zideal_intelligence <dbl> -3.07225356, 1.05405819, -1.00909768, -2.04067562,…
## $ zideal_kindness <dbl> 0.8156212, 0.8156212, -1.1846226, 0.8156212, 0.815…
## $ zideal_health <dbl> -0.0523192, 0.8896893, -0.9943277, 0.8896893, 0.88…
## $ zideal_physatt <dbl> -1.5139923, 1.1689482, -0.6196788, 1.1689482, 1.16…
## $ zcmc_yll <dbl> 0.09657417, 0.09657417, 0.09657417, 0.09657417, 0.…
## $ zpathindex <dbl> -0.163697, -0.163697, -0.163697, -0.163697, -0.163…
Creating a country list
The unique() function from base r was used to return a vector for the ‘country’ variable with duplicate elements or rows removed. The $ was used to specify the variable ‘country’ within the ‘data’ data frame.
countrylist <- unique(data$country)
Figure 1
Running lines (82-107, 1122-1156) from the Walter et al. (2020) r script. As mentioned before, a lot of this code involved analyses and much of it was outside the scope of our knowledge. To extract lines of code, a process of trial and error was used (i.e. we took lines of code at a time and returned if error messages indicated that the relevant variables were not found). Overall, we tried our best to minimise our use of their r script.
################ Sex Differences in Preferences
################ ###################
# Good Financial Prospects
gfpdiff <- lmer(zideal_resources ~ sex + (1 + sex | CIN), data = data)
gfpslopes <- coef(gfpdiff)$CIN[, 2]
# Physical Attractiveness
padiff <- lmer(zideal_physatt ~ sex + (1 + sex | CIN), data = data)
paslopes <- coef(padiff)$CIN[, 2]
# Health
hdiff <- lmer(zideal_health ~ sex + (1 + sex | CIN), data = data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00221845 (tol = 0.002, component 1)
hslopes <- coef(hdiff)$CIN[, 2]
# Intelligence
intdiff <- lmer(zideal_intelligence ~ sex + (1 + sex | CIN),
data = data)
intslopes <- coef(intdiff)$CIN[, 2]
# Kindness
kinddiff <- lmer(zideal_kindness ~ sex + (1 + sex | CIN), data = data)
kindslopes <- coef(kinddiff)$CIN[, 2]
# Age all ages
agediff <- lmer(zagediff ~ sex + (1 + sex | CIN), data = data)
ageslopes <- coef(agediff)$CIN[, 2]
##### Sex Differences in Preferences Data #####
# health
datahealth <- data.frame(country = unique(data$country), slopes = coef(hdiff)$CIN[,
2], n = tapply(data$zideal_health, data$country, length))
healthslopes <- qplot(slopes, country, color = country, data = datahealth,
xlab = "Sex Difference in Health Preference", ylab = "Country",
size = I(4)) + theme_classic(base_size = 25) + geom_vline(xintercept = 0,
size = 2) + theme(legend.position = "none") + coord_cartesian(xlim = c(-1.25,
1.25))
# physical attractiveness
datapa <- data.frame(country = unique(data$country), slopes = coef(padiff)$CIN[,
2], n = tapply(data$zideal_physatt, data$country, length))
physattslopes <- qplot(slopes, country, color = country, data = datapa,
xlab = "Sex Difference in Physical Attractiveness Preference",
ylab = "", size = I(4)) + theme_classic(base_size = 20) +
geom_vline(xintercept = 0, size = 2) + theme(legend.position = "none") +
coord_cartesian(xlim = c(-1.25, 1.25))
# good financial prospects
datagfp <- data.frame(country = unique(data$country), slopes = coef(gfpdiff)$CIN[,
2], n = tapply(data$zideal_resources, data$country, length))
resourceslopes <- qplot(slopes, country, color = country, data = datagfp,
xlab = "Sex Difference in Good Financial Prospects Preference",
ylab = "Country", size = I(4)) + theme_classic(base_size = 20) +
geom_vline(xintercept = 0, size = 2) + theme(legend.position = "none") +
coord_cartesian(xlim = c(-1.25, 1.25))
# kindness
datakind <- data.frame(country = unique(data$country), slopes = coef(kinddiff)$CIN[,
2], n = tapply(data$zideal_kindness, data$country, length))
kindslopes <- qplot(slopes, country, color = country, data = datakind,
xlab = "Sex Difference in Kindness Preference", ylab = "Country",
size = I(4)) + theme_classic(base_size = 25) + geom_vline(xintercept = 0,
size = 2) + theme(legend.position = "none") + coord_cartesian(xlim = c(-1.25,
1.25))
# intelligence
dataint <- data.frame(country = unique(data$country), slopes = coef(intdiff)$CIN[,
2], n = tapply(data$zideal_intelligence, data$country, length))
intslopes <- qplot(slopes, country, color = country, data = dataint,
xlab = "Sex Difference in Intelligence Preference", ylab = "Country",
size = I(4)) + theme_classic(base_size = 25) + geom_vline(xintercept = 0,
size = 2) + theme(legend.position = "none") + coord_cartesian(xlim = c(-1.25,
1.25))
##### Megaplot with All Preferences Together #####
# need to adjust age variable to eliminate mate_age under
# 10 subset data to only include those who reported a mate
# age ### n = 8920
data3 <- data[complete.cases(data$mate_age), ]
### subset data to include only those with mates older than
### 10 ### n = 8614
data3 <- data3[data3$mate_age > 10, ]
# create age difference variable
data3$agediff <- (data3$mate_age - data3$age)
# standardize outcome variable
data3$zagediff <- scale(data3$agediff)
# get slopes
agediff2 <- lmer(zagediff ~ sex + (1 + sex | CIN), data = data3)
ageslopes2 <- coef(agediff2)$CIN[, 2]
# create plotting data frame for age difference w/o mate
# age under 10
levels(data3$sex) <- c("Females", "Males")
data3$country <- factor(data3$country)
dataage2 <- data.frame(country = unique(data3$country), slopes = coef(agediff2)$CIN[,
2], n = tapply(data3$zagediff, data3$country, length))
# tag data frame with preference variable they refer to
datahealth$variable <- "Health"
datapa$variable <- "Physical Attractiveness"
datagfp$variable <- "Good Financial Prospects"
dataage2$variable <- "Age Choice"
datakind$variable <- "Kindness"
dataint$variable <- "Intelligence"
# create data frame
superdata <- rbind(datahealth, datapa, datagfp, dataage2, datakind,
dataint)
superdata$variable <- as.factor(superdata$variable)
superdata$variable <- factor(superdata$variable, levels = c("Age Choice",
"Good Financial Prospects", "Physical Attractiveness", "Intelligence",
"Kindness", "Health"))
Viewing this ‘superdata’ data frame (see above- it was created through Walter et al’s code)
glimpse() was used to present a more compact view of the data, allowing me to see each of the columns (i.e. variables) in the data frame, as well as see their data type.
glimpse(superdata)
## Rows: 266
## Columns: 4
## $ country <chr> "Algeria", "Australia", "Austria", "Belgium", "Brazil", "Bulg…
## $ slopes <dbl> 0.01684815, -0.06542704, -0.04481616, -0.28715237, -0.0909391…
## $ n <int> 614, 508, 197, 419, 296, 115, 203, 396, 156, 148, 396, 260, 8…
## $ variable <fct> Health, Health, Health, Health, Health, Health, Health, Healt…
After inspecting the new data frame, I saw that it had all the relevant variables mentioned in the author’s r script for plot 1. I decided to try to reverse engineer it, kind of. I ran their r script for plot 1 and for some reason, the plot that was produced looked very different from the figure they had included in their paper. The aesthetic features looked different (font, size of the dots, perhaps spacing even?), and the plot itself seemed to be different too. For example, certain dots/data points in the figure seemed to be missing from the plot produced by the code.
This really confused me because I hadn’t made any edits to their code. It should have produced the same plot but it didn’t. The only explanation we could think of was perhaps further edits to the code were made and Walter et al. (2020) may have perhaps neglected to include them. We weren’t sure.
First Attempt:
The ggplot() function from the ggplot package was used to create a plot with the ‘superdata’ data frame.
geom_jitter() jittered the data to reduce overplotting, as that was something Walter et al. (2020) had mentioned and included in their r script. geom_vline() was also used to add a vertical line along the x axis at the ‘0’ intercept, with the ‘size’ argument being used to increase the line’s thickness.
The labs() function was used to add labels to the x and y axes and theme_classic() from the ggeasy package was used to change the background colour to white. The scale_x_continuous() function was used to specify the increments of the ticks on the x-axis being 0.5, ranging from -2 to 2. The function easy_remove_legend() was also used to remove the legend.
plot_one <- ggplot(data = superdata,
mapping = aes(
x = slopes,
y = variable,
geom = "blank",
colour = variable)) +
labs(x = "Sex Difference (Males - Females)", y = "Difference Variable")+
geom_jitter(size = I(2.5), height = .30)+
geom_vline(xintercept = 0, size = 2)+
theme_classic(base_size = 25)+
scale_x_continuous(breaks = seq(-2, 2, .5),
limits = c(-2,2))+
easy_remove_legend()+
ggtitle(label = "Scatterplot of sex differences in mate preference ratings")
print(plot_one)
Final attempt: Added seed and using geom_point()
The first plot (above) looked similar to the figure but it still felt off somehow.
Within their figure caption, Walter et al. (2020) had stated that the data was jittered in order to reduce overplotting. As a result, we had used geom_jitter(). However, we noticed that the data points appeared to be scattered differently each time we ran the code. After many discussions and attempts at troubleshooting, we had a breakthrough and realised that the points were off because the jitter added random noise to the points. As a result, the data would scatter randomly every time we ran the code.
The solution to this was to use a seed. Upon further inspection of Walter et al’s (2020) r script, we were able to find a seed number. A helpful resource we found regarding how to use a seed was:
We decided not to use geom_jitter() like Walter et al. (2020) had in their script, using geom_point() instead to set the seed and make the points jitter. We used the seed number from their r script (line 18- number: 1302019).
plot_one <- ggplot(
data = superdata,
mapping = aes(
x = slopes,
y = variable,
colour = variable)) +
geom_point(position = position_jitter(seed = 1302019),
size = 1.5) +
geom_vline(xintercept = 0, size=1) +
theme_classic() +
labs( x="Sex Difference (Males - Females)", y="Difference Variable") +
scale_x_continuous(
breaks = seq(from= -2, to=2, by=0.5),
limits=c(-2,2)) +
easy_remove_legend()+
ggtitle(label = "Scatterplot of sex differences in mate preference ratings")
print(plot_one)
Great! The figure created above resembled the figure from the paper.
Figure 2
First attempt:
The ‘age_diff_sdmp’ data frame was used, as it had already filtered out/excluded participants who reported mate ages 10 or under. The aesthetic mappings specified the variables on the x and y axis, with ‘age’ being on the x axis and ‘age_difference’ being on the y axis. Colour also specified that the sex variable should be differentiated based on colour.
geom_point() was used to create a scatterplot and geom_smooth() was used to add a smoothing line, which enables patterns within the plot to be more clearly seen. A horizontal reference line was also included through the function geom_hline(), with the argument yintercept = 0 specifying that it should lie at 0 on the y-axis.
theme_apa() from the papaja package was used to remove the grey background and make the overall plot look clearer. scale_y_continuous() and scale_x_continuous() were then used to change the aesthetic features of the x and y axes, with the ‘name’ arguments specifying the axis titles. The ‘breaks’ arguments was also used to adjust the tick marks within the plot, with the ‘seq’ argument specifying the minimum and maximum values of the tick marks and the increments of the sequence- which was 10.
plot_two <- ggplot(data = age_diff_sdmp,
mapping = aes(
x = age,
y = age_difference,
colour = sex)) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 0) +
theme_apa()+
scale_y_continuous(name = "Age Difference (Partner - Self",
breaks = seq(from = -50, to = 30, by = 10))+
scale_x_continuous(name = "Participant Age (Years)",
breaks = seq(from = 20, to = 90, by = 10))
print(plot_two)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Second attempt: Adding theme_classic() instead, and using scale_colour_manual()
After glancing at Walter et al’s r script, it was apparent that they had used theme_classic() as opposed to theme_apa(). As a result, the theme was subsequently changed. The scale_colour_manual() function was used to map specific colours manually. Through some trial and error, the colours “limegreen” and “dodgerblue” were chosen, as they more closely resembled the colours within the figure. The notation ‘c’ was used prior to listing the labels and colours in order to depict a vector which r could understand.
plot_two <- ggplot(data = age_diff_sdmp,
mapping = aes(
x = age,
y = age_difference,
colour = sex
)) +
geom_point() +
geom_smooth(
size = 3,
method = "loess") +
geom_hline(yintercept = 0) +
theme_classic() +
scale_y_continuous(
name = "Age Difference (Partner - Self)",
breaks = seq(from = -50, to = 30, by = 10)) +
scale_x_continuous(
name = "Participant Age (Years)",
breaks = seq(from = 20, to = 90, by = 10)) +
scale_colour_manual(
name="",
labels=c("Female","Male"),
values = c("dodgerblue", "limegreen"))
print(plot_two)
## `geom_smooth()` using formula 'y ~ x'
Final attempt: Adjusting alpha to ‘0.5’ instead of ‘1’, and including easy_remove_legend()
The code was adjusted slightly in order to set alpha at 0.2 and change the opacity of the colour within the plot. This was useful, as it visually reduced the weight of less centralised data points, much like the figure in the paper. The function easy_remove_legend() from the package ggeasy was also used to remove the legend title.
plot_two <- ggplot(data = age_diff_sdmp,
mapping = aes(
x = age,
y = age_difference,
colour = sex)) +
geom_point(alpha = 0.2) +
geom_smooth(
size = 3,
method = "loess") +
geom_hline(yintercept = 0) +
theme_classic() +
scale_y_continuous(
name = "Age Difference (Partner - Self)",
breaks = seq(from = -50, to = 30, by = 10)) +
scale_x_continuous(
name = "Participant Age (Years)",
breaks = seq(from = 20, to = 90, by = 10)) +
scale_colour_manual(
name="",
labels = c("Female","Male"),
values = c("dodgerblue", "limegreen")) +
easy_remove_legend_title()
print(plot_two)
## `geom_smooth()` using formula 'y ~ x'
Figure 2 was successfully reproduced!
Figure 3
I had already read in the file ‘MahaDandCIs.csv’ and saved it as ‘mld’. As stated before, Mahalanobis D was based on all five of the preference variables.
First attempt:
This code used the ‘mld’ data frame to map the countries by their Mahalanobis D value.
The function colour() was used to differentiate the various countries by colour (i.e. colour code the countries). geom_point() was then used to create a scatterplot, with the xlim() shortcut specifying the limits for the x axis. This was important as the scale began at 0 and we wanted it to start at 0.5.
geom_vline() was used to add a vertical line in the 0 intercept on the x axis, as seen in the figure in the paper. geom_errorbarh() was then used to add horizontal error bars to the plot, with the aesthetic mappings being used to add the CI limits as the error bars. labs() was used to add labels to the plot, and the theme_classic() related code was used to clean up the plot a bit and make it resemble the figure in the paper more. The function easy_remove_legend() from the package ‘ggeasy’ was also used to remove the legend on the side of the plot.
plot_three <- ggplot(
data = mlbd,
mapping = aes(
x = D,
y = country,
colour = country
)
) +
geom_point(size = 2) +
xlim(0,2) +
geom_vline(xintercept = 0, size = 1) +
geom_errorbarh(
mapping = aes(xmin = DlowCI, xmax = DhighCI)) +
labs(
x = expression(paste("Mahalanobis D")),
y = "Country") +
theme_classic() +
easy_remove_legend()
print(plot_three)
Final attempt: Changing the size of the dots and italicising the ‘D’
There weren’t many changes from the code above. The size aesthetic within the geom_point() function was changed in order to make the dots bigger and slightly better matched to the figure. Additionally, the ‘D’ of Mahalanobis D was italicised.
plot_three <- ggplot(
data = mlbd,
mapping = aes(
x = D,
y = country,
colour = country
)
) +
geom_point(size = 3) +
xlim(0,2) +
geom_vline(xintercept = 0, size = 1) +
geom_errorbarh(
mapping = aes(xmin = DlowCI, xmax = DhighCI)) +
labs(
x = expression(paste("Mahalanobis ", italic("D"))),
y = "Country") +
theme_classic() +
easy_remove_legend()
print(plot_three)
Another one down! It was a pretty good match with the figure in the paper too (see below).
Figure 4
Running lines (1174-1188) from the Walter et al. (2020) r script (see ‘gender equality analyses’ section in their r script). As mentioned before, much of the code involved analyses outside the scope of our knowledge. To extract lines of code, a process of trial and error was used (i.e. we took lines of code at a time and returned if necessary). We tried our best to minimise our use of their r script though.
# Find the value of each gender equality variable for each
# country
gem1995av <- tapply(data$gem1995, data$country, mean)
gdi1995av <- tapply(data$gdi1995, data$country, mean)
giiav <- tapply(data$gii, data$country, mean)
gdi2015av <- tapply(data$gdi2015, data$country, mean)
gggiav <- tapply(data$gggi, data$country, mean)
ge_compav <- tapply(data$gepcascores, data$country, mean)
# new data frame with national level gender equality
# variables separated by sex
ge_plotdata <- data.frame(country = rep(countrylist, 2), sex = rep(c("Female",
"Male"), each = 45), gem1995 = rep(gem1995av, 2), gdi1995 = rep(gdi1995av,
2), gii = rep(giiav, 2), gdi2015 = rep(gdi2015av, 2), gggi = rep(gggiav,
2), ge_comp = rep(ge_compav, 2))
# make data frame for age difference
age_ge_plotdata <- data.frame(ge_plotdata, agedif = c(tapply(data$agediff[data$sex ==
0], data$country[data$sex == 0], function(x) mean(x, na.rm = T)),
tapply(data$agediff[data$sex == 1], data$country[data$sex ==
1], function(x) mean(x, na.rm = T))))
First attempt:
The data used was the ‘age_ge_plotdata’ (see above in Walter et al.’s code). Aes() was used to specify how the variables would be mapped, with the ge_comp variable (i.e. gender equality composite score) being on the x axis and the agedif variable (i.e. age difference) on the y axis. Additionally, colour was used to differentiate between each sex.
The plot had two key elements: geom_point() was used to create a scatterplot, and geom_smooth() added a smoothing line that helped to see patterns within the plot. The argument ‘method = “lm”’ was used to fit a linear model (i.e. a straight line), with ‘size’ specifying the weighting of the smoothing line.
The theme was specified with theme_classic(), with the labs() function renaming the x and y axis labels. scale_colour_manual() was used to set the colour aesthetic manually, in order to match the figure reported in the paper better. easy_remove_legend_title from the ggeasy package was also used to remove the legend title.
plot_four <- ggplot(data = age_ge_plotdata,
mapping = aes(
x = ge_comp,
y = agedif,
colour = sex)) +
geom_point() +
geom_smooth(method = "lm", size = 2) +
theme_classic() +
labs(
x = "Gender-Equality Composite",
y = "Age Difference (Partner-Self)") +
scale_colour_manual(
values = c("dodgerblue", "limegreen")) +
easy_remove_legend_title()
print(plot_four)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).
The plot produced looked okay but it was far from being done. The y-axis limits weren’t correct, as they ranged from -15 to +5, instead of -5 to +5 like in the paper.
Second attempt: Using ylim()
To fix the y-axis limits, ylim() was used to set the limits to -5, +5.
plot_four <- ggplot(data = age_ge_plotdata,
mapping = aes(
x = ge_comp,
y = agedif,
colour = sex)) +
geom_point() +
geom_smooth(
method = "lm",
size = 2) +
ylim(-5, 5) +
theme_classic() +
labs(
x = "Gender-Equality Composite",
y = "Age Difference (Partner-Self)") +
scale_colour_manual(
values = c("dodgerblue", "limegreen")) +
easy_remove_legend_title()
print(plot_four)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (stat_smooth).
## Warning: Removed 14 rows containing missing values (geom_point).
That problem was solved however, another key issue remained as the data points weren’t entirely correct. We were stumped because on the lower end, there was an outlier that needed to be removed (see the blue dot among the sea of green). There might have been other incorrect/discrepancies in the data points aside from that however, it was difficult to gauge so we focusing our efforts on that single outlier.
To find the data point, the filter() function was used to filter for females as the data point was blue and female scores were represented by the colour blue. The arrange() function was then used to specify that the data needed to be arranged according to ‘agedif’, as we were trying to find a value with a large negative age difference. Then, the select() function was used to select all rows aside from gem1995 to ggi, as those rows contained irrelevant gender equality variables. head() from the utils package was then used to see the first few results from the output.
outlier <- age_ge_plotdata %>%
filter (sex == "Female") %>% # '==' refers to 'exactly equal to'
arrange(agedif) %>%
select(!gem1995:gggi) # "!x' is used to denote 'not x'
head(outlier)
## country sex ge_comp agedif
## 1 Malaysia Female NA -8.93406593
## 2 Turkey Female -0.8874625 -4.37100737
## 3 Belgium Female 0.8974179 -0.27325581
## 4 Peru Female -0.3319754 -0.08235294
## 5 Norway Female 1.6578799 0.46666667
## 6 Lithuania Female 1.0676906 0.81521739
Malaysia was the top result however, it couldn’t be the outlier as it had a missing ge_comp value and therefore wasn’t part of the scatter plot. The second result was Turkey and it had similar ge_comp and agedif values to the outlier.
Therefore, the filter() function was used to filter it out, with the number ‘-4.37100737’ referring to the agedif value of the outlier. However, issues arose as this didn’t work well for numeric variables.
filter_plot4_data <- age_ge_plotdata %>%
filter(agedif != -4.37100737) # "!x' is used to denote 'not x'
Another attempted approach was to filter according to country and sex, as the outlier was a female from Turkey. However, this didn’t work very well as it removed anyone from Turkey and any females.
filter_plot4_data <- age_ge_plotdata %>%
filter(!country == "Turkey" & !sex == "Female") # "!x' is used to denote 'not x'
The approach finally settled on was to filter for people not from Turkey and not female. This worked well because it removed anyone from Turkey who was female, and therefore successfully removed/filtered out the outlier who WAS from Turkey and female. In the code, ‘|’ was used, which means “or” in Boolean logic.
filter_plot4_data <- age_ge_plotdata %>%
filter(!country == "Turkey" | !sex == "Female") # "!x' is used to denote 'not x'
Second attempt: Trying out the new, filtered data frame when creating the plot
This code was the same however, the ‘filter_plot4_data’ data frame created above was being used.
plot_four <- ggplot(data = filter_plot4_data,
mapping = aes(
x = ge_comp,
y = agedif,
colour = sex)) +
geom_point() +
geom_smooth(
method = "lm",
size = 2) +
ylim(-5, 5) +
theme_classic() +
labs(
x = "Gender-Equality Composite",
y = "Age Difference (Partner-Self)") +
scale_colour_manual(
values = c("dodgerblue", "limegreen")) +
easy_remove_legend_title()
print(plot_four)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (stat_smooth).
## Warning: Removed 14 rows containing missing values (geom_point).
We were happy that we were able to reproduce the figure as close as we could get it however, it wasn’t exact. We continued to look through the r script often just to see if we could figure it out and literally four days before the VR report was due, after examining it for the billionth time, we found the relevant code!!
In their paper, Walter et al. (2020) had said that they were reporting mate ages older than 10. This led us to start wondering if they had been doing that for the plot too. After checking the other r script (ie. over10 script), we found that the data frame we had been using was different than that of Walter et al. (2020). To remedy this, we added the following lines (lines 61-64 & lines 234-25) and then began to create a plot.
Running lines (61-64, 225-234, 1174-1188) from the Walter et al. (2020) r script. As mentioned before, much of the code involved analyses outside the scope of our knowledge. To extract lines of code, a process of trial and error was used (i.e. we took lines of code at a time and returned if necessary). We tried our best to minimise our use of their r script though.
# subset data to only include those who reported a mate age ### n = 8920
data<-data[complete.cases(data$mate_age),]
# subset data to include only those with mates older than 10 ### n = 8614
data<-data[data$mate_age>10,]
### This is the same code as before but using the correct data (over10) ###
#create plotting dataset
data3<-data
# make country a factor
data3$country<-factor(data3$country)
#Find the value of each gender equality variable for each country
gem1995av<-tapply(data3$gem1995,data3$country,mean)
gdi1995av<-tapply(data3$gdi1995,data3$country,mean)
giiav<-tapply(data3$gii,data3$country,mean)
gdi2015av<-tapply(data3$gdi2015,data3$country,mean)
gggiav<-tapply(data3$gggi,data3$country,mean)
ge_compav<-tapply(data3$gepcascores,data3$country,mean)
# new dataframe with national level gender equality variables separated by sex
ge_plotdata<-data.frame("country"=rep(unique(data3$country),2),"sex"=rep(c("Female","Male"),each= 41),"gem1995"=rep(gem1995av,2),"gdi1995"=rep(gdi1995av,2),"gii"=rep(giiav,2),"gdi2015"=rep(gdi2015av,2),"gggi"=rep(gggiav,2),"ge_comp"=rep(ge_compav,2))
# create dataframe for plotting age of mate sex difference and gender equality
age_ge_plotdata<-data.frame(ge_plotdata,"agedif"=c(tapply(data3$agediff[data3$sex==0],data3$country[data3$sex==0],function(x) mean(x,na.rm=T)),tapply(data3$agediff[data3$sex==1],data3$country[data3$sex==1],function(x) mean(x,na.rm=T))))
Now creating the plot
Using the age_ge_plotdata (correct version!).
plot_four <- ggplot(data = age_ge_plotdata,
mapping = aes(
x = ge_comp,
y = agedif,
colour = sex
)) +
geom_point() +
geom_smooth(
method = "lm",
size = 2) +
ylim(-5, 5) +
theme_classic() +
labs(
x = "Gender-Equality Composite",
y = "Age Difference (Partner-Self)") +
scale_colour_manual(
values = c("dodgerblue", "limegreen")) +
easy_remove_legend_title()
print(plot_four)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).
FINALLY!! Successfully reproduced the final figure!
Initially, I planned on first examining whether social desirability or demand characteristics may have influenced participant ratings for their ideal preferences. This was because Walter et al. (2020) collected all of their participant data in person, and the question format was quite directive (i.e. asking participants to rate their ideal preferences). The idea of participants possibly altering or adjusting their responses based on this didn’t seem too far outside the realm of reality.
A paper by Evans and Brase (2007) titled “Assessing sex differences and similarities in mate preferences: Above and beyond demand characteristics” addressed how many of the studies examining sex differences in mate preferences often rely on directive questions, and argued that having close-ended and forced-choice response formats could potentially guide respondents. This could lead to certain patterns of responses emerging. Additionally, they suggested that audience effects (i.e. a change in behaviour or performance due to being observed) could potentially influence participants as well.
In their study, Evans and Brase (2007) similarly examined sex differences in trait preferences for mates. The three sets of traits examined were : 1) ambitiousness (ie. resources and income), 2) physical attractiveness, and 3) kindness and intelligence. However, unlike Walter et al (2020), they utilised a minimally directive methodology i.e. collecting data using open-ended questions and anonymous computer data collection. Additionally, they systematically controlled the target audience by having the same or an opposite sex target.
Their results were similar to Walter et al. (2020) and provided no evidence of audience effects having an impact or close-ended questions biasing responses. The fact that their results were similar even with these changes in the methodology was insightful and kind of demonstrated the robustness of the findings regarding sex differences in mate preferences.
Ultimately, their study provided me with the answers I needed.
In terms of my new research question….
In the paper, Walter et al. (2020) noted that participants completed the questionnaire on ideal mate preferences for a long-term romantic partner. These instructions appeared at the top of the questionnaire:
“For the following questions we are interested in what you desire in an ideal long-term mate (e.g. committed, romantic relationship). Each of the following is a trait that a potential mate might have. For each trait, please select the option that best represents your ideal long-term mate. Please remember we are interested in your preferences for ideal long-term (committed, romantic) mates.”
I was curious if young people and older people had differences in the qualities they desired in a long-term romantic partner. I had heard my parents say variations of the phrase “beauty fades”, “wisdom comes with age” etcetera, and it led to me to believe that as one gets older, they develop greater maturity and start to place more weight or importance on more redeeming and character-based qualities in place of more superficial qualities.
This notion was supported by the findings of a recent study by Whyte, Brooks, Chan & Torgler (2021) in their article “Sex differences in sexual attraction for aesthetics, resources and personality across age”. They conducted an online survey for over 7000 Australians and found that both sexes placed greater importance on personality factors (i.e. openness, trust, emotional connection) as they aged. Surprisingly however, male respondents aged 60 or over had higher preferences for personality factors than similarly aged female respondents.
However, the findings of the Walter et al. (2020) study were somewhat contradictory, as it was found that men had higher preferences for physical attractiveness relative to women, and they increasingly reported younger mate age preferences relative to their own age. The results presented the idea that in contrast to women, men valued more superficial qualities throughout their lifetime.
In any case, I was excited to dive into it!
My hypothesis:
For superficial qualities (i.e. physical attractiveness), young people will have higher mean preference ratings relative to older people. Men will tend to have higher preference ratings than women.
For more thoughtful qualities (i.e. kindness, intelligence), young people will have lower mean preference ratings relative to older people. Women will tend to have higher preferences than men.
In terms of how I intended to examine this, I thought grouping men and women by their age (i.e. young, middle-aged, or old) would be a good starting point. From there, I could see the differences between their preferences at various life stages.
I mainly looked at preferences for kindness and intelligence versus physical attractiveness, as the former two delved a bit more deeply into one’s character. Preferences for ideal financial resources and health were not included as they seemed kind of pointless- ideally, everyone would like to be financially secure and have a healthy partner with whom they can live a long life.
To answer the research question, the steps I planned to take were:
Group participants by gender (i.e. male or female), and then by age (i.e. young, middle-aged, or old).
Examine the ideal mate preferences for young men and women compared to their older counterparts.
Create a table of summary statistics, and potentially a plot of some form (perhaps a box plot?)
Let’s get cracking!
Creating a new, “clean” data frame with the variables of interest
I continued to work with the sdmp data set, as the variables were ‘cleaned’ (i.e. the variables had appropriate data types, the value labels were changed).
The select() function was used to select the variables I wanted to use. Additionally, the na.omit() function was used to exclude all incomplete cases/missing values from the data frame. I used it instead of ‘na.rm = TRUE’ as it eliminated the need to repeat that same line of code each time I calculated the means/SD’s.
clean_data <- sdmp %>%
select(age, sex, starts_with("ideal_")) %>%
na.omit()
Creating age groups for participants
The mutate() function was used to create a new variable- ‘age_group’, with the case_when() function from the ‘dplyr’ package specifying vectors of ages that then formed the age groups. The as.factor() function was also used to change the ‘age_group’ variable into a factor.
one_data <- clean_data %>%
mutate(
age_group = dplyr::case_when(
age >= 18 & age <= 34 ~ "18-34",
age >= 35 & age <= 54 ~ "35-54",
age >= 55 ~ "> 55"
),
age_group = as.factor(
age_group))
Viewing the new data frame
glimpse() was used to present a more compact view of the data, allowing me to see each of the columns (i.e. variables) in the data frame and their data type. This was helpful as I could check if the data frame only contained the variables I selected, as well as if the ‘age_group’ variable was a factor.
glimpse(one_data)
## Rows: 13,898
## Columns: 8
## $ age <dbl> 21, 22, 47, 20, 23, 20, 22, 27, 19, 19, 19, 28, 22,…
## $ sex <fct> male, male, female, male, male, male, male, female,…
## $ ideal_intelligence <dbl> 3, 7, 5, 4, 7, 6, 6, 7, 5, 5, 4, 6, 6, 7, 6, 7, 4, …
## $ ideal_kindness <dbl> 7, 7, 5, 7, 7, 7, 7, 7, 7, 5, 7, 6, 7, 7, 7, 7, 5, …
## $ ideal_health <dbl> 6, 7, 5, 7, 7, 7, 7, 7, 5, 7, 7, 6, 7, 7, 6, 4, 5, …
## $ ideal_physatt <dbl> 4, 7, 5, 7, 7, 7, 7, 6, 7, 5, 5, 6, 6, 7, 6, 2, 6, …
## $ ideal_resources <dbl> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4, …
## $ age_group <fct> 18-34, 18-34, 35-54, 18-34, 18-34, 18-34, 18-34, 18…
Just looking at the tibble produced, I noticed that a lot of participants seemed to belong to the ‘young’ age group. In order to make it fair, I figured it was worth seeing how many participants belonged to the young, middle-aged and old groups.
Seeing the number of participants in each age group
The group_by() function was used to define the variable of interest as ‘age_group’ and the summarise() function was then used to count the total number of participants in each age group.
one_data %>%
group_by(age_group) %>%
summarise(n = n())
## # A tibble: 3 × 2
## age_group n
## <fct> <int>
## 1 > 55 331
## 2 18-34 10568
## 3 35-54 2999
I was right; there were significantly more young participants than there were middle aged and old combined. This was an issue however, it’s good that I became aware of the differences so I knew which age groups were getting more representation and I could then factor that in when analysing the results.
Calculating the mean, standard deviation, and standard error for physical attractiveness
This code calculated the mean preference rating for ‘ideal_physatt’ (i.e. physical attractiveness) for men and women in the age groups I had established.
The group_by() function was used to define the variables of interest as ‘sex’ and ‘age_group’, and summarise() was then used to calculate summary statistics (i.e. n, mean, SD and SE) for ideal physical attractiveness. The round() function from base r then specified the number of decimal places desired (i.e. 2).
The kbl() function from the ‘kableExtra’ package was used to create a table, with the ‘caption’ argument being used to add more clarity on the results. The pack_rows() function was also used to group the rows for females and males together and make the table more clear.
table_physatt <- one_data %>%
group_by(sex, age_group) %>%
summarise(n = n(), mean = round(mean(ideal_physatt), 2),
sd = round(sd(ideal_physatt), 2), se = round(sd/sqrt(n),
2))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
table_physatt %>%
kbl(caption = "Preferences for Ideal Physical Attractiveness in a partner") %>%
kable_material("hover") %>%
pack_rows("Females", 1, 3) %>%
pack_rows("Males", 4, 6)
| sex | age_group | n | mean | sd | se |
|---|---|---|---|---|---|
| Females | |||||
| female | > 55 | 156 | 5.36 | 1.12 | 0.09 |
| female | 18-34 | 5772 | 5.56 | 1.10 | 0.01 |
| female | 35-54 | 1628 | 5.57 | 1.16 | 0.03 |
| Males | |||||
| male | > 55 | 175 | 5.91 | 1.21 | 0.09 |
| male | 18-34 | 4796 | 5.86 | 1.08 | 0.02 |
| male | 35-54 | 1371 | 5.84 | 1.09 | 0.03 |
Middle aged women had the highest mean preferences for physical attractiveness, however they also had the largest deviation. Young women had the second highest mean preferences, followed by older women.
In contrast, older men had the highest mean preferences for physical attractiveness and the greatest deviation. Young men had the second highest mean preferences, followed by middle aged men.
Does this finding support my hypothesis: Not quite.
Now onto the more character-based qualities…
Calculating the mean, standard deviation, and standard error for kindness
This code calculated the mean preference rating for ideal kindness for men and women in the age groups I had established.
The group_by() function was used to define the variables of interest as ‘sex’ and ‘age_group’, and summarise() was then used to calculate summary statistics (i.e. n, mean, SD and SE) for ideal kindness. The round() function from base r then specified the number of decimal places desired (i.e. 2).
The kbl() function from the ‘kableExtra’ package was used to create a table, with the ‘caption’ argument being used to add more clarity on the results. The pack_rows() function was also used to group the rows for females and males together and make the table more clear.
table_kindness <- one_data %>%
group_by(sex, age_group) %>%
summarise(n = n(), mean = round(mean(ideal_kindness), 2),
sd = round(sd(ideal_kindness), 2), se = round(sd/sqrt(n),
2))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
table_kindness %>%
kbl(caption = "Preferences for Ideal Kindness in a partner") %>%
kable_material("hover") %>%
pack_rows("Females", 1, 3) %>%
pack_rows("Males", 4, 6)
| sex | age_group | n | mean | sd | se |
|---|---|---|---|---|---|
| Females | |||||
| female | > 55 | 156 | 6.35 | 0.97 | 0.08 |
| female | 18-34 | 5772 | 6.21 | 0.98 | 0.01 |
| female | 35-54 | 1628 | 6.31 | 0.98 | 0.02 |
| Males | |||||
| male | > 55 | 175 | 6.27 | 0.99 | 0.07 |
| male | 18-34 | 4796 | 6.09 | 1.02 | 0.01 |
| male | 35-54 | 1371 | 6.23 | 0.97 | 0.03 |
Older women had the highest mean preference for kindness, followed by middle-aged women. Young women had the lowest preference for kindness.
Older men had the highest mean preference for kindness. Middle aged men had the second highest mean preferences, followed by young men.
Does this finding support my hypothesis: It does.
Calculating the mean, standard deviation, and standard error for intelligence
This code calculated the mean preference rating for ideal intelligence for men and women in the age groups I had established.
The group_by() function was used to define the variables of interest as ‘sex’ and ‘age_group’, and summarise() was then used to calculate summary statistics (i.e. n, mean, SD and SE) for ideal intelligence. The round() function from base r then specified the number of decimal places desired (i.e. 2).
The kbl() function from the ‘kableExtra’ package was used to create a table, with the ‘caption’ argument being used to add more clarity on the results. The pack_rows() function was also used to group the rows for females and males together and make the table more clear.
table_intel <- one_data %>%
group_by(sex, age_group) %>%
summarise(n = n(), mean = round(mean(ideal_intelligence),
2), sd = round(sd(ideal_intelligence), 2), se = round(sd/sqrt(n),
2))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
table_intel %>%
kbl(caption = "Preferences for Ideal Intelligence in a partner") %>%
kable_material("hover") %>%
pack_rows("Females", 1, 3) %>%
pack_rows("Males", 4, 6)
| sex | age_group | n | mean | sd | se |
|---|---|---|---|---|---|
| Females | |||||
| female | > 55 | 156 | 6.13 | 0.94 | 0.08 |
| female | 18-34 | 5772 | 6.01 | 0.93 | 0.01 |
| female | 35-54 | 1628 | 6.11 | 0.94 | 0.02 |
| Males | |||||
| male | > 55 | 175 | 5.95 | 1.03 | 0.08 |
| male | 18-34 | 4796 | 5.89 | 1.00 | 0.01 |
| male | 35-54 | 1371 | 6.00 | 1.00 | 0.03 |
Older women had the highest mean preference for intelligence, followed by middle aged women, then by young women.
Middle aged men had the highest mean preference for intelligence, followed by older men, and then young men.
Does this finding support my hypothesis: Somewhat.
Creating a table of summary statistics
I created a larger table of the results so that it would be easier to view and compare the mean preference ratings across the age groups. I opted to use the summary_table() function from the qwraps2 package to create a data summary table, as it operates kind of like a list-of-lists and presents the data in a really clean way. The outer list defines the row groups (i.e. the variables), and the inner list defines the rows within each row group (i.e. means, SD’s)
A helpful resource I found that taught me more about the summary_table() function was:
The list() function was used to define the desired summary for a specific variable (i.e. ideal intelligence) and this name was used to label the row groups within the table. The inner lists were formulae which were used to calculate the means and SDs for each variable, with the round() function specifying the number of decimal places desired (i.e. 2).
The summary_table() function was then being used to build the table, with the ‘summary’ argument specifying the lists created earlier under ‘one_summary’ and the ‘by’ argument specifying the groups.
The argument ‘qwraps2_marksup’ indicated that markdown was being used for the markup language. The arguments ‘rtitle’ and ‘cnames’ were used to specify the title and rename the column labels.
In order to keep the formatting the same when knitted, ‘results = asis’ was added to the chunk option. A resource that helped me learn more about that was:
one_summary <-
list(
"Ideal Intelligence" =
list("mean" = ~ round(mean(ideal_intelligence), 2),
"SD" = ~ round(sd(ideal_intelligence), 2)),
"Ideal Kindness" =
list("mean" = ~ round(mean(ideal_kindness), 2),
"SD" = ~ round(sd(ideal_kindness), 2)),
"Ideal Physical Attractiveness" =
list("mean" = ~ round(mean(ideal_physatt), 2),
"SD" = ~ round(sd(ideal_physatt, 2))))
grouped_by_table <-
summary_table(one_data, summaries = one_summary, by = c("sex", "age_group"))
orig_opt <- options()$qwraps2_markup
options(qwraps2_markup = "markdown")
print(grouped_by_table,
rtitle = "Summary Statistics",
cnames = c("Young Women", "Young Men", "Middle-aged Women",
"Middle-aged Men", "Old Women", "Old Men"))
| Summary Statistics | Young Women | Young Men | Middle-aged Women | Middle-aged Men | Old Women | Old Men |
|---|---|---|---|---|---|---|
| Ideal Intelligence | ||||||
| mean | 6.13 | 5.95 | 6.01 | 5.89 | 6.11 | 6 |
| SD | 0.94 | 1.03 | 0.93 | 1 | 0.94 | 1 |
| Ideal Kindness | ||||||
| mean | 6.35 | 6.27 | 6.21 | 6.09 | 6.31 | 6.23 |
| SD | 0.97 | 0.99 | 0.98 | 1.02 | 0.98 | 0.97 |
| Ideal Physical Attractiveness | ||||||
| mean | 5.36 | 5.91 | 5.56 | 5.86 | 5.57 | 5.84 |
| SD | 1 | 1 | 1 | 1 | 1 | 1 |
Now let’s try plotting!
Initially, I chose a boxplot because the nature of the research question and the data types (i.e. categorical x continuous) seemed to favor such a plot. However, they didn’t look the way that I wanted them to and I couldn’t really figure out why (see below)
p1 <- ggplot(data = one_data, mapping = aes(
x = age_group,
y = ideal_kindness,
fill = sex)) +
geom_boxplot(ylim = range(0:10)) +
stat_summary(fun = mean,
geom = "point",
size = 2,
color = "steelblue") +
theme_classic()+
facet_wrap(~sex)+
labs(title = "Preferences for Ideal Kindness", x = "Age Group", y = "")+
coord_cartesian(ylim = c(0, 10))+
scale_fill_brewer("PRGn")
## Warning: Ignoring unknown parameters: ylim
print(p1)
Instead, I opted to use violin plots. This resource quite compactly summarised what they are:
“Violin plots are used when you want to observe the distribution of numeric data, and are especially useful when you want to make a comparison of distributions between multiple groups. The peaks, valleys, and tails of each group’s density curve can be compared to see where groups are similar or different.”
From my research, I noticed that they were quite under-utilised despite many sources saying that they could actually convey more information than a box plot. Apparently, the shape of a violin plot represents the density estimate of the variable. Therefore, the more data points that exist in a specific range, the larger the violin for that range. Additionally, violin plots allow us to observe the distribution’s modalities (i.e. the number of peaks) and skew.
I had never made a violin plot before so it seemed like a nice challenge and a good way to learn more about them.
Ideal Physical Attractiveness
The ‘data’ data frame was used with the ggplot() function from the ggplot package. The aesthetic mappings were used to specify the variables on the x and y axis, with ‘age_group’ being on the x axis and ‘ideal_physatt’ being on the y axis. Fill was also used to specify that the age_group variable should be differentiated based on sex.
geom_violin() was used to create a violin plot and the facet_wrap() function was used to present the facets side by side. To add the mean and SD, the stats_summary() function used the ‘mean_sdl’, which computes the mean plus or minus a constant times the standard deviation. The ‘mult’ argument specified the constant as 1 and the ‘geom’ argument specified that the mean +/- SD should be included as a point range. The ‘colour’ argument specified that the point range should be black.
The scale_fill_brewer() function was used to set the colour, with the labs() function specifying the labels for the x and y axes. theme_classic() from the ggeasy package was also used to change the background colour to white. The theme() function added a plot title and adjusted its position with the use of the arguments ‘hjust’ (i.e. horizontal adjustment) and ‘vjust’ (i.e. vertical adjustment), as well as the text appearance. The strip.background() and strip.text.x() functions were also used to remove the strip text above each individual facet box.
p1 <- ggplot(one_data, mapping = aes(x = age_group, y = ideal_physatt, fill = sex)) +
geom_violin()+
facet_wrap(~sex)+
stat_summary(fun.data = mean_sdl, mult = 1, geom = "pointrange", colour = "black")+
scale_fill_brewer(palette = "Paired")+
theme_classic()+
labs(x = "Age Group", y = "")+
ggtitle("Preference Ratings for Physical Attractiveness") +
theme(
plot.title = element_text(hjust=0, vjust=0.5, face='bold', size = 20),
strip.background = element_blank(), strip.text.x = element_blank())
## Warning: Ignoring unknown parameters: mult
print(p1)
Ideal Kindness
The ‘data’ data frame was used with the ggplot() function from the ggplot package. The aesthetic mappings were used to specify the variables on the x and y axis, with ‘age_group’ being on the x axis and ‘ideal_kindness’ being on the y axis. Fill was also used to specify that the age_group variable should be differentiated based on sex.
geom_violin() was used to create a violin plot and the facet_wrap() function was used to present the facets side by side. To add the mean and SD, the stats_summary() function used the ‘mean_sdl’, which computes the mean plus or minus a constant times the standard deviation. The ‘mult’ argument specified the constant as 1 and the ‘geom’ argument specified that the mean +/- SD should be included as a pointrange. The ‘colour’ argument specified that the pointrange should be black.
The scale_fill_brewer() function was used to set the colour, with the labs() function specifying the labels for the x and y axes. theme_classic() from the ggeasy package was also used to change the background colour to white. The theme() function added a plot title and adjusted its position with the use of the arguments ‘hjust’ (i.e. horizontal adjustment) and ‘vjust’ (i.e. vertical adjustment), as well as the text appearance. The strip.background() and strip.text.x() functions were also used to remove the strip text above each individual facet box.
p2 <- ggplot(one_data, mapping = aes(x = age_group, y = ideal_kindness, fill = sex)) +
geom_violin()+
facet_wrap(~sex)+
stat_summary(fun.data = mean_sdl, mult = 1, geom = "pointrange", colour = "black")+
scale_fill_brewer(palette = "Paired")+
theme_classic()+
labs(x = "Age Group", y = "")+
ggtitle("Preference Ratings for Ideal Kindness") +
theme(
plot.title = element_text(hjust=1, vjust=0.5, face='bold', size = 20),
strip.background = element_blank(), strip.text.x = element_blank())
## Warning: Ignoring unknown parameters: mult
print(p2)
Ideal Intelligence
The ‘data’ data frame was used with the ggplot() function from the ggplot package. The aesthetic mappings were used to specify the variables on the x and y axis, with ‘age_group’ being on the x axis and ‘ideal_intelligence’ being on the y axis. Fill was also used to specify that the age_group variable should be differentiated based on sex.
geom_violin() was used to create a violin plot and the facet_wrap() function was used to present the facets side by side. To add the mean and SD, the stats_summary() function used the ‘mean_sdl’, which computes the mean plus or minus a constant times the standard deviation. The ‘mult’ argument specified the constant as 1 and the ‘geom’ argument specified that the mean +/- SD should be included as a pointrange. The ‘colour’ argument specified that the pointrange should be black.
The scale_fill_brewer() function was used to set the colour, with the labs() function specifying the labels for the x and y axes. theme_classic() from the ggeasy package was also used to change the background colour to white. The theme() function added a plot title and adjusted its position with the use of the arguments ‘hjust’ (i.e. horizontal adjustment) and ‘vjust’ (i.e. vertical adjustment), as well as the text appearance. The strip.background() and strip.text.x() functions were also used to remove the strip text above each individual facet box.
p3 <- ggplot(one_data, mapping = aes(x = age_group, y = ideal_intelligence, fill = sex)) +
geom_violin()+
facet_wrap(~sex)+
stat_summary(fun.data = mean_sdl, mult = 1, geom = "pointrange", colour = "black")+
scale_fill_brewer(palette = "Paired")+
theme_classic()+
labs(x = "Age Group", y = "")+
ggtitle("Preference Ratings for Ideal Intelligence") +
theme(
plot.title = element_text(hjust=1, vjust=0.5, face='bold', size = 20),
strip.background = element_blank(), strip.text.x = element_blank())
## Warning: Ignoring unknown parameters: mult
print(p3)
Interpreting the findings
The research question was whether individuals placed greater importance on more character-based qualities (i.e. kindness, intelligence) than superficial qualities (i.e. physical attractiveness) as they aged. That is to say, are ideal mate preferences mediated by age? Furthermore, does this differ for men and women?
The results provided mixed evidence.
I hypothesised that for superficial qualities (i.e. physical attractiveness), young people would have higher mean preference ratings relative to older people. This was not seen to be the case. Young men and women had the second highest mean preference ratings for physical attractiveness. Middle-aged women were seen to have the highest ratings for all women, and old men had the highest mean preference ratings for all men, despite there being less representation for that age group (the total number of participants aged 55+ was 331).
I hypothesised that for more thoughtful qualities (i.e. kindness, intelligence), young people would have lower mean preference ratings relative to older people. There was some support for this, as young men and women had the lowest preferences for both kindness and intelligence, and old men and women had the highest preferences for kindness.
In general however, the findings did lend support to my hypotheses regarding sex differences in mean preference ratings for superficial qualities, and character-based qualities.
In terms of limitations, the exact age limits for each age group were somewhat arbitrary and may not have accurately characterised or reflected the differences that exist between individuals in different life stages. Additionally, there were more young participants (n = 10,568) than middle aged (n = 2999) and old combined (n = 331). As a result, the sample may not have adequately reflected the mate preferences and the variation in those preferences for middle-aged or older individuals.
Eagly and Wood (1999) proposed Biosocial Role Theory, which posits that sex differences in behaviours are a result of societal expectations and gender roles. They hypothesised that societies with greater gender inequality would have greater sex differences in mate preferences and found some evidence to support their position- gender equality diminished sex differences in ranked, but not rated, preferences for financial prospects.
I found this perspective interesting because it could account for changes within society, and could therefore offer an explanation for declines in mate preferences for financial prospects.
This contrasted the Evolutionary Psychology Perspective (Buss, 1989) which could not. The perspective’s underlying theoretical basis is that women need access to resources in order to successfully raise offspring and therefore, they tend to prefer partners who have the ability to invest resources (i.e. greater financial prospects). I was curious whether the evolutionary adaptation for seeking mates with good financial resources would still persist as strongly for females when greater gender parity exists?
A study by Moore, Cassidy, Law Smith & Perrett (2006) titled “The effects of female control of resources on sex-differentiated mate preferences” examined this further, looking at how mate preferences for financial prospects were influenced by female control of resources.
They defined control of resources somewhat abstractly, as it represented a composite of the individual’s participation in activities that would enable them to successfully provide for their offspring. For example; their educational attainment, their level of financial independence, and their control over their finances. They found that with increased female control of resources, there was a decrease in preferences for good financial prospects in a mate.
Furthermore, the relationship between access to resources and female preferences was explored by Kasser and Sharma (1999) in their paper “Reproductive Freedom, Educational Equality, and Females’ Preferences for Resource-Acquisition Characteristics in Mates”. They actually re-analysed Buss et al.’s (1990) data and found a negative relationship between educational equality (i.e. access to education) and female preferences for good financial prospects in a partner.
Taken together, the findings supported the notion that female preferences for resource-acquisition in a partner would be lower provided they had access to resources that would enable them to successfully raise their offspring.
In terms of what that meant for my research question, it seemed that gender equality may have potentially influenced female preferences for good financial prospects in a partner.
My hypothesis:
Higher gender equality should reflect more eqalitarian views about gender norms at the societal level and less perpetuation of gendered social constructs. Females may not necessarily identify with traditional ideas of men being the provider etcetera, and may seek greater financial independence due to greater equal opportunity. Therefore, females in countries with higher gender equality should have lower mate preference ratings for financial prospects relative to countries with lower gender equality.
In terms of how I intended to examine this, I thought that filtering the data to find the countries with the lowest and highest gender equality would be a good start. From there, I could filter to include only female participants and directly compare the mean preference ratings for financial prospects in a mate.
To answer this question, the steps I planned to take to were:
Examine the countries based on their gender equality score and find the countries with the highest and lowest scores.
Filter the data so that I can compare the female participant ratings directly.
Compare the mean preference ratings and then maybe plot it? Perhaps run a t-test if the difference looks like it could potentially be statistically significant.
Loading in the data and viewing it
As I was examining gender equality, it was important to have access to the gender equality composite score (i.e. gepca score) Walter et al. (2020) created based on the updated Gender-related Development Index (GDI), Global Gender Gap Index (GGGI), and Gender Inequality Index (GII).
The function read_csv() from the ‘readr package’ was used to read the data file ‘walter.csv’, which was the data file created in the verification section of this report. It contained variables that Walter et al. (2020) did not include in their main data set, such as the gepca scores.
The ‘<-’ is known as an “assignment operator” and it means that the object on the left is equal to the object/code to the right. The ‘walter.csv’ data frame was saved as ‘data’.
data <- read_csv("walter.csv")
glimpse(data)
## Rows: 14,399
## Columns: 46
## $ PIN <dbl> 12506, 1997, 1448, 10625, 6106, 4078, 3034, 5281, …
## $ CIN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", …
## $ country <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Alger…
## $ city <chr> "Algiers", "Algiers", "Setif", "Setif", "Algiers",…
## $ countrycode <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24…
## $ partnum <chr> "85", "72", "277", "229", "23", "82", "86", "135",…
## $ partcode <chr> "A85", "A72", "SB277", "S229", "A23", "A82", "A86"…
## $ sample <dbl> 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
## $ sex <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ age <dbl> 21, 22, 47, 20, 23, 20, 22, 27, 19, 19, 19, 28, 22…
## $ religious <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ religion <chr> "islam", "islam", "Islam", "Islam", "islam", "isla…
## $ relstat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ relstat2 <dbl> 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3,…
## $ rellength <dbl> 28, NA, 307, NA, NA, 14, 14, 9, NA, NA, NA, 14, NA…
## $ ideal_intelligence <dbl> 3, 7, 5, 4, 7, 6, 6, 7, 5, 5, 4, 6, 6, 7, 6, 7, 4,…
## $ ideal_kindness <dbl> 7, 7, 5, 7, 7, 7, 7, 7, 7, 5, 7, 6, 7, 7, 7, 7, 5,…
## $ ideal_health <dbl> 6, 7, 5, 7, 7, 7, 7, 7, 5, 7, 7, 6, 7, 7, 6, 4, 5,…
## $ ideal_physatt <dbl> 4, 7, 5, 7, 7, 7, 7, 6, 7, 5, 5, 6, 6, 7, 6, 2, 6,…
## $ ideal_resources <dbl> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4,…
## $ mate_age <dbl> 16, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19…
## $ popsize <dbl> 39667, 39667, 39667, 39667, 39667, 39667, 39667, 3…
## $ country_religion <chr> "Muslim", "Muslim", "Muslim", "Muslim", "Muslim", …
## $ lattitude <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28…
## $ gem1995 <dbl> 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0.266, 0…
## $ gdi1995 <dbl> 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0.508, 0…
## $ gii <dbl> 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0.429, 0…
## $ gdi2015 <dbl> 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0.854, 0…
## $ gggi <dbl> 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0.642, 0…
## $ gdp_percap <dbl> 15100, 15100, 15100, 15100, 15100, 15100, 15100, 1…
## $ infect_death <dbl> 0.000196637, 0.000196637, 0.000196637, 0.000196637…
## $ infect_yll <dbl> 0.01025033, 0.01025033, 0.01025033, 0.01025033, 0.…
## $ cmc_yll <dbl> 0.05141553, 0.05141553, 0.05141553, 0.05141553, 0.…
## $ gb_path <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ agediff <dbl> -5, -5, -30, -3, -5, -2, -4, -9, -1, -1, -1, -9, -…
## $ pathindex <dbl> -0.1635533, -0.1635533, -0.1635533, -0.1635533, -0…
## $ gepcascores <dbl> -1.298351, -1.298351, -1.298351, -1.298351, -1.298…
## $ zagediff <dbl> -0.66236022, -0.66236022, -4.37646742, -0.36523164…
## $ zideal_resources <dbl> -3.5742767, 0.5722421, -0.2570617, -1.0863654, 1.4…
## $ zideal_intelligence <dbl> -3.07225356, 1.05405819, -1.00909768, -2.04067562,…
## $ zideal_kindness <dbl> 0.8156212, 0.8156212, -1.1846226, 0.8156212, 0.815…
## $ zideal_health <dbl> -0.0523192, 0.8896893, -0.9943277, 0.8896893, 0.88…
## $ zideal_physatt <dbl> -1.5139923, 1.1689482, -0.6196788, 1.1689482, 1.16…
## $ zcmc_yll <dbl> 0.09657417, 0.09657417, 0.09657417, 0.09657417, 0.…
## $ zpathindex <dbl> -0.163697, -0.163697, -0.163697, -0.163697, -0.163…
Creating a new, “clean” data frame with the variables of interest
The select() function was used to select the variables I wanted to use. Additionally, the na.omit() function was used to exclude all incomplete cases/missing values from the data frame.
two_data <- data %>%
select(gepcascores, country, sex, ideal_resources) %>%
na.omit()
Recoding the ‘sex’ variable
In the data frame, the variable “sex” used values ‘0’ and ‘1’ to represent ‘women’ and ‘men’ respectively. By using the mutate() and recode() function from dplyr, I could change the ‘0’ and ‘1’ values to ‘female’ and ‘male’. This helped to avoid getting confused (and potentially making mistakes) with the original ‘0’ and ‘1’ variables.
two_data %>%
mutate(sex = case_when(sex == 0 ~ "Female",
sex == 1 ~ "Male"))
## # A tibble: 13,470 × 4
## gepcascores country sex ideal_resources
## <dbl> <chr> <chr> <dbl>
## 1 -1.30 Algeria Male 1
## 2 -1.30 Algeria Male 6
## 3 -1.30 Algeria Female 5
## 4 -1.30 Algeria Male 4
## 5 -1.30 Algeria Male 7
## 6 -1.30 Algeria Male 7
## 7 -1.30 Algeria Male 7
## 8 -1.30 Algeria Female 6
## 9 -1.30 Algeria Male 4
## 10 -1.30 Algeria Male 5
## # … with 13,460 more rows
Converting sex into a factor
As sex was a categorical variable, I used the as.factor() function to convert it from a double, which represents non-whole numbers, to a factor.
two_data$sex <- as.factor(two_data$sex)
Viewing the new data frame created
glimpse() was bused to present a more compact view of the data, allowing me to see each of the columns (i.e. variables) in the data frame and their data type. This was helpful as I could then check to see whether the code thus far worked.
glimpse(two_data)
## Rows: 13,470
## Columns: 4
## $ gepcascores <dbl> -1.298351, -1.298351, -1.298351, -1.298351, -1.298351,…
## $ country <chr> "Algeria", "Algeria", "Algeria", "Algeria", "Algeria",…
## $ sex <fct> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ideal_resources <dbl> 1, 6, 5, 4, 7, 7, 7, 6, 4, 5, 4, 6, 5, 7, 7, 1, 4, 4, …
Filtering countries to see which country had the highest gender equality score
The filter() function was used to obtain the maximum gepca value.
The country with the highest gender equality is Norway, with a score of 1.65788.
two_data %>%
filter(gepcascores == max(gepcascores))
## # A tibble: 265 × 4
## gepcascores country sex ideal_resources
## <dbl> <chr> <fct> <dbl>
## 1 1.66 Norway 0 5
## 2 1.66 Norway 0 4
## 3 1.66 Norway 1 6
## 4 1.66 Norway 1 5
## 5 1.66 Norway 0 5
## 6 1.66 Norway 0 5
## 7 1.66 Norway 1 6
## 8 1.66 Norway 0 6
## 9 1.66 Norway 0 6
## 10 1.66 Norway 1 6
## # … with 255 more rows
Filtering countries to see which country had the lowest gender equality score
The filter() function was used to obtain the minimum gepca value.
The country with the lowest gender equality is Pakistan, with a score of -2.750119.
two_data %>%
filter(gepcascores == min(gepcascores))
## # A tibble: 590 × 4
## gepcascores country sex ideal_resources
## <dbl> <chr> <fct> <dbl>
## 1 -2.75 Pakistan 1 5
## 2 -2.75 Pakistan 1 7
## 3 -2.75 Pakistan 1 6
## 4 -2.75 Pakistan 1 6
## 5 -2.75 Pakistan 1 4
## 6 -2.75 Pakistan 1 4
## 7 -2.75 Pakistan 1 6
## 8 -2.75 Pakistan 1 6
## 9 -2.75 Pakistan 1 4
## 10 -2.75 Pakistan 1 6
## # … with 580 more rows
Filtering the data frame to only include the countries Norway and Pakistan
A new data frame called ‘country_data’ was created, with the filter() function being used to include only participants from the countries Pakistan and Norway.
country_data <- two_data %>%
filter(country == c("Norway", "Pakistan"))
Filtering to include only female participants
The [ function (not too sure what it’s called) was used to only include observations from female respondents, with the $ specifying the variable of interest within the data frame as ‘sex’.
Using subset() was another method I considered however, the resource below mentioned some of the dangers associated with it and recommended the use of [ instead.
new_data <- country_data[country_data$sex == "0",]
Viewing the new data frame created
glimpse() was used to present a somewhat more compact view of the data, allowing me to check if the previous code worked and whether the data frame contained only females from Norway and Pakistan.
glimpse(new_data)
## Rows: 233
## Columns: 4
## $ gepcascores <dbl> 1.65788, 1.65788, 1.65788, 1.65788, 1.65788, 1.65788, …
## $ country <chr> "Norway", "Norway", "Norway", "Norway", "Norway", "Nor…
## $ sex <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ideal_resources <dbl> 4, 5, 6, 6, 5, 6, 6, 6, 5, 6, 4, 4, 4, 6, 6, 5, 5, 6, …
Calculating summary statistics
This code was used to calculate summary statistics for female ideal mate preferences for financial prospects according to country.
The group_by() function was used to specify the variable of interest (i.e. the countries: Pakistan and Norway). The summarise() function from dplyr was then used to calculate summary statistics (i.e. N, mean, SD, and SE).
The kbl() function from the ‘kableExtra’ package was used to create a table, with the ‘caption’ argument clarifying what the table was presenting. The kable_material() function was then used to specify the aesthetic features and better differentiate the scores of women in Norway and Pakistan with the ‘striped’ option.
data_summary <- new_data %>%
group_by(country) %>%
summarise(n = n(), mean = round(mean(ideal_resources), 2),
sd = round(sd(ideal_resources), 2), se = round(sd/sqrt(n),
2))
data_summary %>%
kbl(caption = "Female Ideal Mate Preferences for Financial Prospects") %>%
kable_material(c("striped", "hover"))
| country | n | mean | sd | se |
|---|---|---|---|---|
| Norway | 66 | 5.41 | 0.80 | 0.1 |
| Pakistan | 167 | 5.85 | 1.32 | 0.1 |
Okay so there definitely was a difference between the mean preference ratings for female participants in Pakistan and Norway. Females in Pakistan reported higher mean preference ratings for financial prospects, and had greater deviation.
Maybe plotting it with standard deviation error bars could help to display the data better.
A key thing to note though is that Pakistan had a higher N (i.e. there were more participants from Pakistan) so that’s something to be conscious about.
Creating a visualisation to represent the data
The ‘data_summary’ data frame created above was being used with the ggplot() function from the ggplot package. The aesthetic mappings specified the variables on the x and y axis, with ‘country’ being on the x axis and ‘mean’ (i.e. mean preference ratings for ideal financial prospects) being on the y axis.
As opposed to geom_bar(), geom_col() was used because I wanted the height of the bars to represent the values in the data. The fill argument was then used to specify that the colour of the bars should be based on the country. The geom_errorbar() function added error bars to the plot, with the aesthetic mappings being used to add the SD limits as the error bars. theme_classic() from the ggeasy package was also used to change the background colour to white. labs() refers to the labels added to the plot, with the additional theme() related code altering the text appearance of the title. The easy_remove_legend() function from the ggeasy package was also used to remove the legend.
plot <- data_summary %>%
ggplot(mapping = aes(x = country, y = mean, fill = country)) +
geom_col() + geom_errorbar(aes(ymin = mean - sd, ymax = mean +
sd), width = 0.4, colour = "black", alpha = 0.9, size = 1.3) +
theme_classic() + scale_fill_brewer(palette = "Paired") +
labs(title = "Mean Preference Ratings for Ideal Mate Financial Prospects",
x = "", y = "") + theme(plot.title = element_text(hjust = 0.5,
vjust = 0.5, face = "bold", size = 15)) + easy_remove_legend()
print(plot)
Running a t-test
I used a t-test in order to determine whether there was a statistically significant difference between Norway and Pakistan’s female mean preference ratings for financial prospects. The null hypothesis was that the two means are equal, and the alternative hypothesis was that they’re not.
Initially, I wanted to use the ttestIS() function from the jmv package to conduct an independent samples t-test. However, Student’s independent t-test assumes that the data from each group are from normal distributions and have equal variances. I wasn’t comfortable with making that assumption so I used a Welch’s t-test instead, as it adjusts the number of degrees of freedom when the variances are thought not to be equal to each other.
The filter() function was used to create separate data frames for each country. Then, the t.test() function was used to run a Welch Two Sample t-test, looking at the differences between the countries in regards to their preference ratings for ideal financial prospects in a partner.
Pakistan <- new_data %>%
filter(country == "Pakistan")
Norway <- new_data %>%
filter(country == "Norway")
t.test(Pakistan$ideal_resources, Norway$ideal_resources)
##
## Welch Two Sample t-test
##
## data: Pakistan$ideal_resources and Norway$ideal_resources
## t = 3.1033, df = 192.08, p-value = 0.002203
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.160783 0.721634
## sample estimates:
## mean of x mean of y
## 5.850299 5.409091
Interpreting the findings
Resources that helped to interpret and write-up the findings:
It was hypothesised that females in countries with higher gender equality would have lower mate preference ratings for financial prospects relative to countries with lower gender equality.
Mate preference ratings for financial prospects differed significantly according to Welch’s t-test, t(192.08) = 3.1033, p < 0.05. On average, female participants in a country with low gender equality had a mean preference score of 5.85 (SD = 1.32) whereas female participants in a country with high gender equality had a mean preference score of 5.41 (SD = 0.80). The 95% confidence interval for the effect of gender equality on mate preference ratings for financial prospects was between 0.16 and 0.72 percent. The results supported the hypothesis.
In saying that however, it’s important to acknowledge the limitations of this examination. First, the number of female participants in Pakistan (n = 167) was greater than that of Norway (n = 66). Second, the underlying implication behind the hypothesis was that gender equality scores reflect more egalitarian social/cultural norms or greater access to resources that would enable women to have control over resources. However, gender equality scores may not always be a valid or accurate indicator of equal opportunity. Walter et al. (2020) themselves noted that relatively abstract national-level predictors may not accurately reflect the ecological surroundings of participants.
Evolutionary psychologists posit that physical attractiveness is valued because it reflects health, fertility (in women), and dominance (in men). However, socio-cultural accounts suggest that physical attractiveness is valued due to positive stereotyping and other biases. An example of such stereotyping is the “beautiful is good” effect noted by Dion, Berscheid & Walster (1974), where individuals automatically attribute positive characteristics to those who are attractive. Monin (2003) also noted a “beautiful is familiar” effect, whereby physically attractive individuals are perceived to be more familiar than physically unattractive individuals- regardless of whether there was prior exposure or not.
All in all, studies have outlined the effects of physical attractiveness on perceptions of an individual in various social contexts. However, this raises the question- does physical attractiveness continue to predict outcomes over time?
The poet John Keats recognised the power of physical appearances and argued that it was something everlasting; something that would continue to bring joy to observers. This contrasts the position of G.B Shaw, a playwright who similarly recognised the power of attractiveness, but instead suggested that beauty is fleeting and fades with familiarity.
I was interested in examining that- those competing trains of thought. It was a bit abstract and difficult to do with the data, but I figured that I could somewhat try to get answers. I intended to look at individuals with high and low mate preferences for physical attractiveness, and see whether their relationship length differed. I thought that this could provide answers as to whether the biases associated with physical attraction, those biases that drive the power of physical appearances, last.
That is to say, is there a relationship between preferences for physical attractiveness and relationship length?
My hypothesis:
If John Keats was correct, individuals who prioritised physical attractiveness more (i.e. had higher ratings) would be more likely to hold positive illusions of their partner and therefore, would be more likely to have longer relationship lengths relative to individuals who had lower ratings for physical attractiveness.
However, If G.B Shaw was correct, individuals who prioritised physical attractiveness more (i.e. had higher ratings) would be more likely to initially hold positive illusions of their partner but this would fade as their partner became more familiar. As such, they would be more likely to have shorter relationship lengths relative to individuals who had lower ratings, as they may prefer partners who exist in a somewhat ‘idealised’ state.
I’m inclined to agree with G.B Shaw (that might just be the pessimist in me). Individuals who prioritise physical attractiveness more (ie. have higher mean preference ratings) are more likely to have shorter relationship lengths.
To answer this question, the steps I planned to take to were:
Filter out relationship length to exclude all reported cases of 0.
Create a new data frame with groups (i.e. high and low preference ratings) and compare the mean relationship lengths. Maybe run a t-test?
Use a data frame with all participants and see if there’s a correlation between preference ratings and relationship length.
It wasn’t immediately clear what the DV for relationship length was (i.e. days? years?). I looked at the supplemental material and found some indication that it referred to months.
“Participants in this sample were M = 49.86 years old on average (SD = 14.48) and were in their relationships for Mdn = 156 months” (line 199-200 from Walter et al’s (2020) supplemental materials)
Let’s get started!
Creating a new, “clean” data frame with the variables of interest
I worked with the sdmp data set read in earlier.
The select() function was used to select the variables I wanted to use. Additionally, the na.omit() function was used to exclude all incomplete cases/missing values from the data frame.
three_data <- sdmp %>%
select(ideal_physatt, rellength, sex) %>%
na.omit()
Viewing the new data frame created
glimpse() was used to present a more compact view of the data, allowing me to see each of the columns (i.e. variables) in the data frame and their data type. This was helpful as I could then check to see whether my code worked and only the select variables were included.
glimpse(three_data)
## Rows: 7,135
## Columns: 3
## $ ideal_physatt <dbl> 4, 5, 7, 7, 6, 6, 6, 7, 7, 7, 7, 7, 7, 5, 5, 7, 7, 7, 7,…
## $ rellength <dbl> 28, 307, 14, 14, 9, 14, 4, 51, 64, 36, 27, 41, 50, 34, 3…
## $ sex <fct> male, female, male, male, female, male, male, male, male…
Just glancing at the data, I noticed that some individuals reported ‘0’ for their relationship length. I wasn’t sure whether this reflected a mistake on the participant’s part, or whether it meant that they weren’t in a relationship- in any case, I thought it would be wise to remove all cases of ‘0’.
Filtering the data to exclude all ‘0’ values for relationship length
The filter() function from the dplyr package was used to remove all ‘0’ values within the relationship length variable. The glimpse() function was then used to view the data and double check if it worked.
three_data <- three_data %>%
dplyr::filter(rellength != "0") # '!x' is used to denote 'not x'
glimpse(three_data)
## Rows: 7,040
## Columns: 3
## $ ideal_physatt <dbl> 4, 5, 7, 7, 6, 6, 6, 7, 7, 7, 7, 7, 7, 5, 5, 7, 7, 7, 7,…
## $ rellength <dbl> 28, 307, 14, 14, 9, 14, 4, 51, 64, 36, 27, 41, 50, 34, 3…
## $ sex <fct> male, female, male, male, female, male, male, male, male…
Creating high and low preference groups
The mutate() function was used to create a new variable- ‘pref_group’, with the case_when() function from the ‘dplyr’ package specifying vectors of ratings that then formed the preference groups. The as.factor() function was then used to change the ‘pref_group’ variable into a factor.
group_pref <- three_data %>%
mutate(
pref_group = dplyr::case_when(
ideal_physatt <= 4 ~ "Low",
ideal_physatt >= 5 ~ "High"))
group_pref$pref_group <- as.factor(group_pref$pref_group)
Viewing the new data frame created
glimpse() was used to present a more compact view of the data, allowing me to see each of the columns (i.e. variables) in the data frame and their data type. This was helpful as I could then check to see whether my code worked and if the new ‘pref_group’ variable was there, as well as whether it had been converted into a factor.
glimpse(group_pref)
## Rows: 7,040
## Columns: 4
## $ ideal_physatt <dbl> 4, 5, 7, 7, 6, 6, 6, 7, 7, 7, 7, 7, 7, 5, 5, 7, 7, 7, 7,…
## $ rellength <dbl> 28, 307, 14, 14, 9, 14, 4, 51, 64, 36, 27, 41, 50, 34, 3…
## $ sex <fct> male, female, male, male, female, male, male, male, male…
## $ pref_group <fct> Low, High, High, High, High, High, High, High, High, Hig…
Calculating the mean relationship length (in months) for high vs low pref group
This code calculated the mean ‘rellength’ (i.e. relationship length) for individuals who had high or low preference ratings for physical attractiveness.
The group_by() function was used to define the variable of interest as ‘pref_group’, and summarise() was then used to calculate summary statistics (i.e. n, mean, SD, min, and max) for relationship length. The round() function from base r specified the number of decimal places desired (i.e. 2).
The kbl() function from the ‘kableExtra’ package was used to create a table, with the ‘caption’ clarifying what the table was presenting. The kable_material() function was used to specify the aesthetic features and better differentiate the relationship lengths of individuals with high and low preference scores with the ‘striped’ option.
table_grouppref <- group_pref %>%
group_by(pref_group) %>%
summarise(N = n(), Mean = round(mean(rellength), 2), SD = round(sd(rellength),
2), Min = min(rellength), Max = max(rellength))
table_grouppref %>%
kbl(caption = "Relationship lengths for individuals with high and low preference scores") %>%
kable_material("striped", "hover")
| pref_group | N | Mean | SD | Min | Max |
|---|---|---|---|---|---|
| High | 6132 | 88.08 | 103.28 | 0.50 | 725 |
| Low | 908 | 99.15 | 110.84 | 0.25 | 544 |
Running a t-test
I used a t-test in order to determine whether there was a statistically significant difference between the relationship lengths of those who had high or low preference ratings for physical attractiveness. The null hypothesis was that the average relationship lengths between the two groups was equal, and the alternative hypothesis was that they were not.
The filter function was used to create separate data frames for the high and low preference rating groups. Then I used the t.test() function to run a Welch Two Sample t-test, looking at the differences between the preference groups in regards to their relationship length.
highpref <- group_pref %>%
filter(pref_group == "High")
lowpref <- group_pref %>%
filter(pref_group == "Low")
t.test(highpref$rellength, lowpref$rellength)
##
## Welch Two Sample t-test
##
## data: highpref$rellength and lowpref$rellength
## t = -2.8314, df = 1152.4, p-value = 0.004715
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.731177 -3.397329
## sample estimates:
## mean of x mean of y
## 88.08195 99.14620
Interpreting the findings
Resources that helped to interpret and write-up the findings:
https://www.socscistatistics.com/tutorials/ttest/default.aspx
https://vault.hanover.edu/~altermattw/courses/220/R/ttest/ttest4.html
Relationship lengths differed significantly according to Welch’s t-test, t(1152.40) = -2.8314, p < 0.05. On average, individuals who had high preference ratings for physical attractiveness had a relationship length of 88.08 months (SD = 103.28) whereas individuals who had low preference ratings had a relationship length of 99.15 months (SD = 110.84). The 95% confidence interval for effect of preference ratings for physical attractiveness on relationship length was between -18.73 and -3.40 percent. The results supported the hypothesis.
I noticed the negative numbers above so it made me wonder if there was a negative correlation perhaps. In order to see if a relationship existed, I created a scatterplot so I could inspect the distribution of data points.
Creating a scatterplot
The ‘group_pref’ data frame was used, with the aesthetic mappings specifying the x axis variable as ‘ideal_physatt’ (i.e. physical attractiveness preferences) and the y axis variable as ‘rellength’ (i.e. relationship length). Colour was also used to specify that the preferences should be differentiated based on preference group (i.e. high or low), enabling us to better see the differences in relationship length between the groups.
geom_point() was used to create a scatterplot, with the argument ‘positions = jitter’ being added so that the individual data points could be seen a bit more. geom_smooth() was also included to add a smoothing line, which enables us to more clearly see patterns within the plot. The argument ‘method = "lm’ was used to fit a linear model (i.e. a straight line), with size specifying the weighting of the smoothing line and colour specifying that it should be black.
The theme was specified with theme_classic(), with the scale_colour_brewer() function from the ‘ggpplot2’ package specifying the colour. The labs() function was then used to rename the x and y axis labels and a legend title was added with the easy_add_legend_title() function from ggeasy.
p3 <- ggplot(data = group_pref, mapping = aes(
x = ideal_physatt,
y = rellength,
colour = pref_group))+
geom_point(position = "jitter", alpha=0.7)+
geom_smooth(method = "lm", color = "black", alpha = 0.3, size = 2)+
theme_classic()+
scale_colour_brewer(palette = "Paired")+
easy_add_legend_title("Preference Group")+
labs(x = "Preference Ratings for Physical Attractiveness",
y = "Relationship Length (months)",
title = "Correlation between relationship length and attractiveness preferences")
print(p3)
## `geom_smooth()` using formula 'y ~ x'
Just looking at it from face value, there appeared to be a very small negative slope which could indicate a potential (weak) negative correlation.
Running a correlation
The Pearson’s product-moment correlation coefficient (r) is essentially a measure of the strength and direction of a linear association between two variables. Before the method could be used to analyse the data, a few assumptions needed to be met.
A helpful resource I found online that outlined the assumptions was:
Now onto checking the assumptions…
1. Checking for normality using a QQ Plot
The normality of data can be tested by different methods. Initially, I opted to try the Shapiro-Wilk test of Normality. However, the sample size was above 5000 and as a result, there was an error message. A work-around would be to include ‘[0:5000]’ within the code however, I wasn’t sure whether this would be valid or accurate, as Shapiro Wilk’s test can be sensitive to even minor deviations from normality with large sample sizes.
Instead, I used a normal QQ Plot to draw a correlation between the data and the normal distribution.
ggqqplot(three_data, "rellength", facet.by = "ideal_physatt")
Just taking it at face value, it looked like the data was mostly normally distributed. Somewhat.
However, in saying that, George Box (1976) said…
“In applying mathematics to subjects such as physics or statistics we make tentative assumptions about the real world which we know are false but which we believe may be useful nonetheless. The physicist knows that particles have mass and yet certain results, approximating what really happens, may be derived from the assumption that they do not. Equally, the statistician knows, for example, that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.”
I chose to move ahead, whilst acknowledging the limitations of my method and the inaccuracies that could come along with it.
2. Running a Pearson’s product-moment correlation
cor.test(three_data$rellength, three_data$ideal_physatt)
##
## Pearson's product-moment correlation
##
## data: three_data$rellength and three_data$ideal_physatt
## t = -2.7878, df = 7038, p-value = 0.005321
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05652850 -0.00985974
## sample estimates:
## cor
## -0.03321222
Interpreting the results
There was a very weak, negative correlation between relationship length and preferences for physical attractiveness (r(7038) = -.03; p < 0.05).
My answer to the research question
The research question was whether the biases associated with physical attraction, those biases that drive the power of physical appearances, last? Does a relationship between preferences for physical attractiveness and relationship length exist?
The results provided mixed evidence.
I hypothesised that G.B Shaw was correct (i.e. beauty is fleeting); individuals who prioritised physical attractiveness more (i.e. had higher ratings) would be more likely to initially hold positive illusions of their partner but this would fade as their partner became more familiar. As such, they would be more likely to have shorter relationship lengths relative to individuals who had lower ratings.
The independent samples t-test found that on average, individuals with higher preference ratings for physical attractiveness had shorter relationship lengths than individuals with low preference ratings. Furthermore, the Pearson’s product-moment correlation found a very weak, negative correlation between relationship length and preferences for physical attractiveness.
This lended some support to G.B Shaw’s position and my hypothesis.
In saying that however, a body of research could potentially shed some light on this finding. In their article “Sex similarities and differences in preferences for short-term mates: What, whether, and why”, Li & Kenrick (2006) examined short-term mate preferences and found that for both sexes, physical attractiveness was valued most.
Walter et al. (2020) asked participants to report their preferences for an ideal long-term mate and it was this data that was used to examine my research question. However, perhaps participants who reported high preferences for physical attractiveness were not entirely accounting for a long-term relationship and instead, were answering based on their preferences for that present moment of time. As such, their high prioritisation of physical attractiveness may have aligned with their shorter relationship lengths (at the time), and this could potentially offer an explanation for the pattern of results.
The first recommendation for the authors of the paper would be to have a README file available on their OSF repository. This would ensure greater reproducibility as it would help individuals understand what the study is about and how to install and use the data, enabling them to reproduce the findings or otherwise contribute to the literature.
The purpose of a README file is to introduce and explain a project, and convey information about the files in a directory. Speaking from my team and I’s experiences, access to a README file would have made it easier to obtain answers to the questions we had and would have substantially helped us with reproducing the statistics and figures within the paper.
In terms of our challenges, we were unable to differentiate between the 3 csv files pertaining to mahalanobis D. The main difference appeared to be the presence or absence of ‘SD’ in the file name; however, it wasn’t clear what ‘SD’ represented. We had theories (e.g. standard deviation, standardised) however, it was only after we read the ‘multivariate effect size’ section of the paper that we connected the dots and understood that it referred to ‘sex differentiated’.
Unlike the file names, the authors used fairly self-explanatory variable names for the data sets. This enabled us to pinpoint which specific variables were relevant for the descriptives, plots etcetera. However, certain variables (e.g. sex) lacked value labels and it wasn’t immediately clear what the values (i.e. ‘0’ and ‘1’) represented. In order to determine what each value stood for, we had to engage in a process of trial and error.
Ultimately, many of the challenges we encountered stemmed from ambiguity. These challenges were not insurmountable however, they did hinder our progress. The absence of a README file meant that we had to delve into the paper and the authors r scripts in order to resolve this ambiguity and solve any issues we faced. Having a README file that provided greater clarity and information would have facilitated more effective and efficient reproducibility, as we would not have had to go through a process of trial and error in order to determine what each value or data file represented.
In terms of how the authors could have done this, a range of resources are available that provide more information about the creation of README files. Some of the resources I found insightful/helpful were:
This resource outlines what a README file is, why and when it should be made, and who should make it. It provides suggestions for creating an effective README file, and features a FAQ section.
This resource is more focused on Node, however; it provides information about what README files do, why they’re important, and how to create them. It outlines some good practices and has a README checklist at the bottom of the article.
This resource outlines the basic elements that define a README file, as well as other considerations e.g. including code samples.
This resource includes a template with instructions for creating an effective README.md file.
This resource is essentially a master-post of README examples. There are also relevant articles, tools, and resources (i.e. getting feedback, creating gifs) within.
My second recommendation for the authors of the paper would be to provide a data file that contains all of the variables necessary for reproducibility. The data file ‘ReplicationProcessedfinaldata04202018.csv’ contained the variables needed for reproducing the descriptive statistics and the means and SD’s however, many of the specific variables necessary for producing the figures were absent (e.g. gepca scores).
This complicated the process greatly. It was also quite disarming, as we hadn’t considered the possibility of the data not being ‘complete’, in a sense, and we weren’t quite sure how to obtain the specific variables we needed. We examined the author’s r script and ran their code however, this prompted error messages to occur. Ultimately, to obtain the relevant variables, we had to extract the lines of code necessary for each plot in a separate Rmd file and label them with the line numbers from the r script. This process was challenging as it involved a lot of trial and error, and we frequently had to return and find missing lines of code due to error messages.
Having access to a data file that contained all of the variables necessary would have significantly aided the process of reproducing the figures, as we would have been able to locate and use the variables directly. It’s a bit unclear as to why the authors didn’t upload such a data file; however, an approach they could have used would be to write a csv file using the write.csv() function. Through this, they could have created a new csv file with all of the variables necessary for reproducibility.
We used this method when creating a data set we could use as a group and the resources that helped us were:
This resource provides information about the function write.csv() and its usage, arguments etcetera. It also provides an example.
This resource website provides a really concise summary of how to use the write.csv() function to save an R data frame as a csv file.
This resource provides information about how to export a data frame to CSV in rstudio.
My third and final recommendation for the authors of the paper would be to have more transparent code. This is important for ensuring reproducibility, as individuals need to be able to understand what was done, why it was done, and how it led to the results/outcomes. Speaking from my team and I’s experiences, we often lacked this understanding and this significantly hindered our progress in terms of reproducing the statistics and figures.
One of the key things we noted early on was that a lot of the issues we faced weren’t necessarily addressed or documented in any form by the authors. A major complication we encountered when reproducing the paper was that the code did not produce plots that resembled the figures. Despite running both our own ggplot code and the authors code from their r script, this issue persisted.
This issue was especially apparent for figure 4. Initially, the position of the data points in the scatterplot did not match entirely and there was an outlier present. However, the authors hadn’t indicated any issues with outliers or documented their removal of the outlier within their r script or the published paper. As a result, we had to work backwards and engage in a process of problem-solving with the limited information available to us. Only later did we realise (after weeks of trouble-shooting and scouring through the script) that they had created a new data frame, which they had then used for that plot.
There was another occasion where the authors had provided the relevant information necessary for reproducing a figure within their r script however, there wasn’t sufficient detail/explanation. We had noticed a seed number within the author’s r script however, it wasn’t included in their code for figure 1 and there was no indication that it was important for reproducing the figure. Ultimately, we were only able to understand the purpose of the seed and resolve our issue after receiving a hint from Jenny.
It’s difficult to say exactly what the authors could have done to be more transparent because I do think there’s an inherent disconnect between those involved in a research project and the external observers. The authors may have had intentions to be transparent however, when coding, they may have prioritised communication with fellow members of their team and may not have actively tried to explain their code in lay-person terms. As a team with little experience in coding, we definitely felt that we had a gap in knowledge, and some of the code was getting lost in translation.
A few suggestions for perhaps building this awareness and developing more effective communication:
This resource outlines the basic principles of commenting, as well as good practices and habits to have.
This resource isn’t a “proper” resource per say, I just found it insightful! A software developer was seeking advice on how to better explain their code and I guess it kind of sheds light on the thinking behind it.