Week 9 Goals
- To try and finish my exploratory analysis for the verification report…no such luck
Challenges and successes
- Well…I have code for all of my exploratory analysis…but I just feel like it’s all wrong and looks terrible. Feeling pretty hopeless this week! Here’s how I went…
# Load packages
library(tidyverse)
library(patchwork)
library(extrafont)
library(cowplot)
library(rstatix)
library(ggeasy)
library(ggpubr)
library(corrplot)
library(car)
library(psych)
library(BayesFactor)
library(gt)
library(jmv)
# Read data
# The data in the OSF repo is in .csv format, so I need to use the `read_csv()` function from the `reader` package.
data_1_raw = read_csv('WTR_Comfort_S1.csv')
data_2 = read_csv('WTR_Comfort_S2.csv')
data_3 = read_csv('WTR_Comfort_S3.csv')3a. Are there gender differences in the threshold for disgust?
My experience is that women tend to feel higher levels of disgust compared to men. This is supported in the literature where women have continually been found to have noticably higher levels of disgust compared to men, across many different types of disgust, such as insects, open sores, dirty clothing, and faeces (Curtis et al., 2004; Prokop & Fančovičová, 2010), and animals, death, hygiene, food, and sex (Haidt et al., 1994). In Tybur et al. (2020), disgust was measured on a likert scale and disgsut statements related to dog faeces, red sores, sweaty palms, mouldy food, body odor, cockroaches, and bloody cuts.
According to the codebook, the gender variable is sex (where 1 = male, 2 = female, 3 = other). The disgust scale for seeing a cockroach run across the floor is DS6. First, I checked that the data had these variables - it does. However, it’s spread across study 1-3. I wanted to see if I could somehow merge the data in these variables into one dataset by their participant number to answer this question. However, after trying some code, I realised this probably wouldn’t work because it was a between subjects design where the participants in each study were separate. I first tried merge(by=) but this threw up an error.
I used dim() to get some information about the 3 datasets so I could see if the merge worked.
dim(data_1_raw)## [1] 504 100
dim(data_2)## [1] 430 85
dim(data_3)## [1] 905 68
data_1_raw has 504 rows and 100 columns, data_2 has 430 rows and 85 columns and data_3 has 905 rows and 68 columns. I then merged the data using full_join() and joined by participant. This would supposedly return all rows and all columns from all datasets in a final dataset. I then used slice to show a tibble of the new dataset. I went from (1:905) because that was the range of rows I got from dim().
total <- full_join(data_1_raw, data_2, data_3, by="participant")
total %>% slice(1:905)## # A tibble: 504 x 184
## participant sex.x age.x relat.x income.x poli_soc.x poli_econ.x trust_gen.x
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 21 1 4 3 6 9
## 2 2 2 22 2 1 2 2 5
## 3 3 1 25 2 1 4 4 4
## 4 4 2 28 1 5 7 7 8
## 5 5 1 38 1 20 2 2 2
## 6 6 1 49 1 20 3 6 7
## 7 7 2 32 1 10 4 4 8
## 8 8 2 39 1 16 6 6 7
## 9 9 2 47 2 7 1 1 7
## 10 10 2 19 2 10 4 4 3
## # … with 494 more rows, and 176 more variables: DS1.x <dbl>, DS2.x <dbl>,
## # DS3.x <dbl>, DS4.x <dbl>, DS5.x <dbl>, DS6.x <dbl>, DS7.x <dbl>,
## # relationship_category <dbl>, part_leng <dbl>, part_sex <dbl>,
## # part_age <dbl>, HH1.x <dbl>, HH2.x <dbl>, HH3.x <dbl>, HH4.x <dbl>,
## # HH5.x <dbl>, HH6.x <dbl>, HH7.x <dbl>, HH8.x <dbl>, HH9.x <dbl>,
## # HH10.x <dbl>, comf1.x <dbl>, comf2.x <dbl>, comf3.x <dbl>, comf4.x <dbl>,
## # comf5.x <dbl>, comf6.x <dbl>, comf7.x <dbl>, comf8.x <dbl>, comf9.x <dbl>,
## # comf10.x <dbl>, 37_54 <dbl>, 37_46 <dbl>, 37_39 <dbl>, 37_31 <dbl>,
## # 37_24 <dbl>, 37_17 <dbl>, 37_9 <dbl>, 37_2 <dbl>, 37_-6 <dbl>,
## # 37_-13 <dbl>, 23_33 <dbl>, 23_29 <dbl>, 23_24 <dbl>, 23_20 <dbl>,
## # 23_15 <dbl>, 23_10 <dbl>, 23_6 <dbl>, 23_1 <dbl>, 23_-3 <dbl>, 23_-8 <dbl>,
## # 75_109.x <dbl>, 75_94.x <dbl>, 75_79.x <dbl>, 75_64.x <dbl>, 75_49.x <dbl>,
## # 75_34.x <dbl>, 75_19.x <dbl>, 75_4.x <dbl>, 75_-11.x <dbl>, 75_-26.x <dbl>,
## # 19_28.x <dbl>, 19_24.x <dbl>, 19_20.x <dbl>, 19_16.x <dbl>, 19_12.x <dbl>,
## # 19_9.x <dbl>, 19_5.x <dbl>, 19_1.x <dbl>, 19_-3.x <dbl>, 19_-7.x <dbl>,
## # 46_67.x <dbl>, 46_58.x <dbl>, 46_48.x <dbl>, 46_39.x <dbl>, 46_30.x <dbl>,
## # 46_21.x <dbl>, 46_12.x <dbl>, 46_2.x <dbl>, 46_-7.x <dbl>, 46_-16.x <dbl>,
## # 68_99 <dbl>, 68_85 <dbl>, 68_71 <dbl>, 68_58 <dbl>, 68_44 <dbl>,
## # 68_31 <dbl>, 68_17 <dbl>, 68_3 <dbl>, 68_-10 <dbl>, 68_-24 <dbl>,
## # English_exclude.x <dbl>, sex.y <dbl>, age.y <dbl>, relat.y <dbl>,
## # income.y <dbl>, poli_soc.y <dbl>, poli_econ.y <dbl>, trust_gen.y <dbl>,
## # HH1.y <dbl>, …
Hmmm… this wasn’t quite what I wanted. The datasets are side by side and I want them on top of each other. I’m wondering if pivot_longer() would help?
total %>% pivot_longer(participant, names_to = NULL, values_to = 'total_participants')## # A tibble: 504 x 184
## sex.x age.x relat.x income.x poli_soc.x poli_econ.x trust_gen.x DS1.x DS2.x
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 21 1 4 3 6 9 7 5
## 2 2 22 2 1 2 2 5 5 3
## 3 1 25 2 1 4 4 4 3 2
## 4 2 28 1 5 7 7 8 7 5
## 5 1 38 1 20 2 2 2 6 6
## 6 1 49 1 20 3 6 7 4 3
## 7 2 32 1 10 4 4 8 7 7
## 8 2 39 1 16 6 6 7 6 4
## 9 2 47 2 7 1 1 7 6 3
## 10 2 19 2 10 4 4 3 7 2
## # … with 494 more rows, and 175 more variables: DS3.x <dbl>, DS4.x <dbl>,
## # DS5.x <dbl>, DS6.x <dbl>, DS7.x <dbl>, relationship_category <dbl>,
## # part_leng <dbl>, part_sex <dbl>, part_age <dbl>, HH1.x <dbl>, HH2.x <dbl>,
## # HH3.x <dbl>, HH4.x <dbl>, HH5.x <dbl>, HH6.x <dbl>, HH7.x <dbl>,
## # HH8.x <dbl>, HH9.x <dbl>, HH10.x <dbl>, comf1.x <dbl>, comf2.x <dbl>,
## # comf3.x <dbl>, comf4.x <dbl>, comf5.x <dbl>, comf6.x <dbl>, comf7.x <dbl>,
## # comf8.x <dbl>, comf9.x <dbl>, comf10.x <dbl>, 37_54 <dbl>, 37_46 <dbl>,
## # 37_39 <dbl>, 37_31 <dbl>, 37_24 <dbl>, 37_17 <dbl>, 37_9 <dbl>, 37_2 <dbl>,
## # 37_-6 <dbl>, 37_-13 <dbl>, 23_33 <dbl>, 23_29 <dbl>, 23_24 <dbl>,
## # 23_20 <dbl>, 23_15 <dbl>, 23_10 <dbl>, 23_6 <dbl>, 23_1 <dbl>, 23_-3 <dbl>,
## # 23_-8 <dbl>, 75_109.x <dbl>, 75_94.x <dbl>, 75_79.x <dbl>, 75_64.x <dbl>,
## # 75_49.x <dbl>, 75_34.x <dbl>, 75_19.x <dbl>, 75_4.x <dbl>, 75_-11.x <dbl>,
## # 75_-26.x <dbl>, 19_28.x <dbl>, 19_24.x <dbl>, 19_20.x <dbl>, 19_16.x <dbl>,
## # 19_12.x <dbl>, 19_9.x <dbl>, 19_5.x <dbl>, 19_1.x <dbl>, 19_-3.x <dbl>,
## # 19_-7.x <dbl>, 46_67.x <dbl>, 46_58.x <dbl>, 46_48.x <dbl>, 46_39.x <dbl>,
## # 46_30.x <dbl>, 46_21.x <dbl>, 46_12.x <dbl>, 46_2.x <dbl>, 46_-7.x <dbl>,
## # 46_-16.x <dbl>, 68_99 <dbl>, 68_85 <dbl>, 68_71 <dbl>, 68_58 <dbl>,
## # 68_44 <dbl>, 68_31 <dbl>, 68_17 <dbl>, 68_3 <dbl>, 68_-10 <dbl>,
## # 68_-24 <dbl>, English_exclude.x <dbl>, sex.y <dbl>, age.y <dbl>,
## # relat.y <dbl>, income.y <dbl>, poli_soc.y <dbl>, poli_econ.y <dbl>,
## # trust_gen.y <dbl>, HH1.y <dbl>, HH2.y <dbl>, HH3.y <dbl>, …
Well..that seemed to do something, but I just don’t know what I’m looking at… I think I’ll scratch merging the data. I will use the information I got from the dim() function though, and choose to do my further analyses on Study 3 because it has 905 rows (and therefore 905 participants) so should afford me the greatest statistical power. I chose the first variable to examine - dog poo!
I made a summary table of descriptive statistics instead. Working out the best way to do this took me hours and many meltdowns. I couldn’t work out how Jenny had done it despite being at the Q&A and re-watching her code it. Finally came up with what is quite a pretty descriptives table! I just started it with one variable (DS1) to start off with.
DS1summary <- data_3 %>%
filter(sex != '3') %>%
group_by(sex) %>%
summarise (mean = mean(DS1),
sd = sd(DS1),
n = n(),
se = sd/sqrt(n))
print(DS1summary)## # A tibble: 2 x 5
## sex mean sd n se
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 5.15 1.40 452 0.0657
## 2 2 5.56 1.38 445 0.0656
DS1summary %>% gt() %>% fmt_number(columns = 2:5, decimals = 2)| sex | mean | sd | n | se |
|---|---|---|---|---|
| 1 | 5.15 | 1.40 | 452.00 | 0.07 |
| 2 | 5.56 | 1.38 | 445.00 | 0.07 |
Well… according to this table, females feel higher levels of disgust on average (5.15) compared to males (5.56).
I then plotted this to have a visual comparison, before doing some inferential statistical testing.
DS1summary %>% ggplot(aes(x=sex, y=mean, fill=sex)) +
geom_col() +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
width = .2) +
scale_x_discrete(name = 'gender', breaks = c(1:2), labels = c('Males', 'Females')) +
scale_y_continuous(name = 'Average disgust - dog poo', expand = c(0,0), limits = c(0, 6)) +
easy_text_size(15) +
theme(legend.position = "none")Yep.. females are definitely still higher! Now let’s see if this is statistically significant.
data_3 %>%
filter(sex != '3')## # A tibble: 897 x 68
## participant sex age relat income DS_Inst DS1 DS2 DS3 DS4 DS5
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 599 2 28 1 10 1 7 6 7 5 7
## 2 542 2 37 1 7 1 5 7 5 6 5
## 3 685 1 37 1 6 1 7 7 7 7 7
## 4 778 2 77 1 4 1 7 7 5 7 7
## 5 484 2 73 2 4 1 4 6 4 4 4
## 6 23 1 70 1 7 1 3 3 2 5 4
## 7 643 1 70 1 5 1 4 5 3 6 5
## 8 276 2 69 2 2 1 7 5 4 7 4
## 9 360 1 69 1 10 1 7 6 4 6 5
## 10 33 2 68 2 4 1 6 7 4 4 7
## # … with 887 more rows, and 57 more variables: DS6 <dbl>, DS7 <dbl>,
## # Target <dbl>, Value <dbl>, Alex_inst <dbl>, Alex_height <dbl>,
## # Alex_weight <dbl>, Alex_age <dbl>, Alex_wealth <dbl>,
## # Alex_intelligent <dbl>, Alex_attractive <dbl>, Alex_kind <dbl>,
## # Alex_honest <dbl>, Alex_care <dbl>, comfort_inst <dbl>, comf1 <dbl>,
## # comf2 <dbl>, comf3 <dbl>, comf4 <dbl>, comf5 <dbl>, comf6 <dbl>,
## # comf7 <dbl>, comf8 <dbl>, comf9 <dbl>, comf10 <dbl>, WTR_INST <dbl>,
## # 75_109 <dbl>, 75_94 <dbl>, 75_79 <dbl>, 75_64 <dbl>, 75_49 <dbl>,
## # 75_34 <dbl>, 75_19 <dbl>, 75_4 <dbl>, 75_-11 <dbl>, 75_-26 <dbl>,
## # 19_28 <dbl>, 19_24 <dbl>, 19_20 <dbl>, 19_16 <dbl>, 19_12 <dbl>,
## # 19_9 <dbl>, 19_5 <dbl>, 19_1 <dbl>, 19_-3 <dbl>, 19_-7 <dbl>, 46_67 <dbl>,
## # 46_58 <dbl>, 46_48 <dbl>, 46_39 <dbl>, 46_30 <dbl>, 46_21 <dbl>,
## # 46_12 <dbl>, 46_2 <dbl>, 46_-7 <dbl>, 46_-16 <dbl>, English_Exclude <dbl>
data3sexxds1 <- select(data_3, sex, DS1)
female <- data3sexxds1 %>%
filter(sex == '2')
male <- data3sexxds1 %>%
filter(sex == '1')
t.test(female, male, var.equal = TRUE)##
## Two Sample t-test
##
## data: female and male
## t = 6.8712, df = 1792, p-value = 8.753e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5034617 0.9056812
## sample estimates:
## mean of x mean of y
## 3.780899 3.076327
Okay this doesn’t appear to be significant, as p = 8.75. I’m not actually sure if this is right…I can’t quite work out how to do a t-test in R because this seems wrong…
3c. Do people in a relationship have higher levels of contact comfort than those in platonic relationships?
I would assume that people in a relationship would have higher levels of contact comfort than those who aren’t…so let’s see! I’m going to use the comf2 variable which is ‘using target’s deodorant stick on yourself’ because that seems like something that people in relationships would do all the time, but people who aren’t as used to being close to someone, may find pretty gross… This variable is measured from 1= very uncomfortable to 7 = very comfortable.
I first made a summary table of descriptive statistics.
comf2summary <- data_1_raw %>%
group_by(relat) %>%
summarise (mean = mean(comf2),
sd = sd(comf2),
n = n(),
se = sd/sqrt(n))
print(comf2summary)## # A tibble: 2 x 5
## relat mean sd n se
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 4.25 2.37 335 0.130
## 2 2 2.99 2.01 169 0.155
comf2summary %>% gt() %>% fmt_number(columns = 2:5, decimals = 2) | relat | mean | sd | n | se |
|---|---|---|---|---|
| 1 | 4.25 | 2.37 | 335.00 | 0.13 |
| 2 | 2.99 | 2.01 | 169.00 | 0.15 |
Okay! Definitely looks like there is a difference, with people in a romantic relationship about 1.25 points higher on the level of comfort. Let’s see if the plot supports this…
comf2summary %>% ggplot(aes(x=relat, y=mean, fill=relat)) +
geom_col() +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
width = .2) +
scale_x_continuous(name = 'Relationship Category', breaks = c(1:2), labels = c('In a romantic relationship', 'Not in a romantic relationship')) +
scale_y_continuous(name = 'Average contact comfort - deoderant', expand = c(0,0), limits = c(0, 6)) +
easy_text_size(15) +
theme(legend.position = "none")Oh yeah, definitely some sort of difference! Wonder if it’s statistically significant?
data1relxcomf2 <- select(data_1_raw, relat, comf2)
yesrelationship <- data1relxcomf2 %>%
filter(relat == '1')
norelationship <- data1relxcomf2 %>%
filter(relat == '2')
t.test(yesrelationship, norelationship, var.equal = TRUE)##
## Two Sample t-test
##
## data: yesrelationship and norelationship
## t = 0.95002, df = 1006, p-value = 0.3423
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1414894 0.4070551
## sample estimates:
## mean of x mean of y
## 2.626866 2.494083
Wow! Looks like it’s significant too! At least one of my questions came up with something interesting!
Week 10 Goals
- Get someone to check my statistical analyses
- Write up my recommendations
- Knit my report into word
- Don’t throw my computer at the wall