My main goal was to come up with the three exploratory analyses and start coding the descriptives and plot for my first question.
These questions are subject to change, but I thought these questions were interesting and would show off different skills!
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor) # read csv
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2) # plot boxplot
library(gt) # create summary table
library(ggpubr) # statistical analyses
data1 <- read_csv("~/RFiles/verification_report/Nichols_et_al_dataset_V2.0.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## moneyclaim = col_character(),
## `completion time (practice included)` = col_time(format = ""),
## `completion time (payments only)` = col_time(format = ""),
## Religion = col_character(),
## `Religion Text` = col_character()
## )
## i Use `spec()` for the full column specifications.
I used the rename function to rename each variable like it was in the original paper. Then, I selected the relevant variables to make a cleaner dataset to look at. We used na.omit to get rid of any NA values in our dataset.
data1 <- data1 %>%
filter(include == 0) %>%
rename(cond = con,
claimpercent = claim,
claimmoney = moneyclaim,
CT_practice = `completion time (practice included)`,
CT_payments = `completion time (payments only)`,
religiosity = relig,
religion = Religion) %>%
select(site, claimpercent, id, sex) %>%
as_tibble()
data1 <- na.omit(data1)
We had to rename each site variable to better fit with our codebook.
data1$site[data1$site==1] <- 0
data1$site[data1$site==3] <- 1
# makes USA = 0, Japan = 1, Czech Republic = 2
0 = female, 1 = male, NA = missing data
We had to rename and make each site and sex a factor variable instead of a double class variable.
data1$site <- factor(data1$site, labels = c("USA","Japan","Czech Rep."))
data1$sex <- factor(data1$sex, labels = c("Female", "Male"))
I used the gt package to create a summary table. group_by made the summary focus on sex, while the summarise function calculated the mean, sd and se of claimpercent. Then, I used gt() to create the table, and made the columns mean, sd, se and rounded it to 2 decimal places.
gender_summary <- data1 %>%
group_by(sex) %>%
summarise(mean = mean(claimpercent),
sd = sd(claimpercent),
n = n(),
se = sd/sqrt(n))
gender_summary %>%
gt() %>%
fmt_number(
columns = c(mean, sd, se),
decimals = 2)
| sex | mean | sd | n | se |
|---|---|---|---|---|
| Female | 0.26 | 0.28 | 208 | 0.02 |
| Male | 0.29 | 0.30 | 195 | 0.02 |
I used ggplot2 to create a box plot. Facet_wrap was used to split the plots into female and male. Lines 105 and 106 renamed the x and y axis.
I used the stat_compare_means() function from ggpubr to calculate ANOVA, but I am not entirely sure if that was the correct way to go about it.
fig1_sex <- ggplot(data1) +
geom_boxplot(
aes(x = site, y = claimpercent, fill = site)) +
facet_wrap(vars(sex)) +
scale_y_continuous(name = "Percent Claimed") +
scale_x_discrete(name = "Country") +
theme_classic() +
theme(legend.position = "none") +
ggtitle(label = "Difference in Percent Claimed Between Sexes in Each Country") +
scale_fill_viridis_d(
alpha = .5,
name = NULL
) +
stat_compare_means(data = data1,
aes(x = site,
y = claimpercent),
method = "anova")
plot(fig1_sex)
I am not quite sure what kind of statistical test to use - I am not sure if I just chose a difficult question, or if I should conduct a few different analyses.
# anova - site
m1 <- aov(claimpercent ~ site, data = data1)
summary(m1)
## Df Sum Sq Mean Sq F value Pr(>F)
## site 2 5.259 2.6293 35.94 4.44e-15 ***
## Residuals 400 29.266 0.0732
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I ran into a few challenges using gt() because I had never used it before. For example, I used fmt() instead of fmt_number(), so I could not create the table. Overall though, gt() function is very easy to use once you get the hang of it!
My main problem was figuring out which statistical analyses I should use. I was not sure how to find the difference between each sex in each site - would I have to conduct a few different t-tests or ANOVA tests? Or is there an easier way?