Goals

My main goal was to come up with the three exploratory analyses and start coding the descriptives and plot for my first question.

Questions

  1. Was there a significant difference between sexes in each country for how much they lied?
  2. Was there a significant difference between younger (=<23) or older (>23) in each condition?
  3. Are there significant differences between different religions? (Christian, Buddhism, Judaism, Muslim)

These questions are subject to change, but I thought these questions were interesting and would show off different skills!

Coding for Q1

read libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(janitor) # read csv
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggplot2) # plot boxplot
library(gt) # create summary table
library(ggpubr) # statistical analyses

read nichols csv

data1 <- read_csv("~/RFiles/verification_report/Nichols_et_al_dataset_V2.0.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   moneyclaim = col_character(),
##   `completion time (practice included)` = col_time(format = ""),
##   `completion time (payments only)` = col_time(format = ""),
##   Religion = col_character(),
##   `Religion Text` = col_character()
## )
## i Use `spec()` for the full column specifications.

rename variables + make tibble

I used the rename function to rename each variable like it was in the original paper. Then, I selected the relevant variables to make a cleaner dataset to look at. We used na.omit to get rid of any NA values in our dataset.

data1 <- data1 %>%  
  filter(include == 0) %>% 
  rename(cond = con,
         claimpercent = claim,
         claimmoney = moneyclaim,
         CT_practice = `completion time (practice included)`,
         CT_payments = `completion time (payments only)`,
         religiosity = relig,
         religion = Religion) %>% 
  select(site, claimpercent, id, sex) %>% 
  as_tibble()

data1 <- na.omit(data1)

rename each site

We had to rename each site variable to better fit with our codebook.

data1$site[data1$site==1] <- 0 
data1$site[data1$site==3] <- 1 

# makes USA = 0, Japan = 1, Czech Republic = 2

0 = female, 1 = male, NA = missing data

make site and sex factor, and rename

We had to rename and make each site and sex a factor variable instead of a double class variable.

data1$site <- factor(data1$site, labels = c("USA","Japan","Czech Rep."))

data1$sex <- factor(data1$sex, labels = c("Female", "Male"))

creating summary table

I used the gt package to create a summary table. group_by made the summary focus on sex, while the summarise function calculated the mean, sd and se of claimpercent. Then, I used gt() to create the table, and made the columns mean, sd, se and rounded it to 2 decimal places.

gender_summary <- data1 %>% 
  group_by(sex) %>% 
  summarise(mean = mean(claimpercent),
            sd = sd(claimpercent),
            n = n(),
            se = sd/sqrt(n))

gender_summary %>% 
  gt() %>% 
  fmt_number(
    columns = c(mean, sd, se),
    decimals = 2)
sex mean sd n se
Female 0.26 0.28 208 0.02
Male 0.29 0.30 195 0.02

boxplot

I used ggplot2 to create a box plot. Facet_wrap was used to split the plots into female and male. Lines 105 and 106 renamed the x and y axis.

I used the stat_compare_means() function from ggpubr to calculate ANOVA, but I am not entirely sure if that was the correct way to go about it.

fig1_sex <- ggplot(data1) +
  geom_boxplot(
    aes(x = site, y = claimpercent, fill = site)) +
  facet_wrap(vars(sex)) +
  scale_y_continuous(name = "Percent Claimed") +
  scale_x_discrete(name = "Country") +
  theme_classic() +
  theme(legend.position = "none") +
  ggtitle(label = "Difference in Percent Claimed Between Sexes in Each Country") +
  scale_fill_viridis_d(
    alpha = .5,
    name = NULL
  ) +
  stat_compare_means(data = data1,
                     aes(x = site,
                     y = claimpercent),
                     method = "anova")

plot(fig1_sex)

statistical analysis (ANOVA?)

I am not quite sure what kind of statistical test to use - I am not sure if I just chose a difficult question, or if I should conduct a few different analyses.

# anova - site

m1 <- aov(claimpercent ~ site, data = data1)

summary(m1)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## site          2  5.259  2.6293   35.94 4.44e-15 ***
## Residuals   400 29.266  0.0732                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Challenges

I ran into a few challenges using gt() because I had never used it before. For example, I used fmt() instead of fmt_number(), so I could not create the table. Overall though, gt() function is very easy to use once you get the hang of it!

My main problem was figuring out which statistical analyses I should use. I was not sure how to find the difference between each sex in each site - would I have to conduct a few different t-tests or ANOVA tests? Or is there an easier way?

Next Steps

  • Start working on Q2 and Q3
  • Figure out which statistical analyses I should use for Q1