Inference_simulations

Tips before getting started

This is a document made to accompany some simulations from the Inference lecture in Psych 201a. The goal of this document is to continue learning in R/tidyverse but also to gain hands on experience simulating and manipulating data.

If you need to install something, you can run install.packages('tidyverse'), where you substitute the name of the library

You should have this repository cloned on your computer, ideally in a folder where you have all of your github repositories (e.g., /brialong/Documents/GitHub/in_class_excercises\)

To understand what a function does, type ? [function_name] where function_name refers to a function name in a loaded repository.

Setup

set.seed(123) # good practice to set a random seed, or else different runs get you different results

Import the functions & data that we need

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2) # plotting
library(ggthemes) # optional, but nice

Define the simulation function

This makes “tea data”, a tibble (dataframe) where there are a certain number of people in each condition (default = 48, i.e., n_total, with n_total/2 in each half)

The averages of the two conditions are separated by a known effect (“delta”) with some variance (“sigma”). You can change these around since we’re simulating data!

make_tea_data <- function(n_total = 48, sigma = 1.25, delta = 1.5) {
  n_half <- n_total / 2
  tibble(condition = c(rep("milk first", n_half), rep("tea first", n_half)),
         rating = c(round(rnorm(n_half, mean = 3.5 + delta, sd = sigma)), 
                    round(rnorm(n_half, mean = 3.5, sd = sigma)))) |>
    mutate(rating = if_else(rating > 10, 10, rating), # truncate if greater than max/min of rating scale
           rating = if_else(rating < 1, 1, rating))
}

# Increase the sample size to 85
tea_data_large_sample <- make_tea_data(n_total = 85)

# Summarize the mean ratings by condition
tea_data_large_sample %>%
  group_by(condition) %>%
  summarise(mean_rating = mean(rating))

# A tibble: 2 × 2
  condition  mean_rating
  <chr>            <dbl>
1 milk first        5.10
2 tea first         3.55

# Run a t-test to compare the two conditions
t.test(rating ~ condition, data = tea_data_large_sample)


    Welch Two Sample t-test

data:  rating by condition
t = 6.237, df = 81.989, p-value = 1.848e-08
alternative hypothesis: true difference in means between group milk first and group tea first is not equal to 0
95 percent confidence interval:
 1.054002 2.041236
sample estimates:
mean in group milk first  mean in group tea first 
                5.095238                 3.547619

#Observation after changing the sample size

#After increasing the sample size to 85, the mean ratings for "milk first" and "tea first" were 5.2 and 3.5, respectively. The t-test revealed a significant difference between the two conditions (p-value = <.001), suggesting that with a larger sample size, we have more power to detect the difference in ratings.

# Decrease the difference in means to 0.5
tea_data_small_delta <- make_tea_data(delta = 0.5)

# Summarize the mean ratings by condition
tea_data_small_delta %>%
  group_by(condition) %>%
  summarise(mean_rating = mean(rating))

# A tibble: 2 × 2
  condition  mean_rating
  <chr>            <dbl>
1 milk first        4.17
2 tea first         3.38

# Run a t-test to compare the two conditions
t.test(rating ~ condition, data = tea_data_small_delta)


    Welch Two Sample t-test

data:  rating by condition
t = 2.3823, df = 45.964, p-value = 0.0214
alternative hypothesis: true difference in means between group milk first and group tea first is not equal to 0
95 percent confidence interval:
 0.1227416 1.4605917
sample estimates:
mean in group milk first  mean in group tea first 
                4.166667                 3.375000

#Observation changing after Delta

#When the delta (difference in means) was reduced to 0.5, the difference in ratings between "milk first" and "tea first" was smaller. The mean ratings were 4.2 and 3.5, respectively, and the t-test was significant (p-value = 0.03). even with a reduced delta (difference in means) of 0.5, there is still a statistically significant difference between the “milk first” and “tea first” groups.

Make data frames where we have small or larger samples of tea data for ONE experiment

# here's, we're calling our custom function, and specifying different inputs than the defaults (which are inside the parenthese up above)
tea_data <- make_tea_data(n_total = 18, delta=1.5)

tea_data_highn <- make_tea_data(n_total = 48, delta=1.5)

To do: OK, look at these data frames. How long are they, what are the column names? Look at them in your console and in the environment if you want.

To do: Write basic tidyverse code to calculate the mean of each condition (hint: use group_by and summarize)

by_condition <- tea_data %>%
  group_by(condition) %>%
  summarize(mean_rating = mean(rating), num_participants = length(rating))

by_condition_highn <- tea_data_highn %>%
  group_by(condition) %>%
  summarize(mean_rating = mean(rating), num_participants = length(rating))

In this first draw, was it significant with N=9 or N=24 per group?

OK, we can do t-tests already! I’ve done these within a pipe (whoops, I often use the old pipe operator because I have small hands)

To do: Run these and look at the outputs by posting out_low_n in the console

out_low_n <- tea_data %>%
  t.test(rating ~ condition, data = .,  var.equal = TRUE) 

out_high_n <- tea_data_highn %>%
  t.test(rating ~ condition, data = ., var.equal = TRUE)

Simulate 1000 experiments

To do: Run these code blocks

…where you have 18 participants per experiment with an average difference of 1.5 points in tea deliciousness on average

samps <- tibble(sim = 1:1000) |> # 
  mutate(data = map(sim, \(i) make_tea_data(n_total = 18, delta=1.5))) |>  # simulate
  unnest(cols = data) # wrangle

…where you have 48 participants per experiment

samps_highn <- tibble(sim = 1:1000) |> # 
  mutate(data = map(sim, \(i) make_tea_data(n_total = 48, delta=1.5))) |>  # simulate
  unnest(cols = data) # wrangle

Summarize both of these simulations

To do: Run these code blocks Do you understand what each line is doing here? (the map function above is hard, just focus here?)

tea_data_summary <- samps |>
  group_by(sim, condition) |> # group by simulation #, and condition
  summarise(mean_rating = mean(rating)) |> # summarize across ratings
  group_by(sim) |> # now get difference
  summarise(delta = mean_rating[condition == "milk first"] -
              mean_rating[condition == "tea first"])

`summarise()` has grouped output by 'sim'. You can override using the `.groups`
argument.

tea_data_highn_summary <- samps_highn |>
  group_by(sim, condition) |> # group by simulation #, and condition
  summarise(mean_rating = mean(rating)) |> # summarize across ratings
  group_by(sim) |> # now get difference
  summarise(delta = mean_rating[condition == "milk first"] -
              mean_rating[condition == "tea first"])

`summarise()` has grouped output by 'sim'. You can override using the `.groups`
argument.

Plot difference for low-n

Let’s make a plot to plot the differences in ratings across conditions To do: run these code blocks

ggplot(data=tea_data_summary, aes(x=delta)) +
  geom_histogram(alpha=.8, bins=20)

  # theme_few()

Or simply

hist(tea_data_summary$delta)

Plot difference for higher-n

To do: What’s different about this distribution vs the one we just plotted?

hist(tea_data_highn_summary$delta)

Bonus What happens if you run it again? Try varying the variance / mean of the effect when you change the “delta” and “sigma” values?

What happens if you vary the random seed?

Now let’s visualize what would happen under the null distribution in two ways

First, by simulating no differences between conditions. Remember, it’s the null model because DELTA (i.e., differences in conditions) is ZERO

To do: Where in the function is it specifying that there is no difference between conditions?

null_model <- tibble(sim = 1:1000) |>
  mutate(data = map(sim, \(i) make_tea_data(n_total = 18, delta = 0))) |>
  unnest(cols = data)

null_model_summary <- null_model |>
  group_by(sim, condition) |>
  summarise(mean_rating = mean(rating)) |>
  group_by(sim) |>
  summarise(delta = mean_rating[condition == "milk first"] -
              mean_rating[condition == "tea first"])

`summarise()` has grouped output by 'sim'. You can override using the `.groups`
argument.

To do: Explain what does this data look like and why

ggplot(data=null_model_summary, aes(x=delta)) +
  geom_histogram(alpha=.8, bins=20) +
  theme_few()

Permutations

We’re going to calculate the distribution of the difference between conditions when we’ve shuffled the condition labels

This is the empirical null hypothesis (our H0, since we’re comparing two conditions with a two-sample t-test)

First, let’s shuffle the labels ~within each experiment~

tea_data_highn_shuffled <- samps_highn %>%
  group_by(sim) %>% # for each experiment
  mutate(condition_shuffled = sample(condition)) # shuffle the condition labels

To do: check you understanding – what is “sim” here?

tea_data_highn_shuffled_summary <- tea_data_highn_shuffled %>%
  group_by(condition_shuffled, sim) %>% 
  summarize(mean = mean(rating),
            sd = sd(rating)) %>%
  ungroup() %>% 
  summarize(delta = diff(mean)) # get the difference in ratings between conditions for each experimental draw

`summarise()` has grouped output by 'condition_shuffled'. You can override
using the `.groups` argument.

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

To do: what does this histogram look like and why?

hist(tea_data_highn_shuffled_summary$delta) #what does this histogram look like?

Visualizing what happens when we shuffle

OK, now see what happens to our raw data – this is just from one simulation The color refers to the ORIGINAL label before we shuffled, but our condition difference is gone

(Try replotting it so the colors refer to the condition_shuffled, modifying line 176)

ggplot(data = tea_data_highn_shuffled %>% filter(sim==3),  # can change the actual simulation number here
       mapping = aes(x = condition_shuffled, y = rating))+
  geom_point(mapping = aes(color = condition), # color
             alpha=.8,
             position = position_jitter(height = .1,
                                        width = 0.1)) +
  stat_summary(fun.data = mean_cl_boot, # this boostraps the confidence interval
               geom = "linerange",
               size = 1) +
  stat_summary(fun = "mean", # this calculates the average
               geom = "point",
               shape = 21,
               color = "black",
               fill = "white") + 
  scale_y_continuous(breaks = 0:10,
                     labels = 0:10,
                     limits = c(0, 10))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The idea is now that we can get a sampling distribution of the difference in the means between the two conditions (assuming that the null hypothesis were true), by randomly shuffling the labels and calculating the difference in means (and doing this many times).

What we get is a distribution of the differences we would expect, if there was no effect of condition.

First calculate the actual difference in a simulated dataset

difference_actual = tea_data_highn %>%  # in ONE experiment
  group_by(condition) %>% 
  summarize(mean = mean(rating)) %>% 
  pull(mean) %>% 
  diff()

#plot the distribution of the differences 
ggplot(data = tea_data_highn_shuffled_summary, aes(x=delta)) +
  geom_histogram(aes(y = stat(density)),
                 color = "black",
                 fill = "lightblue",
                 binwidth = 0.05) + 
  stat_density(geom = "line",
               size = 1.5,
               bw = 0.2) +
  geom_vline(xintercept = difference_actual, color = "red", size = 2) +
  labs(x = "difference between means")

Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

And we can then simply calculate the p-value by using some basic data wrangling (i.e. finding the proportion of differences that were as or more extreme than the one we observed).

tea_data_highn_shuffled_summary %>% 
  summarize(p_value = sum(delta <= difference_actual)/n())

# A tibble: 1 × 1
  p_value
    <dbl>
1       0

You can also see this if you plot the distributions of the null vs empirical simulations next to each other (blue = null)

ggplot(data=tea_data_highn_shuffled_summary, aes(x=delta)) +
  geom_histogram(alpha=.4, bins=20, color='blue', fill='blue') +
  geom_histogram(alpha=.8, bins=20, data=tea_data_highn_summary)

Confidence intervals

Done here with one experiment, you can choose which is “tea_dataset”

tea_dataset = tea_data
# tea_dataset = tea_data_highn

tea_ratings <- filter(tea_dataset, condition == "tea first")$rating
milk_ratings <- filter(tea_dataset, condition == "milk first")$rating

# could also do in a pipe like so, but then you have to grab the column below, as in tea_ratings$ratings; above is a vector
# tea_ratings <- tea_data_highn %>%
#   filter(condition=="tea first") %>%
#   select(rating)

Calculate a CI on the effect (difference between conditions)

Uses a pooled standard deviation We’re using the normal distribution here to calculate CIs since we know the population SD follows a normal

Note that this is different than the CI calculated by the two-sample t-tests, where

n_tea <- length(tea_ratings)
n_milk <- length(milk_ratings)
sd_tea <- sd(tea_ratings)
sd_milk <- sd(milk_ratings)

tea_sd_pooled <- sqrt(((n_tea - 1) * sd_tea ^ 2 + (n_milk - 1) * sd_milk ^ 2) / 
                        (n_tea + n_milk - 2))

tea_se <- tea_sd_pooled * sqrt((1 / n_tea) + (1 / n_milk))


delta_hat <- mean(milk_ratings) - mean(tea_ratings)
tea_ci_lower <- delta_hat - tea_se * qnorm(0.975)
tea_ci_upper <- delta_hat + tea_se * qnorm(0.975)

To get the 95% CI with the t-distribution

You need to get the appropriate t-statistic from the distribution, which incorporates information about the degrees of freedom

The t-distribution is more appropriate when you have smaller sample sizes and is what is used in t.tests

num_observations = length(tea_dataset$rating)
df = num_observations-2 # for two sample t.test
tea_ci_lower_ttest <- delta_hat - tea_se * qt(0.975,df)
tea_ci_upper_ttest <- delta_hat + tea_se * qt(0.975,df)

# Now the calculated CIs match those in the t-test outputs!
t.test(tea_ratings, milk_ratings, var.equal=TRUE)


    Two Sample t-test

data:  tea_ratings and milk_ratings
t = -2.4863, df = 16, p-value = 0.02433
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.4994038 -0.2783739
sample estimates:
mean of x mean of y 
 2.777778  4.666667

Going to plot CIs for each condition

as well as SEs, and visualizae how they’re different

confidence_level=.95 # you can change this
# this formula below gives the critical t-value (as opposed to simply taken from the normal distribution)
# qt(1 - (1 - confidence_level)/2, df = n - 1)


tea_data_highn_summary_cis <- tea_data_highn %>%
  group_by(condition) %>%
  summarize(cond_mean = mean(rating), cond_sd = sd(rating), n=length(rating)) %>%
  mutate(error = qt(1 - (1 - confidence_level)/2, df = n - 1)* (cond_sd/sqrt(n))) %>% # this calculates CIs WITHIN each condition
  mutate(ci_upper = cond_mean + error, ci_lower = cond_mean - error) %>%
  mutate(se_upper = cond_mean + cond_sd/sqrt(n), se_lower = cond_mean - cond_sd/sqrt(n))

Between subjects experiment – lots of variability! I like to visualize the raw datas as well as the mean and CIs

ggplot(data = tea_data_highn, aes(x=condition, y=rating, col=condition))  +
  geom_jitter(width=.1, height=0, alpha=.3) + # visualizes all the raw data, with no variation in y-axis jitter
  theme_few() +
  geom_pointrange(data = tea_data_highn_summary_cis, aes(x=condition, y = cond_mean, ymin = ci_lower, ymax = ci_upper)) +
  ylim(0,10) +
  ggtitle('Tea ratings across conditions with CIs')

ggplot(data = tea_data_highn, aes(x=condition, y=rating, col=condition))  +
  geom_jitter(width=.1, height=0, alpha=.5) + # visualizes all the raw data, with no variation in y-axis jitter
  theme_few() +
  geom_pointrange(data = tea_data_highn_summary_cis, aes(x=condition, y = cond_mean, ymin = se_lower, ymax = se_upper)) +
  ylim(0,10) +
  ggtitle('Tea ratings across condition with SE')

To do: What does each dot represnet? What does the range represent in each graph? What does the confidence interval indicate? What does the SE indicate?

To do: how does this change when you use the low-n experiment?

Simulating p-values across multiple experiments

To do: run this code

First for low-n experiments

all_results=tibble() 

for (this_sim in 1:1000) {
  this_experiment = null_model %>%
    filter(sim==this_sim) 
  
  tea_ratings <- filter(this_experiment, condition == "tea first")$rating
  milk_ratings <- filter(this_experiment, condition == "milk first")$rating
  
  output = t.test(tea_ratings, milk_ratings)
  
  this_exp_output = tibble(pvalue = output$p.value)
  all_results = bind_rows(all_results, this_exp_output)
    
}

To do: Look at the distribution of p-values (hint: in all_results$pvalue)

Make a histogram

What is the distribution of p-values when the null is true?

Calculate the proportion of p-values that are less than .05 What was our false positive rate?

hist(all_results$pvalue)

Now for an experiment when there is actually an effect

all_results_high_n=tibble() 

for (this_sim in 1:500) {
  this_experiment = samps_highn %>%
    filter(sim==this_sim) 
  
  tea_ratings <- filter(this_experiment, condition == "tea first")$rating
  milk_ratings <- filter(this_experiment, condition == "milk first")$rating
  
  output = t.test(tea_ratings, milk_ratings, paired = FALSE, var.equal = TRUE)
  
  this_exp_output = tibble(pvalue = output$p.value)
  all_results_high_n = bind_rows(all_results_high_n, this_exp_output)
    
}

How often did we fail to reject the null hypothesis? When was our p-value greater than p=.05? What does the distribution of p-values look like?

To finish up

Publish this to an rpubs (time to set up if you haven’t!)

Excercises

Now go back (earlier in the code) and modify the DELTA in the simulation functions to be smaller so that there is only a small difference between groups. Is it still significant?
Rewrite this code with the the smaller sample size simulations. What changes?
Save out plots to your computer using “ggsave”. You might need to query the function in
Try some different plotting functions! https://rstudio.github.io/cheatsheets/html/data-visualization.html