set.seed(123) # good practice to set a random seed, or else different runs get you different results
Inference_simulations
Tips before getting started
This is a document made to accompany some simulations from the Inference lecture in Psych 201a. The goal of this document is to continue learning in R/tidyverse but also to gain hands on experience simulating and manipulating data.
If you need to install something, you can run install.packages('tidyverse')
, where you substitute the name of the library
You should have this repository cloned on your computer, ideally in a folder where you have all of your github repositories (e.g., /brialong/Documents/GitHub/in_class_excercises\
)
To understand what a function does, type ? [function_name]
where function_name refers to a function name in a loaded repository.
Setup
Import the functions & data that we need
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2) # plotting
library(ggthemes) # optional, but nice
Define the simulation function
This makes “tea data”, a tibble (dataframe) where there are a certain number of people in each condition (default = 48, i.e., n_total, with n_total/2 in each half)
The averages of the two conditions are separated by a known effect (“delta”) with some variance (“sigma”). You can change these around since we’re simulating data!
<- function(n_total = 48, sigma = 1.25, delta = 1.5) {
make_tea_data <- n_total / 2
n_half tibble(condition = c(rep("milk first", n_half), rep("tea first", n_half)),
rating = c(round(rnorm(n_half, mean = 3.5 + delta, sd = sigma)),
round(rnorm(n_half, mean = 3.5, sd = sigma)))) |>
mutate(rating = if_else(rating > 10, 10, rating), # truncate if greater than max/min of rating scale
rating = if_else(rating < 1, 1, rating))
}
# Increase the sample size to 85
<- make_tea_data(n_total = 85)
tea_data_large_sample
# Summarize the mean ratings by condition
%>%
tea_data_large_sample group_by(condition) %>%
summarise(mean_rating = mean(rating))
# A tibble: 2 × 2
condition mean_rating
<chr> <dbl>
1 milk first 5.10
2 tea first 3.55
# Run a t-test to compare the two conditions
t.test(rating ~ condition, data = tea_data_large_sample)
Welch Two Sample t-test
data: rating by condition
t = 6.237, df = 81.989, p-value = 1.848e-08
alternative hypothesis: true difference in means between group milk first and group tea first is not equal to 0
95 percent confidence interval:
1.054002 2.041236
sample estimates:
mean in group milk first mean in group tea first
5.095238 3.547619
#Observation after changing the sample size
#After increasing the sample size to 85, the mean ratings for "milk first" and "tea first" were 5.2 and 3.5, respectively. The t-test revealed a significant difference between the two conditions (p-value = <.001), suggesting that with a larger sample size, we have more power to detect the difference in ratings.
# Decrease the difference in means to 0.5
<- make_tea_data(delta = 0.5)
tea_data_small_delta
# Summarize the mean ratings by condition
%>%
tea_data_small_delta group_by(condition) %>%
summarise(mean_rating = mean(rating))
# A tibble: 2 × 2
condition mean_rating
<chr> <dbl>
1 milk first 4.17
2 tea first 3.38
# Run a t-test to compare the two conditions
t.test(rating ~ condition, data = tea_data_small_delta)
Welch Two Sample t-test
data: rating by condition
t = 2.3823, df = 45.964, p-value = 0.0214
alternative hypothesis: true difference in means between group milk first and group tea first is not equal to 0
95 percent confidence interval:
0.1227416 1.4605917
sample estimates:
mean in group milk first mean in group tea first
4.166667 3.375000
#Observation changing after Delta
#When the delta (difference in means) was reduced to 0.5, the difference in ratings between "milk first" and "tea first" was smaller. The mean ratings were 4.2 and 3.5, respectively, and the t-test was significant (p-value = 0.03). even with a reduced delta (difference in means) of 0.5, there is still a statistically significant difference between the “milk first” and “tea first” groups.
Make data frames where we have small or larger samples of tea data for ONE experiment
# here's, we're calling our custom function, and specifying different inputs than the defaults (which are inside the parenthese up above)
<- make_tea_data(n_total = 18, delta=1.5) tea_data
<- make_tea_data(n_total = 48, delta=1.5) tea_data_highn
To do: OK, look at these data frames. How long are they, what are the column names? Look at them in your console and in the environment if you want.
To do: Write basic tidyverse code to calculate the mean of each condition (hint: use group_by
and summarize
)
<- tea_data %>%
by_condition group_by(condition) %>%
summarize(mean_rating = mean(rating), num_participants = length(rating))
<- tea_data_highn %>%
by_condition_highn group_by(condition) %>%
summarize(mean_rating = mean(rating), num_participants = length(rating))
In this first draw, was it significant with N=9 or N=24 per group?
OK, we can do t-tests already! I’ve done these within a pipe (whoops, I often use the old pipe operator because I have small hands)
To do: Run these and look at the outputs by posting out_low_n
in the console
<- tea_data %>%
out_low_n t.test(rating ~ condition, data = ., var.equal = TRUE)
<- tea_data_highn %>%
out_high_n t.test(rating ~ condition, data = ., var.equal = TRUE)
Simulate 1000 experiments
To do: Run these code blocks
…where you have 18 participants per experiment with an average difference of 1.5 points in tea deliciousness on average
<- tibble(sim = 1:1000) |> #
samps mutate(data = map(sim, \(i) make_tea_data(n_total = 18, delta=1.5))) |> # simulate
unnest(cols = data) # wrangle
…where you have 48 participants per experiment
<- tibble(sim = 1:1000) |> #
samps_highn mutate(data = map(sim, \(i) make_tea_data(n_total = 48, delta=1.5))) |> # simulate
unnest(cols = data) # wrangle
Summarize both of these simulations
To do: Run these code blocks Do you understand what each line is doing here? (the map function above is hard, just focus here?)
<- samps |>
tea_data_summary group_by(sim, condition) |> # group by simulation #, and condition
summarise(mean_rating = mean(rating)) |> # summarize across ratings
group_by(sim) |> # now get difference
summarise(delta = mean_rating[condition == "milk first"] -
== "tea first"]) mean_rating[condition
`summarise()` has grouped output by 'sim'. You can override using the `.groups`
argument.
<- samps_highn |>
tea_data_highn_summary group_by(sim, condition) |> # group by simulation #, and condition
summarise(mean_rating = mean(rating)) |> # summarize across ratings
group_by(sim) |> # now get difference
summarise(delta = mean_rating[condition == "milk first"] -
== "tea first"]) mean_rating[condition
`summarise()` has grouped output by 'sim'. You can override using the `.groups`
argument.
Plot difference for low-n
Let’s make a plot to plot the differences in ratings across conditions To do: run these code blocks
ggplot(data=tea_data_summary, aes(x=delta)) +
geom_histogram(alpha=.8, bins=20)
# theme_few()
Or simply
hist(tea_data_summary$delta)
Plot difference for higher-n
To do: What’s different about this distribution vs the one we just plotted?
hist(tea_data_highn_summary$delta)
Bonus What happens if you run it again? Try varying the variance / mean of the effect when you change the “delta” and “sigma” values?
What happens if you vary the random seed?
Now let’s visualize what would happen under the null distribution in two ways
First, by simulating no differences between conditions. Remember, it’s the null model because DELTA (i.e., differences in conditions) is ZERO
To do: Where in the function is it specifying that there is no difference between conditions?
<- tibble(sim = 1:1000) |>
null_model mutate(data = map(sim, \(i) make_tea_data(n_total = 18, delta = 0))) |>
unnest(cols = data)
<- null_model |>
null_model_summary group_by(sim, condition) |>
summarise(mean_rating = mean(rating)) |>
group_by(sim) |>
summarise(delta = mean_rating[condition == "milk first"] -
== "tea first"]) mean_rating[condition
`summarise()` has grouped output by 'sim'. You can override using the `.groups`
argument.
To do: Explain what does this data look like and why
ggplot(data=null_model_summary, aes(x=delta)) +
geom_histogram(alpha=.8, bins=20) +
theme_few()
Permutations
We’re going to calculate the distribution of the difference between conditions when we’ve shuffled the condition labels
This is the empirical null hypothesis (our H0, since we’re comparing two conditions with a two-sample t-test)
First, let’s shuffle the labels ~within each experiment~
<- samps_highn %>%
tea_data_highn_shuffled group_by(sim) %>% # for each experiment
mutate(condition_shuffled = sample(condition)) # shuffle the condition labels
To do: check you understanding – what is “sim” here?
<- tea_data_highn_shuffled %>%
tea_data_highn_shuffled_summary group_by(condition_shuffled, sim) %>%
summarize(mean = mean(rating),
sd = sd(rating)) %>%
ungroup() %>%
summarize(delta = diff(mean)) # get the difference in ratings between conditions for each experimental draw
`summarise()` has grouped output by 'condition_shuffled'. You can override
using the `.groups` argument.
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
To do: what does this histogram look like and why?
hist(tea_data_highn_shuffled_summary$delta) #what does this histogram look like?
Visualizing what happens when we shuffle
OK, now see what happens to our raw data – this is just from one simulation The color refers to the ORIGINAL label before we shuffled, but our condition difference is gone
(Try replotting it so the colors refer to the condition_shuffled, modifying line 176)
ggplot(data = tea_data_highn_shuffled %>% filter(sim==3), # can change the actual simulation number here
mapping = aes(x = condition_shuffled, y = rating))+
geom_point(mapping = aes(color = condition), # color
alpha=.8,
position = position_jitter(height = .1,
width = 0.1)) +
stat_summary(fun.data = mean_cl_boot, # this boostraps the confidence interval
geom = "linerange",
size = 1) +
stat_summary(fun = "mean", # this calculates the average
geom = "point",
shape = 21,
color = "black",
fill = "white") +
scale_y_continuous(breaks = 0:10,
labels = 0:10,
limits = c(0, 10))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
The idea is now that we can get a sampling distribution of the difference in the means between the two conditions (assuming that the null hypothesis were true), by randomly shuffling the labels and calculating the difference in means (and doing this many times).
What we get is a distribution of the differences we would expect, if there was no effect of condition.
First calculate the actual difference in a simulated dataset
= tea_data_highn %>% # in ONE experiment
difference_actual group_by(condition) %>%
summarize(mean = mean(rating)) %>%
pull(mean) %>%
diff()
#plot the distribution of the differences
ggplot(data = tea_data_highn_shuffled_summary, aes(x=delta)) +
geom_histogram(aes(y = stat(density)),
color = "black",
fill = "lightblue",
binwidth = 0.05) +
stat_density(geom = "line",
size = 1.5,
bw = 0.2) +
geom_vline(xintercept = difference_actual, color = "red", size = 2) +
labs(x = "difference between means")
Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
And we can then simply calculate the p-value by using some basic data wrangling (i.e. finding the proportion of differences that were as or more extreme than the one we observed).
%>%
tea_data_highn_shuffled_summary summarize(p_value = sum(delta <= difference_actual)/n())
# A tibble: 1 × 1
p_value
<dbl>
1 0
You can also see this if you plot the distributions of the null vs empirical simulations next to each other (blue = null)
ggplot(data=tea_data_highn_shuffled_summary, aes(x=delta)) +
geom_histogram(alpha=.4, bins=20, color='blue', fill='blue') +
geom_histogram(alpha=.8, bins=20, data=tea_data_highn_summary)
Confidence intervals
Done here with one experiment, you can choose which is “tea_dataset”
= tea_data
tea_dataset # tea_dataset = tea_data_highn
<- filter(tea_dataset, condition == "tea first")$rating
tea_ratings <- filter(tea_dataset, condition == "milk first")$rating
milk_ratings
# could also do in a pipe like so, but then you have to grab the column below, as in tea_ratings$ratings; above is a vector
# tea_ratings <- tea_data_highn %>%
# filter(condition=="tea first") %>%
# select(rating)
Calculate a CI on the effect (difference between conditions)
Uses a pooled standard deviation We’re using the normal distribution here to calculate CIs since we know the population SD follows a normal
Note that this is different than the CI calculated by the two-sample t-tests, where
<- length(tea_ratings)
n_tea <- length(milk_ratings)
n_milk <- sd(tea_ratings)
sd_tea <- sd(milk_ratings)
sd_milk
<- sqrt(((n_tea - 1) * sd_tea ^ 2 + (n_milk - 1) * sd_milk ^ 2) /
tea_sd_pooled + n_milk - 2))
(n_tea
<- tea_sd_pooled * sqrt((1 / n_tea) + (1 / n_milk))
tea_se
<- mean(milk_ratings) - mean(tea_ratings)
delta_hat <- delta_hat - tea_se * qnorm(0.975)
tea_ci_lower <- delta_hat + tea_se * qnorm(0.975) tea_ci_upper
To get the 95% CI with the t-distribution
You need to get the appropriate t-statistic from the distribution, which incorporates information about the degrees of freedom
The t-distribution is more appropriate when you have smaller sample sizes and is what is used in t.tests
= length(tea_dataset$rating)
num_observations = num_observations-2 # for two sample t.test
df <- delta_hat - tea_se * qt(0.975,df)
tea_ci_lower_ttest <- delta_hat + tea_se * qt(0.975,df) tea_ci_upper_ttest
# Now the calculated CIs match those in the t-test outputs!
t.test(tea_ratings, milk_ratings, var.equal=TRUE)
Two Sample t-test
data: tea_ratings and milk_ratings
t = -2.4863, df = 16, p-value = 0.02433
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.4994038 -0.2783739
sample estimates:
mean of x mean of y
2.777778 4.666667
Going to plot CIs for each condition
as well as SEs, and visualizae how they’re different
=.95 # you can change this
confidence_level# this formula below gives the critical t-value (as opposed to simply taken from the normal distribution)
# qt(1 - (1 - confidence_level)/2, df = n - 1)
<- tea_data_highn %>%
tea_data_highn_summary_cis group_by(condition) %>%
summarize(cond_mean = mean(rating), cond_sd = sd(rating), n=length(rating)) %>%
mutate(error = qt(1 - (1 - confidence_level)/2, df = n - 1)* (cond_sd/sqrt(n))) %>% # this calculates CIs WITHIN each condition
mutate(ci_upper = cond_mean + error, ci_lower = cond_mean - error) %>%
mutate(se_upper = cond_mean + cond_sd/sqrt(n), se_lower = cond_mean - cond_sd/sqrt(n))
Between subjects experiment – lots of variability! I like to visualize the raw datas as well as the mean and CIs
ggplot(data = tea_data_highn, aes(x=condition, y=rating, col=condition)) +
geom_jitter(width=.1, height=0, alpha=.3) + # visualizes all the raw data, with no variation in y-axis jitter
theme_few() +
geom_pointrange(data = tea_data_highn_summary_cis, aes(x=condition, y = cond_mean, ymin = ci_lower, ymax = ci_upper)) +
ylim(0,10) +
ggtitle('Tea ratings across conditions with CIs')
ggplot(data = tea_data_highn, aes(x=condition, y=rating, col=condition)) +
geom_jitter(width=.1, height=0, alpha=.5) + # visualizes all the raw data, with no variation in y-axis jitter
theme_few() +
geom_pointrange(data = tea_data_highn_summary_cis, aes(x=condition, y = cond_mean, ymin = se_lower, ymax = se_upper)) +
ylim(0,10) +
ggtitle('Tea ratings across condition with SE')
To do: What does each dot represnet? What does the range represent in each graph? What does the confidence interval indicate? What does the SE indicate?
To do: how does this change when you use the low-n experiment?
Simulating p-values across multiple experiments
To do: run this code
First for low-n experiments
=tibble()
all_results
for (this_sim in 1:1000) {
= null_model %>%
this_experiment filter(sim==this_sim)
<- filter(this_experiment, condition == "tea first")$rating
tea_ratings <- filter(this_experiment, condition == "milk first")$rating
milk_ratings
= t.test(tea_ratings, milk_ratings)
output
= tibble(pvalue = output$p.value)
this_exp_output = bind_rows(all_results, this_exp_output)
all_results
}
To do: Look at the distribution of p-values (hint: in all_results$pvalue)
Make a histogram
What is the distribution of p-values when the null is true?
Calculate the proportion of p-values that are less than .05 What was our false positive rate?
hist(all_results$pvalue)
Now for an experiment when there is actually an effect
=tibble()
all_results_high_n
for (this_sim in 1:500) {
= samps_highn %>%
this_experiment filter(sim==this_sim)
<- filter(this_experiment, condition == "tea first")$rating
tea_ratings <- filter(this_experiment, condition == "milk first")$rating
milk_ratings
= t.test(tea_ratings, milk_ratings, paired = FALSE, var.equal = TRUE)
output
= tibble(pvalue = output$p.value)
this_exp_output = bind_rows(all_results_high_n, this_exp_output)
all_results_high_n
}
How often did we fail to reject the null hypothesis? When was our p-value greater than p=.05? What does the distribution of p-values look like?
To finish up
Publish this to an rpubs (time to set up if you haven’t!)
Excercises
Now go back (earlier in the code) and modify the DELTA in the simulation functions to be smaller so that there is only a small difference between groups. Is it still significant?
Rewrite this code with the the smaller sample size simulations. What changes?
Save out plots to your computer using “ggsave”. You might need to query the function in
Try some different plotting functions! https://rstudio.github.io/cheatsheets/html/data-visualization.html