Acknowledgement: We used materials from Prof. Michael Franke (University of Tübingen) to create these exercises.
Here’s test scores (say, from a language proficiency test) of two
groups of students. Use a \(t\)-test to
test whether there is a difference in means between these groups. Use a
two-sided test, for unpaired samples, equal variance and equal sample
size. Use the function t.test and interpret the
results.
group_1 <- c(
104, 105, 100, 91, 105, 118, 164, 168, 111, 107, 136, 149, 104, 114, 107, 95,
83, 114, 171, 176, 117, 107, 108, 107, 119, 126, 105, 119, 107, 131
)
group_2 <- c(
133, 115, 84, 79, 127, 103, 109, 128, 127, 107, 94, 95, 90, 118, 124, 108,
87, 111, 96, 89, 106, 121, 99, 86, 115, 136, 114
)
t.test(
group_1,
group_2,
paired = F,
mu = 0,
var.equal = T
)
##
## Two Sample t-test
##
## data: group_1 and group_2
## t = 2.0901, df = 55, p-value = 0.04124
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4732249 22.5045529
## sample estimates:
## mean of x mean of y
## 118.9333 107.4444
Possible interpretation: The result indicates a p-value smaller then 0.05. Thus, there is evidence against the null-hypothesis (\(H_0\): difference of means is equal to zero). That means in turn, that the higher average test score of group 1 (\(\mu_x\) = 118.9) compared to group 2 (\(\mu_y\) = 107.4) is most likely not due to chance.
The goal of this exercise is to get experience in calculating \(p\)-values for different kinds of (directed and undirected) hypotheses. We use the (allegedly easiest) case of coin flips. We also would like to get more confident in how to report the results we obtain from our code.
To obtain the right intuitions for \(p\)-values for directed and undirected
hypotheses, you can use the function plot_binomial provided
below. It allows you to plot the (binomial) sampling distribution and to
specify which (if any) parts of the plot you want to highlight. It also
calculates the total probability for all the values (of \(k\)) in the vector supplied to its argument
highlight. The result of this calculation is shown in the
plot’s title. See the examples below to understand what the function
does and how you can use it to develop better intuitions about \(p\)-values.
plot_binomial <- function(theta, N, highlight = NULL) {
# put the data together
plotData <- tibble(x = 0:N, y = dbinom(0:N, N, theta))
# make a simple bar plot
out_plot <- ggplot(plotData, aes(x = x , y = y )) +
geom_col(fill = "gray", width = 0.35) +
labs(
x = "test statistic k",
y = str_c("Binomial(k, ", N, ", ", theta, ")")
)
# if given, highlight some bars in red
if (!is.null(highlight)) {
plotData2 = tibble(x = highlight, y = dbinom(highlight, N, theta))
out_plot <- out_plot +
geom_col(
data = plotData2,
aes(x = x, y = y),
fill = "firebrick",
width = 0.35
) +
ggtitle(
str_c(
"Prob. selected values: ",
sum(dbinom(highlight, N, theta)) %>% signif(5)
)
)
}
out_plot
}
plot_binomial(theta = 0.5, N = 24, highlight = c(7:16))
plot_binomial(
theta = 0.5,
N = 24,
highlight = which(dbinom(0:24, 24,p=0.5) <= dbinom(7, 24,p=0.5))-1
)
In the following, you will be confronted with different scenarios, each of which has a different research hypothesis. For each of these, think about what it is that you want to learn from a test based on a \(p\)-value. For each scenario, you should therefore fix a suitable null hypothesis about the coin’s bias. Remember that a \(p\)-value quantifies evidence against the specified null-hypothesis (keeping an alternative hypothesis in the back of our minds to distinguish the case of genuinely testing a point-valued null hypothesis from the case of testing an interval-based null hypothesis via a single value used to generated the sampling distribution). You should therefore specify a null-hypothesis (and an alternative hypothesis) that is most conducive of shedding light on your research question. (Once more, the goal of this exercise is for you to become more comfortable with the whole logic of using \(p\)-values to draw conclusions of interest for a research goal.) When asked to judge significance, please use a significance level of \(\alpha = 0.05\).
The manufacturer of a trick coin claims that their product has a bias of \(\theta = 0.8\) of coming up heads on each toss. You make it your “research hypothesis” to find out whether this is true. Suppose you tossed the coin \(N = 45\) times and you observed \(k=42\) heads.
What null-hypothesis would you like to fix for a test that might shed light on your research question? What is the alternative hypothesis?
We test \(\theta = 0.8\). The alternative is \(\theta \neq 0.8\).
Use the function plot_binomial to plot the sampling
distribution for this null-hypothesis. Highlight a single value in this
plot, namely the one for the observed value of the test statistic \(k=42\).
plot_binomial(
theta = 0.8,
N = 45,
highlight = 42
)
Given the research question, what values of \(k\) would count as more extreme evidence against the chosen null-hypothesis?
All values of \(k\) with that are at least as unlikely to occur under the null-hypothesis are at least as strong evidence against \(H_0\).
Use the function plot_binomial to plot the sampling
distribution, but highlight all the values of \(k\) that provide at least as strong
evidence against the null-hypothesis as the observed data \(k=42\) does.
plot_binomial(
theta = 0.8,
N = 45,
highlight = which(dbinom(0:45, 45,p=0.8) <= dbinom(42,45,p=0.8))-1
)
Based on your answer to the previous question, is this a one-sided or a two-sided test?
This is a two-sided test.
What is the \(p\)-value of this test?
This can be read of the title of the last plot. It’s ~ 0.02389.
Use the built-in function binom.test to run that same
test. (You should obtain the same \(p\)-value as what you answered in the
previous question.)
binom.test(42,45,0.8)
##
## Exact binomial test
##
## data: 42 and 45
## number of successes = 42, number of trials = 45, p-value = 0.02389
## alternative hypothesis: true probability of success is not equal to 0.8
## 95 percent confidence interval:
## 0.8173155 0.9860349
## sample estimates:
## probability of success
## 0.9333333
Give one or two concise sentences stating your results and the interpretation of them regarding your research hypothesis. An example (which is absurdly wrong!) could be:
We conducted a two-sided binomial test assuming the null-hypothesis that the coin’s bias is \(\theta = 0.8\) and observed a significant test result (\(N = 45\), \(k=0.42\), \(p \approx 0.02389\)). This constitutes evidence against the manufacturer’s claim that the coin’s bias is exactly 0.8. We therefore conclude that the coin is biased towards heads.
The manufacturer of a trick coin claims that their product has a bias of \(\theta \le 0.3\) of coming up heads on each toss. You make it your “research hypothesis” to find out whether this is true. Suppose you tossed the coin \(N = 32\) times and you observed \(k=15\) heads.
What null-hypothesis would you like to fix for a test that might shed light on your research question? What is the alternative hypothesis?
We test \(\theta = 0.3\). The alternative is that \(\theta > 0.3\).
Use the function plot_binomial to plot the sampling
distribution for this null-hypothesis. Highlight a single value in this
plot, namely the one for the observed value of the test statistic \(k=15\).
plot_binomial(
theta = 0.3,
N = 32,
highlight = 15
)
Use the function plot_binomial to plot the sampling
distribution, but highlight all the values of \(k\) that provide at least as strong
evidence against the null-hypothesis as the observed data \(k=15\) does.
plot_binomial(
theta = 0.3,
N = 32,
highlight = 15:32
)
Based on your answer to the previous question, is this a one-sided or a two-sided test?
This is a one-sided test.
What is the \(p\)-value of this test?
This can be read of the title of the last plot. It’s ~ 0.03272.
Use the built-in function binom.test to run that same
test. (You should obtain the same \(p\)-value as what you answered in the
previous question.)
binom.test(15,32,0.3,alternative= "greater")
##
## Exact binomial test
##
## data: 15 and 32
## number of successes = 15, number of trials = 32, p-value = 0.03272
## alternative hypothesis: true probability of success is greater than 0.3
## 95 percent confidence interval:
## 0.3154413 1.0000000
## sample estimates:
## probability of success
## 0.46875
Give one or two concise sentences stating your results and the interpretation of them regarding your research hypothesis.
We conducted a binomial test assuming the null-hypothesis that the coin’s bias is \(\theta \le 0.3\) against the alternative hypothesis that the bias is larger and observed a significant test result (\(N = 32\), \(k=15\), \(p \approx 0.03272\)). This constitutes evidence against the manufacturer’s claim that the coin’s bias is no bigger than 0.3.
The manufacturer of a trick coin claims that their product has a bias of \(\theta \ge 0.6\) of coming up heads on each toss. You make it your “research hypothesis” to find out whether this is true. Suppose you tossed the coin \(N = 100\) times and you observed \(k=53\) heads.
Use the built-in function binom.test to calculate a
\(p\)-value for this case. (Use the
previous steps for yourself if it helps you see through how to set this
up.) State and interpret your results like you did in the last part of
the previous cases.
binom.test(53,100,0.6,alternative= "less")
##
## Exact binomial test
##
## data: 53 and 100
## number of successes = 53, number of trials = 100, p-value = 0.09298
## alternative hypothesis: true probability of success is less than 0.6
## 95 percent confidence interval:
## 0.0000000 0.6155513
## sample estimates:
## probability of success
## 0.53
We conducted a binomial test assuming the null-hypothesis that the coin’s bias is \(\theta \ge 0.6\) against the alternative hypothesis that the bias is lower. We obtained no significant test result (\(N = 100\), \(k=53\), \(p \approx 0.09298\)). We conclude that the data provide no compelling evidence against the manufacturer’s claim that the coin’s bias is at least 0.6.
The goal of this exercise is to make you feel comfortable with applying and interpreting the results of a Pearson \(\chi^2\)-test of goodness of fit.
Imagine you are on a funfair (German: Kirmes, Jahrmarkt). As usual, you head straight for the lottery booth (German: Losbude). The vendor advertises that of all tickets 5% are mega-winners, 15% are winners, 15% are free rides on the fairy-go-round (German: Karussell), 35% are consolation prizes (German: Trostpreise) and only the remaining 30% are blanks (German: Nieten). You are your nerdy self, as usual, and you buy 50 tickets and count the number of tickets in each category. What you got is this:
n_obs <- c(
mega_winner = 1, # hurray!
winner = 2,
free_ride = 10,
consolation = 18,
blank = 19
)
The goal of this exercise is to further hone your plotting skills, this time also challenging you to come up with your own idea for a good visual presentation.
Find an informative way of plotting the observed counts and the counts you would have expected to see when buying 40 tickets, which is this vector:1
expected <- c(
mega_winner = 5,
winner = 15,
free_ride = 15,
consolation = 35,
draws = 30
) * sum(n_obs) / 100
# many possibilities
# data
lottery_data <- tibble(
category = factor(c("mega_winner", "winner", "free_ride", "consolation", "blank"), ordered = T, levels = c("mega_winner", "winner", "free_ride", "consolation", "blank")),
observed = n_obs,
expected = expected
)
# adjacent bars
lottery_data %>%
pivot_longer(2:3) %>%
ggplot(
aes(x = category, y = value, fill = name)
) + geom_col(position = "dodge")
# scatter plot
lottery_data %>%
ggplot(aes(y = observed, x = expected)) +
geom_point() +
geom_abline(aes(slope = 1, intercept = 0), color = "firebrick") +
geom_label(aes(x = expected, y = observed + 1.25, label = category)) +
xlim(c(0,25)) + ylim(c(0,25))
Use the built-in function chisq.test to test the
vendor’s claim about the probability of obtaining a ticket from each
category based on the counts you observed. Interpret and report your
findings like you would in a research report.
chisq.test(n_obs, p = c(5, 15, 15, 35, 30) / 100)
##
## Chi-squared test for given probabilities
##
## data: n_obs
## X-squared = 6.8476, df = 4, p-value = 0.1442
Since the \(\chi^2\) goodness of fit test produces a non-significant result (\(\chi^2\) = 4, \(p \approx\) 0.1442), we conclude that there is no strong evidence against the vendor’s claim about the respective probabilities of obtaining a ticket from each category.
Here’s a table of counts showing the results from an (imaginary)
enquete asking students for the program in which they are enrolled and
their favorite statistical paradigm. Use a \(\chi^2\)-test using function
chisq.test to check if there is a basis for the assumption
that preferences of statistical method differ between different fields
of study.
observed_counts <- matrix(
c(
31,56,23,
104,67,12,
24,34,42,
19,16,8
),
nrow = 4,
byrow = T,
dimnames = list(
program = c("CogSci", "Psych", "Computer Science", "Philosophy"),
preference = c("frequentist", "Bayes", "bootstrap")
)
)
observed_counts
## preference
## program frequentist Bayes bootstrap
## CogSci 31 56 23
## Psych 104 67 12
## Computer Science 24 34 42
## Philosophy 19 16 8
chisq.test(
observed_counts,
correct = F # it's also fine to use the correction, but we didn't discuss this in class
)
##
## Pearson's Chi-squared test
##
## data: observed_counts
## X-squared = 69.473, df = 6, p-value = 5.243e-13
Possible interpretation: The null-hypothesis states that the variables “preference for statistical method” and “field of study” are independent of each other. The p-value indicates that there is evidence against the null-hypothesis and, thus, in favor of the alternative hypothesis: “preference for statistical method” and “field of study” seem to be dependent.
The point of this exercise is to practice a kind of task that may well occur in the midterm exam. [The secondary (cheeky) point of this exercise is to introduce a function that might just happen to be useful for some other thing.]
Consider this function and try to understand what it does.
mysterious_function <- function(vec) {
map_lgl(
1:(length(vec)-1),
function(i) {vec[i] == vec[i+1]}
) %>% sum()
}
Describe what this function does in plain and easy natural language.
Possible answer: The function takes a vector (vec) as input, “goes” through each entry of the vector and compares if two successive entries are identical. The output is then the number of successive identical “pairs” found in the input vector.
The point of this exercise is to make you understand better the notion of a test statistic and to become aware of the fact that you could just invent a test statistic to maximally address your specific problem or application. This exercise should also make you experience that it is possible to compute a \(p\)-value based on a test statistic for whose sampling distribution you cannot give a concise mathematical characterization (even if approximate).
We consider a binomial test of a coin’s fairness. We have observed
\(k = 15\) heads out of \(N = 30\) flips. Use the function
binomial.test to calculate a two-sided \(p\)-value for the null hypothesis that
\(\theta = 0.5\). Interpret the
result.
binom.test(15,30,0.5)
##
## Exact binomial test
##
## data: 15 and 30
## number of successes = 15, number of trials = 30, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3129703 0.6870297
## sample estimates:
## probability of success
## 0.5
The p-value is 1. Hence, there’s absolutely no indication in the data that the null-hypothesis is false.
The previous test addressed one aspect of our data: the number of heads in the \(N = 30\) flips. Remember that we said in class that the number \(k\) is a test statistic. It is a good one if what we want to get is information about \(\theta\).
But there are many vectors of raw data with \(N = 30\) that give us a value of test statistic \(k = 15\). Here are three:
obs_1 <- rep(c(1,0), each = 15) # 15 ones followed by 15 zeros
obs_2 <- rep(c(1,0), times = 15) # 15 pairs of one-zero
obs_3 <- c(1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,1)
We know that all of these observations will lead to the same outcome under a binomial test. But there is another feature of these raw data vectors that should concern us, and that is whether each observation was truly independent of the previous. (Notice that independence of each coin flip of any other is a fundamental assumption (which might just be wrong!) of the binomial test.)
So, to address the independence issue, let’s just come up with our
own (ad hoc) test statistic. Let’s just count the number of
non-identical consecutive outcomes, i.e., the number of times that
observation \(i\) was different from
observation \(i+1\). Define a function
called number_of_swaps that takes vectors like
obs_X from above as input and returns the number of
non-identical consecutive outcomes. Apply the function to the three
vectors of observations above.
number_of_swaps <- function(vec) {
map_lgl(
1:(length(vec)-1),
function(i) {vec[i] != vec[i+1]} # changed to inequality
) %>% sum()
}
number_of_swaps(obs_1)
## [1] 1
number_of_swaps(obs_2)
## [1] 29
number_of_swaps(obs_3)
## [1] 8
The number of swaps is a test statistic. (It is simple and useful, but no claim is made here regarding whether it is the best test statistic to address independence!) But what if we do not know a formula that characterizes its sampling distribution?
The answer is: sampling! If we do not know a mathematical characterization of a distribution, but we can write an algorithm of the data-generating process, we can work with (a large set of) samples. (That’s old news.) We can use samples to approximately calculate a \(p\)-value, too. (That’s news (of sorts).)
Write a function called sample_no_swaps that takes as
input an argument n_samples specifying the number of
samples to take, and that outputs the number of samples in a vector of
30 random samples from 1 and 0 (each equally likely). [Hint: to generate
a single vector of the kind we are after, you can use this code snippet:
sample(c(1,0), size = 30, replace = T)]
n_samples <- 1000
sample_no_swaps <- function(n_samples = 1000) {
map_int(
1:n_samples,
function(i) {
sample(c(1,0), size = 30, replace = T) %>%
number_of_swaps
}
)
}
Using the previously defined function sample_no_swaps,
get 100,000 samples of swaps, obtain counts for each number of swaps,
and plot the result using geom_col.
tibble(
swaps = sample_no_swaps(100000)
) %>%
dplyr::count(swaps) %>%
ggplot(aes(x = swaps, y = n)) +
geom_col()
Consider the vector \(obs_3\) above. It’s number of swaps is surprisingly low under our assumption of independence of flips. We want to approximate a one-sided \(p\)-value using MC-sampling. Fill in the missing bits-and-pieces in the code below, compute a \(p\)-value and interpret the result.
n_samples <- 100000
MC_sampling_p_value <- function(n_samples) {
%FILL-ME%(sample_no_swaps(n_samples) %FILL-ME% number_of_swaps(%FILL-ME%))
}
MC_sampling_p_value(n_samples)
n_samples <- 100000
MC_sampling_p_value <- function(n_samples) {
mean(sample_no_swaps(n_samples) <= number_of_swaps(obs_3))
}
MC_sampling_p_value(n_samples)
## [1] 0.01201
Possible answer: The null-hypothesis assumes independence of each coin-toss outcome from the previous observation. As we find: the sampling distribution has it’s peak around 15 swaps. Thus, 15 or more swaps suggest independence of observations. The number_of_swaps for obs3 is very small (= 8 swaps). Calculating the p-value, using obs3 as evidence and the sampling distribution, we get a significant result which suggests that the observations (of obs3) are dependent rather than independent (=> rejection of \(H_0\)).
In this exercise our objective is to understand the conceptual differences between Bayesian and Frequentist statistics by analyzing simple experiments with coin tosses.
First, let’s revise a few key concepts of Frequentist statistics:
Hypothesis testing: In Frequentist statistics, hypothesis testing is a method used to determine if there is enough evidence in a sample of data to infer that a certain condition is true for the entire population.
Method 1: Frequentist Approach
The R solution is provided below - you don’t need to code anything, however, note down you conclusion on experiment:
# Here, we make sure everyone works with the same data.
set.seed(123)
# Simulate 50 coin tosses with equal chance (H for heads, T for tails).
coin_tosses <- sample(c("H", "T"), 50, replace = TRUE)
# Calculate the proportion of heads: we sum up the number of "H" outcomes and divide by the total number of tosses.
heads_proportion <- sum(coin_tosses == "H") / length(coin_tosses)
# Perform a two-sided binomial test (checks in either direction, i.e. more heads or more tails) to see if the observed proportion of heads
# is significantly different from 0.5.
test_result <- binom.test(sum(coin_tosses == "H"), length(coin_tosses), p = 0.5, alternative = "two.sided")
# Let's print the results.
print(test_result)
##
## Exact binomial test
##
## data: sum(coin_tosses == "H") and length(coin_tosses)
## number of successes = 30, number of trials = 50, p-value = 0.2026
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4517940 0.7359216
## sample estimates:
## probability of success
## 0.6
# Create a sequence of head proportions from 0 to 1.
probs <- seq(0, 1, length.out = 100)
# Calculate the density for each proportion in the sequence (given the null hypothesis).
densities <- dbinom(x = round(probs * length(coin_tosses)), size = length(coin_tosses), prob = 0.5)
# Calculate the critical values for the tails from the binomial distribution - with this, we create the rejection region for the NH at 95% CI.
critical_value <- qbinom(c(0.025, 0.975), size = length(coin_tosses), prob = 0.5) / length(coin_tosses)
# Create the plot with the probability density function for the binomial distribution.
plot(probs, densities, type = 'l', lwd = 2, col = 'blue',
main = "Coin Toss Distribution under the Null Hypothesis",
xlab = "Proportion of Heads", ylab = "Probability Density",
xlim = c(0, 1), ylim = c(0, max(densities)))
# Shade the regions of the distribution that represent the critical values (tails).
# Any observed proportion of heads within these shaded areas would lead us to reject the null hypothesis.
polygon(c(probs[probs <= critical_value[1]], rev(probs[probs <= critical_value[1]])),
c(rep(0, sum(probs <= critical_value[1])), rev(densities[probs <= critical_value[1]])),
col = rgb(0.8, 0.8, 0.8, 0.5))
polygon(c(probs[probs >= critical_value[2]], rev(probs[probs >= critical_value[2]])),
c(rep(0, sum(probs >= critical_value[2])), rev(densities[probs >= critical_value[2]])),
col = rgb(0.8, 0.8, 0.8, 0.5))
# Add a vertical line to the plot at the observed proportion of heads.
# This visually indicates where our observed data falls in relation to the expected distribution.
abline(v = heads_proportion, col = "darkred", lwd = 2, lty = 2)
# Add text for the observed proportion and p-value.
text(x = heads_proportion, y = max(densities) * 0.85,
labels = paste("Observed\nProportion\n(p-value:", round(test_result$p.value, 4), ")"),
col = "darkred", cex = 0.8, adj = c(1, 0))
# Add a legend to explain the elements.
legend("topright", legend = c("Probability Density", "Critical Region", "Observed Proportion"),
lty = c(1, 1, 2), lwd = c(2, 2, 2),
col = c("blue", "grey", "darkred"), bty = "n", cex = 0.8)
p-values to determine if the observed data significantly
deviates from what is expected under the hypothesis.YOUR CONCLUSION HERE: The results of the Frequentist approach in our coin tossing example do not provide sufficient statistical evidence to reject the null hypothesis that the coin is fair. The p-value obtained from the test result is 0.2026, which is greater than the standard alpha level of 0.05. This indicates that the observed data — 30 heads out of 50 tosses, or a 60% heads proportion — could reasonably occur due to chance when the true probability of heads is 0.5. The 95% confidence interval for the true probability of heads ranges from approximately 0.452 to 0.736 - this suggests that the interval contains the 0.5 mark and further supports the lack of evidence against the coin’s fairness. Therefore, based on Frequentist analysis, we conclude there is no statistical basis to consider the coin biased.
Method 2: Bayesian Approach
Again, we are providing the R solution and no coding is needed from you is this exercise, just provide your conclusion on the experiment:
# We take the same data ("coin_tosses") from Method 1.
# Our prior belief.
alpha_prior <- 1
beta_prior <- 1
# Update with data: the number of heads and tails are used to update the alpha and beta parameters.
alpha_post <- alpha_prior + sum(coin_tosses == "H")
beta_post <- beta_prior + sum(coin_tosses == "T")
# Define the posterior distribution as a Beta distribution with the updated parameters.
posterior <- function(x) dbeta(x, alpha_post, beta_post)
# A sequence of probabilities from 0 to 1.
x <- seq(0, 1, length = 100)
# Final plot.
plot(x, posterior(x), type = "l", xlab = "Probability of Heads", ylab = "Density", main = "Posterior Distribution")
YOUR CONCLUSION HERE: The resulting posterior distribution gives us a new view of what we believe about the coin’s fairness after seeing the data. The peak of this distribution has shifted to 0.6, which means that after observing the data, a probability of 0.6 for heads now seems more credible than our initial guess of 0.5 (which would be the peak if the coin were fair). However, the fact that the peak is at 0.6 now doesn’t mean the true probability of heads is 0.6, just that this is our best estimate given the data and our prior. The width of the posterior distribution reflects our remaining uncertainty: a narrower peak would indicate more certainty, and a wider peak indicates less. So, while the Frequentist approach didn’t provide enough evidence to reject the hypothesis that the coin is fair, the Bayesian approach has shifted our belief towards thinking that the coin might be more likely to land on heads, but still with some uncertainty. Contrasting this with the Frequentist approach, which does not take prior belief into account and only uses the observed data to test a hypothesis, the Bayesian method provides a probability distribution over possible values of the coin’s probability of heads. To conlcude, the Frequentist p-value was not sufficiently low to reject the null hypothesis of fairness (p-value > 0.05), and the Bayesian analysis suggests a shift in belief towards a probability of heads that is higher than 0.5.
Notice that the Pearson \(\chi^2\)-test rests on an approximation of normality, which is only sufficiently accurate if we have enough samples. A rule-of-thumb is that at most 20% of all cells should have expected frequencies below 5 in order for the test to be applicable.↩︎