Acknowledgement: We used materials from Prof. Michael Franke (University of Tübingen) to create these exercises.

1 Comparing two groups with a \(t\)-test

Here’s test scores (say, from a language proficiency test) of two groups of students. Use a \(t\)-test to test whether there is a difference in means between these groups. Use a two-sided test, for unpaired samples, equal variance and equal sample size. Use the function t.test and interpret the results.

group_1 <- c(
  104, 105, 100, 91, 105, 118, 164, 168, 111, 107, 136, 149, 104, 114, 107, 95, 
  83, 114, 171, 176, 117, 107, 108, 107, 119, 126, 105, 119, 107, 131
)
group_2 <- c(
  133, 115, 84, 79, 127, 103, 109, 128, 127, 107, 94, 95, 90, 118, 124, 108, 
  87, 111, 96, 89, 106, 121, 99, 86, 115, 136, 114
)

t.test(
  group_1,
  group_2,
  paired =  F, 
  mu = 0,
  var.equal = T
)

## 
##  Two Sample t-test
## 
## data:  group_1 and group_2
## t = 2.0901, df = 55, p-value = 0.04124
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.4732249 22.5045529
## sample estimates:
## mean of x mean of y 
##  118.9333  107.4444

Possible interpretation: The result indicates a p-value smaller then 0.05. Thus, there is evidence against the null-hypothesis (\(H_0\): difference of means is equal to zero). That means in turn, that the higher average test score of group 1 (\(\mu_x\) = 118.9) compared to group 2 (\(\mu_y\) = 107.4) is most likely not due to chance.

2 Adressing hypotheses about coin flips with hypothesis testing

The goal of this exercise is to get experience in calculating \(p\)-values for different kinds of (directed and undirected) hypotheses. We use the (allegedly easiest) case of coin flips. We also would like to get more confident in how to report the results we obtain from our code.

To obtain the right intuitions for \(p\)-values for directed and undirected hypotheses, you can use the function plot_binomial provided below. It allows you to plot the (binomial) sampling distribution and to specify which (if any) parts of the plot you want to highlight. It also calculates the total probability for all the values (of \(k\)) in the vector supplied to its argument highlight. The result of this calculation is shown in the plot’s title. See the examples below to understand what the function does and how you can use it to develop better intuitions about \(p\)-values.

plot_binomial <- function(theta, N, highlight = NULL) {
  # put the data together
  plotData <- tibble(x = 0:N, y = dbinom(0:N, N, theta))
  # make a simple bar plot
  out_plot <- ggplot(plotData, aes(x = x , y = y )) + 
    geom_col(fill = "gray", width = 0.35) +
    labs(
      x = "test statistic k",
      y =  str_c("Binomial(k, ", N, ", ", theta, ")")
    )
  # if given, highlight some bars in red
  if (!is.null(highlight)) {
    plotData2 = tibble(x = highlight, y = dbinom(highlight, N, theta))
    out_plot <- out_plot + 
      geom_col(
        data = plotData2, 
        aes(x = x, y = y), 
        fill = "firebrick", 
        width = 0.35
      )  +
      ggtitle(
        str_c(
          "Prob. selected values: ", 
          sum(dbinom(highlight, N, theta)) %>% signif(5)
          )
        )
  }
  out_plot
}
plot_binomial(theta = 0.5, N = 24, highlight = c(7:16))

plot_binomial(
  theta = 0.5, 
  N = 24, 
  highlight = which(dbinom(0:24, 24,p=0.5) <= dbinom(7, 24,p=0.5))-1
)

In the following, you will be confronted with different scenarios, each of which has a different research hypothesis. For each of these, think about what it is that you want to learn from a test based on a \(p\)-value. For each scenario, you should therefore fix a suitable null hypothesis about the coin’s bias. Remember that a \(p\)-value quantifies evidence against the specified null-hypothesis (keeping an alternative hypothesis in the back of our minds to distinguish the case of genuinely testing a point-valued null hypothesis from the case of testing an interval-based null hypothesis via a single value used to generated the sampling distribution). You should therefore specify a null-hypothesis (and an alternative hypothesis) that is most conducive of shedding light on your research question. (Once more, the goal of this exercise is for you to become more comfortable with the whole logic of using \(p\)-values to draw conclusions of interest for a research goal.) When asked to judge significance, please use a significance level of \(\alpha = 0.05\).

2.1 Case 1: Manufacturer says: “\(\theta = 0.8\)”

The manufacturer of a trick coin claims that their product has a bias of \(\theta = 0.8\) of coming up heads on each toss. You make it your “research hypothesis” to find out whether this is true. Suppose you tossed the coin \(N = 45\) times and you observed \(k=42\) heads.

2.1.1 Fix the null-hypothesis

What null-hypothesis would you like to fix for a test that might shed light on your research question? What is the alternative hypothesis?

We test \(\theta = 0.8\). The alternative is \(\theta \neq 0.8\).

2.1.2 Plot the sampling distribution

Use the function plot_binomial to plot the sampling distribution for this null-hypothesis. Highlight a single value in this plot, namely the one for the observed value of the test statistic \(k=42\).

plot_binomial(
  theta = 0.8,
  N = 45,
  highlight = 42
)

2.1.3 More extreme values of \(k\)

Given the research question, what values of \(k\) would count as more extreme evidence against the chosen null-hypothesis?

All values of \(k\) with that are at least as unlikely to occur under the null-hypothesis are at least as strong evidence against \(H_0\).

Use the function plot_binomial to plot the sampling distribution, but highlight all the values of \(k\) that provide at least as strong evidence against the null-hypothesis as the observed data \(k=42\) does.

plot_binomial(
  theta = 0.8,
  N = 45,
  highlight = which(dbinom(0:45, 45,p=0.8) <= dbinom(42,45,p=0.8))-1
)

2.1.4 One- or two-sided test?

Based on your answer to the previous question, is this a one-sided or a two-sided test?

This is a two-sided test.

2.1.5 \(p\)-value

What is the \(p\)-value of this test?

This can be read of the title of the last plot. It’s ~ 0.02389.

2.1.6 Compare to built-in function

Use the built-in function binom.test to run that same test. (You should obtain the same \(p\)-value as what you answered in the previous question.)

binom.test(42,45,0.8)

## 
##  Exact binomial test
## 
## data:  42 and 45
## number of successes = 42, number of trials = 45, p-value = 0.02389
## alternative hypothesis: true probability of success is not equal to 0.8
## 95 percent confidence interval:
##  0.8173155 0.9860349
## sample estimates:
## probability of success 
##              0.9333333

2.1.7 Interpret and report your results

Give one or two concise sentences stating your results and the interpretation of them regarding your research hypothesis. An example (which is absurdly wrong!) could be:

We conducted a two-sided binomial test assuming the null-hypothesis that the coin’s bias is \(\theta = 0.8\) and observed a significant test result (\(N = 45\), \(k=0.42\), \(p \approx 0.02389\)). This constitutes evidence against the manufacturer’s claim that the coin’s bias is exactly 0.8. We therefore conclude that the coin is biased towards heads.

2.2 Case 2: Manufacturer says: “\(\theta \le 0.3\)”

The manufacturer of a trick coin claims that their product has a bias of \(\theta \le 0.3\) of coming up heads on each toss. You make it your “research hypothesis” to find out whether this is true. Suppose you tossed the coin \(N = 32\) times and you observed \(k=15\) heads.

2.2.1 Fix the null-hypothesis

What null-hypothesis would you like to fix for a test that might shed light on your research question? What is the alternative hypothesis?

We test \(\theta = 0.3\). The alternative is that \(\theta > 0.3\).

2.2.2 Plot the sampling distribution

plot_binomial(
  theta = 0.3,
  N = 32,
  highlight = 15
)

2.2.3 More extreme values of \(k\)

plot_binomial(
  theta = 0.3,
  N = 32,
  highlight = 15:32
)

2.2.4 One- or two-sided test?

Based on your answer to the previous question, is this a one-sided or a two-sided test?

This is a one-sided test.

2.2.5 \(p\)-value

What is the \(p\)-value of this test?

This can be read of the title of the last plot. It’s ~ 0.03272.

2.2.6 Compare to built-in function

Use the built-in function binom.test to run that same test. (You should obtain the same \(p\)-value as what you answered in the previous question.)

binom.test(15,32,0.3,alternative= "greater")

## 
##  Exact binomial test
## 
## data:  15 and 32
## number of successes = 15, number of trials = 32, p-value = 0.03272
## alternative hypothesis: true probability of success is greater than 0.3
## 95 percent confidence interval:
##  0.3154413 1.0000000
## sample estimates:
## probability of success 
##                0.46875

2.2.7 Interpret and report your results

Give one or two concise sentences stating your results and the interpretation of them regarding your research hypothesis.

We conducted a binomial test assuming the null-hypothesis that the coin’s bias is \(\theta \le 0.3\) against the alternative hypothesis that the bias is larger and observed a significant test result (\(N = 32\), \(k=15\), \(p \approx 0.03272\)). This constitutes evidence against the manufacturer’s claim that the coin’s bias is no bigger than 0.3.

2.3 Case 3: Manufacturer says: “\(\theta \ge 0.6\)”

The manufacturer of a trick coin claims that their product has a bias of \(\theta \ge 0.6\) of coming up heads on each toss. You make it your “research hypothesis” to find out whether this is true. Suppose you tossed the coin \(N = 100\) times and you observed \(k=53\) heads.

Use the built-in function binom.test to calculate a \(p\)-value for this case. (Use the previous steps for yourself if it helps you see through how to set this up.) State and interpret your results like you did in the last part of the previous cases.

binom.test(53,100,0.6,alternative= "less")

## 
##  Exact binomial test
## 
## data:  53 and 100
## number of successes = 53, number of trials = 100, p-value = 0.09298
## alternative hypothesis: true probability of success is less than 0.6
## 95 percent confidence interval:
##  0.0000000 0.6155513
## sample estimates:
## probability of success 
##                   0.53

We conducted a binomial test assuming the null-hypothesis that the coin’s bias is \(\theta \ge 0.6\) against the alternative hypothesis that the bias is lower. We obtained no significant test result (\(N = 100\), \(k=53\), \(p \approx 0.09298\)). We conclude that the data provide no compelling evidence against the manufacturer’s claim that the coin’s bias is at least 0.6.

3 Pearson’s \(\chi^2\)-test of goodness of fit

The goal of this exercise is to make you feel comfortable with applying and interpreting the results of a Pearson \(\chi^2\)-test of goodness of fit.

Imagine you are on a funfair (German: Kirmes, Jahrmarkt). As usual, you head straight for the lottery booth (German: Losbude). The vendor advertises that of all tickets 5% are mega-winners, 15% are winners, 15% are free rides on the fairy-go-round (German: Karussell), 35% are consolation prizes (German: Trostpreise) and only the remaining 30% are blanks (German: Nieten). You are your nerdy self, as usual, and you buy 50 tickets and count the number of tickets in each category. What you got is this:

n_obs <- c(
  mega_winner = 1, # hurray!
  winner = 2,
  free_ride = 10,
  consolation = 18,
  blank = 19
)

3.1 Plot data and prediction

The goal of this exercise is to further hone your plotting skills, this time also challenging you to come up with your own idea for a good visual presentation.

Find an informative way of plotting the observed counts and the counts you would have expected to see when buying 40 tickets, which is this vector:¹

expected <- c(
  mega_winner = 5,
  winner = 15,
  free_ride = 15,
  consolation = 35,
  draws = 30
) * sum(n_obs) / 100

# many possibilities

# data
lottery_data <- tibble(
  category = factor(c("mega_winner", "winner", "free_ride", "consolation", "blank"), ordered = T, levels = c("mega_winner", "winner", "free_ride", "consolation", "blank")),
  observed = n_obs,
  expected = expected
) 

# adjacent bars
lottery_data %>% 
  pivot_longer(2:3) %>% 
  ggplot(
    aes(x = category, y = value, fill = name)
  ) + geom_col(position = "dodge")

# scatter plot
lottery_data %>% 
  ggplot(aes(y = observed, x = expected)) +
  geom_point() +
  geom_abline(aes(slope = 1, intercept = 0), color = "firebrick") +
  geom_label(aes(x = expected, y = observed + 1.25, label = category)) +
  xlim(c(0,25)) + ylim(c(0,25))

3.2 Test the vendor’s claim

Use the built-in function chisq.test to test the vendor’s claim about the probability of obtaining a ticket from each category based on the counts you observed. Interpret and report your findings like you would in a research report.

chisq.test(n_obs, p = c(5, 15, 15, 35, 30) / 100)

## 
##  Chi-squared test for given probabilities
## 
## data:  n_obs
## X-squared = 6.8476, df = 4, p-value = 0.1442

Since the \(\chi^2\) goodness of fit test produces a non-significant result (\(\chi^2\) = 4, \(p \approx\) 0.1442), we conclude that there is no strong evidence against the vendor’s claim about the respective probabilities of obtaining a ticket from each category.

4 Pearson’s \(\chi^2\)-test of independence

Here’s a table of counts showing the results from an (imaginary) enquete asking students for the program in which they are enrolled and their favorite statistical paradigm. Use a \(\chi^2\)-test using function chisq.test to check if there is a basis for the assumption that preferences of statistical method differ between different fields of study.

observed_counts <- matrix(
  c(
    31,56,23,
    104,67,12,
    24,34,42,
    19,16,8
  ),
  nrow = 4,
  byrow = T,
  dimnames = list(
    program = c("CogSci", "Psych", "Computer Science", "Philosophy"),
    preference = c("frequentist", "Bayes", "bootstrap")
    
  )
)

observed_counts

##                   preference
## program            frequentist Bayes bootstrap
##   CogSci                    31    56        23
##   Psych                    104    67        12
##   Computer Science          24    34        42
##   Philosophy                19    16         8

chisq.test(
  observed_counts, 
  correct = F # it's also fine to use the correction, but we didn't discuss this in class
)

## 
##  Pearson's Chi-squared test
## 
## data:  observed_counts
## X-squared = 69.473, df = 6, p-value = 5.243e-13

Possible interpretation: The null-hypothesis states that the variables “preference for statistical method” and “field of study” are independent of each other. The p-value indicates that there is evidence against the null-hypothesis and, thus, in favor of the alternative hypothesis: “preference for statistical method” and “field of study” seem to be dependent.

5 Understanding a mystery function

The point of this exercise is to practice a kind of task that may well occur in the midterm exam. [The secondary (cheeky) point of this exercise is to introduce a function that might just happen to be useful for some other thing.]

Consider this function and try to understand what it does.

mysterious_function <- function(vec) {
  map_lgl(
    1:(length(vec)-1),
    function(i) {vec[i] == vec[i+1]}
  ) %>% sum()
}

Describe what this function does in plain and easy natural language.

Possible answer: The function takes a vector (vec) as input, “goes” through each entry of the vector and compares if two successive entries are identical. The output is then the number of successive identical “pairs” found in the input vector.

6 Simulating a \(p\)-value for a custom-made test statistic

The point of this exercise is to make you understand better the notion of a test statistic and to become aware of the fact that you could just invent a test statistic to maximally address your specific problem or application. This exercise should also make you experience that it is possible to compute a \(p\)-value based on a test statistic for whose sampling distribution you cannot give a concise mathematical characterization (even if approximate).

6.1 Binomial test of fairness

We consider a binomial test of a coin’s fairness. We have observed \(k = 15\) heads out of \(N = 30\) flips. Use the function binomial.test to calculate a two-sided \(p\)-value for the null hypothesis that \(\theta = 0.5\). Interpret the result.

binom.test(15,30,0.5)

## 
##  Exact binomial test
## 
## data:  15 and 30
## number of successes = 15, number of trials = 30, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.3129703 0.6870297
## sample estimates:
## probability of success 
##                    0.5

The p-value is 1. Hence, there’s absolutely no indication in the data that the null-hypothesis is false.

6.2 Questioning independence based on swaps

The previous test addressed one aspect of our data: the number of heads in the \(N = 30\) flips. Remember that we said in class that the number \(k\) is a test statistic. It is a good one if what we want to get is information about \(\theta\).

But there are many vectors of raw data with \(N = 30\) that give us a value of test statistic \(k = 15\). Here are three:

obs_1 <- rep(c(1,0), each = 15)  # 15 ones followed by 15 zeros
obs_2 <- rep(c(1,0), times = 15) # 15 pairs of one-zero
obs_3 <- c(1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,1)

We know that all of these observations will lead to the same outcome under a binomial test. But there is another feature of these raw data vectors that should concern us, and that is whether each observation was truly independent of the previous. (Notice that independence of each coin flip of any other is a fundamental assumption (which might just be wrong!) of the binomial test.)

So, to address the independence issue, let’s just come up with our own (ad hoc) test statistic. Let’s just count the number of non-identical consecutive outcomes, i.e., the number of times that observation \(i\) was different from observation \(i+1\). Define a function called number_of_swaps that takes vectors like obs_X from above as input and returns the number of non-identical consecutive outcomes. Apply the function to the three vectors of observations above.

number_of_swaps <- function(vec) {
  map_lgl(
    1:(length(vec)-1),
    function(i) {vec[i] != vec[i+1]} # changed to inequality
  ) %>% sum()
}

number_of_swaps(obs_1)

## [1] 1

number_of_swaps(obs_2)

## [1] 29

number_of_swaps(obs_3)

## [1] 8

6.3 Approximating a sampling distribution via sampling

The number of swaps is a test statistic. (It is simple and useful, but no claim is made here regarding whether it is the best test statistic to address independence!) But what if we do not know a formula that characterizes its sampling distribution?

The answer is: sampling! If we do not know a mathematical characterization of a distribution, but we can write an algorithm of the data-generating process, we can work with (a large set of) samples. (That’s old news.) We can use samples to approximately calculate a \(p\)-value, too. (That’s news (of sorts).)

Write a function called sample_no_swaps that takes as input an argument n_samples specifying the number of samples to take, and that outputs the number of samples in a vector of 30 random samples from 1 and 0 (each equally likely). [Hint: to generate a single vector of the kind we are after, you can use this code snippet: sample(c(1,0), size = 30, replace = T)]

n_samples <- 1000
sample_no_swaps <- function(n_samples = 1000) {
  map_int(
    1:n_samples,
    function(i) {
      sample(c(1,0), size = 30, replace = T) %>% 
        number_of_swaps
    }
  )
}

6.4 Plot the sampling distribution

Using the previously defined function sample_no_swaps, get 100,000 samples of swaps, obtain counts for each number of swaps, and plot the result using geom_col.

tibble(
  swaps = sample_no_swaps(100000)
) %>% 
  dplyr::count(swaps) %>% 
  ggplot(aes(x = swaps, y = n)) +
  geom_col()

6.5 Compute a \(p\)-value with MC-sampling

Consider the vector \(obs_3\) above. It’s number of swaps is surprisingly low under our assumption of independence of flips. We want to approximate a one-sided \(p\)-value using MC-sampling. Fill in the missing bits-and-pieces in the code below, compute a \(p\)-value and interpret the result.

n_samples <- 100000
MC_sampling_p_value <- function(n_samples) {
  %FILL-ME%(sample_no_swaps(n_samples) %FILL-ME% number_of_swaps(%FILL-ME%))
}
MC_sampling_p_value(n_samples)

n_samples <- 100000
MC_sampling_p_value <- function(n_samples) {
  mean(sample_no_swaps(n_samples) <= number_of_swaps(obs_3))
}
MC_sampling_p_value(n_samples)

## [1] 0.01201

Possible answer: The null-hypothesis assumes independence of each coin-toss outcome from the previous observation. As we find: the sampling distribution has it’s peak around 15 swaps. Thus, 15 or more swaps suggest independence of observations. The number_of_swaps for obs3 is very small (= 8 swaps). Calculating the p-value, using obs3 as evidence and the sampling distribution, we get a significant result which suggests that the observations (of obs3) are dependent rather than independent (=> rejection of \(H_0\)).

7 Coin Toss Analysis: Bayesian vs Frequentist Approach

In this exercise our objective is to understand the conceptual differences between Bayesian and Frequentist statistics by analyzing simple experiments with coin tosses.

First, let’s revise a few key concepts of Frequentist statistics:

Hypothesis testing: In Frequentist statistics, hypothesis testing is a method used to determine if there is enough evidence in a sample of data to infer that a certain condition is true for the entire population.
- A null hypothesis (\(H_0\)) typically represents the baseline assumption. For example, in our coin toss example (see Method 1), the \(H_0\) assumes that the coin is fair (probability of head \(=\) 0.5).
- An alternative hypothesis (\(H_a\)) contrasts the null hypothesis and represents what the researcher wants to prove. In our coin tossing case, \(H_a\) would be that the coin is not fair (probability of heads \(\not=\) 0.5).
- P-value measures the likelihood of getting the results that are as far from the expected outcome (under the null hypothesis) as the results you actually observed. Essentially, it tells you how often you would see the data like yours if the null hypothesis was true. A very low p-value (typically \(≤\) 0.05) suggests that your observed data is quite unusual and might lead you to reject the validity of the null hypothesis.
- Binomial test is the choice of our test statistics in our coin tossing case (since we have binary data).
- Confidence interval provides a range of plausible values for an unknown parameter (e.g. the probability of heads in a coin toss). This interval has an associated confidence level that quantifies the level of confidence that the parameter lies within the interval. Optional: watch CI explanation.
- Two-sided vs one-sided p-value: The two-sided p-value test is used when our hypothesis is non-directional, i.e it tests for the possibility of the coin being biased in either direction (e.g. more heads or more tails than expected for a fair coin). The one-sided p-value is used when the hypothesis is directional, i.e. if you were to hypothesize that the coin is biased specifically towards heads.

Method 1: Frequentist Approach

Hypothesis: Assume the coin is fair, meaning the probability of heads (H) and tails (T) is 0.5 each.
Experiment: Toss the coin 50 times and record the outcomes (H or T).
Analysis: Calculate the proportion of heads in the 50 tosses.

The R solution is provided below - you don’t need to code anything, however, note down you conclusion on experiment:

# Here, we make sure everyone works with the same data. 
set.seed(123)

# Simulate 50 coin tosses with equal chance (H for heads, T for tails).
coin_tosses <- sample(c("H", "T"), 50, replace = TRUE)

# Calculate the proportion of heads: we sum up the number of "H" outcomes and divide by the total number of tosses.
heads_proportion <- sum(coin_tosses == "H") / length(coin_tosses)

# Perform a two-sided binomial test (checks in either direction, i.e. more heads or more tails) to see if the observed proportion of heads
# is significantly different from 0.5.
test_result <- binom.test(sum(coin_tosses == "H"), length(coin_tosses), p = 0.5, alternative = "two.sided")

# Let's print the results. 
print(test_result)

## 
##  Exact binomial test
## 
## data:  sum(coin_tosses == "H") and length(coin_tosses)
## number of successes = 30, number of trials = 50, p-value = 0.2026
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4517940 0.7359216
## sample estimates:
## probability of success 
##                    0.6

# Create a sequence of head proportions from 0 to 1.
probs <- seq(0, 1, length.out = 100)

# Calculate the density for each proportion in the sequence (given the null hypothesis).
densities <- dbinom(x = round(probs * length(coin_tosses)), size = length(coin_tosses), prob = 0.5)

# Calculate the critical values for the tails from the binomial distribution - with this, we create the rejection region for the NH at 95% CI.
critical_value <- qbinom(c(0.025, 0.975), size = length(coin_tosses), prob = 0.5) / length(coin_tosses)

# Create the plot with the probability density function for the binomial distribution.
plot(probs, densities, type = 'l', lwd = 2, col = 'blue', 
     main = "Coin Toss Distribution under the Null Hypothesis", 
     xlab = "Proportion of Heads", ylab = "Probability Density", 
     xlim = c(0, 1), ylim = c(0, max(densities)))

# Shade the regions of the distribution that represent the critical values (tails).
# Any observed proportion of heads within these shaded areas would lead us to reject the null hypothesis.
polygon(c(probs[probs <= critical_value[1]], rev(probs[probs <= critical_value[1]])), 
        c(rep(0, sum(probs <= critical_value[1])), rev(densities[probs <= critical_value[1]])), 
        col = rgb(0.8, 0.8, 0.8, 0.5))
polygon(c(probs[probs >= critical_value[2]], rev(probs[probs >= critical_value[2]])), 
        c(rep(0, sum(probs >= critical_value[2])), rev(densities[probs >= critical_value[2]])), 
        col = rgb(0.8, 0.8, 0.8, 0.5))

# Add a vertical line to the plot at the observed proportion of heads.
# This visually indicates where our observed data falls in relation to the expected distribution.
abline(v = heads_proportion, col = "darkred", lwd = 2, lty = 2)

# Add text for the observed proportion and p-value.
text(x = heads_proportion, y = max(densities) * 0.85, 
     labels = paste("Observed\nProportion\n(p-value:", round(test_result$p.value, 4), ")"), 
     col = "darkred", cex = 0.8, adj = c(1, 0))

# Add a legend to explain the elements.
legend("topright", legend = c("Probability Density", "Critical Region", "Observed Proportion"), 
       lty = c(1, 1, 2), lwd = c(2, 2, 2), 
       col = c("blue", "grey", "darkred"), bty = "n", cex = 0.8)

Conclusion: Discuss whether the results support the hypothesis that the coin is fair. Consider the concept of p-values to determine if the observed data significantly deviates from what is expected under the hypothesis.

YOUR CONCLUSION HERE: The results of the Frequentist approach in our coin tossing example do not provide sufficient statistical evidence to reject the null hypothesis that the coin is fair. The p-value obtained from the test result is 0.2026, which is greater than the standard alpha level of 0.05. This indicates that the observed data — 30 heads out of 50 tosses, or a 60% heads proportion — could reasonably occur due to chance when the true probability of heads is 0.5. The 95% confidence interval for the true probability of heads ranges from approximately 0.452 to 0.736 - this suggests that the interval contains the 0.5 mark and further supports the lack of evidence against the coin’s fairness. Therefore, based on Frequentist analysis, we conclude there is no statistical basis to consider the coin biased.

Method 2: Bayesian Approach

Prior belief: Before tossing the coin, we might first express a belief that there’s an equal chance for the coin to be biased in any direction. As you remember from the last week’s lecture, this can be modeled by \(Beta(1,1)\), which is a uniform distribution that indicates all probabilities of heads are equally likely from 0 to 1.
Experiment: We will use the same data from Method 1, where the coin is tossed 50 times.
Bayesian inference: After observing the data, we update our belief. This involves calculating the posterior distribution, which gives us a new probability distribution reflecting the likelihood of different probabilities of heads after considering the evidence (Bayes’ theorem).

Again, we are providing the R solution and no coding is needed from you is this exercise, just provide your conclusion on the experiment:

# We take the same data ("coin_tosses") from Method 1.

# Our prior belief.
alpha_prior <- 1
beta_prior <- 1

# Update with data: the number of heads and tails are used to update the alpha and beta parameters.
alpha_post <- alpha_prior + sum(coin_tosses == "H")
beta_post <- beta_prior + sum(coin_tosses == "T")

# Define the posterior distribution as a Beta distribution with the updated parameters.
posterior <- function(x) dbeta(x, alpha_post, beta_post)

# A sequence of probabilities from 0 to 1.
x <- seq(0, 1, length = 100)

# Final plot.
plot(x, posterior(x), type = "l", xlab = "Probability of Heads", ylab = "Density", main = "Posterior Distribution")

Conclusion: Discuss how your belief about the fairness of the coin has changed after the experiment. Contrast this with the conclusion drawn using the Frequentist approach.

YOUR CONCLUSION HERE: The resulting posterior distribution gives us a new view of what we believe about the coin’s fairness after seeing the data. The peak of this distribution has shifted to 0.6, which means that after observing the data, a probability of 0.6 for heads now seems more credible than our initial guess of 0.5 (which would be the peak if the coin were fair). However, the fact that the peak is at 0.6 now doesn’t mean the true probability of heads is 0.6, just that this is our best estimate given the data and our prior. The width of the posterior distribution reflects our remaining uncertainty: a narrower peak would indicate more certainty, and a wider peak indicates less. So, while the Frequentist approach didn’t provide enough evidence to reject the hypothesis that the coin is fair, the Bayesian approach has shifted our belief towards thinking that the coin might be more likely to land on heads, but still with some uncertainty. Contrasting this with the Frequentist approach, which does not take prior belief into account and only uses the observed data to test a hypothesis, the Bayesian method provides a probability distribution over possible values of the coin’s probability of heads. To conlcude, the Frequentist p-value was not sufficiently low to reject the null hypothesis of fairness (p-value > 0.05), and the Bayesian analysis suggests a shift in belief towards a probability of heads that is higher than 0.5.

Notice that the Pearson \(\chi^2\)-test rests on an approximation of normality, which is only sufficiently accurate if we have enough samples. A rule-of-thumb is that at most 20% of all cells should have expected frequencies below 5 in order for the test to be applicable.↩︎

Week 8: Hypothesis Testing

December 8, 2023