Acknowledgements: These exercises are based on the materials prepared by Prof. Michael Franke

Exercise 1: Common probability distributions in R

The goal of this exercise is to get more comfortable with probability distributions in R, using functions like rnorm and dbinom.

Binomial distribution

Plot the binomial distribution

Random variable \(K\) follows a binomial distribution, so that \(k \sim \text{Binom}(\theta, N)\), where \(\theta\) is the bias towards heads (= success, number 1) and \(N\) is the number of coin tosses (trials). Plot the probability mass function for \(0 \le k \le N\) on the \(x\)-axis, for each one of the four values \(\theta \in \{0.2, 0.5, 0.7, 0.9\}\), keeping \(N=24\) fixed. Use a facet plot. The outcome should look roughly like the plot below.

expand_grid(
  k = 0:24,
  theta = c(0.2, 0.5, 0.7, 0.9)
) %>% 
  mutate(
    lh = dbinom(k, 24, theta),
    theta = str_c("theta = ", theta)
  ) %>% 
  ggplot(aes(x = k, y = lh)) +
  geom_col() +
  facet_wrap(~theta, ncol = 2) +
  labs(
    x = latex2exp::TeX("Number of heads $k$"),
    y = "Likelihood of observation"
  )

Interpret the plot

Describe in one concise sentence what one of the four faceted plot shows. For example (but totally wrong): “The graphs in each facet show the prior probability of latent parameter \(k\) for data observations from different conditions.”

Cumulative probability

Calculate the probability of observing no more than 7 successes for \(N= 24\) and each \(\theta \in \{0.2, 0.5, 0.7, 0.9\}\).

theta <- c(0.2, 0.5, 0.7, 0.9)
pbinom(7, 24, theta)
## [1] 9.108287e-01 3.195733e-02 4.387079e-05 1.729516e-12

Normal distribution

Sample from the normal distribution and plot

Suppose variable \(X\) follows a normal distribution, so that \(x \sim \text{Norm}(\mu,\sigma)\) for some mean \(\mu\) and standard deviation \(\sigma\). For each combination of parameter values \(\mu \in \{0, 10\}\) and \(\sigma \in \{1,5\}\), take 10,000 random samples from \(X\). Plot the results as a histogram, using facets. The output could look roughly like the plot below.

n_samples = 10000
expand_grid(
  mu = c(0,10),
  sd = c(1,5),
  samples = 1
) %>% 
  nest(samples) %>% 
  group_by(mu, sd) %>% 
  mutate(
    samples = list(s = rnorm(10000,mu,sd))
  ) %>% 
  select(- data) %>% 
  unnest(samples)%>% 
  ungroup() %>% 
  mutate(
    mu = str_c("mu = ", mu),
    sd = str_c("sigma = ", sd)
  ) %>% 
  ggplot(aes(x = samples)) +
  geom_histogram() +
  facet_grid(mu ~ sd, scales = "free") 

Cumulative probability

Calculate the probability of observing an outcome of at least 0.5, \(P(X \ge 0.5)\), for each of the four parameter value combinations used above. The outcome of your calculation should be:

expand_grid(
  mu = c(0,10),
  sd = c(1,5)
) %>% 
  mutate(
    prob = 1 - pnorm(0.5, mu, sd)
  )
## # A tibble: 4 × 3
##      mu    sd  prob
##   <dbl> <dbl> <dbl>
## 1     0     1 0.309
## 2     0     5 0.460
## 3    10     1 1    
## 4    10     5 0.971

Explain these results

Explain the results from the previous task. Why do we see this ordering in the probability scores for different parameter pairs, e.g., why does \(\mu = 1\) and \(\sigma = 1\) score the lowest, followed by …, followed by … etc.?

Exercise 2. Plot joint and marginal probabilities

Before we dive into the exercise, let’s revise what joint and marginal probabilities mean.

Joint probability calculates the likelihood of two events happening at the same time. It is denoted as \(P(A \cap B)\) which reads as “the probability of A and B occurring together” and is defined as follows:

\[ \begin{align*} P(A \cap B) &= P(A) \times P(B|A) \end{align*} \] where,

Alternatively, if events \(A\) and \(B\) are independent (the occurrence of one does not affect the occurrence of the other), the joint probability is calculated as follows:

\[ \begin{align*} P(A \cap B) &= P(A) \times P(B) \end{align*} \] Keeping this information in mind and when starting to work on ex. 2, ask yourself which formula would suit the Titanic context. Are the events independent or dependent of each other?

Marginal probability is the probability of an event occurring irrespective of the outcome of another event. It’s essentially the probability of a single event in the presence of multiple variables. In scenarios where you have joint probabilities involving two or more variables, the marginal probability of one of these variables is the sum of joint probabilities over all levels of the other variable(s).

For a simple event case with two events, \(A\) and \(B\), where \(B\) can have multiple outcomes (e.g. \(B_1, B_2, ..., B_n\)), the marginal probability of event \(A\) is expressed as:

\[ P(A) = \sum_{i} P(A \cap B_i) \]

Optional videos:

In this exercise we will be using the titanic dataset which you can directly download using R package titanic.

Your tasks:

Yes No
Male x y
Female z w

Where, the x, y, z, w represent counts of each category combination (e.g., x being the total number of males who survived).

Hint for calculating joint probabilities:

You can try to make your own calculation by hand first (using your contingency table output) and then compare your results with prop.table() function output.

Hint for calculating marginal probabilities:

Again, try calculating the probabilities by hand first, and then compare with your function result.

data <- titanic::titanic_train

data_cleaned <- data %>% 
  select(Sex, Survived) %>% 
  filter(!is.na(Sex), !is.na(Survived)) %>% 
  mutate(Survived=factor(ifelse(Survived==1, "Yes", "No"),
                         levels=c("Yes", "No")))

cont_table <- table(data_cleaned$Sex, data_cleaned$Survived)
joint_probs <- prop.table(cont_table)

ggplot(as.data.frame(joint_probs), aes(x=Var1, y=Freq, fill=Var2)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(x = "Gender", 
       y = "Probability", 
       fill = "Survived", 
       title = "Joint Probability of Gender and Survival on Titanic")

marginal_probs_sex <- prop.table(margin.table(cont_table, 1))
marginal_probs_survival <- prop.table(margin.table(cont_table, 2))

ggplot(as.data.frame(marginal_probs_survival), aes(x=Var1, y=Freq)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Survival", 
       y = "Probability", 
       title = "Marginal Probability of Survival")

Question: In 2-3 sentences interpret your results, mentioning the importance of joint probability and how it differences from marginal probability.

The joint probability calculations for the Titanic dataset show the likelihood of survival in relation to gender, revealing a clear discrepancy between males and females, with females having a higher probability of survival. In contrast, the marginal probabilities of survival, irrespective of gender, present an overall picture of the chances of survival.

Exercise 3. Combining two random variables

In our fictitious example, a neuropsychologist took a large sample of 10.000 students and looked at their reaction times of a task with two conditions EASY and HARD. Suppose that the neuropsychologist randomly assigned the condition EASY or HARD to a student. He looks at the difference between the conditions. The summary statistics for the conditions of the people in the study are shown below.

tribble(~ condition, ~ mean , ~ sd,
        "EASY",   180,    5,
        "HARD", 159,    12)
## # A tibble: 2 × 3
##   condition  mean    sd
##   <chr>     <dbl> <dbl>
## 1 EASY        180     5
## 2 HARD        159    12

Take 10.000 samples and model the distributions in a single plot like so:

n_samples = 10000
df <- tibble(
  mu = c(180, 159),
  sd = c(5, 12),
  samples = 1
) %>% 
  nest(samples) %>% 
  group_by(mu) %>% 
  mutate(
    samples = list(s = rnorm(10000,mu,sd))
  ) %>% 
  select(- data) %>% 
  unnest(samples)%>% 
  ungroup() %>% 
  mutate(
    mu = str_c("mu = ", mu),
    sd = str_c("sigma = ", sd)
  )

ggplot(df, aes(x= samples, fill= mu)) +
  geom_histogram( color='#e9ecef', alpha=0.6, position='identity')

a.) What is the mean of the difference between the two heights?

b.) What is the variance of the difference between the two heights?

c.) What is the standard deviation of the difference between the two heights?

Check your results by calculating the mean, variance and standard deviation. It might help to visualize the distribution of reactiontime(EASY) - reactiontime(HARD). Try to construct the distribution similar to the one below.

samples_diff = df[df$mu == "mu = 180",]$samples - df[df$mu == "mu = 159",]$samples
df_diff <- as_tibble(samples_diff)
ggplot(df_diff, aes(x = samples_diff)) +
  geom_histogram(color='#e9ecef', alpha=0.6, position='identity')