Acknowledgements: These exercises are based on the materials prepared by Prof. Michael Franke

Exercise 1: Common probability distributions in R

The goal of this exercise is to get more comfortable with probability distributions in R, using functions like rnorm and dbinom.

Binomial distribution

Plot the binomial distribution

Random variable \(K\) follows a binomial distribution, so that \(k \sim \text{Binom}(\theta, N)\), where \(\theta\) is the bias towards heads (= success, number 1) and \(N\) is the number of coin tosses (trials). Plot the probability mass function for \(0 \le k \le N\) on the \(x\)-axis, for each one of the four values \(\theta \in \{0.2, 0.5, 0.7, 0.9\}\), keeping \(N=24\) fixed. Use a facet plot. The outcome should look roughly like the plot below.

expand_grid(
  k = 0:24,
  theta = c(0.2, 0.5, 0.7, 0.9)
) %>% 
  mutate(
    lh = dbinom(k, 24, theta),
    theta = str_c("theta = ", theta)
  ) %>% 
  ggplot(aes(x = k, y = lh)) +
  geom_col() +
  facet_wrap(~theta, ncol = 2) +
  labs(
    x = latex2exp::TeX("Number of heads $k$"),
    y = "Likelihood of observation"
  )

Interpret the plot

Describe in one concise sentence what one of the four faceted plot shows. For example (but totally wrong): “The graphs in each facet show the prior probability of latent parameter \(k\) for data observations from different conditions.”

Cumulative probability

Calculate the probability of observing no more than 7 successes for \(N= 24\) and each \(\theta \in \{0.2, 0.5, 0.7, 0.9\}\).

theta <- c(0.2, 0.5, 0.7, 0.9)
pbinom(7, 24, theta)

## [1] 9.108287e-01 3.195733e-02 4.387079e-05 1.729516e-12

Normal distribution

Sample from the normal distribution and plot

Suppose variable \(X\) follows a normal distribution, so that \(x \sim \text{Norm}(\mu,\sigma)\) for some mean \(\mu\) and standard deviation \(\sigma\). For each combination of parameter values \(\mu \in \{0, 10\}\) and \(\sigma \in \{1,5\}\), take 10,000 random samples from \(X\). Plot the results as a histogram, using facets. The output could look roughly like the plot below.

n_samples = 10000
expand_grid(
  mu = c(0,10),
  sd = c(1,5),
  samples = 1
) %>% 
  nest(samples) %>% 
  group_by(mu, sd) %>% 
  mutate(
    samples = list(s = rnorm(10000,mu,sd))
  ) %>% 
  select(- data) %>% 
  unnest(samples)%>% 
  ungroup() %>% 
  mutate(
    mu = str_c("mu = ", mu),
    sd = str_c("sigma = ", sd)
  ) %>% 
  ggplot(aes(x = samples)) +
  geom_histogram() +
  facet_grid(mu ~ sd, scales = "free")

Cumulative probability

Calculate the probability of observing an outcome of at least 0.5, \(P(X \ge 0.5)\), for each of the four parameter value combinations used above. The outcome of your calculation should be:

expand_grid(
  mu = c(0,10),
  sd = c(1,5)
) %>% 
  mutate(
    prob = 1 - pnorm(0.5, mu, sd)
  )

## # A tibble: 4 × 3
##      mu    sd  prob
##   <dbl> <dbl> <dbl>
## 1     0     1 0.309
## 2     0     5 0.460
## 3    10     1 1    
## 4    10     5 0.971

Explain these results

Explain the results from the previous task. Why do we see this ordering in the probability scores for different parameter pairs, e.g., why does \(\mu = 1\) and \(\sigma = 1\) score the lowest, followed by …, followed by … etc.?

Exercise 2. Plot joint and marginal probabilities

Before we dive into the exercise, let’s revise what joint and marginal probabilities mean.

Joint probability calculates the likelihood of two events happening at the same time. It is denoted as \(P(A \cap B)\) which reads as “the probability of A and B occurring together” and is defined as follows:

\[ \begin{align*} P(A \cap B) &= P(A) \times P(B|A) \end{align*} \] where,

\(P(A)\) is the probability of event \(A\) occurring,
\(P(B|A)\) is the probability of event \(B\) occurring given that event \(A\) has occurred.

Alternatively, if events \(A\) and \(B\) are independent (the occurrence of one does not affect the occurrence of the other), the joint probability is calculated as follows:

\[ \begin{align*} P(A \cap B) &= P(A) \times P(B) \end{align*} \] Keeping this information in mind and when starting to work on ex. 2, ask yourself which formula would suit the Titanic context. Are the events independent or dependent of each other?

Marginal probability is the probability of an event occurring irrespective of the outcome of another event. It’s essentially the probability of a single event in the presence of multiple variables. In scenarios where you have joint probabilities involving two or more variables, the marginal probability of one of these variables is the sum of joint probabilities over all levels of the other variable(s).

For a simple event case with two events, \(A\) and \(B\), where \(B\) can have multiple outcomes (e.g. \(B_1, B_2, ..., B_n\)), the marginal probability of event \(A\) is expressed as:

\[ P(A) = \sum_{i} P(A \cap B_i) \]

Optional videos:

In this exercise we will be using the titanic dataset which you can directly download using R package titanic.

Your tasks:

Clean and transform data, making sure to extract two relevant dataset columns for our task: Sex and Survived with select() function.
Filter out rows with missing values with filter(), and convert Survived column into a factor with two levels Yes and No using mutate() and filter() functions.
Create a contingency table out of Sex and Survived columns. It should resemble the following table structure:

	Yes	No
Male	x	y
Female	z	w

Where, the x, y, z, w represent counts of each category combination (e.g., x being the total number of males who survived).

Calculate joint probabilities using prop.table().
Calculate marginal probabilities for both Sex and Survived using prop.table().
Visualize the joint and marginal probabilities.

Hint for calculating joint probabilities:

\(N\) is the total number of passengers.
The probability of being a female passenger: \(P(\text{Female}) = \frac{\text{Number of female passengers}}{N}\).
The conditional probability of surviving given that the passenger is female: \(P(\text{Female} | \text{Survived}) = \frac{\text{Number of female passengers who survived}}{\text{Number of female passengers}}\).
Then the joint probability of a passenger being female and surviving is: \(P(\text{Female} \cap \text{Survived}) = P(\text{Female}) \times P(\text{Survived} \mid \text{Female})\).

You can try to make your own calculation by hand first (using your contingency table output) and then compare your results with prop.table() function output.

Hint for calculating marginal probabilities:

Calculation of marginal probability of being female: \(P(\text{Female}) = \frac{\text{Count(Female and Yes)} + \text{Count(Female and No)}}{\text{Total Count}}\)
Calculation of marginal probability of surviving: \(P(\text{Survived} = \text{Yes}) = \frac{\text{Count(Female and Yes)} + \text{Count(Male and Yes)}}{\text{Total Count}}\)

Again, try calculating the probabilities by hand first, and then compare with your function result.

data <- titanic::titanic_train

data_cleaned <- data %>% 
  select(Sex, Survived) %>% 
  filter(!is.na(Sex), !is.na(Survived)) %>% 
  mutate(Survived=factor(ifelse(Survived==1, "Yes", "No"),
                         levels=c("Yes", "No")))

cont_table <- table(data_cleaned$Sex, data_cleaned$Survived)
joint_probs <- prop.table(cont_table)

ggplot(as.data.frame(joint_probs), aes(x=Var1, y=Freq, fill=Var2)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(x = "Gender", 
       y = "Probability", 
       fill = "Survived", 
       title = "Joint Probability of Gender and Survival on Titanic")

marginal_probs_sex <- prop.table(margin.table(cont_table, 1))
marginal_probs_survival <- prop.table(margin.table(cont_table, 2))

ggplot(as.data.frame(marginal_probs_survival), aes(x=Var1, y=Freq)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Survival", 
       y = "Probability", 
       title = "Marginal Probability of Survival")

Question: In 2-3 sentences interpret your results, mentioning the importance of joint probability and how it differences from marginal probability.

The joint probability calculations for the Titanic dataset show the likelihood of survival in relation to gender, revealing a clear discrepancy between males and females, with females having a higher probability of survival. In contrast, the marginal probabilities of survival, irrespective of gender, present an overall picture of the chances of survival.

Exercise 3. Combining two random variables

In our fictitious example, a neuropsychologist took a large sample of 10.000 students and looked at their reaction times of a task with two conditions EASY and HARD. Suppose that the neuropsychologist randomly assigned the condition EASY or HARD to a student. He looks at the difference between the conditions. The summary statistics for the conditions of the people in the study are shown below.

tribble(~ condition, ~ mean , ~ sd,
        "EASY",   180,    5,
        "HARD", 159,    12)

## # A tibble: 2 × 3
##   condition  mean    sd
##   <chr>     <dbl> <dbl>
## 1 EASY        180     5
## 2 HARD        159    12

Take 10.000 samples and model the distributions in a single plot like so:

n_samples = 10000
df <- tibble(
  mu = c(180, 159),
  sd = c(5, 12),
  samples = 1
) %>% 
  nest(samples) %>% 
  group_by(mu) %>% 
  mutate(
    samples = list(s = rnorm(10000,mu,sd))
  ) %>% 
  select(- data) %>% 
  unnest(samples)%>% 
  ungroup() %>% 
  mutate(
    mu = str_c("mu = ", mu),
    sd = str_c("sigma = ", sd)
  )

ggplot(df, aes(x= samples, fill= mu)) +
  geom_histogram( color='#e9ecef', alpha=0.6, position='identity')

a.) What is the mean of the difference between the two heights?

b.) What is the variance of the difference between the two heights?

c.) What is the standard deviation of the difference between the two heights?

Check your results by calculating the mean, variance and standard deviation. It might help to visualize the distribution of reactiontime(EASY) - reactiontime(HARD). Try to construct the distribution similar to the one below.

samples_diff = df[df$mu == "mu = 180",]$samples - df[df$mu == "mu = 159",]$samples
df_diff <- as_tibble(samples_diff)
ggplot(df_diff, aes(x = samples_diff)) +
  geom_histogram(color='#e9ecef', alpha=0.6, position='identity')

Week 5. Probability theory

November 15, 2023

Exercise 1: Common probability distributions in R

Binomial distribution

Plot the binomial distribution

Interpret the plot

Cumulative probability

Normal distribution

Sample from the normal distribution and plot

Cumulative probability

Explain these results

Exercise 2. Plot joint and marginal probabilities

Exercise 3. Combining two random variables