Acknowledgements: These exercises are based on the materials prepared by Prof. Michael Franke
The goal of this exercise is to get more comfortable with probability
distributions in R, using functions like rnorm and
dbinom.
Random variable \(K\) follows a binomial distribution, so that \(k \sim \text{Binom}(\theta, N)\), where \(\theta\) is the bias towards heads (= success, number 1) and \(N\) is the number of coin tosses (trials). Plot the probability mass function for \(0 \le k \le N\) on the \(x\)-axis, for each one of the four values \(\theta \in \{0.2, 0.5, 0.7, 0.9\}\), keeping \(N=24\) fixed. Use a facet plot. The outcome should look roughly like the plot below.
expand_grid(
k = 0:24,
theta = c(0.2, 0.5, 0.7, 0.9)
) %>%
mutate(
lh = dbinom(k, 24, theta),
theta = str_c("theta = ", theta)
) %>%
ggplot(aes(x = k, y = lh)) +
geom_col() +
facet_wrap(~theta, ncol = 2) +
labs(
x = latex2exp::TeX("Number of heads $k$"),
y = "Likelihood of observation"
)
Describe in one concise sentence what one of the four faceted plot shows. For example (but totally wrong): “The graphs in each facet show the prior probability of latent parameter \(k\) for data observations from different conditions.”
Calculate the probability of observing no more than 7 successes for \(N= 24\) and each \(\theta \in \{0.2, 0.5, 0.7, 0.9\}\).
theta <- c(0.2, 0.5, 0.7, 0.9)
pbinom(7, 24, theta)
## [1] 9.108287e-01 3.195733e-02 4.387079e-05 1.729516e-12
Suppose variable \(X\) follows a normal distribution, so that \(x \sim \text{Norm}(\mu,\sigma)\) for some mean \(\mu\) and standard deviation \(\sigma\). For each combination of parameter values \(\mu \in \{0, 10\}\) and \(\sigma \in \{1,5\}\), take 10,000 random samples from \(X\). Plot the results as a histogram, using facets. The output could look roughly like the plot below.
n_samples = 10000
expand_grid(
mu = c(0,10),
sd = c(1,5),
samples = 1
) %>%
nest(samples) %>%
group_by(mu, sd) %>%
mutate(
samples = list(s = rnorm(10000,mu,sd))
) %>%
select(- data) %>%
unnest(samples)%>%
ungroup() %>%
mutate(
mu = str_c("mu = ", mu),
sd = str_c("sigma = ", sd)
) %>%
ggplot(aes(x = samples)) +
geom_histogram() +
facet_grid(mu ~ sd, scales = "free")
Calculate the probability of observing an outcome of at least 0.5, \(P(X \ge 0.5)\), for each of the four parameter value combinations used above. The outcome of your calculation should be:
expand_grid(
mu = c(0,10),
sd = c(1,5)
) %>%
mutate(
prob = 1 - pnorm(0.5, mu, sd)
)
## # A tibble: 4 × 3
## mu sd prob
## <dbl> <dbl> <dbl>
## 1 0 1 0.309
## 2 0 5 0.460
## 3 10 1 1
## 4 10 5 0.971
Explain the results from the previous task. Why do we see this ordering in the probability scores for different parameter pairs, e.g., why does \(\mu = 1\) and \(\sigma = 1\) score the lowest, followed by …, followed by … etc.?
Before we dive into the exercise, let’s revise what joint and marginal probabilities mean.
Joint probability calculates the likelihood of two events happening at the same time. It is denoted as \(P(A \cap B)\) which reads as “the probability of A and B occurring together” and is defined as follows:
\[ \begin{align*} P(A \cap B) &= P(A) \times P(B|A) \end{align*} \] where,
Alternatively, if events \(A\) and \(B\) are independent (the occurrence of one does not affect the occurrence of the other), the joint probability is calculated as follows:
\[ \begin{align*} P(A \cap B) &= P(A) \times P(B) \end{align*} \] Keeping this information in mind and when starting to work on ex. 2, ask yourself which formula would suit the Titanic context. Are the events independent or dependent of each other?
Marginal probability is the probability of an event occurring irrespective of the outcome of another event. It’s essentially the probability of a single event in the presence of multiple variables. In scenarios where you have joint probabilities involving two or more variables, the marginal probability of one of these variables is the sum of joint probabilities over all levels of the other variable(s).
For a simple event case with two events, \(A\) and \(B\), where \(B\) can have multiple outcomes (e.g. \(B_1, B_2, ..., B_n\)), the marginal probability of event \(A\) is expressed as:
\[ P(A) = \sum_{i} P(A \cap B_i) \]
Optional videos:
In this exercise we will be using the titanic dataset which you can
directly download using R package titanic.
Your tasks:
Sex and Survived
with select() function.filter(), and
convert Survived column into a factor with two levels
Yes and No using mutate() and
filter() functions.Sex and
Survived columns. It should resemble the following table
structure:| Yes | No | |
|---|---|---|
| Male | x | y |
| Female | z | w |
Where, the x, y, z,
w represent counts of each category combination (e.g.,
x being the total number of males who survived).
prop.table().Sex and Survived using
prop.table().Hint for calculating joint probabilities:
You can try to make your own calculation by hand first (using your
contingency table output) and then compare your results with
prop.table() function output.
Hint for calculating marginal probabilities:
Calculation of marginal probability of being female: \(P(\text{Female}) = \frac{\text{Count(Female and Yes)} + \text{Count(Female and No)}}{\text{Total Count}}\)
Calculation of marginal probability of surviving: \(P(\text{Survived} = \text{Yes}) = \frac{\text{Count(Female and Yes)} + \text{Count(Male and Yes)}}{\text{Total Count}}\)
Again, try calculating the probabilities by hand first, and then compare with your function result.
data <- titanic::titanic_train
data_cleaned <- data %>%
select(Sex, Survived) %>%
filter(!is.na(Sex), !is.na(Survived)) %>%
mutate(Survived=factor(ifelse(Survived==1, "Yes", "No"),
levels=c("Yes", "No")))
cont_table <- table(data_cleaned$Sex, data_cleaned$Survived)
joint_probs <- prop.table(cont_table)
ggplot(as.data.frame(joint_probs), aes(x=Var1, y=Freq, fill=Var2)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(x = "Gender",
y = "Probability",
fill = "Survived",
title = "Joint Probability of Gender and Survival on Titanic")
marginal_probs_sex <- prop.table(margin.table(cont_table, 1))
marginal_probs_survival <- prop.table(margin.table(cont_table, 2))
ggplot(as.data.frame(marginal_probs_survival), aes(x=Var1, y=Freq)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Survival",
y = "Probability",
title = "Marginal Probability of Survival")
Question: In 2-3 sentences interpret your results, mentioning the importance of joint probability and how it differences from marginal probability.
The joint probability calculations for the Titanic dataset show the likelihood of survival in relation to gender, revealing a clear discrepancy between males and females, with females having a higher probability of survival. In contrast, the marginal probabilities of survival, irrespective of gender, present an overall picture of the chances of survival.
In our fictitious example, a neuropsychologist took a large sample of 10.000 students and looked at their reaction times of a task with two conditions EASY and HARD. Suppose that the neuropsychologist randomly assigned the condition EASY or HARD to a student. He looks at the difference between the conditions. The summary statistics for the conditions of the people in the study are shown below.
tribble(~ condition, ~ mean , ~ sd,
"EASY", 180, 5,
"HARD", 159, 12)
## # A tibble: 2 × 3
## condition mean sd
## <chr> <dbl> <dbl>
## 1 EASY 180 5
## 2 HARD 159 12
Take 10.000 samples and model the distributions in a single plot like so:
n_samples = 10000
df <- tibble(
mu = c(180, 159),
sd = c(5, 12),
samples = 1
) %>%
nest(samples) %>%
group_by(mu) %>%
mutate(
samples = list(s = rnorm(10000,mu,sd))
) %>%
select(- data) %>%
unnest(samples)%>%
ungroup() %>%
mutate(
mu = str_c("mu = ", mu),
sd = str_c("sigma = ", sd)
)
ggplot(df, aes(x= samples, fill= mu)) +
geom_histogram( color='#e9ecef', alpha=0.6, position='identity')
a.) What is the mean of the difference between the two heights?
b.) What is the variance of the difference between the two heights?
c.) What is the standard deviation of the difference between the two heights?
Check your results by calculating the mean, variance and standard
deviation. It might help to visualize the distribution of
reactiontime(EASY) - reactiontime(HARD). Try to construct
the distribution similar to the one below.
samples_diff = df[df$mu == "mu = 180",]$samples - df[df$mu == "mu = 159",]$samples
df_diff <- as_tibble(samples_diff)
ggplot(df_diff, aes(x = samples_diff)) +
geom_histogram(color='#e9ecef', alpha=0.6, position='identity')