nhgh <- read.csv("~/probability-and-inference-portfolio-Ganiyu-Mubarak/nhgh.csv")
dim(nhgh)
## [1] 6795 21
The final exam will be a one-on-one oral exam with the instructor. Please meet the instructor near the “fish-bowl” office in the Data Science Institute lobby. The exam will be recorded in Zoom. Please prepare solutions to the following is a set of questions. During the oral exam, the instructor will ask a series of questions covering topics from the course and the questions. For example, the instructor may ask:
You will be graded on both the accuracy of your responses and the clarity with which you explain course concepts and solutions to questions.
The final exam should represent your own work. Do not consult with or collaborate in any way with anyone other than the instructor.
Prior to meeting with the instructor, you should:
99-final-exam.The Monte Hall problem is a classic game show. Contestants on the show where shown three doors. Behind one randomly selected door was a sportscar; behind the other doors were goats.
At the start of the game, contestants would select a door, say door A. Then, the host would open either door B or C to reveal a goat. At that point in the game, the host would ask the contestant if she would like to change her door selection. Once a contestant decided to stay or change, the host would open the chosen door to reveal the game prize, either a goat or a car.
In this problem, consider a modified version of the Monte Hall problem in which the number of doors is variable. Rather than 3 doors, consider a game with 4 or 5 or 50 doors. In the modified version of the game, a contestant would select an initial door, say door A. Then, the host would open one of the remaining doors to reveal a goat. At that point in the game, the host would ask the contestant if she would like to change her door selection. Once a contestant decided to stay or change, the host would open the chosen door to reveal the game prize, either a goat or a car.
Consider two strategies:
C. The function game below plays a single game of Monte Hall. The function returns a vector of length two, the first element is the prize under strategy 1 and the second element is the prize under strategy 2. The function has a single input parameter, N, which is the number of doors in the game.
Use the game function to estimate the probability that both strategies result in a goat. Let N=4.
game <- function(N){
if(N<3) stop("Must have at least 3 doors")
prize <- sample(c(rep("goat",N-1),"car"), N)
guess <- sample(1:N,1)
game <- data.frame(door = 1:N, prize = prize, stringsAsFactors = FALSE) %>%
mutate(first_guess = case_when(
door == guess ~ 1
, TRUE ~ 0
)) %>%
mutate(potential_reveal = case_when(
first_guess == 1 ~ 0
, prize == "car" ~ 0
, TRUE ~ 1
)) %>%
mutate(reveal = 1*(rank(potential_reveal, ties.method = "random") == 3)) %>%
mutate(potential_switch = case_when(
first_guess == 1 ~ 0
, reveal == 1 ~ 0
, TRUE ~ 1
)) %>%
mutate(switch = 1*(rank(potential_switch, ties.method = "random") == 3))
c(game$prize[game$first_guess == 1], game$prize[game$switch == 1])
}
Solution
d <- data.frame()
for (i in 1:1000) {
g <- game(4)
for (m in 1:2) {
d[i,m] <- g[m]
}
}
d <- d %>% mutate(both_goat = if_else(V1 == "goat" & V2 == "goat", 1, 0))
d_goat <- d %>% filter(both_goat == 1)
sprintf("The probability of getting a goat through both strategies is %.2f", nrow(d_goat)/nrow(d))
## [1] "The probability of getting a goat through both strategies is 0.37"
B. Communicate the precision of your simulated probability in part C by calculating a 99% confidence interval.
Solution
prop.test(nrow(d_goat), nrow(d), conf.level = 0.99)
##
## 1-sample proportions test with continuity correction
##
## data: nrow(d_goat) out of nrow(d), null probability 0.5
## X-squared = 66.049, df = 1, p-value = 4.399e-16
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
## 0.3321364 0.4115856
## sample estimates:
## p
## 0.371
The 99% confidence interval of the probability of getting goats through the two strategies ranges from 0.3634 to 0.3884. This means that the probability of getting goats through both strategies is likely to lie between 0.3634 and 0.3884 99 percent of the time.
A. Let D(N) be the difference between the difference in probabilities between strategy 2 and strategy 1.
\[ D(N) = P(\text{win strategy 2}|\text{N doors}) - P(\text{win strategy 1}|\text{N doors}) \]
Create a plot that shows how D changes as N increases. Put N on the x-asis, ranging from 3 to 10. Put D on the y-axis.
Solution
D <- function(N) {
d <- data.frame()
for (i in 1:1000) {
g <- game(N)
for (m in 1:2) {
d[i,m] <- g[m]
}
}
d <- d %>% mutate(second_win = if_else(V1 == "goat" & V2 == "car", 1, 0))
d <- d %>% mutate(first_win = if_else(V1 == "car" & V2 == "goat", 1, 0))
d_second <- d %>% filter(second_win == 1)
d_first <- d %>% filter(first_win == 1)
second_win <- nrow(d_second)/nrow(d)
first_win <- nrow(d_first)/nrow(d)
prob <- second_win - first_win
return(prob)
}
N <- 3:10
val <- c()
for (i in seq_along(N)) {
val[i] <- D(N[i])
}
df <- data.frame(x = N, y = val)
ggplot(df) +
geom_line(aes(x, y), color = "blue") +
labs(title = "As N increases D(N) reduces", x = "N", y = "D(N)") +
theme_bw() +
theme(plot.title = element_text(hjust=0.5))
Generally, as N increases, D(N) reduces. However, there is still a better chance of winning via strategy 2 as D(N) >= 0.
Consider a test for a rare genetic condition. Let T+ denote a test result that indicates the condition is present, while T- denotes absence. Let D+ and D- denote the true status of the disease.
C. Fill-in the probability table using the following information:
| D+ | D- | ||
|---|---|---|---|
| T+ | |||
| T- | |||
| 0.001 | 1 |
Solution
| D+ | D- | ||
|---|---|---|---|
| T+ | 0.00085 | 0.04995 | 0.0508 |
| T- | 0.00015 | 0.94905 | 0.9492 |
| 0.001 | 0.999 | 1 |
B. Calculate the negative predictive value of the test, P(D-|T-).
Solution
\[P(D-|T-)=P(T-|D-)*P(D-)\div P(T-)\]
\[P(T-|D-)*P(D-)=0.94905\] \[P(T-)=0.9492\] \[P(D-|T-)=0.99984\]
p_d_neg_t_neg = 0.94905/0.9492
p_d_neg_t_neg
## [1] 0.999842
A Create a plot that shows how the positive predictive value as a function of the prevalence of disease, P(D+).
prevalence <- sequence(0.001, 0.1, length = 50)
ppv <- ???
plot(prevalence, ppv, xlab = "Prevalence", ylab = "PPV")
Solution \[P(D+|T+)=P(T+|D+)*P(D+)\div P(T+)\] According to our table above, \[P(T+|D+)=0.85\] \[P(T+)=0.0508\] \[P(D+|T+)=0.85*P(D+)\div 0.0508\]
prevalence <- seq.int(0.001, 0.1, length = 50)
ppv <- (0.85*prevalence)/0.508
plot(prevalence, ppv, xlab = "Prevalence", ylab = "PPV",
main = "As prevalence increases, PPV increases")
Suppose the yearly hospital charges (in thousands of dollars) for a randomly selected Vanderbilt student is a mixture distribution.
For 50% of students, the hospital charges will be $0. For the remaining 50% of students, the hospital charges are a random variable described by a gamma distribution with shape = 2 and scale = 2. (Again, in thousands of dollars.)
hospital_charges <- function(N){
group <- rbinom(N, 1, 0.5)
charges <- 0*group + rgamma(N, shape = 2, scale = 2)*(1-group)
charges
}
C. What is the 90th percentile for yearly hospital charges for a randomly selected Vanderbilt student?
Solution
ninetieth <- quantile(hospital_charges(13500), 0.9)
# 13500 students used because there are about 13500 students at Vanderbilt
sprintf("The 90th percentile for yearly hospital charges for a randomly selected Vanderbilt student is: %.2f (in thousands of dollars)", ninetieth)
## [1] "The 90th percentile for yearly hospital charges for a randomly selected Vanderbilt student is: 6.02 (in thousands of dollars)"
B. Consider the class average yearly hospital charge for the students in a class of size 30. Plot the density function or a simulated histogram of the class average yearly hospital charge.
Solution
avg <- c()
for (i in 1:1000) {
avg[i] <- mean(hospital_charges(30))
}
hist(avg, freq = FALSE, main = "Distribution of class average yearly hospital charge", xlab = "Charge (in 1000 dollars)")
A. What is the probability that a randomly selected class of size 30 students will have less than 10 students with zero yearly hospital charges?
Solution
new <- c()
for (i in 1:1000) {
r <- hospital_charges(30)
new[i] <- length(r[r == 0])
}
new_less_10 <- new[new < 10]
prob_less_10 <- length(new_less_10)/length(new)
sprintf("The probability that a randomly selected class of size 30 students will have less than 10 students with zero yearly hospital charges is: %.2f", prob_less_10)
## [1] "The probability that a randomly selected class of size 30 students will have less than 10 students with zero yearly hospital charges is: 0.02"
C. Suppose diastolic blood pressure (DBP) follows a normal distribution with mean 80 mmHg and SD 15 mmHg. What is the probability that a randomly sampled person’s DBP lies between 70 and 104 mmHg?
Solution
prob <- pnorm(104, mean = 80, sd = 15) - pnorm(70, mean = 80, sd = 15)
sprintf("The probability that a randomly sampled person's DBP lies between 70 and 104 mmHg is: %.2f", prob)
## [1] "The probability that a randomly sampled person's DBP lies between 70 and 104 mmHg is: 0.69"
B. Suppose a human femur was discovered that is 37 cm long. Also suppose that using the NHANES data, researchers believe the distribution of femur bones, by sex, are distributed as follows:
Under the assumption that male and females are equally likely, what is the probability that the discovered femur was from a male? Solution
\[P(Male | Length = 37) = P(Male and Length = 37)/(P(Male and Length = 37)+ P(Female and Length = 37))\]
prob_female_37 <- dnorm(37, mean = 36, sd = 3.3)
prob_male_37 <- dnorm(37, mean = 40, sd = 3.4)
prob_male <- prob_male_37/(prob_male_37 + prob_female_37)
sprintf("The probability that the femur was from a male is: %.2f", prob_male)
## [1] "The probability that the femur was from a male is: 0.41"
A. Continuing part B, generate a plot of P(femur from male | femur length = x). Let femur length range from 25 to 50.
femur_length <- 25:50
#prob_male <- ???
plot.new()
plot.window(xlim = c(25,50), ylim = c(0,1))
#lines(femur_length, prob_male)
axis(1)
axis(2)
box()
title(xlab = "Femur Length", ylab = "P(Male | femur length)")
Solution
femur_length <- 25:50
prob_male <- dnorm(femur_length, mean = 40, sd = 3.4)/(dnorm(femur_length, mean = 40, sd = 3.4) + dnorm(femur_length, mean = 36, sd = 3.3))
plot.new()
plot.window(xlim = c(25,50), ylim = c(0,1))
lines(femur_length, prob_male)
axis(1)
axis(2)
box()
title(xlab = "Femur Length", ylab = "P(Male | femur length)",
main = "As Femur Length increases, the higher the chances of it belonging to a male")
Let us revisit the yearly hospital charges distribution from a previous section.
Recall: The yearly hospital charges (in thousands of dollars) for a randomly selected Vanderbilt student is a mixture distribution. For 50% of students, the hospital charges will be $0. For the remaining 50% of students, the hospital charges are a random variable described by a gamma distribution with shape = 2 and scale = 2. (Again, in thousands of dollars.)
hospital_charges <- function(N){
group <- rbinom(N, 1, 0.5)
charges <- 0*group + rgamma(N, shape = 2, scale = 2)*(1-group)
charges
}
C. What is E[yearly hospital charges]?
Solution
set.seed(0)
m <- hospital_charges(13500)
sprintf("The E[yearly hospital charges] is : %.2f", mean(m))
## [1] "The E[yearly hospital charges] is : 1.98"
B. Suppose Vanderbilt implements a cap of $10,000 on yearly student hospital charges. What is the mean yearly hospital charge under the new policy?
Solution
v <- m
for (i in seq_along(v)) {
if (v[i] > 10) {
v[i] <- 10
}
}
sprintf("The new E[yearly hospital charges] after a cap of $10000 is : %.2f", mean(v))
## [1] "The new E[yearly hospital charges] after a cap of $10000 is : 1.93"
A. What is the variance of yearly hospital charge under the new policy?
Solution
sprintf("The Var[yearly hospital charges] after a cap of $10000 is : %.2f", var(v))
## [1] "The Var[yearly hospital charges] after a cap of $10000 is : 6.90"
C. Consider the log normal distribution. If X is a log normal random variable, then log(X) is a normal random variable. One way to create pseudo-random draws from the log normal distribution is to generate draws from a normal distribution and then to transform the draws by exponentiating. The parameters of the log normal distribution are the parameters of the underlying normal distribution, \(\mu\) and \(\sigma\) (or \(\sigma^2\)).
Log normal data are prevalent is biological systems and econometrics.
Suppose a blood chemistry measure has a log normal distribution with \(\mu\) = 0 and \(\sigma\) = 1. Generate an histogram or density curve for the sampling distribution of the median when the sample size is 101.
Solution
vec <- c()
for (i in 1:10000) {
x <- rlnorm(101, mean = 0, sd = 1)
vec[i] <- median(x)
}
hist(vec, xlab = "Median",
main = "A sample distribution of the median of a standard lognormal disribution with 101 values")
B. Below is the CDF function for the kth order statistic when the underlying distribution is log normal with \(\mu\) = 0 and \(\sigma\) = 1. Create a plot of the ECDF of the simulated sampling distribution generated in C and overlay the CDF using the function below.
Fk <- function(x,k,n){
pbinom(k-1, n, plnorm(x), lower.tail = FALSE)
}
plot(ecdf(vec), main = "Cumulative distribution function", xlab = "Median",
ylab = "percentile")
curve(Fk(x, 51, 101), add = TRUE, col = "blue")
A. Of the 25th, 50th, and 75th quantiles of the distribution from B, which will have the tightest 95% CI? (Show the sampling distribution of each.)
Solution
vec_25 <- c()
vec_50 <- c()
vec_75 <- c()
for (i in 1:10000) {
x <- rlnorm(101, mean = 0, sd = 1)
vec_25[i] <- quantile(x, c(0.25))
vec_50[i] <- median(x)
vec_75[i] <- quantile(x, c(0.75))
}
hist(vec_25, xlab = "25th quantile",
main = "A sample distribution of the 25th quantile of a standard lognormal disribution with 101 values")
hist(vec_50, xlab = "Median",
main = "A sample distribution of the median of a standard lognormal disribution with 101 values")
hist(vec_75, xlab = "75th quantile",
main = "A sample distribution of the 75th quantile of a standard lognormal disribution with 101 values")
The sampling distribution of the 25th percentile is going to have the tightest 95% confidence interval as it has a range of about 0.5 unlike those of the median and 75th percentile which are higher.
The following code will load the NHANES data and select the first 500 rows.
d1 <- nhgh[1:500,]
C. Estimate the distribution of standing height for adult (age > 18) males using the MLE method with a normal distribution. Create a plot of the estimated density function.
Solution
d1_male_18 <- d1 %>%
filter(sex == "male" & age > 18) %>%
select(ht) %>%
pull()
mean = mean(d1_male_18)
sd = sd(d1_male_18)
nLL <- function(mean, sd){
fs <- dnorm(
x = d1_male_18
, mean = mean
, sd = sd
, log = TRUE
)
-sum(fs)
}
fit <- mle(nLL, start = list(mean=1,sd=1),method = "L-BFGS-B",
lower = c(0, 0.01))
hist(d1_male_18, freq = FALSE, main = "Distribution of adult males' heights",
xlab = "Height", ylim = c(0, 0.06))
curve(dnorm(x, mean = coef(fit)[1], sd = coef(fit)[2]), add = TRUE, col = "blue")
B. Estimate the distribution of BMI for adult (age > 18) females using using the method of moment method with the gamma distribution. Create a plot of the estimated density function.
Solution
d1_female_18 <- d1 %>%
filter(sex == "female" & age > 18) %>%
select(bmi) %>%
pull()
mean = mean(d1_female_18)
sd = sd(d1_female_18)
var = var(d1_female_18)
shape = mean^2/var
scale = var/mean
hist(d1_female_18, freq = FALSE, main = "Distribution of adult females' bmi",
xlab = "BMI", ylim = c(0, 0.08))
curve(dgamma(x, shape = shape, scale = scale), add = TRUE, col = "blue")
A. Estimate the distribution of creatinine (SCr) for adults (age > 18) using the kernel density method with a Gaussian kernel. Create a plot of the estimated density function.
Solution
d1_18 <- d1 %>%
filter(age > 18) %>%
select(SCr) %>%
pull()
d1_18 <- d1_18 %>%
na.omit()
plot(density(d1_18, adjust = 8, kernel = "gaussian"), col = "blue", main = "Distribution of adults' creatinine", xlab = "creatinine")
hist(d1_18, freq = FALSE, add = TRUE)
The following code will load the low birth weight data from the MASS package. The description of the variables in the dataset can be found in the birthwt documentation with the command ?MASS::birthwt.
bwt <- MASS::birthwt
sbwt <- bwt %>%
filter(smoke == 1) %>%
pull(bwt)
C. Generate a 95% confidence interval for the mean birthweight of infants whose mothers did smoke during pregnancy using the bootstrap.
Solution
n_bwt <- length(sbwt)
M <- 5000
out <- rep(NA, M)
for(i in 1:M){
index <- sample.int(n_bwt, n_bwt, replace = TRUE)
out[i] <- sbwt[index] %>% mean
}
quantile(out, c(0.025, 0.975))
## 2.5% 97.5%
## 2621.306 2917.778
B. Generate a 95% confidence interval for the mean birthweight of infants whose mothers did smoke during pregnancy using the Central Limit Theorem shortcut.
Solution
t.test(sbwt, mu = mean(sbwt))
##
## One Sample t-test
##
## data: sbwt
## t = 0, df = 73, p-value = 1
## alternative hypothesis: true mean is not equal to 2771.919
## 95 percent confidence interval:
## 2619.094 2924.744
## sample estimates:
## mean of x
## 2771.919
A. Let \(\mu_s\) be the mean birthweight of infants whose mothers smoked during pregnancy. Let \(\mu_{ns}\) be the mean for the non-smoking group. Use simulation to calculate the 95% confidence interval for \(\mu_s/\mu_{ns}\).
nsbwt <- bwt %>%
filter(smoke == 0) %>%
pull(bwt)
mean_n <- mean(nsbwt)
sd_n <- sd(nsbwt)
sbwt <- bwt %>%
filter(smoke == 1) %>%
pull(bwt)
mean_s <- mean(sbwt)
sd_s <- sd(sbwt)
new <- c()
for (i in 1:10000) {
s <- rnorm(length(sbwt), mean = mean_s, sd = sd_s)
ns <- rnorm(length(nsbwt), mean = mean_n, sd = sd_n)
new[i] <- mean(s)/mean(ns)
}
quantile(new, c(0.025, 0.975))
## 2.5% 97.5%
## 0.8442297 0.9734500
C. Suppose two studies were performed looking at the risk of mild complication after hernia repair using open and laparoscopic surgical approaches. The study results are below. Using the data from each study individually, perform the hypothesis test that the risk of complication between open and laparoscopic repairs are the same under the usual point null. What is the p-value from each study? What do you conclude from each study?
| Study 1 | Comp | No comp |
|---|---|---|
| Open | 30 | 70 |
| Lap | 35 | 65 |
| Study 2 | Comp | No comp |
|---|---|---|
| Open | 600 | 1400 |
| Lap | 619 | 1381 |
Solution
p_value_calc <- function(x) {
if (x < 0) {
pnorm(q=x, lower.tail=TRUE)
} else {
pnorm(q=x, lower.tail=FALSE)
}
}
p_open1 <- 30/100
p_lap1 <- 35/100
p_combo1 <- 65/200
mean_diff1 <- p_lap1 - p_open1
s_error1 <- sqrt(p_combo1*(1-p_combo1)*(2*(1/100)))
z1 <- mean_diff1/s_error1
sprintf("The p-value of the first study: %.2f", 2*p_value_calc(z1))
## [1] "The p-value of the first study: 0.45"
p_open2 <- 600/2000
p_lap2 <- 619/2000
p_combo2 <- 1219/40000
mean_diff2 <- p_lap2 - p_open2
s_error2 <- sqrt(p_combo2*(1-p_combo2)*(2*(1/2000)))
z2 <- mean_diff2/s_error2
sprintf("The p-value of the second study: %.2f", 2*p_value_calc(z2))
## [1] "The p-value of the second study: 0.08"
For both studies, the p-values are both above 0.05. This leads us to conclude that we have to fail to reject the null hypothesis given that the risk of complication from both open and laparoscopc surgical approaches are the same.
B. Suppose that prior to the studies, the researchers established an equivalence threshold of 6 percentage points. Using the confidence intervals, which studies (if any) showed a conclusive similarity between surgical approaches for the complication rate. Explain why.
Solution
study1 <- (prop.test(35, 100)$conf.int - prop.test(30, 100)$conf.int)*100
study2 <- (prop.test(619, 2000)$conf.int - prop.test(600, 2000)$conf.int)*100
sprintf("The confidence interval of the first study is between %.2f and %.2f percent", study1[1], study1[2])
## [1] "The confidence interval of the first study is between 4.46 and 5.15 percent"
sprintf("The confidence interval of the second study is between %.2f and %.2f percent", study2[1], study2[2])
## [1] "The confidence interval of the second study is between 0.93 and 0.97 percent"
Based on the two confidence intervals, it can be confirmed that both studies displayed conclsive similarity when compared with the 6 percentage points threshold. This arises from the fact that the differences between each the confidence interval values from both studies yielded values that were not lower than negative 6 percent on the lower bound and not greater than positive 6 percent on the upper bound. This means that both studies yielded results in which the risks of complication from both open and laparoscopic approaches are different.
A. If the data from the studies were combined, what is the smallest equivalence threshold that would identify a conclusive similarity between the surgical approaches?
Solution
| Study 1 | Comp | No comp |
|---|---|---|
| Open | 30 | 70 |
| Lap | 35 | 65 |
| Study 2 | Comp | No comp |
|---|---|---|
| Open | 600 | 1400 |
| Lap | 619 | 1381 |
| Study combined | Comp | No comp |
|---|---|---|
| Open | 630 | 1470 |
| Lap | 654 | 1446 |
study_comb <- (prop.test(654, 2100)$conf.int - prop.test(630, 2100)$conf.int)*100
sprintf("The confidence interval of the first study is between %.2f and %.2f", study_comb[1], study_comb[2])
## [1] "The confidence interval of the first study is between 1.12 and 1.16"
Given the fact that both the range lies from 1.12 to 1.16, a 1.2 percentage point threshold should be enough to establish conclusive similarity in the combined study as both values will fall in between -1.2 and 1.2 regardless of how they are substracted from each other.
C. Fill in the blank. The sample correlation is a measure of linear association.
B. Explain why predictions from a conditional distribution generally have smaller prediction error than predictions from the marginal distribution.
Solution
This is because outcomes of conditional distribution tend to have a smaller variance compared to those of marginal distribution. Hence, this makes the predictions related to conditional distribution probability more accurate because the predicted outcomes are likelier to be closer to the actual measurements. Moreover, smaller variance indicates smaller prediction error.
A. Use the CLT shortcut to calculate the 95% confidence interval for the correlation of arm circumference and arm length using the NHANES dataset. Is the sample correlation a reasonable measure of association for this data?
Solution
cor.test(nhgh$armc, nhgh$arml)
##
## Pearson's product-moment correlation
##
## data: nhgh$armc and nhgh$arml
## t = 46.838, df = 6601, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4811147 0.5173282
## sample estimates:
## cor
## 0.4994396
Sample correlation shows that there is a moderate relationship between the two variables. It can be inferred that the correlation is true according to its p-value being < 0.05. However, correlation is not strong enough to establish a powerful relationship between the two variables. Hence, it is not a reasonable measure of association. Moreover, the r-squared score of the relationship is 0.25 which means arm circumference can only explain 25% of arm length and vice-versa. Thus, this makes it hard to build an accurate model between these for predicting any of the two variables based on the other one.