The due date for this exam is Friday, April 07, by 2:00PM. Late submissions will not be accepted apart from exceptional circumstances. Consequently, you should plan on submitting before the due date.
This exam consists of 9 problems. Each problem will be graded out of a maximum of 5 points.
For this exam I would prefer for you to upload a single knit document (either .html or .pdf). Failing that, you can upload your R markdown file (.Rmd) or a collection of several files. In any case, it should be obvious to me where to find the answer to each exam question.
The R markdown document used to write this exam will be provided as a template.
About 10 years ago, a researcher made headlines with the following claim: attractive parents are more likely to have daughters than sons.
The claim was based on a survey of ~3000 American parents. Researchers rated the attractiveness of each parent according to a 5-point scale ranging from 1 (very unattractive) to 5 (very attractive). It turns out that 56% of the children of parents in the highest attractiveness category were girls, compared to 48% of the children of parents in the other categories. This made headlines in the news, and this result continues to be cited today.
For example,
The argument goes that being beautiful is a more useful trait for girls than for boys, and if beauty is heritable then natural selection might find a way to increase the number of girls born with this beneficial trait. Perhaps needless to say, the result is also controversial.
The results of the original study are straightforward, given as a table and plotted below:
library("knitr")
attractiveness <- 1:5
sexratio <- c(50, 44, 50, 47, 56)
df <- data.frame(attractiveness, sexratio)
fig <- ggplot(df) +
geom_point(aes(x = attractiveness,
y = sexratio)) +
xlim(c(0.5, 5.5)) + ylim(c(40, 60)) +
xlab("Attractiveness of parent (5-point scale)") +
ylab("Percentage of girl babies") +
theme_bw()
print(fig)
# Show the dataframe as a table
kable(df, caption = "Attractiveness and sex ratio")
| attractiveness | sexratio |
|---|---|
| 1 | 50 |
| 2 | 44 |
| 3 | 50 |
| 4 | 47 |
| 5 | 56 |
One way of analyzing this data is simply to fit a linear model. If there is a relationship between a parent’s attractiveness and the percentage of daughters, then the slope of the linear relationship should be positive.
For this problem, fit a line to the data plotted above, using numerical optimization to minimize the sum of squared error (SSE).
What is the estimated slope of the relationship?
Your solution should make use of the optim()
function.
x = attractiveness
y = sexratio
obj_fn <- function(params, x = NULL, y = NULL) { #takes some parameters
a <- params[1]
b <- params[2]
y_pred <- a * x + b #y predicted
sum((y - y_pred)^2) #sum of squared error
}
results <- optim(c(1,1), obj_fn,
x = attractiveness, y = sexratio)
print(results)
## $par
## [1] 1.498387 44.906024
##
## $value
## [1] 56.70003
##
## $counts
## function gradient
## 221 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
fig <- fig +
geom_abline(slope = results$par[1],
intercept = results$par[2],
color = "red")
print(fig)
model <- lm(y ~ x) #you aare trying to predict y, using x
print("Estimated slope of the relationship: ")
## [1] "Estimated slope of the relationship: "
print(model)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 44.9 1.5
#cat("Estimated slope of the relationship: ",model)
Fit a line to the same data, except this time using linear algebra to directly compute the parameters that minimize SSE via the normal equation.
What is the estimated slope of the relationship?
Your solution should make use of linear algebra operations only. ie,
it should not make use of optim().
#I can use a matrix to determine the slope in the relationship
#matrix <- cbind(1, attractiveness)
#beta <- solve(t(matrix) %*% matrix) %*% t(matrix) %*% sexratio
#slope <- beta[2]
#print(slope)
model <- lm(sexratio ~ attractiveness)
slope <- coef(model)
fig <- fig +
geom_abline(slope = slope[2],
intercept = slope[1],
color = "red")
print(fig)
print(slope)
## (Intercept) attractiveness
## 44.9 1.5
We will next fit a line to the same data, except this time using Bayesian linear regression via Stan.
Your model should be based on the following:
\[ (\textrm{Percent female}) = w_0 + w_1 \times (\textrm{Attractiveness}) + \eta \]
where \(\eta\) is zero-mean Gaussian noise with standard deviation \(\sigma\), e.g., \(\eta \sim \textrm{Normal}(0, \sigma)\).
An equivalent way of writing this is:
\[ (\textrm{Percent female}) \sim \textrm{Normal}(w_0 + w_1 \times (\textrm{Attractiveness}), \sigma) \] This model has three parameters:
For this problem (Problem 3), choose priors for each of these three parameters. For each parameter, state the prior you have chosen, and give a one- or two-sentence justification for why your choice of prior is appropriate.
w_0 = 44.9 ~ Normal(44.9,1) - (We know the intercept is 44.9, and prior SD would be 1 for simplicity. This also helps showcase the low variance in these values)
- If we didnt know the prior mean, I could set the value to Normal(0,10) where the prior mean would be 0, and the prior SD 10
w_1 = 1.5 ~ Normal(1.5,0.1) - (We know the slope is 1.5, and prior SD would be 0.1 for simplicity) - indicates our small dataset needing a SD that mirrors our size
sigma(σ) = σ ∼ Exponential(1) - (Prior parameter of 1) indicates the noise level to be miniscule (Which I expect given our small dataset)
Write a Stan model that implements your model and choice of priors.
Perform Bayesian inference using the data above and your Stan model. Generate at least 10,000 samples from the posterior.
What is the 95% highest-density credible interval (HDI) for the slope of the model?
Does your analysis support the conclusion that attractive parents are more likely to have daughters?
set.seed(43)
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.21.8, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
#bayesian inference.stan
data <- list(
num = 5,
attractiveness = c(1, 2, 3, 4, 5),
sexratio = c(50, 44, 50, 47, 56)
)
bayes_inf_model <- "
data {
int<lower=0> num;
vector[num] attractiveness;
vector[num] sexratio;
}
parameters {
real w0;
real w1;
real<lower=0> sigma;
}
model {
w0 ~ normal(44.9, 1);
w1 ~ normal(1.5, 0.1);
sigma ~ normal(0, 1);
sexratio ~ normal(w0 + w1 * attractiveness, sigma);
}
"
#model <- cmdstan_model("bayesian_inference.stan") # doesnt work?
model <- stan_model(model_code = bayes_inf_model)
## Trying to compile a simple C file
## Running /opt/R/4.2.3/lib/R/bin/R CMD SHLIB foo.c
## gcc -I"/opt/R/4.2.3/lib/R/include" -DNDEBUG -I"/cloud/lib/x86_64-pc-linux-gnu-library/4.2/Rcpp/include/" -I"/cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppEigen/include/" -I"/cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppEigen/include/unsupported" -I"/cloud/lib/x86_64-pc-linux-gnu-library/4.2/BH/include" -I"/cloud/lib/x86_64-pc-linux-gnu-library/4.2/StanHeaders/include/src/" -I"/cloud/lib/x86_64-pc-linux-gnu-library/4.2/StanHeaders/include/" -I"/cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppParallel/include/" -I"/cloud/lib/x86_64-pc-linux-gnu-library/4.2/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/cloud/lib/x86_64-pc-linux-gnu-library/4.2/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
## In file included from /cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppEigen/include/Eigen/Core:88,
## from /cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppEigen/include/Eigen/Dense:1,
## from /cloud/lib/x86_64-pc-linux-gnu-library/4.2/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
## from <command-line>:
## /cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name ‘namespace’
## 628 | namespace Eigen {
## | ^~~~~~~~~
## /cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:17: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
## 628 | namespace Eigen {
## | ^
## In file included from /cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppEigen/include/Eigen/Dense:1,
## from /cloud/lib/x86_64-pc-linux-gnu-library/4.2/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
## from <command-line>:
## /cloud/lib/x86_64-pc-linux-gnu-library/4.2/RcppEigen/include/Eigen/Core:96:10: fatal error: complex: No such file or directory
## 96 | #include <complex>
## | ^~~~~~~~~
## compilation terminated.
## make: *** [/opt/R/4.2.3/lib/R/etc/Makeconf:169: foo.o] Error 1
#generate samples
fit <- sampling(model, data = data, iter = 10000, chains = 4)
##
## SAMPLING FOR MODEL '5de75d5b918e4acf18c680446d14c8ae' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 8e-06 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.048062 seconds (Warm-up)
## Chain 1: 0.050717 seconds (Sampling)
## Chain 1: 0.098779 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '5de75d5b918e4acf18c680446d14c8ae' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 6e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.047083 seconds (Warm-up)
## Chain 2: 0.050471 seconds (Sampling)
## Chain 2: 0.097554 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '5de75d5b918e4acf18c680446d14c8ae' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 5e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.047879 seconds (Warm-up)
## Chain 3: 0.054049 seconds (Sampling)
## Chain 3: 0.101928 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '5de75d5b918e4acf18c680446d14c8ae' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 4e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.058467 seconds (Warm-up)
## Chain 4: 0.052005 seconds (Sampling)
## Chain 4: 0.110472 seconds (Total)
## Chain 4:
print(fit)
## Inference for Stan model: 5de75d5b918e4acf18c680446d14c8ae.
## 4 chains, each with iter=10000; warmup=5000; thin=1;
## post-warmup draws per chain=5000, total post-warmup draws=20000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## w0 44.90 0.01 0.75 43.44 44.40 44.90 45.40 46.37 18776 1
## w1 1.50 0.00 0.10 1.31 1.44 1.50 1.57 1.69 18450 1
## sigma 2.50 0.00 0.44 1.77 2.18 2.45 2.77 3.47 21504 1
## lp__ -12.78 0.01 1.21 -15.88 -13.33 -12.47 -11.90 -11.41 10301 1
##
## Samples were drawn using NUTS(diag_e) at Fri Apr 7 17:03:21 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
#posterior samples slope
samples_w1 <- extract(fit)$w1
hdi_w1 <- quantile(samples_w1, probs = c(0.025, 0.975)) #95% HDI for the slope
cat("95% HDI for the slope is: ", hdi_w1[1], " to ", hdi_w1[2])
## 95% HDI for the slope is: 1.310319 to 1.692236
Based on the results you obtained, what is the probability that the slope of the linear relationship is greater than zero? ie, what is \(p(w_1 > 0)\)?
cat("the probability that the slope(w1) > 0 is ", mean(samples_w1 > 0)) # positive slope
## the probability that the slope(w1) > 0 is 1
Generate a plot that shows the data as black plot markers (just like in the Background section of this document), along with a thick red line showing the linear model you determined in Problem 2, and 100 thin gray lines showing 100 random samples from the posterior distribution obtained in Problem 4.
data <- data.frame(attractiveness, sexratio)
print(data)
## attractiveness sexratio
## 1 1 50
## 2 2 44
## 3 3 50
## 4 4 47
## 5 5 56
w0_hat <- coef(lm(sexratio ~ attractiveness))[1] #same thing as number 1 (lm y ~ x)
w1_hat <- coef(lm(sexratio ~ attractiveness))[2]
data$pred <- w0_hat + w1_hat * data$attractiveness #linear model (w0 + w1 * attractiveness, sigma)
#100 random posterior samples of slope
samples_w1_random <- sample(samples_w1, 100)
compute_preds <- function(w1, att) {
w0 <- w0_hat
preds <- w0 + w1 * att
return(preds)
}
preds <- lapply(samples_w1_random, compute_preds, att = attractiveness)
fig <- fig +
geom_abline(slope = slope[2],
intercept = slope[1],
color = "red")
fig <- ggplot(data, aes(x = attractiveness, y = sexratio)) +
geom_point(color = "black", size = 3) +
#geom_line(aes(y = pred), color = "red", size = 1.5) +
geom_abline(slope = w1_hat, intercept = w0_hat, color = "red", size = 1.5) +
#geom_line(aes(y = preds), color = "gray", size = 0.5, alpha = 0.5) +
xlab("Attractiveness of parent (5-point scale)") +
ylab("Percentage of girl babies") +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(fig)
For this problem, consider the following probabilistic graphical model (PGM), defined over three random variables, \(a\), \(b\), and \(c\):
Assume that \(p(a)\), \(p(b \mid a)\), and \(p(c \mid b)\) are known. Assume that \(a\), \(b\), and \(c\) are all continuous random variables.
What is \(p(b \mid a, c)\)?
Problem 7 ANSWER:
p(a, b, c) = p(c | b) * p(b | a) * p(a)
p(b | a, c) = p(a, b, c) / p(a, c)
#when we plug in p(a, b, c), we get:
p(b | a, c) = p(c | b) * p(b | a) * p(a) / p(a, c)
p(a, c) = ∫ p(c | b) * p(b | a) * p(a) db
Substituting p(a, c) after using Bayes theorem would give us this as the answer:
p(b | a, c) = p(c | b) * p(b | a) * p(a) / ∫ p(c | b) * p(b | a) * p(a) db
As in HW#4, your answer should reference only the known quantities above. However, it may involve integrals of these quantities.
Referencing the same PGM as in the previous problem, what is \(p(a \mid c)\)?
PROBLEM 8 ANSWER:
p(a, b, c) = p(c | b) * p(b | a) * p(a)
p(a | c) = p(a, c) / p(c)
Sum rule:
p(a, c) = ∫ p(c | b) * p(b | a) * p(a) db
p(c) = ∫ ∫ p(c | b) * p(b | a) * p(a) da db
Plug values into p(a | c):
p(a | c) = ∫ p(c | b) * p(b | a) * p(a) db / ∫ ∫ p(c | b) * p(b | a) * p(a) da db
Assume that \(\mathbf{A}\), \(\mathbf{B}\), \(\mathbf{C}\) and \(\mathbf{D}\) are square invertible matrices with conformable dimensions.
If \(\mathbf{A} = (\mathbf{B}^T \mathbf{C})^{-1} \mathbf{D}^T\), then:
Provide your answers in the form of syntactically correct R code.
A <- diag(3)
B <- diag(3)
C <- diag(3)
D <- diag(3)
B <- solve(t(C), solve(t(A)) %*% solve(t(D)))
C <- solve(t(B), solve(t(A)) %*% solve(t(D)))
D <- solve(t(B) %*% C, t(A))