This chapter (Chapter 4) introduced the simple linear regression model, a framework for estimating the association between a predictor variable and an outcome variable. The Gaussian distribution comprises the likelihood in such models, because it counts up the relative numbers of ways different combinations of means and standard deviations can produce an observation. To fit these models to data, the chapter introduced quadratic approximation of the posterior distribution and the tool quap. It also introduced new procedures for visualizing prior and posterior distributions.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html
file as: YourName_ANLY505-Year-Semester.html
and publish
the assignment to your R Pubs account and submit the link to Canvas.
Each question is worth 5 points.
4-1. For the model definition below, simulate observed y values from the prior (not the posterior). Make sure to plot the simulation. \[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ ∼ Normal(0, 10) \\ \ σ ∼ Exponential(1) \\ \end{align}\]
set.seed(123)
prior_mu <- rnorm(1e4, 0, 10)
prior_sigma <- rexp(1e4, 1)
prior_y <- rnorm(1e4, prior_mu, prior_sigma)
dens(prior_y, xlab="observed y of prior")
4-2. Translate the model just above into a quap formula.
quap_y1 <-
alist(
y ~ dnorm(mu, sigma),
mu ~ dnorm(0, 10),
sigma ~ dexp(1)
)
4-3. Translate the quap model formula below into a
mathematical model definition:
y ~ dnorm( mu , sigma ),
mu <- a + b*x,
a ~ dnorm( 0 , 10 ),
b ~ dunif( 0 , 1 ),
sigma ~ dexp( 1 )
quap_y2 <-
alist(
y ~ dnorm(mu, sigma),
mu <- a + b*x,
a ~ dnorm(0, 10),
b ~ dunif( 0 , 1 ),
sigma ~ dexp(1)
)
4-4. A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors. Simulate from the priors that you chose to see what the model expects before it sees the data. Do this by sampling from the priors. Then consider 50 students, each simulated with a different prior draw. For each student simulate 3 years. Plot the 50 linear relationships with height(cm) on the y-axis and year on the x-axis. What can we do to make these priors more likely?
A: Supposed, the sample is defined as male student heights from middle school. The student age range is from 12 to 15. A simple linear model will be conducted for mean estimation.
Step 1, based on this sample, the heights of student are assumed to follow a normal distribution:
H ~ Normal(μ, σ)
Step 2, suppose \(\alpha\) stands for height’s mean, \(\beta\) stands for growth rate, and x stands for year. The linear model of height’s mean is
μ = \(\alpha\) + \(\beta\)*x
Step 3, the prior for the intercept. Here is some information for average height (https://censusatschool.ie/results/censusatschool-21-results/). Supposed the 12 years old male’s mean of height follows normal distribution:
\(\alpha\) ∼ Normal(165, 10)
Height <- rnorm(10000, 165, 10)
dens(Height, lwd=3, col=4, xlab="Student Height", main="Distribution of 12 years old male student's height")
Step 4, the prior for slope, supposed growth rate follows log normal distribution:
\(\beta\) ~ LogNormal(1,0.5)
Mostly, the yearly height growth was located in [0,10]. That means the three years height growth should be located within [0,30], it is reasonable for height growth.
Beta1 <- rlnorm(10000, 1, 0.5)
dens(Beta1, lwd=3, xlab="Yearly Height Growth", main="Distribution of yearly height growth")
Step 5, the prior for sigma is the standard deviation of the height. From plot below, if sigma=20, the height will be larger than 200cm. It is not reasonable. So I supposed prior for sigma follows uniform(0.10) distribution. :
σ ~ Uniform(0,10)
Height1 <- rnorm(10000, 165, 10)
Height2 <- rnorm(10000, 165, 20)
dens(Height1, lwd=3, col=4, xlab="Student Height", main="Distribution of Height with different sigma")
dens(Height2, lwd=3, col=3, add=TRUE)
legend(x = "topright",
legend = c("sigma=10", "sigma=20"),
lty = c(1, 1),
col = c(4, 3),
lwd = 2)
Step 6, fit model
Height ~ dnorm(mu, sigma ),
μ = \(\alpha\) + \(\beta\)*x, \(\alpha\) ~ dnorm(165, 10),
\(\beta\) ~ dlnorm(1, 0.5),
σ ~ dunif(0, 10)
# re-simulate of the model
#simulation with 50 student, each simulate 3 years
set.seed(123)
sample1_alpha <- rnorm(50,165,10)
sample1_beta <- rlnorm(50,1,0.5)
sample1_sigma <- runif(50,0,10)
year <- seq(from =1, to=3, len=3)
plot(NULL, xlim=c(1,3),ylim=c(150,200),
xlab ="Years", ylab = "Height(cm)", main = "linear plots of students' yearly heights")
for (i in 1:50)
lines (year, sample1_alpha[i] + sample1_beta[i] *year , lwd=2 , col=4)
4-5. (a) Now suppose I remind you that every student got taller each year. Does this information lead you to change your choice of priors? How? Again, simulate from the priors and plot. (b) Now suppose I tell you that the variance among heights for students of the same age is never more than 64cm. How does this lead you to revise your priors?
A: For question a, if every student got taller each year, the distribution of height growth rate should be always larger than 0. Log normal distribution is a right choice. Thus, I will keep beta follows a log normal distribution. For question b, since the variance is the square of σ. If it is never more than 64cm, the sigma distribution should be updated to:
σ ~ Uniform(0,8)
One issue is for prior \(\alpha\) ∼ Normal(165, 10), standard deviation is already 10. If we directly update σ, the variance for estimation should be higher than 64cm. It is better to reduce standard deviation of \(\alpha\). The prior for the intercept will update to:
\(\alpha\) ∼ Normal(165, 5)
The variance for each year is below 64cm.
# re-simulate of the model
#simulation with 50 student, each simulate 3 years
set.seed(123)
sample2_alpha <- rnorm(50,165,5)
sample2_beta <- rlnorm(50,1,0.5)
sample2_sigma <- runif(50,0,8)
# simulation for each year
simulate2_year1 <- rnorm(1e4,sample2_alpha+sample2_beta,sample2_sigma)
simulate2_year2 <- rnorm(1e4,sample2_alpha+sample2_beta*2,sample2_sigma)
simulate2_year3 <- rnorm(1e4,sample2_alpha+sample2_beta*3,sample2_sigma)
# variance matrix
simulate2 <- cbind(simulate2_year1, simulate2_year2, simulate2_year3)
simulate2_var <-apply(simulate2,2,var)
simulate2_var
## simulate2_year1 simulate2_year2 simulate2_year3
## 44.41296 50.26261 61.57743
# plot density for each year
dens(simulate2_year1, lwd=2, col=4, xlab="Student Height", main="Distribution of Height for each year")
dens(simulate2_year2, lwd=2, col=3, add=TRUE)
dens(simulate2_year3, lwd=2, col=2, add=TRUE)
legend(x = "topright",
legend = c("First Year", "Second Year", "Third Year"),
lty = c(1, 1, 1),
col = c(4, 3, 2),
lwd = 2)