Chapter 4 - Geocentric Models

This chapter introduced the simple linear regression model, a framework for estimating the association between a predictor variable and an outcome variable. The Gaussian distribution comprises the likelihood in such models, because it counts up the relative numbers of ways different combinations of means and standard deviations can produce an observation. To fit these models to data, the chapter introduced quadratic approximation of the posterior distribution and the tool quap. It also introduced new procedures for visualizing prior and posterior distributions.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

4E1. In the model definition below, which line is the likelihood? \[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ ∼ Normal(0, 10) \\ \ σ ∼ Exponential(1) \\ \end{align}\]

# The first answer- y_i ∼ Normal(μ, σ) is the likelihood

4E2. In the model definition just above, how many parameters are in the posterior distribution?

# Two parameters are in the posterior distribution μ and σ. 

4E3. Using the model definition above, write down the appropriate form of Bayes’ theorem that includes the proper likelihood and priors.

#Pr(μ,σ|y)=∏iNormal(yi|μ,σ)Normal(μ|0,10)Uniform(σ|0,10)∫∫∏iNormal(hi|μ,σ)Normal(μ|0,10)Uniform(σ|0,10)dμdσ

4E4. In the model definition below, which line is the linear model? \[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ_i = α + βx_i \\ \ α ∼ Normal(0, 10) \\ \ β ∼ Normal(0, 1) \\ \ σ ∼ Exponential(2) \\ \end{align}\]

# The second answer μ_i = α + βx_i is the linear model

4E5. In the model definition just above, how many parameters are in the posterior distribution?

# There are three parameters in the posterior distribution: α, β, σ. 

4M1. For the model definition below, simulate observed y values from the prior (not the posterior). \[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ ∼ Normal(0, 10) \\ \ σ ∼ Exponential(1) \\ \end{align}\]

sample_mu = rnorm(1e4, 0, 10)
sample_sigma = rexp(1e4, 1)
prior_y = rnorm(1e4, sample_mu, sample_sigma)
dens(prior_y)

4M2. Translate the model just above into a quap formula.

qlist<-alist(
  prior_y~dnorm(sample_mu,sample_sigma),
  sample_mu~dnorm(0,10),
  sample_sigma~dexp(1)
)
## q<-quap(qlist, data = x)
## quap function was not able to execute since data=x was not provided 
4M3. Translate the quap model formula below into a mathematical model definition:

y ~ dnorm( mu , sigma ),
mu <- a + b*x,
a ~ dnorm( 0 , 10 ),
b ~ dunif( 0 , 1 ),
sigma ~ dexp( 1 )

#yi ~ Normal(μ,σ)
#μi = α+βxi
#α  ~ Normal(0,50)
#β ~ Uniform(0,10)
#σ ~ Uniform(0,50)

4M4. A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors.

#If the same group of students are repeatedly sampled each year, then the observations are not independent and we should use a linear mixed model. However, we don't have this information, so I will instead use a linear model. The question talks about “students” without specifying age, so I am going to start with a weak prior for the intercept, α, that will capture likely heights for students all the way from school age children to college age young adults (my assumptions are the followng: start from ~ 110 cm for a 5 year old female to ~ 180 cm for a 20 year old male). Similarly, I will use a weak prior for the slope, β, that will capture likely yearly growth rates for this wide age range (from ~ 7.0 cm/year for a 5 year old to ~ 0.5 cm/year for a 20 year old). Finally, I will use a uniform prior for the standard deviation of heights that can cover the full range if students from all ages are included.

#hi ~ Normal(μ,σ)
#μi = α+βxi
#α  ~ Normal(150,25)
#β  ~ Normal(4,2)
#σ  ~ Uniform(0,50)

#I have chosen a linear model without any polynomial terms or transformations because I noticed that a later question will ask for log transformation and I want an un-transformed point of comparison. For the alpha prior, I chose a normal distribution centered on 150 cm with an SD of 25 cm; 150 cm is in the middle of the expected distribution if both school and college students are included and 25 cm is enough variability that two SDs around the mean (i.e., 100 cm to 200 cm) should include most students at the high and low end of the age distribution. For the beta prior, I chose a normal distribution centered on 4 cm/year with an SD of 2 cm/year; 4 cm/year is in the middle of the expected distribution if both school and college students are included and 2 cm/year is enough variability that two SDs around the mean (i.e., 0 cm/year to 8 cm/year) should include most students at the high and low end of the age distribution. This also captures prior knowledge that students should only very rarely be growing less tall over time. Finally, for the sigma prior, I chose a uniform distribution from 0 cm to 50 cm; this range includes both a tight distribution of students around the same age/height and a wide range of students at both school and college ages/heights; 50 cm is a bit high, but I want a conservative prior to begin with.

4M5. Now suppose I remind you that every student got taller each year. Does this information lead you to change your choice of priors? How?

#Having that the average height at the first year was 120 cm and that every student got taller each year makes me more confident that we are talking about school age students (e.g., around 7 years old). Thus, I can narrow the range of my prior distributions to make heights and growth rates from older ages less plausible. My expectation for σ is also much lower now too as I no longer expect a balanced mix of young and old students.

#hi ~ Normal(μ,σ)
#μi = α+βxi
#α  ~ Normal(120,10)
#β  ~ Normal(7,1)
#σ  ~ Uniform(0,20)

#I will center the α prior around 120 cm and decrease its SD to 10 cm to reflect our new knowledge about the average height. Similarly, I will recenter the β prior around 7 cm/year and decrease its SD to 1 cm/year as these values are more consistent with school age students. Finally, I will reduce the maximum value in the σ prior to 20 cm, as a higher SD is less likely with such a low average height.

4M6. Now suppose I tell you that the variance among heights for students of the same age is never more than 64cm. How does this lead you to revise your priors?

#The variance is the square of σ, so if variance is never more than 64 cm, then σ is never more than 8 cm. So we can adjust the maximum of the σ prior. This information about σ may also have implications for the α prior, but I am not confident enough about this relationship to update that prior.

#hi ~ Normal(μ,σ)
#μi = α+βxi
#α  ~ Normal(120,10)
#β  ~ Normal(7,1)
#σ  ~ Uniform(0,8)

4M7. Refit model m4.3 from the chapter, but omit the mean weight xbar this time. Compare the new model’s posterior to that of the original model. In particular, look at the covariance among the parameters. What is different? Then compare the posterior predictions of both models.

data("Howell1")
d <- Howell1
d2 <- d[ d$age >= 18 , ]
xbar <- mean(d2$weight)
m4.3 <- quap(
alist(
      height ~ dnorm( mu , sigma ) , 
      mu <- a + b*( weight - xbar ) , 
      a ~ dnorm( 178 , 20 ) ,
      b ~ dlnorm( 0 , 1 ) ,
      sigma ~ dunif( 0 , 50 )), data=d2 )

4M8. In the chapter, we used 15 knots with the cherry blossom spline. Increase the number of knots and observe what happens to the resulting spline. Then adjust also the width of the prior on the weights—change the standard deviation of the prior and watch what happens. What do you think the combination of knot number and the prior on the weights controls?

data(cherry_blossoms) 
d <- cherry_blossoms 
d2 <- d[ complete.cases(d$temp) , ]

#increase the number of knots to 30

num_knots <- 30
knot_list <- quantile( d2$year , probs=seq(0,1,length.out=num_knots))
library(splines) 
B <- bs(d2$year, knots=knot_list[-c(1,num_knots)] , degree=3 , intercept=TRUE )
plot( NULL , xlim=range(d2$year) , ylim=c(0,1) , xlab="year" , ylab="basis value" )
for ( i in 1:ncol(B) ) lines( d2$year , B[,i])

knot_list
##        0% 3.448276% 6.896552% 10.34483%  13.7931% 17.24138% 20.68966% 24.13793% 
##  839.0000  895.7241  934.4483  973.1724 1011.8966 1050.6207 1089.3448 1128.0690 
## 27.58621% 31.03448% 34.48276% 37.93103% 41.37931% 44.82759% 48.27586% 51.72414% 
## 1166.7931 1205.5172 1244.2414 1282.9655 1321.6897 1360.4138 1399.1379 1437.8621 
## 55.17241% 58.62069% 62.06897% 65.51724% 68.96552% 72.41379% 75.86207% 79.31034% 
## 1476.5862 1515.3103 1554.0345 1592.7586 1631.4828 1670.2069 1708.9310 1747.6552 
## 82.75862%  86.2069% 89.65517% 93.10345% 96.55172%      100% 
## 1786.3793 1825.1034 1863.8276 1902.5517 1941.2759 1980.0000

4H2. Select out all the rows in the Howell1 data with ages below 18 years of age. If you do it right, you should end up with a new data frame with 192 rows in it.

  1. Fit a linear regression to these data, using quap. Present and interpret the estimates. For every 10 units of increase in weight, how much taller does the model predict a child gets?

  2. Plot the raw data, with height on the vertical axis and weight on the horizontal axis. Superimpose the MAP regression line and 89% interval for the mean. Also superimpose the 89% interval for predicted heights.

  3. What aspects of the model fit concern you? Describe the kinds of assumptions you would change, if any, to improve the model. You don’t have to write any new code. Just explain what the model appears to be doing a bad job of, and what you hypothesize would be a better model.

#New data frame with data for ages below 18 years old
h2_data <- Howell1[Howell1$age < 18, ]
nrow(h2_data)
## [1] 192
#a
quaplist <- alist(
  height ~ dnorm(mu, sigma),
  mu <- a + b * weight,
  a ~ dnorm(110, 30),
  b ~ dnorm(0, 10),
  sigma ~ dunif(0, 60)
)

qp <- quap(quaplist, data = h2_data)
precis(qp)
##            mean         sd      5.5%     94.5%
## a     58.363192 1.39512363 56.133515 60.592869
## b      2.714246 0.06820376  2.605243  2.823249
## sigma  8.433415 0.43008890  7.746050  9.120781
# For each 10 units of weight increase, we can see 2.71cm increase in height

#b
plot(height ~ weight, data = h2_data, col = col.alpha("red", 0.4))

weight.sq <- seq(from = min(h2_data$weight), to = max(h2_data$weight), by = 1)
mu <- link(qp, data = data.frame(weight = weight.sq))

mu.mean = apply(mu, 2, mean)
mu.HPDI = apply(mu, 2, HPDI, prob = 0.89)
lines(weight.sq, mu.mean)
shade(mu.HPDI, weight.sq)

sim.height <- sim(qp, data = list(weight = weight.sq))

height.HPDI <- apply(sim.height, 2, HPDI, prob = 0.89)
shade(height.HPDI, weight.sq)

# This linear model seems to provide not the best results in predicting heights. For both low and high weights (<10 and >30), we can see that it overestimates height. And for most middling (10-30) weights it underestimates height. The potential problem could be that we assume the linear relationship between mu and weight. A polynomial regression  (e.g., quadratic) may be a  better fit here.