Chapter 4 - Geocentric Models

This chapter introduced the simple linear regression model, a framework for estimating the association between a predictor variable and an outcome variable. The Gaussian distribution comprises the likelihood in such models, because it counts up the relative numbers of ways different combinations of means and standard deviations can produce an observation. To fit these models to data, the chapter introduced quadratic approximation of the posterior distribution and the tool quap. It also introduced new procedures for visualizing prior and posterior distributions.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

4-1. For the model definition below, simulate observed y values from the prior (not the posterior). Make sure to plot the simulation. \[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ ∼ Normal(0, 10) \\ \ σ ∼ Exponential(1) \\ \end{align}\]

set.seed(10)
sample_mu = rnorm(1e4, 0, 10)
sample_sigma = rexp(1e4, 1)
prior_y = rnorm(1e4, sample_mu, sample_sigma)
dens(prior_y, xlab="observed y values")

4-2. Translate the model just above into a quap formula.

quap_formula <- alist(
  y ~ dnorm(mu, sigma),
  mu ~ dnorm(0, 10),
  sigma ~ dexp(1))
4-3. Translate the quap model formula below into a mathematical model definition:

y ~ dnorm( mu , sigma ),
mu <- a + b*x,
a ~ dnorm( 0 , 10 ),
b ~ dunif( 0 , 1 ),
sigma ~ dexp( 1 )

\[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ_i ∼ α + βx_i \\ \ α ∼ Normal(0, 10) \\ \ β ∼ Uniform(0, 1) \\ \ σ ∼ Exponential (1) \\ \end{align}\]

#\[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ_i ∼ α + βx_i \\ \ α ∼ Normal(0, 10) \\ \ β ∼ Uniform(0, 1) \\ \ σ ∼ Exponential(1) \\ \end{align}\]

4-4. A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors. Simulate from the priors that you chose to see what the model expects before it sees the data. Do this by sampling from the priors. Then consider 50 students, each simulated with a different prior draw. For each student simulate 3 years. Plot the 50 linear relationships with height(cm) on the y-axis and year on the x-axis. What can we do to make these priors more likely?

math model: \[\begin{align} \ h_i ∼ Normal(μ, σ) \\ \ μ_i ∼ α + βx_i \\ \ α ∼ Normal(130, 10) \\ \ β ∼ Normal(0, 10) \\ \ σ ∼ Uniform(0, 20) \\ \end{align}\]

# Validate with simulation data
# the variable alpha is set up with a relatively reasonable average height of the student, which is around 130cm, so that the distribution for alpha have a mean of 130 and the variance of 10 ensure the simulated alpha shouldn't be too vary from the mean.

# beta is the slope of correlation coefficient between year and height, therefore I start by setting up a normal distribution with mean of zero and variance 10, to make this slope could be either positive or negative (but it might not be negative in the real world)

# finally the sigma give us some randomness, which we don't want this number to be too small or too big, but the sigma should be non-negative, therefore I use the uniform distribution to control the sigma range

#simulation with 50 student, each simulate 3 years
set.seed(10)
sample_alpha <- rnorm(50,130,10)
sample_beta <- rnorm(50,0,10)
sample_sigma <- runif(50, 0,20)

year <- seq(from =1, to=3, len=3)

plot(NULL, xlim=c(1,3),ylim=c(60,200),
     xlab ="Years", ylab = "Height(cm)", main = "linear plots of students' height vs year")
for (i in 1:50)
  lines (year, sample_alpha[i] + sample_beta[i] *year , lwd=3 , col=2)

#Comment: the simulated data could represent the trend of student's year vs height in general, but we could observe some outlines which are rarely happen in the real world

4-5. (a) Now suppose I remind you that every student got taller each year. Does this information lead you to change your choice of priors? How? Again, simulate from the priors and plot. (b) Now suppose I tell you that the variance among heights for students of the same age is never more than 64cm. How does this lead you to revise your priors?

# for question a, student got taller each year lead to a change of my choice of priors, for the slope of beta, instead of normal distribution, I will instead use log-normal distribution to prevent any negative values , since student got taller, so the mean of the log-normal distribution should also be greater than zero, I selected 2 in this case

# re-simulate of the model
#simulation with 50 student, each simulate 3 years
set.seed(10)
sample_alpha <- rnorm(50,130,10)
sample_beta <- rlnorm(50,2,0.5)
sample_sigma <- runif(50, 0,20)

year <- seq(from =1, to=3, len=3)

plot(NULL, xlim=c(1,3),ylim=c(60,200),
     xlab ="Years", ylab = "Height(cm)", main = "linear plots of students' height vs year")
for (i in 1:50)
  lines (year, sample_alpha[i] + sample_beta[i] *year , lwd=3 , col=2)

# for question b , since the variance among heights for students of the same age is never more than 64cm, then I need to shrink the sigma from the prior to achieve this goal , as well as reduce the sigma for the normal distribution and log normal distribuiotn  when randomly drawing the parameter alpha and beta

set.seed(10)
sample_alpha <- rnorm(1e4,130,5)
sample_beta <- rlnorm(1e4,2,0.2)
sample_sigma <- runif(1e4, 0, 5)

# Validate for the first three years of the gap for simulated highest height vs lowest height within same year

#year 1
simulate_year1 <- rnorm(1e4,sample_alpha+sample_beta,sample_sigma)
gap_year1 <- max(simulate_year1) - min(simulate_year1)

#year 2
simulate_year2 <- rnorm(1e4,sample_alpha+sample_beta *2 ,sample_sigma)
gap_year2 <- max(simulate_year2) - min(simulate_year2)

#year 3
simulate_year3 <- rnorm(1e4,sample_alpha+sample_beta *3 ,sample_sigma)
gap_year3 <- max(simulate_year3) - min(simulate_year3)

#Comment : For the simulated data, the gap between highest and lowest heights within the same year (in year 1-3) are all below 64cm

4-6. In the chapter, we used 15 knots with the cherry blossom spline. Increase the number of knots and observe what happens to the resulting spline. Then adjust also the width of the prior on the weights—change the standard deviation of the prior and watch what happens. What do you think the combination of knot number and the prior on the weights controls?

library(splines)
data(cherry_blossoms) 
d <- cherry_blossoms 
precis(d)
##                   mean          sd      5.5%      94.5%
## year       1408.000000 350.8845964 867.77000 1948.23000
## doy         104.540508   6.4070362  94.43000  115.00000
## temp          6.141886   0.6636479   5.15000    7.29470
## temp_upper    7.185151   0.9929206   5.89765    8.90235
## temp_lower    5.098941   0.8503496   3.78765    6.37000
##                                                                                                                           histogram
## year                       <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2581>
## doy                                                                <U+2581><U+2582><U+2585><U+2587><U+2587><U+2583><U+2581><U+2581>
## temp                                                               <U+2581><U+2583><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581>
## temp_upper <U+2581><U+2582><U+2585><U+2587><U+2587><U+2585><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## temp_lower <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2583><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581><U+2581>
d2 <- d[ complete.cases(d$temp) , ]


# Check number of knot when it is 20
num_knots <- 20
knot_list <- quantile( d2$year , probs=seq(0,1,length.out=num_knots))
B <- bs(d2$year, knots=knot_list[-c(1,num_knots)] , degree=3 , intercept=TRUE )

quap_model <- quap(
alist(
T ~ dnorm( mu , sigma ) ,
mu <- a + B %*% w ,
a ~ dnorm(6,10),
w ~ dnorm(0,1),
sigma ~ dexp(1)
),
data=list( T=d2$temp , B=B ) ,
start=list( w=rep( 0 , ncol(B) ) ) )

post <- extract.samples(quap_model)
w <- apply( post$w , 2 , mean )

mu <- link( quap_model )
mu_PI <- apply(mu,2,PI,0.97)
plot( d2$year , d2$temp , col=col.alpha(rangi2,0.3) , pch=16 )
shade( mu_PI , d2$year , col=col.alpha("black",0.5) )

# Check number of knot when it is 30
num_knots <- 30
knot_list <- quantile( d2$year , probs=seq(0,1,length.out=num_knots))
B <- bs(d2$year, knots=knot_list[-c(1,num_knots)] , degree=3 , intercept=TRUE )

quap_model <- quap(
alist(
T ~ dnorm( mu , sigma ) ,
mu <- a + B %*% w ,
a ~ dnorm(6,10),
w ~ dnorm(0,1),
sigma ~ dexp(1)
),
data=list( T=d2$temp , B=B ) ,
start=list( w=rep( 0 , ncol(B) ) ) )

post <- extract.samples(quap_model)
w <- apply( post$w , 2 , mean )

mu <- link( quap_model )
mu_PI <- apply(mu,2,PI,0.97)
plot( d2$year , d2$temp , col=col.alpha(rangi2,0.3) , pch=16 )
shade( mu_PI , d2$year , col=col.alpha("black",0.5) )

# Increase the standard deviation of the prior
num_knots <- 20
knot_list <- quantile( d2$year , probs=seq(0,1,length.out=num_knots))
B <- bs(d2$year, knots=knot_list[-c(1,num_knots)] , degree=3 , intercept=TRUE )

quap_model <- quap(
alist(
T ~ dnorm( mu , sigma ) ,
mu <- a + B %*% w ,
a ~ dnorm(6,10),
w ~ dnorm(0,1),
sigma ~ dexp(10)
),
data=list( T=d2$temp , B=B ) ,
start=list( w=rep( 0 , ncol(B) ) ) )

post <- extract.samples(quap_model)
w <- apply( post$w , 2 , mean )

mu <- link( quap_model )
mu_PI <- apply(mu,2,PI,0.97)
plot( d2$year , d2$temp , col=col.alpha(rangi2,0.3) , pch=16 )
shade( mu_PI , d2$year , col=col.alpha("black",0.5) )

# Comment : with more knots, the resulting spline has more turning points and with a higher standard deviation, the 97% posterior interval for mu has been increased