This chapter introduced the simple linear regression model, a framework for estimating the association between a predictor variable and an outcome variable. The Gaussian distribution comprises the likelihood in such models, because it counts up the relative numbers of ways different combinations of means and standard deviations can produce an observation. To fit these models to data, the chapter introduced quadratic approximation of the posterior distribution and the tool quap. It also introduced new procedures for visualizing prior and posterior distributions.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
4E1. In the model definition below, which line is the likelihood? \[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ ∼ Normal(0, 10) \\ \ σ ∼ Exponential(1) \\ \end{align}\]
#The line: yi ∼ Normal(μ, σ) is the likelihood in the above model definition.
4E2. In the model definition just above, how many parameters are in the posterior distribution?
#In the posterior distribution, from the above model definition there are about two parameters which are μ and σ.
4E3. Using the model definition above, write down the appropriate form of Bayes’ theorem that includes the proper likelihood and priors.
#The Bayesian theorem states that:
#Pr(μ,σ|y) = Pr(Prior) x Likelihood
#Substituting the respective values from the model definition above to get down the appropriate form of Bayes theorem
#Pr(μ,σ|y) = (∏iNormal(yi|μ, σ)Normal(μ|0, 10)Uniform(σ|0, 10))/(∫∫∏iNormal(hi|μ,σ|)Normal(μ|0, 10)Uniform(σ|0, 10)dμdσ)
4E4. In the model definition below, which line is the linear model? \[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ_i = α + βx_i \\ \ α ∼ Normal(0, 10) \\ \ β ∼ Normal(0, 1) \\ \ σ ∼ Exponential(2) \\ \end{align}\]
#In the above model definition, the linear model is: μi = α + βxi
4E5. In the model definition just above, how many parameters are in the posterior distribution?
#From the above model definition, there are 3 parameters in the posterior distribution which are α, β, σ
4M1. For the model definition below, simulate observed y values from the prior (not the posterior). \[\begin{align} \ y_i ∼ Normal(μ, σ) \\ \ μ ∼ Normal(0, 10) \\ \ σ ∼ Exponential(1) \\ \end{align}\]
#Parameters
mu1 <- rnorm(1e4, 0, 10)
sigma1 <- runif(1e4, 0, 10)
#Prior
yi_prior <- rnorm(mu1, sigma1)
#Density plot
dens(yi_prior)
4M2. Translate the model just above into a quap formula.
#Quap Formula
quap_form <- alist(y ~ dnorm(mu, sigma), mu ~ dnorm(0, 10), sigma ~ dunif(0, 10))
4M3. Translate the quap model formula below into a mathematical model definition:
y ~ dnorm( mu , sigma ),
mu <- a + b*x,
a ~ dnorm( 0 , 10 ),
b ~ dunif( 0 , 1 ),
sigma ~ dexp( 1 )
#mathematical model
mat_form <- alist(y ~ dnorm(mu, sigma), mu ~ a + b*x, a ~ dnorm(0, 10), b ~ dunif(0, 1), sigma ~ dunif(0, 10))
4M4. A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors.
#Assuming that these sample students remain constant each year for 3 years, we can observe a standard growth in their height every year from 3 years worth of data. The growth rate each year is one of the priors, assuming it as b, the gender is not specified so I'm assuming there are varying genders in the sample and it growth rate varies with each gender, σ would be my uniform prior for both male and female genders across the entire sample of students. Therefore, assuming a normal distribution of 145 cm and a standard deviation of 20 cm and a variance of 10cm. With these the sample mathematical model is as follows:
#hi = Normal(μi, σ)
#μi = a + b*xi
#a = Normal(145, 20)
#b = Normal(0, 5)
#σ = Uniform(0, 10)
4M5. Now suppose I remind you that every student got taller each year. Does this information lead you to change your choice of priors? How?
#It doesn't change because growth rate is included.
4M6. Now suppose I tell you that the variance among heights for students of the same age is never more than 64cm. How does this lead you to revise your priors?
#Since standard deviation is square root of variance, the mathematical model would be as follows:
#hi = Normal(μi, σ)
#μi = a + b*xi
#a = Normal(145, 20)
#b = Normal(0, 5)
#σ = Uniform(0, 8)
4M7. Refit model m4.3 from the chapter, but omit the mean weight xbar this time. Compare the new model’s posterior to that of the original model. In particular, look at the covariance among the parameters. What is different? Then compare the posterior predictions of both models.
#loading the data
data("Howell1")
#assigning data to:
df<-Howell1
df2<-df[df$age >=18,]
#mean weight xbar
xbar <-mean(df2$weight)
#Quap
m4_3 <-quap(
alist(
height ~ dnorm(mu, sigma),
mu<-a+b*(weight-xbar),
a ~ dnorm(178, 20),
b ~dlnorm(0,1),
sigma ~ dunif(0,50)
), data=df2)
m4_3
##
## Quadratic approximate posterior distribution
##
## Formula:
## height ~ dnorm(mu, sigma)
## mu <- a + b * (weight - xbar)
## a ~ dnorm(178, 20)
## b ~ dlnorm(0, 1)
## sigma ~ dunif(0, 50)
##
## Posterior means:
## a b sigma
## 154.601368 0.903281 5.071881
##
## Log-likelihood: -1071.01
#Quap Omitting xbar
m4_3b <-quap(
alist(
height ~ dnorm(mu, sigma),
mu<-a+b*(weight),
a ~ dnorm(178, 20),
b ~dlnorm(0,1),
sigma ~ dunif(0,50)
), data=df2)
m4_3b
##
## Quadratic approximate posterior distribution
##
## Formula:
## height ~ dnorm(mu, sigma)
## mu <- a + b * (weight)
## a ~ dnorm(178, 20)
## b ~ dlnorm(0, 1)
## sigma ~ dunif(0, 50)
##
## Posterior means:
## a b sigma
## 114.5341922 0.8907327 5.0727173
##
## Log-likelihood: -1071.07
#The a and b values are different, although sigma remains nearly the same for both models.
4M8. In the chapter, we used 15 knots with the cherry blossom spline. Increase the number of knots and observe what happens to the resulting spline. Then adjust also the width of the prior on the weights—change the standard deviation of the prior and watch what happens. What do you think the combination of knot number and the prior on the weights controls?
#loading the data
data("cherry_blossoms")
#assigning data to df
df<-cherry_blossoms
precis(df)
## mean sd 5.5% 94.5%
## year 1408.000000 350.8845964 867.77000 1948.23000
## doy 104.540508 6.4070362 94.43000 115.00000
## temp 6.141886 0.6636479 5.15000 7.29470
## temp_upper 7.185151 0.9929206 5.89765 8.90235
## temp_lower 5.098941 0.8503496 3.78765 6.37000
## histogram
## year <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2581>
## doy <U+2581><U+2582><U+2585><U+2587><U+2587><U+2583><U+2581><U+2581>
## temp <U+2581><U+2583><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581>
## temp_upper <U+2581><U+2582><U+2585><U+2587><U+2587><U+2585><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## temp_lower <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2583><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581><U+2581>
df2<-df[complete.cases(df$doy),]
#number of knots as used initially
knots<-15
#knots list
l_knots<-quantile(df2$year, probs=seq(0,1,length.out=knots))
l_knots
## 0% 7.142857% 14.28571% 21.42857% 28.57143% 35.71429% 42.85714% 50%
## 812 1036 1174 1269 1377 1454 1518 1583
## 57.14286% 64.28571% 71.42857% 78.57143% 85.71429% 92.85714% 100%
## 1650 1714 1774 1833 1893 1956 2015
#increasing number of knots to 30
knots2<-30
l2_knots<-quantile(df2$year, probs=seq(0,1,length.out=knots2))
l2_knots
## 0% 3.448276% 6.896552% 10.34483% 13.7931% 17.24138% 20.68966% 24.13793%
## 812.0000 958.4828 1032.9655 1115.3448 1168.8621 1213.4138 1261.6897 1315.3793
## 27.58621% 31.03448% 34.48276% 37.93103% 41.37931% 44.82759% 48.27586% 51.72414%
## 1366.8621 1410.6897 1442.6552 1477.3103 1505.7931 1534.2759 1563.7586 1597.2414
## 55.17241% 58.62069% 62.06897% 65.51724% 68.96552% 72.41379% 75.86207% 79.31034%
## 1631.1724 1664.2069 1692.6897 1724.1724 1753.6552 1782.1379 1810.6207 1839.1034
## 82.75862% 86.2069% 89.65517% 93.10345% 96.55172% 100%
## 1867.5862 1898.0690 1928.5517 1958.0345 1986.5172 2015.0000
4H2. Select out all the rows in the Howell1 data with ages below 18 years of age. If you do it right, you should end up with a new data frame with 192 rows in it.
Fit a linear regression to these data, using quap. Present and interpret the estimates. For every 10 units of increase in weight, how much taller does the model predict a child gets?
Plot the raw data, with height on the vertical axis and weight on the horizontal axis. Superimpose the MAP regression line and 89% interval for the mean. Also superimpose the 89% interval for predicted heights.
What aspects of the model fit concern you? Describe the kinds of assumptions you would change, if any, to improve the model. You don’t have to write any new code. Just explain what the model appears to be doing a bad job of, and what you hypothesize would be a better model.
#loading data
data(Howell1)
#set seed to 4
set.seed(4)
#assigning df for ages below 18
dh <- Howell1[Howell1$age < 18, ]
#verifying number of rows in this dh to be 192
nrow(dh)
## [1] 192
#a
q1 <-alist(height ~ dnorm(mu, sigma),
mu <- a + b * weight,
a ~ dnorm(120, 40),
b ~ dnorm(0, 10),
sigma ~ dunif(0,80)
)
m1 <- map(q1, data = dh)
precis(m1, corr = TRUE)
## mean sd 5.5% 94.5%
## a 58.308208 1.39636168 56.076552 60.539863
## b 2.716651 0.06825855 2.607561 2.825742
## sigma 8.437170 0.43056436 7.749045 9.125295
#b
plot(height ~ weight, data = dh, col = col.alpha(rangi2, 0.2))
seq_w <- seq(from = min(dh$weight), to = max(dh$weight, by = 1))
mu = link(m1, data = data.frame(weight = seq_w))
mean_mu <- apply(mu, 2, mean)
#superimposing
hpdi_mu <- apply(mu, 2, HPDI, prob = 0.89) #89% interval for predicted heights
lines(seq_w, mean_mu)
shade(hpdi_mu, seq_w)
h_simulated <- sim(m1, data = list(weight = seq_w))
hpdi_h <- apply(h_simulated, 2, HPDI, prob = 0.89)
shade(hpdi_h, seq_w)
#c
#If we observe the following chart and results, the trend is not quite following the linear line, this is concerning. Because this shows that the model is not predicting height with their respective weights, sometimes the weights are over predicted and sometimes the heights are over predicted for the weight. This could result in inaccuracies and therefore, the for a better model, height vs weight assumption and modeling needs to be revised.