============================================================================================================
Reference: Farrell, M., & Myers-Smith, I. (April 17, 2018). Intro to Stan: Getting started with Bayesian modelling in Stan. “Coding Club”.
After our Intro to Stan lecture, I think it would be valuable to have you go through a similar exercise. Let’s test a second research question.
Research question: Is sea ice extent declining in the Southern Hemisphere over time? Is the same pattern happening in the Antarctic as in the Arctic? Fit a Stan model to find out!
Make sure you follow the steps we used in class.
What do your Stan model results indicate so far?
CONCLUSION: The sea ice extent is increasing in the Southern Hemisphere over time, which means the Antarctic and the Arctic share the opposite pattern. The Stan model with non-informative priors is built and its chains are examined to be converged.
seaice <- read.csv("~/Documents/HU/ANLY 505-90-O/6-Introduction to Stan/seaice.csv", stringsAsFactors=F)
options(scipen=100,digits=3)
seaice %>%
stat.desc() %>%
kable() %>% kable_styling(full_width=F)
| year | extent_north | extent_south | |
|---|---|---|---|
| nbr.val | 39.000 | 39.000 | 39.000 |
| nbr.null | 0.000 | 0.000 | 0.000 |
| nbr.na | 0.000 | 0.000 | 0.000 |
| min | 1979.000 | 10.151 | 10.693 |
| max | 2017.000 | 12.447 | 12.776 |
| range | 38.000 | 2.296 | 2.083 |
| sum | 77922.000 | 447.112 | 455.544 |
| median | 1998.000 | 11.668 | 11.673 |
| mean | 1998.000 | 11.464 | 11.681 |
| SE.mean | 1.826 | 0.106 | 0.065 |
| CI.mean.0.95 | 3.696 | 0.214 | 0.132 |
| var | 130.000 | 0.435 | 0.166 |
| std.dev | 11.402 | 0.659 | 0.407 |
| coef.var | 0.006 | 0.057 | 0.035 |
ANSWER: The dataset contains 39 observations of 3 variables, namely “year”, “extent_north” and “extent_south, starting from 1979 to 2017 and with no missing values. Such a data source can be at the National Snow & Ice Data Center. As for this analysis, the response variable is the sea ice extent in the Southern Hemisphere, i.e.,”extent_south", ranged from 10.693 to 12.776 (unit is \(10^6\ km^2\)).
seaice %>%
ggplot(aes(year,extent_south))+
geom_point()+
labs(x="Year",y="Sea ice extent in the Southern Hemisphere (10^6\ km^2)")+
ggtitle("Annual sea ice extent in the Southern Hemisphere: 1979-2017")
ANSWER: Compared to the North Hemisphere, the Southern sea ice extent fluctuated over time, increasing since 2011 and dropping significantly since 2016.
lm2a <- lm(extent_south~year, data=seaice)
summary(lm2a)
##
## Call:
## lm(formula = extent_south ~ year, data = seaice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2337 -0.1814 0.0159 0.1846 0.8881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.19955 10.92558 -1.30 0.202
## year 0.01295 0.00547 2.37 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.384 on 37 degrees of freedom
## Multiple R-squared: 0.132, Adjusted R-squared: 0.108
## F-statistic: 5.61 on 1 and 37 DF, p-value: 0.0232
grid <- seaice %>%
data_grid(year) %>%
add_predictions(lm2a, "pred_south")
seaice %>%
ggplot(aes(year,extent_south))+
geom_point()+
geom_line(aes(y=pred_south),data=grid,color="red",size=1)+
labs(x="Year",y="Sea ice extent in the Southern Hemisphere (10^6\ km^2)")+
ggtitle("Annual sea ice extent in the Southern Hemisphere: 1979-2017")
ANSWER: The statistical mode is \(Extent_t = \alpha + \beta * Year_t + \epsilon_t\),
where
- \(\alpha\) is the mean response when Year is 0
- \(\beta\) is the change in mean response for a 1-unit change in Year
The coefficient’s p-value of 0.0232 is statistically significant at an alpha level of 0.05. The adjusted R-squared is 0.1082, indicating the linear model may not fit well.
seaice <- seaice %>%
mutate("index"=I(year)-1978) #index, regarding 1978 as 0
lm2b <- lm(extent_south~index, data=seaice)
summary(lm2b)
##
## Call:
## lm(formula = extent_south ~ index, data = seaice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2337 -0.1814 0.0159 0.1846 0.8881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.42155 0.12549 91.02 <0.0000000000000002 ***
## index 0.01295 0.00547 2.37 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.384 on 37 degrees of freedom
## Multiple R-squared: 0.132, Adjusted R-squared: 0.108
## F-statistic: 5.61 on 1 and 37 DF, p-value: 0.0232
lm2b_alpha <- summary(lm2b)$coeff[1] #the intercept
lm2b_beta <- summary(lm2b)$coeff[2] #the slope
lm2b_sigma <- sigma(lm2b) #the residual error
stan_data2 <- list(N=nrow(seaice), y=seaice$extent_south, x=seaice$index)
ANSWER: After the year data were set to index from 1 to 30, the re-run of the linear model does not change the beta and the R-squared, but the alpha.
write("// Stan model for simple linear regression
data {
int < lower = 1 > N; // Sample size
vector[N] x; // Predictor
vector[N] y; // Outcome
}
parameters {
real alpha; // Intercept
real beta; // Slope (regression coefficients)
real < lower = 0 > sigma; // Error SD
}
model {
y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
} // The posterior predictive distribution",
"stan_model2.stan")
stan_model2 <- "~/Documents/HU/ANLY 505-90-O/505 R/stan_model2.stan"
ANSWER: Stan model for the linear regression remains the same. It has three blocks, such as “data”, “parameters”, and “model”. The flat priors of uniform(-∞, ∞) are implicitly used. Non-informative priors are on the parameters.
(fit2 <- stan(file=stan_model2, data=stan_data2, warmup=500, iter=1000, chains=4, cores=2, thin=1))
## Inference for Stan model: stan_model2.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha 11.43 0.00 0.13 11.17 11.35 11.43 11.51 11.69 907 1.00
## beta 0.01 0.00 0.01 0.00 0.01 0.01 0.02 0.02 1003 1.00
## sigma 0.40 0.00 0.05 0.32 0.36 0.39 0.43 0.51 827 1.00
## lp__ 16.27 0.05 1.32 12.81 15.77 16.62 17.20 17.73 611 1.01
##
## Samples were drawn using NUTS(diag_e) at Fri Jun 14 01:22:44 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
ANSWER: The running process of the Stan model remains the same. The values of Rhat are all at 1; hence the chains have converged.
posterior2 <- extract(fit2)
p1 <- seaice %>%
add_predictions(lm2b,"pred_south") %>%
ggplot(aes(index,extent_south))+
geom_point()+
geom_line(aes(y=pred_south),color="red",size=1)+
labs(x="Index",y="Sea ice extent in the Southern Hemisphere (10^6\ km^2)")+
ggtitle("A general lm fit")
p2 <- seaice %>%
ggplot(aes(index,extent_south))+
labs(x="Index",y="Sea ice extent in the Southern Hemisphere (10^6\ km^2)")+
ggtitle("A Stan fit with non-informative priors")
for (i in 1:500) {
p2 <- p2+geom_abline(intercept=posterior2$alpha[i],slope=posterior2$beta[i],color="grey")
}
p2 <- p2+geom_point()+
geom_abline(intercept=mean(posterior2$alpha),slope=mean(posterior2$beta),color="blue",size=1)
gridExtra::grid.arrange(p1,p2, nrow=1, bottom="Models comparison")
ANSWER: The red regression line in the general lm fit and the blue regression line in the Stan fit are identical due to the non-informative priors. The multiple grey regression lines visualize the variability in the estimates from the posterior.