ANLY 505 - Problem Set #5

============================================================================================================
Reference: Farrell, M., & Myers-Smith, I. (April 17, 2018). Intro to Stan: Getting started with Bayesian modelling in Stan. “Coding Club”.

Intro to STAN Homework Set #1

After our Intro to Stan lecture, I think it would be valuable to have you go through a similar exercise. Let’s test a second research question.

Research question: Is sea ice extent declining in the Southern Hemisphere over time? Is the same pattern happening in the Antarctic as in the Arctic? Fit a Stan model to find out!

Make sure you follow the steps we used in class.

What do your Stan model results indicate so far?

CONCLUSION: The sea ice extent is increasing in the Southern Hemisphere over time, which means the Antarctic and the Arctic share the opposite pattern. The Stan model with non-informative priors is built and its chains are examined to be converged.

1. Load and Inspect Data

seaice <- read.csv("~/Documents/HU/ANLY 505-90-O/6-Introduction to Stan/seaice.csv", stringsAsFactors=F)
options(scipen=100,digits=3)
seaice %>%
  stat.desc() %>%
  kable() %>% kable_styling(full_width=F)

	year	extent_north	extent_south
nbr.val	39.000	39.000	39.000
nbr.null	0.000	0.000	0.000
nbr.na	0.000	0.000	0.000
min	1979.000	10.151	10.693
max	2017.000	12.447	12.776
range	38.000	2.296	2.083
sum	77922.000	447.112	455.544
median	1998.000	11.668	11.673
mean	1998.000	11.464	11.681
SE.mean	1.826	0.106	0.065
CI.mean.0.95	3.696	0.214	0.132
var	130.000	0.435	0.166
std.dev	11.402	0.659	0.407
coef.var	0.006	0.057	0.035

ANSWER: The dataset contains 39 observations of 3 variables, namely “year”, “extent_north” and “extent_south, starting from 1979 to 2017 and with no missing values. Such a data source can be at the National Snow & Ice Data Center. As for this analysis, the response variable is the sea ice extent in the Southern Hemisphere, i.e.,”extent_south", ranged from 10.693 to 12.776 (unit is \(10^6\ km^2\)).

2. Plot the data

seaice %>%
  ggplot(aes(year,extent_south))+
  geom_point()+
  labs(x="Year",y="Sea ice extent in the Southern Hemisphere (10^6\ km^2)")+
  ggtitle("Annual sea ice extent in the Southern Hemisphere: 1979-2017")

ANSWER: Compared to the North Hemisphere, the Southern sea ice extent fluctuated over time, increasing since 2011 and dropping significantly since 2016.

3. Run a general linear model using lm()

lm2a <- lm(extent_south~year, data=seaice)
summary(lm2a)

## 
## Call:
## lm(formula = extent_south ~ year, data = seaice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2337 -0.1814  0.0159  0.1846  0.8881 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -14.19955   10.92558   -1.30    0.202  
## year          0.01295    0.00547    2.37    0.023 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.384 on 37 degrees of freedom
## Multiple R-squared:  0.132,  Adjusted R-squared:  0.108 
## F-statistic: 5.61 on 1 and 37 DF,  p-value: 0.0232

grid <- seaice %>%
  data_grid(year) %>%
  add_predictions(lm2a, "pred_south")

seaice %>%
  ggplot(aes(year,extent_south))+
  geom_point()+
  geom_line(aes(y=pred_south),data=grid,color="red",size=1)+
  labs(x="Year",y="Sea ice extent in the Southern Hemisphere (10^6\ km^2)")+
  ggtitle("Annual sea ice extent in the Southern Hemisphere: 1979-2017")

ANSWER: The statistical mode is \(Extent_t = \alpha + \beta * Year_t + \epsilon_t\),

where
- \(\alpha\) is the mean response when Year is 0
- \(\beta\) is the change in mean response for a 1-unit change in Year

The coefficient’s p-value of 0.0232 is statistically significant at an alpha level of 0.05. The adjusted R-squared is 0.1082, indicating the linear model may not fit well.

4. Prepare the data, re-run the lm() and extract summary statistics

seaice <- seaice %>%
  mutate("index"=I(year)-1978) #index, regarding 1978 as 0
lm2b <- lm(extent_south~index, data=seaice)
summary(lm2b)

## 
## Call:
## lm(formula = extent_south ~ index, data = seaice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2337 -0.1814  0.0159  0.1846  0.8881 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 11.42155    0.12549   91.02 <0.0000000000000002 ***
## index        0.01295    0.00547    2.37               0.023 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.384 on 37 degrees of freedom
## Multiple R-squared:  0.132,  Adjusted R-squared:  0.108 
## F-statistic: 5.61 on 1 and 37 DF,  p-value: 0.0232

lm2b_alpha <- summary(lm2b)$coeff[1] #the intercept
lm2b_beta <- summary(lm2b)$coeff[2] #the slope
lm2b_sigma <- sigma(lm2b) #the residual error
stan_data2 <- list(N=nrow(seaice), y=seaice$extent_south, x=seaice$index)

ANSWER: After the year data were set to index from 1 to 30, the re-run of the linear model does not change the beta and the R-squared, but the alpha.

5. Write the Stan model

write("// Stan model for simple linear regression

data {
 int < lower = 1 > N; // Sample size
 vector[N] x; // Predictor
 vector[N] y; // Outcome
}

parameters {
 real alpha; // Intercept
 real beta; // Slope (regression coefficients)
 real < lower = 0 > sigma; // Error SD
}

model {
 y ~ normal(alpha + beta * x, sigma);
}

generated quantities {
} // The posterior predictive distribution",

"stan_model2.stan")

stan_model2 <- "~/Documents/HU/ANLY 505-90-O/505 R/stan_model2.stan"

ANSWER: Stan model for the linear regression remains the same. It has three blocks, such as “data”, “parameters”, and “model”. The flat priors of uniform(-∞, ∞) are implicitly used. Non-informative priors are on the parameters.

6. Run the Stan model

(fit2 <- stan(file=stan_model2, data=stan_data2, warmup=500, iter=1000, chains=4, cores=2, thin=1))

## Inference for Stan model: stan_model2.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## alpha 11.43    0.00 0.13 11.17 11.35 11.43 11.51 11.69   907 1.00
## beta   0.01    0.00 0.01  0.00  0.01  0.01  0.02  0.02  1003 1.00
## sigma  0.40    0.00 0.05  0.32  0.36  0.39  0.43  0.51   827 1.00
## lp__  16.27    0.05 1.32 12.81 15.77 16.62 17.20 17.73   611 1.01
## 
## Samples were drawn using NUTS(diag_e) at Fri Jun 14 01:22:44 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

ANSWER: The running process of the Stan model remains the same. The values of Rhat are all at 1; hence the chains have converged.

7. Compare your results to our results with “lm”

posterior2 <- extract(fit2)

p1 <- seaice %>%
  add_predictions(lm2b,"pred_south") %>%
  ggplot(aes(index,extent_south))+
  geom_point()+
  geom_line(aes(y=pred_south),color="red",size=1)+
  labs(x="Index",y="Sea ice extent in the Southern Hemisphere (10^6\ km^2)")+
  ggtitle("A general lm fit")

p2 <- seaice %>%
  ggplot(aes(index,extent_south))+
  labs(x="Index",y="Sea ice extent in the Southern Hemisphere (10^6\ km^2)")+
  ggtitle("A Stan fit with non-informative priors")
for (i in 1:500) {
  p2 <- p2+geom_abline(intercept=posterior2$alpha[i],slope=posterior2$beta[i],color="grey")
}
p2 <- p2+geom_point()+
  geom_abline(intercept=mean(posterior2$alpha),slope=mean(posterior2$beta),color="blue",size=1)

gridExtra::grid.arrange(p1,p2, nrow=1, bottom="Models comparison")

ANSWER: The red regression line in the general lm fit and the blue regression line in the Stan fit are identical due to the non-informative priors. The multiple grey regression lines visualize the variability in the estimates from the posterior.