Intro to STAN Homework Part #1

After our Intro to Stan lecture I think it would be valuable to have you go through a similar exercise. Let’s test a second research question.

Research question: Is sea ice extent declining in the Southern Hemisphere over time? Is the same pattern happening in the Antarctic as in the Arctic? Fit a Stan model to find out!

Make sure you follow the steps we used in class.

What do your Stan model results indicate so far?

1. Load and Inspect Data

seaice <- read.csv("seaice.csv", stringsAsFactors = FALSE)
head(seaice)
##   year extent_north extent_south
## 1 1979       12.328       11.700
## 2 1980       12.337       11.230
## 3 1981       12.127       11.435
## 4 1982       12.447       11.640
## 5 1983       12.332       11.389
## 6 1984       11.910       11.454
str(seaice)
## 'data.frame':    39 obs. of  3 variables:
##  $ year        : int  1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 ...
##  $ extent_north: num  12.3 12.3 12.1 12.4 12.3 ...
##  $ extent_south: num  11.7 11.2 11.4 11.6 11.4 ...

2. Plot the data

library(ggplot2)
plot(extent_south ~ year, pch = 1, data = seaice, 
     main = "Sea ice extent in the South, over time", 
     xlab = "Year", ylab = "Sea ice in the south", 
     col = "red") 

abline(lm(seaice$extent_south ~ seaice$year), 
       col = "black", lty = 2, lwd = 2) 

# according to the plot, the sea ice varies over time and the variance increase over time. There is no obvious trend

3. Run a general linear model using lm()

lm1 <- lm(extent_south~year, data=seaice)
summary(lm1)
## 
## Call:
## lm(formula = extent_south ~ year, data = seaice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23372 -0.18142  0.01587  0.18465  0.88814 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -14.199551  10.925576  -1.300   0.2018  
## year          0.012953   0.005468   2.369   0.0232 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3843 on 37 degrees of freedom
## Multiple R-squared:  0.1317, Adjusted R-squared:  0.1082 
## F-statistic: 5.611 on 1 and 37 DF,  p-value: 0.02318
# variable `year` has a p-value of 0.0232, which is statistically significant at an alpha level of 0.05. The adjusted R-squared is 0.1082

4. Index the data, re-run the lm(), extract summary statistics and turn the indexed data into a dataframe to pass into Stan

x <- I(seaice$year - 1978) 
y <- seaice$extent_south

lm2 <- lm(y ~ x)
summary(lm2)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23372 -0.18142  0.01587  0.18465  0.88814 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.421555   0.125490  91.015   <2e-16 ***
## x            0.012953   0.005468   2.369   0.0232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3843 on 37 degrees of freedom
## Multiple R-squared:  0.1317, Adjusted R-squared:  0.1082 
## F-statistic: 5.611 on 1 and 37 DF,  p-value: 0.02318
# after setting year index from 1 to 30, the new lm model doesn't change the beta and r-squared

5. Write the Stan model

stan_data <- list(N = nrow(seaice), x = x, y = y)

 write("// Stan model for simple linear regression

data {
 int < lower = 1 > N; // Sample size
 vector[N] x; // Predictor
 vector[N] y; // Outcome
}

parameters {
 real alpha; // Intercept
 real beta; // Slope (regression coefficients)
 real < lower = 0 > sigma; // Error SD
}

model {
 y ~ normal(alpha + x * beta , sigma);
}

generated quantities {
} // The posterior predictive distribution",

"stan_model1.stan")


stan_model1 <- "stan_model1.stan"

6. Run the Stan model and inspect the results

fit <- stan(file = stan_model1, data = stan_data, warmup = 500, iter = 1000, chains = 4, cores = 2, thin = 1)
## Trying to compile a simple C file
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -D_REENTRANT  -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -include stan/math/prim/mat/fun/Eigen.hpp   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
fit
## Inference for Stan model: stan_model1.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## alpha 11.42    0.00 0.13 11.15 11.33 11.42 11.51 11.67   781    1
## beta   0.01    0.00 0.01  0.00  0.01  0.01  0.02  0.02   891    1
## sigma  0.40    0.00 0.05  0.32  0.36  0.39  0.43  0.51   917    1
## lp__  16.23    0.05 1.32 12.74 15.67 16.61 17.18 17.72   675    1
## 
## Samples were drawn using NUTS(diag_e) at Tue Mar 24 20:18:01 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

7. Extract the posterior estimates into a list so we can plot them

posterior <- extract(fit)
str(posterior)
## List of 4
##  $ alpha: num [1:2000(1d)] 11.1 11.4 11.6 11.6 11.5 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL
##  $ beta : num [1:2000(1d)] 0.02353 0.00668 0.00103 0.00799 0.01225 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL
##  $ sigma: num [1:2000(1d)] 0.423 0.457 0.413 0.348 0.354 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL
##  $ lp__ : num [1:2000(1d)] 14.8 12.7 15.2 16.7 17.2 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL

8. Compare your results to our results to “lm”

plot(y ~ x, pch = 20, main="Comaring the resut of lm and Stan Model")
abline(lm2, col = "blue", pch=22, lty = 2, lw = 3)
abline( mean(posterior$alpha), mean(posterior$beta), col = "green", lw = 1)
legend("topleft",c("Linear Model","Stan Model"),fill=c("blue","green"))

# two lines representing lm and stan model are almost the same, they overlay with each other

9. Plot multiple estimates from the posterior

plot(y ~ x, pch = 20, main="Multiple estimates - posterior")

for (i in 1:500) {
 abline(posterior$alpha[i], posterior$beta[i], col = "yellow", lty = 1)
}

abline(lm1, col = "blue", pch=22, lty = 2, lw = 3)
abline( mean(posterior$alpha), mean(posterior$beta), col = "red", lw = 1)
legend("topleft",c("Linear Model","Stan Model", "Multiple Estimates"),fill=c("blue","red","yellow"))

# The multiple yellow regression lines visualize the variability in the estimates from the posterior. Sea ice is not declining in the Southern Hemisphere over time.