Intro to STAN Homework Set #1

After our Intro to Stan lecture I think it would be valuable to have you go through a similar exercise. Let’s test a second research question.

Research question: Is sea ice extent declining in the Southern Hemisphere over time? Is the same pattern happening in the Antarctic as in the Arctic? Fit a Stan model to find out!

Make sure you follow the steps we used in class.

What do your Stan model results indicate so far?

Load and Inspect Data

setwd('C:/Users/bneer/OneDrive/Desktop/Harrisburg University-Modelling and data simulation')
seaice <- read.csv('seaice.csv')
colnames(seaice) <- c("year", "extent_north", "extent_south")
head(seaice)
##   year extent_north extent_south
## 1 1979       12.328       11.700
## 2 1980       12.337       11.230
## 3 1981       12.127       11.435
## 4 1982       12.447       11.640
## 5 1983       12.332       11.389
## 6 1984       11.910       11.454
seaice_temp <- seaice
seaice_temp$year <- factor(seaice_temp$year)
summary(seaice_temp)
##       year     extent_north    extent_south  
##  1979   : 1   Min.   :10.15   Min.   :10.69  
##  1980   : 1   1st Qu.:10.90   1st Qu.:11.42  
##  1981   : 1   Median :11.67   Median :11.67  
##  1982   : 1   Mean   :11.46   Mean   :11.68  
##  1983   : 1   3rd Qu.:11.98   3rd Qu.:11.88  
##  1984   : 1   Max.   :12.45   Max.   :12.78  
##  (Other):33
dim(seaice_temp)
## [1] 39  3
range(seaice$year)
## [1] 1979 2017

The dataset has a total of 39 observations, which is nothing but the data from 1979 to 2017 as each year is a unique observation. The dataset contains the seaice extent in the northern and southern hemispheres year by year from 1979 to 2017. On average the ice extent in northern and southern hemisphere is 11.46 and 11.68 units respectively.

Our response variable the seaice extent in southern hemisphere ranging from 10.69 to 12.78 respectively.

Looking at the distribution of the response variable

x <- seaice$extent_south
h<-hist(x, breaks=10, col="red", xlab="Sea Ice extent in Southern hemisphere",
   main="Histogram with Normal Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)

The seaice content of the southern hemisphere appears to be normally distributed.

Plot the data

plot(extent_south ~ year, pch = 20, data = seaice, main = "Annual sea ice extent in the Southern Hemisphere: 1979-2017", ylab = 'Seaice extent in Southern Hemisphere')

Seaice extent in the southern hemisphere does appear to exhibit positive relationship with year but this does not look like a complete linear relationship. We can see that the seaice extent increased in the Southern hemisphere significantly from 2011 but then decreased rapidly in 2016 and 2017.

Run a general linear model using lm()

lm_south <- lm(extent_south ~ year, data = seaice)
summary(lm_south)
## 
## Call:
## lm(formula = extent_south ~ year, data = seaice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23372 -0.18142  0.01587  0.18465  0.88814 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -14.199551  10.925576  -1.300   0.2018  
## year          0.012953   0.005468   2.369   0.0232 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3843 on 37 degrees of freedom
## Multiple R-squared:  0.1317, Adjusted R-squared:  0.1082 
## F-statistic: 5.611 on 1 and 37 DF,  p-value: 0.02318
plot(extent_south ~ year, pch = 20, data = seaice, main = "Annual sea ice extent in the Southern Hemisphere: 1979-2017", ylab = 'Seaice extent in Southern Hemisphere')
abline(lm_south, col = 2, lty = 2, lw = 3)

Running a linear model we cans ee that even though the effect of year on the response varibale seaice content in southern hemisphere is significant the multiple-r squared of the model is just about 13% meaning only 13% of the variance in southern hemisphere can be explained by year. Thus this is not a very good model.

Prepare the data, re-run the lm() and extract summary statistics

– Preparing the data

x <- I(seaice$year - 1978)
y <- seaice$extent_south
N <- length(seaice$year)

– Rerunning the model

lm_south2 <- lm(y ~ x)
summary(lm_south2)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23372 -0.18142  0.01587  0.18465  0.88814 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.421555   0.125490  91.015   <2e-16 ***
## x            0.012953   0.005468   2.369   0.0232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3843 on 37 degrees of freedom
## Multiple R-squared:  0.1317, Adjusted R-squared:  0.1082 
## F-statistic: 5.611 on 1 and 37 DF,  p-value: 0.02318

One important thing to do was to make the year variable from 1 to 39 as we don’t want to use the integer values of 1979 to 2017 into the model. The goal is to observe changes year by year in seaice extent of southern hemisphere for which using the 39 years data from 1 to 39 is a better option. Here we see a huge difference in intercept. Previously using the integer values of year we got and an intercept of about -14 and now its about 11. But there is no change in the slope and R-squared.

– Extracting the key summary statistics from the simple model to compare it with the stan model

lm_alpha <- summary(lm_south2)$coeff[1]  # the intercept
lm_beta <- summary(lm_south2)$coeff[2]  # the slope
lm_sigma <- sigma(lm_south2)  # the residual error

– preparing the data for stan

stan_data <- list(N = N, x = x, y = y)

Write the Stan model

write("// Stan model for simple linear regression

      data {
      int < lower = 1 > N; // Sample size
      vector[N] x; // Predictor
      vector[N] y; // Outcome
      }
      
      parameters {
      real alpha; // Intercept
      real beta; // Slope (regression coefficients)
      real < lower = 0 > sigma; // Error SD
      }
      
      model {
      y ~ normal(alpha + x * beta , sigma);
      }
      
      generated quantities {
      } // The posterior predictive distribution",

"stan_model2.stan")

Run the Stan model

library(rstan)
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.19.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For improved execution time, we recommend calling
## Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
## although this causes Stan to throw an error on a few processors.
library(gdata)
## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
## The following object is masked from 'package:base':
## 
##     startsWith
library(bayesplot)
## This is bayesplot version 1.7.0
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
##    * Does _not_ affect other ggplot2 plots
##    * See ?bayesplot_theme_set for details on theme setting
# saving the file path
stan_model2 <- "stan_model2.stan"

# getting the model fit
fit <- stan(file = stan_model2, data = stan_data, warmup = 500, 
            iter = 1000, chains = 4, cores = 2, thin = 1)

fit
## Inference for Stan model: stan_model2.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## alpha 11.42    0.00 0.13 11.15 11.32 11.42 11.51 11.68   840    1
## beta   0.01    0.00 0.01  0.00  0.01  0.01  0.02  0.02   905    1
## sigma  0.40    0.00 0.05  0.32  0.36  0.39  0.43  0.50  1041    1
## lp__  16.32    0.05 1.22 13.23 15.73 16.63 17.19 17.73   682    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Oct 14 02:17:37 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The values of Rhat are about 1, which means that the chains have converged.

Compare your results to our results to “lm”

posterior <- extract(fit)

par(mfrow = c(1,2))
plot(y ~ x, pch = 20, main = 'General lm fit', ylab = 'Seaice content in Southern Hemisphere', xlab = 'Year')
abline(lm_south2, col = 2, lty = 2, lw = 3)

plot(y ~ x, pch = 20, main = 'Stan fit', ylab = 'Seaice content in Southern Hemisphere', xlab = 'Year')
abline(mean(posterior$alpha), mean(posterior$beta), col = 6, lw = 2)

The stan model looks identical to the lm model. This is because we are using a simple model and have put non-informative priors on our parameters.