MATH3841 Assignment 2

Mitchell Ward 3330753

I acknowledge I collaborated with Christian Webb.

Part 1: time Series

Question 1

A log transformation of the river flow data appears Gaussian:

elbe <- read.csv("~/Downloads/elbe.csv", sep = ";")
qqnorm(log(elbe$flow), main = "log(elbe$flow) Normal Q-Q Plot")

plot of chunk unnamed-chunk-1

A timeseries is created from the transformed data:

elbe.ts = ts(log(elbe$flow), start = 1980, frequency = 12)

Question 2a

The decomposition is plotted below.

elbe.decomp = decompose(elbe.ts)
plot(elbe.decomp)

plot of chunk unnamed-chunk-3

Qualitatively, the randomness data does appear somewhat like whitenoise.

Question 2b

To get a quick feel for the randomness data, it's normal Q-Q plot was calculated:

qqnorm(elbe.decomp$random)
qqline(elbe.decomp$random)

plot of chunk unnamed-chunk-4

This data appears to come from a Gaussian distribution. Now, importantly, inspecting the autocovariance:

acf(ts(elbe.decomp$random), na.action = na.pass)

plot of chunk unnamed-chunk-5

pacf(ts(elbe.decomp$random), na.action = na.pass)

plot of chunk unnamed-chunk-5

This data is not convicingly white noise. Some ACF values at lag \( h>0 \) fall outside \( \pm 1.96/\sqrt{n} \) in a damped sinusoidal fashion, and there are PACF values significantly different from zero up to a lag of 8.

To confirm this hypothesis, we can compare the ACF/PACF to that of a true white noise process:

norm.ts = ts(rnorm(300))
acf(norm.ts)

plot of chunk unnamed-chunk-6

pacf(norm.ts)

plot of chunk unnamed-chunk-6

As expected, for \( h>0 \), this ACF does fall within \( \pm 1.96/\sqrt{n} \) implying the autocovariance is not significantly different from zero. The log of the river data cannot be considered white noise.

Question 2c

A simple loop was written to find the best \( \text{ARMA}(p,q) \) model for \( p+q\le 4 \), deciding based on the AIC. (Note: the question asks for the maximum AIC, but my understanding is it should be minimised.)

min_aic = Inf
best_pdq = c(0, 0)
for (p in 0:4) {
    for (q in 0:(4 - p)) {
        if (p > 0 || q > 0) {
            elbe.arma = arima(elbe.decomp$random, order = c(p, 0, q))
            print(c(p, q, elbe.arma$aic))
            if (elbe.arma$aic < min_aic) {
                min_aic = elbe.arma$aic
                best_pdq = c(p, 0, q)
            }
        }
    }
}

## [1]   0.0   1.0 304.3
## [1]   0.0   2.0 297.5
## [1]   0.0   3.0 257.7
## [1]   0.0   4.0 247.1
## [1]   1.0   0.0 306.9
## [1]   1.0   1.0 301.8
## [1]   1.0   2.0 247.2
## [1]   1.0   3.0 248.3
## [1]   2   0 299
## [1]   2.0   1.0 242.1
## [1]   2.0   2.0 238.7
## [1]   3   0 300
## [1]   3   1 243
## [1]   4.0   0.0 290.1


print(best_pdq)

## [1] 2 0 2

The ACF and PACF for the best model fit, \( \text{ARMA}(2, 2) \), were plotted:

elbe.arma = arima(elbe.decomp$random, order = best_pdq)
acf(elbe.arma$residuals, na.action = na.pass)

plot of chunk unnamed-chunk-8

pacf(elbe.arma$residuals, na.action = na.pass)

plot of chunk unnamed-chunk-8

The ACF and PACF appear within bounds and show no significant pattern.

Question 3a

A 12-month \( \text{ARIMA}(0,0,0)\times(0,1,0) \) process was fitted to the data:

elbe.arima <- arima(elbe.ts, order = c(0, 0, 0), seasonal = c(0, 1, 0))
acf(elbe.arima$residuals)

plot of chunk unnamed-chunk-9

pacf(elbe.arima$residuals)

plot of chunk unnamed-chunk-9

This time series satisfies \( (1-B^{12})X_t=Z_t \). As this is just the lag 12 difference series, there will be a strong negative correlation between pairs of data exactly one season apart.

Question 3b

The sample ACF is significantly non-zero for interger lags 0 and 1. This might imply a \( \text{AR}(1) \) model. The sample PACF is also significantly non-zero for integer lags 0 and 1. This might imply \( \text{MA}(1) \) model. Using this as the starting point, and examing the AIC for the fits, a \( \text{ARIMA}(0,0,0)\times(0,1,1) \) model is chosen. It yields the following ACF and PACF:

elbe.arima <- arima(elbe.ts, order = c(0, 0, 0), seasonal = c(0, 1, 1))
acf(elbe.arima$residuals)

plot of chunk unnamed-chunk-10

pacf(elbe.arima$residuals)

plot of chunk unnamed-chunk-10

This pulls the between-season ACF and PACF values closer to zero. Clearly there is still correlation between adjacent months to be explained.

Question 3c

The ACF in part 3b might suggest a AR process with \( p\in 2,\cdots,5 \). The PACF might suggest a MA process with \( q=2 \). A \( \text{ARIMA}(2,0,2)\times(0,1,1) \) model was chosen, yielding the following ACF and PACF:

elbe.arima <- arima(elbe.ts, order = c(2, 0, 2), seasonal = c(0, 1, 1))
acf(elbe.arima$residuals)

plot of chunk unnamed-chunk-11

pacf(elbe.arima$residuals)

plot of chunk unnamed-chunk-11

The ACF and PACF appear very similar to that seen of white noise in the second half of question 2b.

Question 3d

The coefficients can be found by printing the object:

elbe.arima

## 
## Call:
## arima(x = elbe.ts, order = c(2, 0, 2), seasonal = c(0, 1, 1))
## 
## Coefficients:
##         ar1    ar2     ma1     ma2    sma1
##       0.489  0.393  -0.083  -0.450  -0.935
## s.e.  0.293  0.247   0.277   0.142   0.064
## 
## sigma^2 estimated as 0.223:  log likelihood = -205,  aic = 421.9

The model equation will be of the form:

\[ Y_t =(1-B^{12}) X_t \\ (1-\phi_1 B-\phi_2 B^2) Y_t = (1+\theta_1 B+\theta_2 B^2)(1+\Theta_1 B^{12})Z_t \]

From our fit, we find \( \phi_1=0.4887, \phi_2=0.3933, \theta_1=-0.0827, \theta_2=-0.4502, \Theta_1=-0.9351 \). Hence, the model equation is

\[ (1-0.4887B-0.3933B^2)(1-B^{12})X_t = (1-0.0827B-0.4502B^2)(1-0.9351B^{12})Z_t \]

where \( Z_t \sim WN(0, 0.2234) \).

Question 3e

As required, we fit a \( \text{ARMA}(3,0) \) model to the residuals of the seasonal ARIMA process.

elbe.arima <- arima(elbe.ts, order = c(0, 0, 0), seasonal = c(0, 1, 1))
elbe.arima.res.arma <- arima(elbe.arima$residuals, order = c(3, 0, 0))
elbe.arima.res.arma

## 
## Call:
## arima(x = elbe.arima$residuals, order = c(3, 0, 0))
## 
## Coefficients:
##         ar1     ar2    ar3  intercept
##       0.412  -0.028  0.205     -0.097
## s.e.  0.057   0.061  0.057      0.067
## 
## sigma^2 estimated as 0.228:  log likelihood = -204.2,  aic = 418.4

I was somewhat confused by the second half of this question, but my interpretation was to calculate Yule-Walker estimates for the time series of seasonal ARIMA residuals:

elbe.arima.res.yule <- ar(elbe.arima$residuals, method = "yule-walker")
elbe.arima.res.yule

## 
## Call:
## ar(x = elbe.arima$residuals, method = "yule-walker")
## 
## Coefficients:
##      1       2       3  
##  0.410  -0.034   0.200  
## 
## Order selected 3  sigma^2 estimated as  0.233

A 95% confidence intermal of the \( \text{ARMA}(3,0) \) coefficients can be calculated:

# AR1
c(elbe.arima.res.arma$coef[1] - 1.96 * sqrt(elbe.arima.res.arma$var.coef[1, 
    1]), elbe.arima.res.arma$coef[1] + 1.96 * sqrt(elbe.arima.res.arma$var.coef[1, 
    1]))

##    ar1    ar1 
## 0.3013 0.5230


# AR2
c(elbe.arima.res.arma$coef[2] - 1.96 * sqrt(elbe.arima.res.arma$var.coef[2, 
    2]), elbe.arima.res.arma$coef[2] + 1.96 * sqrt(elbe.arima.res.arma$var.coef[2, 
    2]))

##      ar2      ar2 
## -0.14851  0.09225


# AR3
c(elbe.arima.res.arma$coef[3] - 1.96 * sqrt(elbe.arima.res.arma$var.coef[3, 
    3]), elbe.arima.res.arma$coef[3] + 1.96 * sqrt(elbe.arima.res.arma$var.coef[3, 
    3]))

##     ar3     ar3 
## 0.09352 0.31627

The Yule-Walker estimates fall within this interval (and likely in a interval of much higher confidence). Both models are not significantly different.

Part 2: Spatial Statistics

The sand height data was loaded into a geodata object, and plotted to get a quick feel.

library(geoR)
sandpit = read.table("~/Desktop/sandpit.txt", header = T, quote = "\"")
sandpit.geo = as.geodata(sandpit)

plot(sandpit.geo)

plot of chunk unnamed-chunk-16

Question 1

A linear model \( \mu(x,y)=\beta_0 + \beta_1 y \) was fitted to the sand heights.

sandpit.lm = lm(sandpit.geo$data ~ sandpit.geo$coords[, "y"])
summary(sandpit.lm)

## 
## Call:
## lm(formula = sandpit.geo$data ~ sandpit.geo$coords[, "y"])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5264 -0.3287 -0.0208  0.3447  1.6747 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.0449     0.0241    43.4   <2e-16 ***
## sandpit.geo$coords[, "y"]  -1.1661     0.0235   -49.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.53 on 498 degrees of freedom
## Multiple R-squared:  0.832,  Adjusted R-squared:  0.832 
## F-statistic: 2.47e+03 on 1 and 498 DF,  p-value: <2e-16

Clearly the model explains a significant amount of variance in the dataset.

Question 2

The residuals of the trend surface were investigated:

mean(sandpit.lm$residuals)

## [1] 1.137e-16

qqnorm(sandpit.lm$residuals)
qqline(sandpit.lm$residuals)

plot of chunk unnamed-chunk-18

These residuals do appear to resemble a zero-mean Gaussian random field.

Question 3

A new geodata object was created for the trend surface residuals. This object is what we are investigating as a weakly stationary random field.

residual.geo = sandpit.geo
residual.geo$data = sandpit.lm$residuals
plot(residual.geo)

The directional variograms in the directions of \( {0,\pi/8,\cdots, 7\pi/8} \) were plotted against the omnidirectional variogram.

plot(variog(residual.geo))
lines(variog(residual.geo), col = 3)
directions = seq(0, 7 * pi/8, length.out = 8)
for (angle in directions) {
    lines(variog(residual.geo, direction = angle))
}

plot of chunk unnamed-chunk-20

They all follow roughly the same trend and show little evidence of directional dependence in the residual data.

In contrast, plotting the directional variograms of the sand height data shows strong directional dependence:

plot(variog(sandpit.geo))
lines(variog(sandpit.geo), col = 3)
directions = seq(0, 7 * pi/8, length.out = 8)
for (angle in directions) {
    lines(variog(sandpit.geo, direction = angle))
}

plot of chunk unnamed-chunk-21

Despite the lack of a legend, it is evident the sand height dataset is anisotropic.

Question 4

From the lecture notes, we know a good approach to binning of irregularly sampled datasets for empirical semivariograms is to:

Ensure each bin has at least 30 samples.
Only calculate the semivariogram up to half the maximum distance between two samples.

A (poor) metric was used to find the maximum distance:

sqrt((max(sandpit$x) - min(sandpit$x))^2 + (max(sandpit$y) - min(sandpit$y))^2)

## [1] 5.521

From this, we decide to calculate the empircal semivariogram up to a lag of 2.5.

residual.variog = variog(residual.geo, uvec = seq(0, 2.5, l = 11), pairs.min = 30)

If each bin has >30 pairs, the following should be true:

print(all(residual.variog$ind.bin))

## [1] TRUE

As we see, there are at least 30 pairs in each bin.

We now fit a Matern model semivariogram to the data, providng an initial sill estimate of 0.275 and range of 0.75. These values were chosen by inspecting the emperical semivariogram.

# Fit Matern model. Approx sill and range read from emperical variog
residual.matern.variog = variofit(residual.variog, cov.model = "matern", ini = c(0.275, 
    0.75), fix.kappa = FALSE)
plot(residual.variog)
lines(residual.matern.variog)

plot of chunk unnamed-chunk-25

residual.matern.variog

## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: matern
## parameter estimates:
##   tausq sigmasq     phi   kappa 
##   0.100   0.179   0.018  41.736 
## Practical Range with cor=0.05 for asymptotic range: 0.3956
## 
## variofit: minimised weighted sum of squares = 9.784

The fitted semivariogram line appears to match the emperical semivariogram nicely. The variofit object reports a partial sill of 0.179 (that is the sill minus the nugget) and a range parameter of 0.018. The fitted nugget and smoothness are:

print(residual.matern.variog$nugget)  #nugget

## [1] 0.09997

print(residual.matern.variog$kappa)  #smoothness

## [1] 41.74

Question 5

Three Matern model semivariograms are fitted based on maximum likelihood. Kappa is not fixed. Each model has different initial conditions.

sandpit.lk1 = likfit(sandpit.geo, cov.model = "matern", kappa = 0.51, fix.kappa = FALSE, 
    trend = "1st", ini = c(0.275, 0.75))
sandpit.lk2 = likfit(sandpit.geo, cov.model = "matern", kappa = 0.51, fix.kappa = FALSE, 
    trend = "1st", ini = c(0.3, 1))
sandpit.lk3 = likfit(sandpit.geo, cov.model = "matern", kappa = 0.51, fix.kappa = FALSE, 
    trend = "1st", ini = c(0.3, 1.5))
plot(residual.variog)
lines(sandpit.lk1)
lines(sandpit.lk2)
lines(sandpit.lk3)

plot of chunk unnamed-chunk-28

All three models overlap; they do converge to the same parameters. The summary outputs these parameters:

summary(sandpit.lk1)

## Summary of the parameter estimation
## -----------------------------------
## Estimation method: maximum likelihood 
## 
## Parameters of the mean component (trend):
##  beta0  beta1  beta2 
##  1.071 -0.070 -1.068 
## 
## Parameters of the spatial component:
##    correlation function: matern
##       (estimated) variance parameter sigmasq (partial sill) =  0.284
##       (estimated) cor. fct. parameter phi (range parameter)  =  0.103
##       (estimated) extra parameter kappa = 0.84
##    anisotropy parameters:
##       (fixed) anisotropy angle = 0  ( 0 degrees )
##       (fixed) anisotropy ratio = 1
## 
## Parameter of the error component:
##       (estimated) nugget =  0
## 
## Transformation parameter:
##       (fixed) Box-Cox parameter = 1 (no transformation)
## 
## Practical Range with cor=0.05 for asymptotic range: 0.3834
## 
## Maximised Likelihood:
##    log.L n.params      AIC      BIC 
##   "-232"      "7"    "478"    "508" 
## 
## non spatial model:
##    log.L n.params      AIC      BIC 
##   "-379"      "4"    "765"    "782" 
## 
## Call:
## likfit(geodata = sandpit.geo, trend = "1st", ini.cov.pars = c(0.275, 
##     0.75), fix.kappa = FALSE, kappa = 0.51, cov.model = "matern")

The practical range of 0.38 agrees with the plot.

Question 6

A prediction grid of 60x80 points was setup between \( x\in (-1.5,1.5), y\in (-2,2) \). This covers most of the range of the dataset.

sandpit.krig.pred.grid <- expand.grid(seq(-1.5, 1.5, l = 60), seq(-2, 2, l = 80))

As per the lecture notes, a linear trend was fitted to the sand height data.

sandpit.krig.mlfit <- likfit(geodata = sandpit.geo, trend = "1st", ini.cov.pars = c(1, 
    1))

sandpit.krig.mlfit

## likfit: estimated model parameters:
##     beta0     beta1     beta2     tausq   sigmasq       phi 
## " 1.0670" "-0.0592" "-1.0239" " 0.0000" " 0.2984" " 0.1990" 
## Practical Range with cor=0.05 for asymptotic range: 0.5961
## 
## likfit: maximised log-likelihood = -237

It is worth noting this fits a trend along the X axis as well (\( \beta_1=-0.0592 \)).

Finally, universal kriging is performed and the results plotted:

sandpit.krig.ukc <- krige.conv(sandpit.geo, loc = sandpit.krig.pred.grid, krige = krige.control(obj.m = sandpit.krig.mlfit, 
    trend.d = "1st", trend.l = "1st"))

image(sandpit.krig.ukc, loc = sandpit.krig.pred.grid, col = topo.colors(100), 
    main = "universal kriging predictions")

plot of chunk unnamed-chunk-33

image(sandpit.krig.ukc, val = sqrt(sandpit.krig.ukc$krige.var), main = "universal kriging std. errors")

plot of chunk unnamed-chunk-33

Question 7

Using the method suggested by the hint, the -0.57 transect was found from the predicted values on the lattice for \( -0.59 < x < 0.55 \):

loci = sandpit.krig.pred.grid
a = -0.59
b = -0.55
sum(loci[, 1] > a & loci[, 1] < b) == 80  #should be true

## [1] TRUE

ind = loci[, 1] > a & loci[, 1] < b
plot(sandpit.krig.ukc$predict[ind], xlab = "Y Coord", ylab = "Sand Height", 
    main = "x=-0.57 transect")
# Confidence intervals:
lines(sandpit.krig.ukc$predict[ind] + 1.96 * sqrt(sandpit.krig.ukc$krige.var[ind]), 
    lty = 3)
lines(sandpit.krig.ukc$predict[ind] - 1.96 * sqrt(sandpit.krig.ukc$krige.var[ind]), 
    lty = 3)

plot of chunk unnamed-chunk-34

Similarly, the 0.93 transect was found by considering the points where \( 0.91 < x < 0.95 \):

a = 0.91
b = 0.95
sum(loci[, 1] > a & loci[, 1] < b) == 80  #should be true

## [1] TRUE

ind = loci[, 1] > a & loci[, 1] < b
plot(sandpit.krig.ukc$predict[ind], xlab = "Y Coord", ylab = "sand height", 
    main = "x=0.93 transect")
# Confidence intervals:
lines(sandpit.krig.ukc$predict[ind] + 1.96 * sqrt(sandpit.krig.ukc$krige.var[ind]), 
    lty = 3)
lines(sandpit.krig.ukc$predict[ind] - 1.96 * sqrt(sandpit.krig.ukc$krige.var[ind]), 
    lty = 3)

plot of chunk unnamed-chunk-35