MATH1307 Assignment 2

Load the relevant packages and create user functions

library(TSA)
library(forecast)
library(x12)
library(tseries)
library(dLagM)
library(readr)
library(stats)
library(car)
library(dynlm)
library(lmtest)

sum.res = function(x){
  summary(x)
  checkresiduals(x$model)
  shapiro.test(x$model$residuals)
}

sum.res2 = function(x){
  summary(x)
  checkresiduals(x)
  shapiro.test(x$residuals)
}

explore=function(x){
  plot(x, xlab='Year', main="Time series plot")
  points(y=x,x=time(x), pch=as.vector(season(x)))
  acf(x, lag.max = 48, main="Sample ACF plot")
  adf.test(x)
}

Task 1

Introduction

The aim of this investigation is to analyse and forecast the amount of horizontal solar radiation reaching the ground at a particular location over the globe. The dataset used in this report is data1 which contains information on monthly average horizontal solar radiation and the monthly precipitation series measured at the same points between January 1960 and December 2014.

Method

Firstly, a time series will be created for each of the variables in data1. Then we will examine them by looking at their stationarity and decompositions. After exploring the relationship between these variables we will proceed to modelling. Three types of models will be used including time series regression models, exponential smoothing models and state-space models. The best model will be decided based on significance of the model and terms, MASE and residual analysis. Finally, the best model chosen will be used to make 2 years ahead forecasts.

Data description

The following code chunk is used to load dataset data1 and convert it to time series:

data1 <- read_csv("data1.csv")
solar = ts(data1$solar,start=c(1960,1), frequency=12)
ppt= ts(data1$ppt,start=c(1960,1), frequency=12)

Data exploration and visualisation

The following code chunk is used to explore and visualise the time series plot and sample ACF plot of monthly solar radiation:

explore(solar)

$*Figure 1: Time series plot and sample ACF plot of monthly solar radiation*$

Figure 1: Time series plot and sample ACF plot of monthly solar radiation

## 
##  Augmented Dickey-Fuller Test
## 
## data:  x
## Dickey-Fuller = -4.7661, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The time series plots for monthly solar radiation from Figure 1 shows that:

The series has no obvious trend or intervention point over the given time period.
This series has seasonality, so it is hard to tell if there is moving average behaviour or changing variance without removing the seasonal component first.

The sample ACF plot and the Augmented Dickey-Fuller test results for monthly solar radiation from Figure 1 shows that:

A decaying pattern in the ACF plot of monthly solar radiation suggesting the existence of trend.
Obvious seasonality was observed in the ACF plot, this supports previous observations based on time series plot.
Augmented Dickey-Fuller test gives a p value of 0.01 (< 0.05), so there is enough evidence to reject the null hypothesis of non-stationarity.

Based on the above evidence and observations made on the time series plot, solar is a stationary time series with seasonality.

The following code chunk is used to explore and visualise the time series plot of monthly precipitation:

explore(ppt)

Figure 2: Time series plot and sample ACF plot of monthly precipitation

## 
##  Augmented Dickey-Fuller Test
## 
## data:  x
## Dickey-Fuller = -6.7438, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The time series plots for monthly precipitation from Figure 2 shows that:

The series has no obvious trend or intervention point over the given time period.
This series has seasonality, so it is hard to tell if there is moving average behaviour or changing variance without removing the seasonal component first.

The sample ACF plot and the Augmented Dickey-Fuller test results for monthly solar radiation from Figure 2 shows that:

A decaying pattern in the ACF plot of monthly solar radiation suggesting the existence of trend.
Obvious seasonality was observed in the ACF plot, this supports previous observations based on time series plot.
Augmented Dickey-Fuller test gives a p value of 0.01 (< 0.05), so there is enough evidence to reject the null hypothesis of non-stationarity.

Based on the above evidence and observations made on the time series plot, ppt is a stationary time series with seasonality.

Decomposition

In order to better investigate seasonal, trend and other components in the time series, X12 decomposition is used. The following code chunk is used to perform and visualise X12 decomposition of solar and ppt series:

solar.x12 = x12(solar)
plot(solar.x12, sa = T, trend = T, main = "X12 decomposition of monthly solar radiation")

Figure 3: X12 decomposition of monthly solar radiation

Figure 3 shows the STL decomposition of solar series, we can observe that:

The seasonally adjusted line is very different from the original, which means that seasonality is a main cause for variation in the original series.
The trend line lies closely along the seasonally adjusted line, which means that trend is also a main cause for variation in the original series.
Areas where the trend line and the seasonally adjusted line do no overlap can be caused by outliers.

ppt.x12 = x12(ppt)
plot(ppt.x12, sa = T, trend = T, main = "X12 decomposition of monthly precipitation")

Figure 4: X12 decomposition of monthly precipitation

Figure 4 shows the STL decomposition of ppt series, we can observe that:

The seasonally adjusted line is very different from the original in the second half of the plot, but it was close to the original line in the first half of the plot. This means that seasonality is a main cause for variation in the original series in the second half of the time series.
The trend line lies shows more fluctuation in the second half than the first half of the plot. Overall, the trend line lies closely along the seasonally adjusted line, which means that trend is also a main cause for variation in the original series.
Areas where the trend line and the seasonally adjusted line do no overlap can be caused by outliers.

Model fitting with time series regression models

The following code chunk is used to test the correlation between solar and ppt:

cor(data1)

##            solar        ppt
## solar  1.0000000 -0.4540277
## ppt   -0.4540277  1.0000000

From the output above, we observe moderately negative correlation between solar and ppt series, so we can use ppt as a predictor to make forecasts of solar with distributed lag models.

The following code chunk is used to auto search for a finite DLM model based on MASE:

finiteDLMauto(x = data1$ppt, y = data1$solar,q.min = 1, q.max = 12, model.type = "dlm", error.type = "MASE", trace = TRUE)

##    q - k    MASE      AIC      BIC R.Adj.Sq Ljung-Box
## 12    12 1.55160 4578.787 4645.895  0.30769         0
## 11    11 1.56213 4590.961 4653.617  0.30199         0
## 10    10 1.57800 4602.658 4660.858  0.29615         0
## 9      9 1.59312 4615.084 4668.827  0.28882         0
## 8      8 1.60481 4625.986 4675.267  0.28251         0
## 7      7 1.60704 4632.716 4677.532  0.28096         0
## 6      6 1.60753 4637.489 4677.837  0.28214         0
## 5      5 1.61385 4644.622 4680.499  0.28072         0
## 4      4 1.64636 4663.600 4695.003  0.26577         0
## 3      3 1.66270 4688.551 4715.478  0.24326         0
## 2      2 1.67597 4712.649 4735.095  0.22173         0
## 1      1 1.68846 4728.713 4746.676  0.21036         0

It is observed that the values of MASE, AIC and BIC decrease as the lag (q) increase.

To investigate the effect of adding term k=1, the following code chunk is used to auto search for a polynomial DLM based on MASE:

finiteDLMauto(x = data1$ppt , y = data1$solar, q.max = 12, k.order =1, model.type = "poly", error.type = "MASE", trace = TRUE)

##     q - k    MASE      AIC      BIC R.Adj.Sq Ljung-Box
## 5   5 - 1 1.63767 4648.873 4666.811  0.27163         0
## 6   6 - 1 1.64621 4657.208 4675.141  0.25454         0
## 4   4 - 1 1.65579 4663.361 4681.305  0.26269         0
## 3   3 - 1 1.66668 4687.030 4704.981  0.24272         0
## 2   2 - 1 1.67675 4710.708 4728.664  0.22285         0
## 1   1 - 1 1.68846 4728.713 4746.676  0.21036         0
## 7   7 - 1 1.70853 4689.073 4706.999  0.20898         0
## 8   8 - 1 1.82815 4740.059 4757.979  0.13620         0
## 12 12 - 1 1.91409 4769.354 4787.250  0.05530         0
## 9   9 - 1 1.94303 4791.522 4809.436  0.05601         0
## 11 11 - 1 1.99229 4811.020 4828.922  0.00523         0
## 10 10 - 1 2.00664 4819.815 4837.723  0.00341         0

It is observed that the values of MASE, AIC and BIC increase when “k=1” is added in the polynomial DLM models. So we will stick to finite DLM models.

We will fit a finite DLM with q= 12 using the following code chunk:

model1 = dlm(formula=solar~ppt, data=data.frame(data1), q = 12)
sum.res(model1)

## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.563  -5.239  -0.796   4.137  32.430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.5164     1.1151  17.501  < 2e-16 ***
## ppt.t        -5.8876     1.9508  -3.018  0.00265 ** 
## ppt.1         0.9993     2.5647   0.390  0.69694    
## ppt.2         0.4343     2.5571   0.170  0.86520    
## ppt.3         1.8763     2.5580   0.734  0.46352    
## ppt.4         1.7459     2.5587   0.682  0.49529    
## ppt.5         3.3279     2.5601   1.300  0.19410    
## ppt.6         0.7751     2.5617   0.303  0.76230    
## ppt.7         1.7937     2.5615   0.700  0.48402    
## ppt.8         0.2827     2.5593   0.110  0.91207    
## ppt.9        -1.1022     2.5615  -0.430  0.66712    
## ppt.10       -1.9333     2.5508  -0.758  0.44880    
## ppt.11       -0.5613     2.5532  -0.220  0.82605    
## ppt.12       -5.3492     1.9216  -2.784  0.00553 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.181 on 634 degrees of freedom
## Multiple R-squared:  0.3216, Adjusted R-squared:  0.3077 
## F-statistic: 23.12 on 13 and 634 DF,  p-value: < 2.2e-16
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 4578.787 4645.895

Figure 5: Residual analysis of model 1

## 
##  Shapiro-Wilk normality test
## 
## data:  x$model$residuals
## W = 0.94168, p-value = 2.922e-15

MASE(model1)

##          MASE
## model1 1.5516

From the summary output above, we can conclude that:

Most terms are not significant at 5% level of significance in model1
The adjusted R-squared for model1 is 0.3077, which means that this model only explains 30.77% of the variability in solar radiation.
The overall model has a p value less than 2.2e-16 (< 0.05), therefore it is significant.
Model1 has AIC value of 4578.787, BIC value of 4645.895 and MASE value of 1.5516.

Figure 5 shows the residual plots for model1:

According to the ACF plot, we can observe significant seasonal lags left in the residuals.
There is an obvious non random pattern in the time series plot.
Histogram of standardised residuals shows violation in normality assumptions.
Shapiro-Wilk normality test has a p value of 2.922e-15 (<0.05), so there is enough evidence to reject null hypothesis of normality.

Overall, we can conclude that model1 is not appropriate for further analysis.

Next step:

There is no intervention point in solar or ppt, so it is not necessary to fit dynamic linear models since they are used for intervention analysis.
Since using ppt as a predictor for solar failed to give promising result, we will then use exponential smoothing models to investigate solar time series without any predictors.

Model fitting with exponential smoothing models

First of all, we must check the minimum value of solar series:

min(solar)

## [1] 1.581982

Since the minimum value of solar series is greater than 0, no transformation is needed before fitting multiplicative models.

There is obvious seasonality in solar time series, so we can fit Holt-Winter’s methods models because they have seasonal components. The following code chunk is used to fit Holt-Winter’s methods models using a loop:

seasonal = c("additive","multiplicative")
dampeds = c(TRUE,FALSE)
expand = expand.grid(seasonal,dampeds)
fit.AICc = array(NA,4)
fit.MASE=array(NA,4)
levels = array(NA, dim=c(4,2))
              
for(i in 1:4){fit.AICc[i]= hw(solar, 
                               seasonal = toString(expand[i ,1]),                    
                               damped = expand[i ,2])$model$aicc
               fit.MASE[i]= accuracy(hw(solar, 
                               seasonal = toString(expand[i ,1]),                    
                               damped = expand[i ,2]))[,6]
levels[i,1]= toString(expand[i ,1])  
levels[i,2]= expand[i ,2]
}
results = data.frame(levels, fit.MASE, fit.AICc)
colnames(results)= c("seasonal","damped","MASE", "AICc")
results

##         seasonal damped      MASE     AICc
## 1       additive   TRUE 0.2461797 5429.489
## 2 multiplicative   TRUE 0.2035619 6367.768
## 3       additive  FALSE 0.2471600 5435.662
## 4 multiplicative  FALSE 0.2233077 6649.699

From the output above, we can see that:

Models with the lowest MASE value does not have the lowest AICc, so we will look at the two lowest MASE model and lowest AICc model.
Model with “seasonal=multiplicative” and “damped = TRUE” has the lowest MASE value of 0.2035619 and AICc value of 6367.768.
Model with “seasonal=multiplicative” and “damped = FALSE” has the second lowest MASE value of 0.2233077 and AICc value of 6649.699.
Model with “seasonal=additive” and “damped = TRUE” has the second lowest AICc value of 5429.489 and MASE value of 0.2461797.

To compare between the lowest AICc and top two lowest MASE models, we will look at their residuals:

hw.md<- hw(solar, seasonal = "multiplicative", damped = TRUE, h=2*frequency(solar))
sum.res(hw.md)

## 
## Forecast method: Damped Holt-Winters' multiplicative method
## 
## Model Information:
## Damped Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = solar, h = 2 * frequency(solar), seasonal = "multiplicative",  
## 
##  Call:
##      damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.805 
##     beta  = 1e-04 
##     gamma = 0.0124 
##     phi   = 0.94 
## 
##   Initial states:
##     l = 9.9033 
##     b = 0.4703 
##     s = 0.4341 0.5649 0.8263 1.1716 1.4514 1.6316
##            1.5503 1.3834 1.1045 0.8837 0.5761 0.4221
## 
##   sigma:  0.3145
## 
##      AIC     AICc      BIC 
## 6366.701 6367.768 6447.561 
## 
## Error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.0456208 2.045442 1.239102 -2.172476 10.00296 0.2035619
##                    ACF1
## Training set 0.04177493
## 
## Forecasts:
##          Point Forecast       Lo 80     Hi 80       Lo 95     Hi 95
## Jan 2015       5.611616   3.3502090  7.873023   2.1530925  9.070139
## Feb 2015       7.060438   3.3373603 10.783517   1.3664818 12.754395
## Mar 2015      10.515176   3.8550902 17.175263   0.3294535 20.700899
## Apr 2015      12.993446   3.5145037 22.472389  -1.5033454 27.490238
## May 2015      16.284381   2.9386340 29.630128  -4.1261776 36.694939
## Jun 2015      18.261020   1.7242212 34.797820  -7.0298315 43.551872
## Jul 2015      19.031597   0.2101276 37.853067  -9.7533565 47.816551
## Aug 2015      17.105654  -1.2074754 35.418784 -10.9018606 45.113169
## Sep 2015      13.764743  -2.0801991 29.609686 -10.4680049 37.997491
## Oct 2015       9.774733  -2.2585804 21.808047  -8.6286319 28.178099
## Nov 2015       6.581938  -2.0456267 15.209503  -6.6127835 19.776659
## Dec 2015       5.163049  -2.0168711 12.342970  -5.8176914 16.143790
## Jan 2016       5.616270  -2.6571166 13.889656  -7.0367828 18.269322
## Feb 2016       7.066360  -3.9144360 18.047156  -9.7273184 23.860039
## Mar 2016      10.524088  -6.6891829 27.737360 -15.8013383 36.849515
## Apr 2016      13.004566  -9.3405081 35.349641 -21.1692761 47.178409
## May 2016      16.298445 -13.0720043 45.668893 -28.6197807 61.216670
## Jun 2016      18.276925 -16.2140342 52.767885 -34.4724452 71.026296
## Jul 2016      19.048304 -18.5466197 56.643228 -38.4481704 76.544779
## Aug 2016      17.120781 -18.1782620 52.419825 -36.8644468 71.106009
## Sep 2016      13.777000 -15.8651593 43.419159 -31.5567705 59.110770
## Oct 2016       9.783493 -12.1627698 31.729756 -23.7804195 43.347405
## Nov 2016       6.587872  -8.8063951 21.982139 -16.9556278 30.131371
## Dec 2016       5.167730  -7.4021686 17.737629 -14.0562710 24.391731

Figure 6: Residual analysis of hw.md

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt-Winters' multiplicative method
## Q* = 23.57, df = 7, p-value = 0.001355
## 
## Model df: 17.   Total lags used: 24

## 
##  Shapiro-Wilk normality test
## 
## data:  x$model$residuals
## W = 0.3802, p-value < 2.2e-16

From the summary output above, we can conclude that:

Model hw.md has AIC value of 6366.701, AICc value of 6367.768, BIC value of 6447.561 and MASE value of 0.2035619.
Even though hw.md model has larger AIC and BIC value than model1, MASE value is significantly reduced. This means that hw.md gives more accurate forecast values for solar series.

Figure 6 shows the residual plots for hw.md:

Residuals from hw.md model is significantly better than model1.
According to the ACF plot, we can only observe significant seasonal lags at lag 12 and lag 24 in the residuals.
There is a relatively random pattern in the time series plot.
Histogram of standardised residuals shows violation in normality assumptions, this can be caused by the existence of outliers.
Shapiro-Wilk normality test has a p value less than 2.2e-16 (<0.05), so there is enough evidence to reject null hypothesis of normality.

Overall, we can conclude that model hw.md is better than all previous models.

hw.m<- hw(solar, seasonal = "multiplicative", damped = FALSE, h=2*frequency(solar))
sum.res(hw.m)

## 
## Forecast method: Holt-Winters' multiplicative method
## 
## Model Information:
## Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = solar, h = 2 * frequency(solar), seasonal = "multiplicative",  
## 
##  Call:
##      damped = FALSE) 
## 
##   Smoothing parameters:
##     alpha = 0.8251 
##     beta  = 1e-04 
##     gamma = 0.0259 
## 
##   Initial states:
##     l = 10.5659 
##     b = -0.2371 
##     s = 0.5024 0.6292 0.9319 1.2229 1.4923 1.6309
##            1.5606 1.3672 1.0278 0.8128 0.5106 0.3114
## 
##   sigma:  0.3971
## 
##      AIC     AICc      BIC 
## 6648.746 6649.699 6725.114 
## 
## Error measures:
##                      ME    RMSE      MAE        MPE     MAPE      MASE
## Training set 0.04281136 2.12539 1.359297 -0.4896895 10.87401 0.2233077
##                    ACF1
## Training set 0.06551473
## 
## Forecasts:
##          Point Forecast        Lo 80     Hi 80       Lo 95     Hi 95
## Jan 2015       5.610335   2.75534413  8.465326   1.2440033  9.976666
## Feb 2015       6.512806   2.03809262 10.987520  -0.3306776 13.356290
## Mar 2015       8.786120   1.33667747 16.235562  -2.6068190 20.179058
## Apr 2015      10.192785  -0.02672876 20.412298  -5.4366123 25.822181
## May 2015      12.512597  -1.96157245 26.986766  -9.6237347 34.648928
## Jun 2015      14.061609  -4.41048190 32.533701 -14.1890164 42.312235
## Jul 2015      14.502088  -6.89909775 35.903274 -18.2282011 47.232377
## Aug 2015      12.945620  -8.35024040 34.241481 -19.6235881 45.514829
## Sep 2015      10.352210  -8.52365174 29.228072 -18.5159294 39.220350
## Oct 2015       7.418279  -7.51119146 22.347750 -15.4143760 30.250935
## Nov 2015       4.989730  -6.05900058 16.038460 -11.9078451 21.887305
## Dec 2015       3.871776  -5.53876625 13.282319 -10.5204066 18.263960
## Jan 2016       4.199400  -7.01753622 15.416336 -12.9554236 21.354224
## Feb 2016       4.838892  -9.30305408 18.980837 -16.7893479 26.467131
## Mar 2016       6.477174 -14.21845674 27.172805 -25.1740619 38.128410
## Apr 2016       7.452633 -18.56883703 33.474103 -32.3437711 47.249037
## May 2016       9.069749 -25.52978297 43.669280 -43.8456686 61.985166
## Jun 2016      10.099485 -31.99838423 52.197355 -54.2836502 74.482621
## Jul 2016      10.315199 -36.68017471 57.310572 -61.5580226 82.188420
## Aug 2016       9.113765 -36.29212001 54.519651 -60.3285438 78.556075
## Sep 2016       7.208700 -32.09306283 46.510463 -52.8981594 67.315559
## Oct 2016       5.105869 -25.38360449 35.595343 -41.5237568 51.735495
## Nov 2016       3.391945 -18.81647619 25.600367 -30.5729044 37.356795
## Dec 2016       2.597256 -16.07148862 21.266001 -25.9541251 31.148637

Figure 7: Residual analysis of hw.m

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' multiplicative method
## Q* = 38.585, df = 8, p-value = 5.868e-06
## 
## Model df: 16.   Total lags used: 24

## 
##  Shapiro-Wilk normality test
## 
## data:  x$model$residuals
## W = 0.37415, p-value < 2.2e-16

From the summary output above, we can conclude that:

Model hw.m has AIC value of 6648.746, AICc value of 6649.699, BIC value of 6725.114 and MASE value of 0.2233077.
Model hw.m has larger AIC, BIC and MASE value than hw,md model.

Figure 7 shows the residual plots for hw.m:

Residuals from hw.m model is slightly better than hw.md.
According to the ACF plot, we can only observe significant seasonal lags at lag 24 and 36 in the residuals. We can ignore these lags because they are further away and lag 12 is not significant.
There is a relatively random pattern in the time series plot.
Histogram of standardised residuals shows violation in normality assumptions, this can be caused by the existence of outliers.
Shapiro-Wilk normality test has a p value less than 2.2e-16 (<0.05), so there is enough evidence to reject null hypothesis of normality.

Overall, we can conclude that the model hw.m is better than all previous models.

hw.ad<- hw(solar, seasonal = "additive", damped = TRUE, h=2*frequency(solar))
sum.res(hw.ad)

## 
## Forecast method: Damped Holt-Winters' additive method
## 
## Model Information:
## Damped Holt-Winters' additive method 
## 
## Call:
##  hw(y = solar, h = 2 * frequency(solar), seasonal = "additive",  
## 
##  Call:
##      damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
##     gamma = 1e-04 
##     phi   = 0.9388 
## 
##   Initial states:
##     l = 11.154 
##     b = 0.7632 
##     s = -10.4919 -8.137 -3.348 2.5794 8.08 11.1219
##            9.9586 6.9916 1.9573 -1.8565 -7.1607 -9.6946
## 
##   sigma:  2.3446
## 
##      AIC     AICc      BIC 
## 5428.422 5429.489 5509.282 
## 
## Error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.01091357 2.314163 1.498521 -1.468083 12.44796 0.2461797
##                   ACF1
## Training set 0.1700724
## 
## Forecasts:
##          Point Forecast     Lo 80     Hi 80       Lo 95    Hi 95
## Jan 2015       5.945198  2.940530  8.949866   1.3499548 10.54044
## Feb 2015       8.479778  4.230548 12.729009   1.9811413 14.97842
## Mar 2015      13.784309  8.579902 18.988716   5.8248548 21.74376
## Apr 2015      17.598568 11.588776 23.608360   8.4073847 26.78975
## May 2015      22.632258 15.912803 29.351713  12.3557383 32.90878
## Jun 2015      25.599150 18.238024 32.960277  14.3412786 36.85702
## Jul 2015      26.761775 18.810497 34.713053  14.6013444 38.92221
## Aug 2015      23.722251 15.221610 32.222892  10.7216429 36.72286
## Sep 2015      18.222624  9.205955 27.239293   4.4328191 32.01243
## Oct 2015      12.293259  2.788471 21.798047  -2.2430606 26.82958
## Nov 2015       7.504219 -2.464877 17.473315  -7.7421977 22.75064
## Dec 2015       5.150488 -5.262286 15.563263 -10.7744758 21.07545
## Jan 2016       5.947275 -4.891175 16.785726 -10.6287044 22.52326
## Feb 2016       8.481728 -2.766251 19.729707  -8.7205712 25.68403
## Mar 2016      13.786139  2.142988 25.429291  -4.0205242 31.59280
## Apr 2016      17.600287  5.574906 29.625668  -0.7909464 35.99152
## May 2016      22.633871 10.238009 35.029734   3.6760353 41.59171
## Jun 2016      25.600665 12.845046 38.356283   6.0926298 45.10870
## Jul 2016      26.763197 13.657667 39.868726   6.7200186 46.80637
## Aug 2016      23.723586 10.277223 37.169950   3.1591478 44.28802
## Sep 2016      18.223877  4.445085 32.002669  -2.8489663 39.29672
## Oct 2016      12.294435 -1.808972 26.397843  -9.2748652 33.86374
## Nov 2016       7.505324 -6.915414 21.926061 -14.5492911 29.55994
## Dec 2016       5.151525 -9.579726 19.882776 -17.3779790 27.68103

Figure 8: Residual analysis of hw.ad

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt-Winters' additive method
## Q* = 210.76, df = 7, p-value < 2.2e-16
## 
## Model df: 17.   Total lags used: 24

## 
##  Shapiro-Wilk normality test
## 
## data:  x$model$residuals
## W = 0.88756, p-value < 2.2e-16

From the summary output above, we can conclude that:

Hw.ad has AIC value of 55428.422, AICc value of 5429.489, BIC value of 5509.282 and MASE value of 0.2461797.
Hw.ad model has larger AIC, AICc BIC and MASE values than hw.m. This means that it has lower forecast accuracy than hw.m.

Figure 8 shows the residual plots for hw.ad:

Residuals of hw.ad are slightly worse than hw.m model.
According to the ACF plot, we can observe 1 significant seasonal lag at lag 12 in the residuals, this has more adverse implications than significant lags at lag 24 or lag 36.
There is a relatively random pattern in the time series plot.
Histogram of standardised residuals shows violation in normality assumptions, this can be caused by the existence of outliers.
Shapiro-Wilk normality test has a p value less than 2.2e-16 (< 0.05), so there is enough evidence to reject null hypothesis of normality.

Overall, we can conclude that model hw.m is still the best model so far.

Next step:

From x12 decompositions, we observed spikes in the remainder series, this means that variations in the series are also caused by factors other than seasonality and trend.
To model this phenomenon, we can use state-space models because it takes error terms into considerations.

Model fitting with state-space models

From previous observations based on time series plots and decompositions, we concluded that seasonality is leading cause of variations in solar time series. Therefore, state-space models with seasonality components should be used to fit the series.

The following code chunk is used to fit state-space models using a loop:

models = c("AAA","MAM","MMM")
dampeds = c(TRUE,FALSE)
expand = expand.grid(models,dampeds)
fit.AICc = array(NA,8)
fit.MASE=array(NA,8)
levels = array(NA, dim=c(8,2))
              
for(i in 1:6){fit.AICc[i]= ets(solar, 
                               model = toString(expand[i ,1]),                    
                               damped = expand[i ,2])$aicc 
              fit.MASE[i]= accuracy(ets(solar, 
                               model = toString(expand[i ,1]),                    
                               damped = expand[i ,2]))[,6]
levels[i,1]= toString(expand[i ,1])  
levels[i,2]= expand[i ,2]
}
# Becasuse "ANN" and "ANA" cannot have damped =TRUE, they will be processed sepeately
fit.AICc[7]= ets(solar, model ="ANN")$aicc
fit.AICc[8]= ets(solar, model ="ANA")$aicc
fit.MASE[7]= accuracy(ets(solar, model ="ANN"))[,6]
fit.MASE[8]= accuracy(ets(solar, model ="ANA"))[,6]
levels[7,1]="ANN"
levels[8,1]="ANA"
levels[7,2]=FALSE
levels[8,2]=FALSE
results = data.frame(levels, fit.MASE, fit.AICc)
colnames(results)= c("Model","Damped","MASE","AICc")
results

##   Model Damped      MASE     AICc
## 1   AAA   TRUE 0.2461797 5429.489
## 2   MAM   TRUE 0.3222574 5954.569
## 3   MMM   TRUE 0.3201193 5996.617
## 4   AAA  FALSE 0.2471600 5435.662
## 5   MAM  FALSE 0.3721664 6106.912
## 6   MMM  FALSE 0.5292151 6671.121
## 7   ANN  FALSE 0.6368203 6296.407
## 8   ANA  FALSE 0.2547040 5450.719

From the output above, we can see that:

Model with the lowest MASE also has the lowest AICc value
Model with “AAA” and “Damped = True” has the lowest MASE value of 0.2461797 and the lowest AICc value of 5429.489.
Model with “AAA” and “Damped = FALSE” has the second lowest MASE value of 0.2471600 and the second lowest AICc value of 5435.662.

The following code is used to trigger auto-search to search for the best model among all possible models based on AICc:

fit.auto.solar = ets(solar,model="ZZZ",ic="aicc")
fit.auto.solar$method

## [1] "ETS(A,Ad,A)"

The above output suggests model with “AAA” and “Damped = True”, same as what we concluded from outputs from the loop.

To compare between the top two lowest MASE models, we will look at their residuals:

ets.AAdA = ets(solar, model="AAA", damped = T)
sum.res2(ets.AAdA)

## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = solar, model = "AAA", damped = T) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
##     gamma = 1e-04 
##     phi   = 0.9388 
## 
##   Initial states:
##     l = 11.154 
##     b = 0.7632 
##     s = -10.4919 -8.137 -3.348 2.5794 8.08 11.1219
##            9.9586 6.9916 1.9573 -1.8565 -7.1607 -9.6946
## 
##   sigma:  2.3446
## 
##      AIC     AICc      BIC 
## 5428.422 5429.489 5509.282 
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.01091357 2.314163 1.498521 -1.468083 12.44796 0.2461797
##                   ACF1
## Training set 0.1700724

Figure 9: Residual analysis of ets.AAdA

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,A)
## Q* = 210.76, df = 7, p-value < 2.2e-16
## 
## Model df: 17.   Total lags used: 24

## 
##  Shapiro-Wilk normality test
## 
## data:  x$residuals
## W = 0.88756, p-value < 2.2e-16

From the summary output above, we can conclude that:

Ets.AAdA has AIC value of 55428.422, AICc value of 5429.489, BIC value of 5509.282 and MASE value of 0.2461797.
Ets.AAdA model has smaller AIC, AICc and BIC values than hw.m but it has larger MASE value. This means that it has worse accuracy in forecasting than hw.m.

Figure 9 shows the residual plots for Ets.AAdA:

Residuals of Ets.AAdA are slightly worse than hw.m model.
According to the ACF plot, we can observe 1 significant seasonal lag at lag 12 in the residuals, this has more adverse implications than significant lags at lag 24 or lag 36.
There is a relatively random pattern in the time series plot.
Histogram of standardised residuals shows violation in normality assumptions, this can be caused by the existence of outliers.
Shapiro-Wilk normality test has a p value less than 2.2e-16 (< 0.05), so there is enough evidence to reject null hypothesis of normality.

Overall, we can conclude that model hw.m is still the best model so far.

ets.AAA = ets(solar, model="AAA", damped = F)
sum.res2(ets.AAA)

## ETS(A,A,A) 
## 
## Call:
##  ets(y = solar, model = "AAA", damped = F) 
## 
##   Smoothing parameters:
##     alpha = 0.9993 
##     beta  = 0.0019 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 11.3306 
##     b = 0.209 
##     s = -10.7542 -8.4968 -3.227 3.0768 7.9264 10.9122
##            10.1422 7.3322 2.2353 -1.9528 -7.3626 -9.8316
## 
##   sigma:  2.3575
## 
##      AIC     AICc      BIC 
## 5434.708 5435.662 5511.076 
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE    MASE
## Training set -0.1179476 2.328736 1.504489 -2.230359 12.68101 0.24716
##                   ACF1
## Training set 0.1703355

Figure 10: Residual analysis of ets.AAA

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,A,A)
## Q* = 205.55, df = 8, p-value < 2.2e-16
## 
## Model df: 16.   Total lags used: 24

## 
##  Shapiro-Wilk normality test
## 
## data:  x$residuals
## W = 0.88809, p-value < 2.2e-16

From the summary output above, we can conclude that:

Ets.AAA has AIC value of 5434.708, AICc value of 5435.662, BIC value of 5511.076 and MASE value of 0.24716.
Ets.AAA model has smaller AIC, AICc and BIC values than hw.m but it has larger MASE value. This means that it has worse accuracy in forecasting than hw.m.

Figure 10 shows the residual plots for AAA:

Residuals of Ets.AAA are slightly worse than hw.m model.
According to the ACF plot, we can observe 1 significant seasonal lag at lag 12 in the residuals, this has more adverse implications than significant lags at lag 24 or lag 36.
There is a relatively random pattern in the time series plot.
Histogram of standardised residuals shows violation in normality assumptions, this can be caused by the existence of outliers.
Shapiro-Wilk normality test has a p value less than 2.2e-16 (< 0.05), so there is enough evidence to reject null hypothesis of normality.

Overall, we can conclude that model hw.m is still the best model so far.

Comparing and choosing the best model for forecasting

To summaries and compare across all exponential smoothing models and state-space models above, a dataframe containing their AICc, AIC, BIC and MASE values are created using the following code chunk:

models <- c("hw.md", "hw.m", "hw.ad","ets.AAdA","ets.AAA")
AICc <- c(hw.md$model$aicc, hw.m$model$aicc, hw.ad$model$aicc, ets.AAdA$aicc, ets.AAA$aicc)
AIC <- c(hw.md$model$aic, hw.m$model$aic, hw.ad$model$aic, ets.AAdA$aic, ets.AAA$aic)
BIC <- c(hw.md$model$bic, hw.m$model$bic, hw.ad$model$bic, ets.AAdA$bic, ets.AAA$bic)
MASE <- c(accuracy(hw.md)[,6], accuracy(hw.m)[,6], accuracy(hw.ad)[,6], 
          accuracy(ets.AAdA)[,6], accuracy(ets.AAA)[,6])
ALL <- data.frame(models, AICc, AIC, BIC, MASE)
colnames(ALL) <- c("model", "AICc", "AIC", "BIC", "MASE")
ALL <- ALL[order(MASE),]
ALL

##      model     AICc      AIC      BIC      MASE
## 1    hw.md 6367.768 6366.701 6447.561 0.2035619
## 2     hw.m 6649.699 6648.746 6725.114 0.2233077
## 3    hw.ad 5429.489 5428.422 5509.282 0.2461797
## 4 ets.AAdA 5429.489 5428.422 5509.282 0.2461797
## 5  ets.AAA 5435.662 5434.708 5511.076 0.2471600

From the output above, we can see that hw.md and hw.m have the lowest MASE.

Based on previous observations, we concluded that all five models have violation in normality assumption, and they differ the most in residuals sample ACF plot. So the residuals sample ACF plot for the top two models will be compared side by side using the following code chunk:

par(mfrow=c(1,2))
acf(hw.md$residuals,lag.max = 48, main="Sample ACF for hw.md residuals")
acf(hw.m$residuals,lag.max = 48, main="Sample ACF for hw.m residuals")

Figure 11: Residual analysis of hw.md and hw.m

Figure 11 shows the Sample ACF plots for residuals in models hw.md and hw.m, we observe that:

Residuals of hw.md shows significant lag at the first seasonal lag (lag12), the second seasonal lag (lag24) and the third seasonal lag (lag 36).
Residuals of hw.m model has no significant lag until the second seasonal lag (lag 24), which can be ignored since there is no significant lag at the first seasonal lag (lag12).

Model hw.m has the second best MASE (very close to the top one) and the best residual check, therefore it is the best model for forecasting.

The following code chunk is used to plot the forecast using hw.m model:

upper.95.int = hw.m$upper[,2]
lower.95.int = hw.m$lower[,2]
centre = hw.m$mean
par(mfrow=c(1,1))
plot(solar, xlim=c(1960, 2018), ylim=c(-60, 80), ylab="Solar radiation", xlab="Year",
     main = "Original series, forecasts and 95% forecast interval for the solar ration series")
lines(centre, col = "red")
lines(upper.95.int, col = "blue")
lines(lower.95.int, col = "blue")
legend("bottomleft", lty=1, cex=0.75, pch=1, pt.cex = 1.9, text.width = 2.3, col=c("black","red","blue"), c("Data","Forecasts","95% confidence limits"))

Figure 12: Forecast with hw.m model

Figure 12 shows the forecasts and 95% forecast interval for the solar ration series when solar series is fitted with hw.m model. Since common sense solar radiation cannot be negative numbers, we will mainly focus on the upper 95% confidence interval. Forecasts using hw.m model have reasonable mean values and confidence interval.

Conclusion

In conclusion, hw.m is determined to be the best model for forecasting solar time series out of all the models fitted above based on factors including residual checks and MASE values. Two years ahead forecasts made with hw.m model is as shown in Figure 12.

Task 2

Introduction

The aim of this investigation is to demonstrate whether the correlation between quarterly Residential Property Price Index (PPI) in Melbourne and quarterly population change is spurious or not. The dataset used in this report is data2 which contains information on the quarterly Residential Property Price Index (PPI) in Melbourne and quarterly population change over the previous quarter in Victoria between September 2003 and December 2016.

Method

Firstly the dataframe data2 will be converted into time series. Then their relationship will be examined by looking at their time series plots and sample CCF plot. If an unusually high correlation is shown in the sample CCF plot, the prewhitening process will be carried out. Prewhitening is a linear operation, therefore any linear relationships between the original series will be preserved after prewhitening. Lastly, the relationship between the post prewhitened series will be examined by looking at their time series plots and sample CCF plot. If there is no significant correlations left the correlation between these series are determined to be spurious.

Data description

The following code chunk is used to load dataset data2 and convert it to time series:

data2<- read_csv("data2.csv")

## Parsed with column specification:
## cols(
##   Quarter = col_character(),
##   price = col_double(),
##   change = col_double()
## )

colnames(data2)<- c("Quarter","res PPI", "pop change")
data2.ts<- ts(data2 [,2:3], start=c(2003,9),frequency = 4)

Data exploration and visualisation

The following code chunk is used to visualise the time series plot of Melbourne residential PPI and population change series:

plot(data2.ts,yax.flip=T, main="Time series plots of Melbourne residential PPI and population change", type="o")

Figure 13

Figure 13 is the time series plots for Melbourne residential PPI and population change, it shows that:

Both series have upward trends over the given time period and there is no intervention point or changing variance present.
The residential PPI series shows successive observations which suggests autoregressive behaviour.
The population change series shows obvious seasonality.
In both time series there are simultaneously increasing patterns in the time series plots. This implies that the series are at least moderately correlated.

To further explore the correlation structure between residential PPI and population change series the sample CCF plot will be displayed:

Spurious correlation may exist if unusually high correlation structure is observed in the sample CCF plot.

The following code chunk is used to visualise the sample CCF plot of Melbourne residential PPI and population change series:

ccf(as.vector(data2.ts[,1]), as.vector(data2.ts[,2]),ylab='CCF', main = "Sample CCF between residential PPI and population change")

Figure 14

Figure 14 is the Sample CCF plots for Melbourne residential PPI and population change, it shows that:

There is unusually high correlation structure seen in the sample CCF plot, therefore potentially indicating the existence of a spurious correlation.

To investigate the existence of a spurious correlation, prewhitening procedure will be applied to these series:

If true spurious correlation exist, it is expected to see disappearance of all existing correlation between series after prewhitening process.

Ensure stationarity

Before applying prewhitening to the series we must ensure that both series are stationary.

The following code chunk is used to explore the stationarity of Melbourne residential PPI and population change series:

par(mfrow=c(1,2))
acf(data2.ts[,1],lag.max = 48, main="Sample ACF for residential PPI")
acf(data2.ts[,2],lag.max = 48, main="Sample ACF for population change")

Figure 15

par(mfrow=c(1,1))
adf.test(data2.ts[,1])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data2.ts[, 1]
## Dickey-Fuller = -1.3264, Lag order = 3, p-value = 0.8458
## alternative hypothesis: stationary

adf.test(data2.ts[,2])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data2.ts[, 2]
## Dickey-Fuller = -1.603, Lag order = 3, p-value = 0.7344
## alternative hypothesis: stationary

Figure 15 is the Sample ACF plots and the Augmented Dickey-Fuller test results for Melbourne residential PPI and population change series, it shows that:

Decaying pattern and significant at seasonal lags implies the existence of non stationarity.
The Augmented Dickey-Fuller test has p values of 0.8458 (> 0.05) and 0.7344 (> 0.05) respectively, therefore it failed to reject the null hypothesis of non stationarity. Both time series are determined to be non-stationary, so they must be converted to stationary series before prewhitening.

The following chuck is used to convert data2.ts into stationary time series through differencing:

data2.diff <- ts.intersect(diff(diff(data2.ts[,1],4)),
                           diff(diff(data2.ts[,2],4)))
par(mfrow=c(1,2))
acf(data2.diff[,1],lag.max = 48, main="Sample ACF for residential PPI")
acf(data2.diff[,2],lag.max = 48, main="Sample ACF for population change")

Figure 16

par(mfrow=c(1,1))
adf.test(data2.diff[,1])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data2.diff[, 1]
## Dickey-Fuller = -5.1122, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

adf.test(data2.diff[,2])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data2.diff[, 2]
## Dickey-Fuller = -3.8985, Lag order = 3, p-value = 0.02136
## alternative hypothesis: stationary

Figure 16 is the Sample ACF plots and the Augmented Dickey-Fuller test results for Melbourne residential PPI and population change series after seasonal and ordinal differencing, it shows that:

There is no decaying patterns or significant seasonal lags left.
The Augmented Dickey-Fuller test has p values of 0.01 (< 0.05) and 0.02136 (< 0.05) respectively. These values are enough to reject the null hypothesis of non stationarity.

Both time series are determined to be stationary after seasonal and ordinal differencing, they are now ready to be used to the prewhitening procedure.

Prewhitening process

The following code chunk is used to apply prewhitening process to Melbourne residential PPI and population change series:

prewhiten(as.vector(data2.diff[,1]), as.vector(data2.diff[,2]), ylab="CCF",
                    main="Sample CCF after prewhitening between residential PPI and population change")

Figure 17

Figure 17 is the sample CCF plots for Melbourne residential PPI and population change after prewhitening, it shows that:

There is no significant correlation left between residential PPI and population change.

The distinction of correlation structure between residential PPI and population change indicates that the correlation between residential PPI and population change is indeed spurious.

Conclusion

There is no significant correlations left in the sample CCF plot after prewhitening process, therefore the correlation between the quarterly Residential Property Price Index (PPI) in Melbourne and quarterly population change is spurious.

Reference

Demirhan H. Time Series Regression Models with Distributed Lag Models. Package ‘dLagM’. 2019 Aug 24. Sourced from: https://cran.r-project.org/web/packages/dLagM/dLagM.pdf

MATH1307 Assignment 2

Xiyue Shu s3705474

Load the relevant packages and create user functions

Task 1

Introduction

Method

Data description

Data exploration and visualisation

Decomposition

Model fitting with time series regression models

Model fitting with exponential smoothing models

Model fitting with state-space models

Comparing and choosing the best model for forecasting

Conclusion

Task 2

Introduction

Method

Data description

Data exploration and visualisation

Ensure stationarity

Prewhitening process

Conclusion

Reference