Introduction
Task 1 - Analysis and forecasting of horizontal solar radiation time series
Task 2 - Analysis of the correlation between Residential Property Price Index (PPI) in Melbourne and population change in Victoria from September 2003 to December 2016.
Conclusion

Introduction

The report consists of two parts. The objective of the first task is analysis and forecasting of the amount of horizontal solar radiation reaching the ground at a particular location over the globe. For this task, we will apply time series regression method to fit distributed lag models using monthly precipitation series as an independent explanatory series. We will also apply exponential smoothing methods with corresponding state-space models to forecast solar radiation series. We will then demonstrate an appropriate comparison between these methods in terms of residual assumptions and goodness of fit measures. The final goal of this analysis is to give 2 years ahead forecasts from the best suitable model in terms of its mean absolute scaled error (MASE) measure.

The aim of the second task is to analyse the correlation structure between quarterly Residential Property Price Index (PPI) in Melbourne and population change over the previous quarter in Victoria from September 2003 to December 2016. We will explore and demonstrate if the correlation between the two series is found spurious or not.

Task 1 - Analysis and forecasting of horizontal solar radiation time series

Data description

The dataset solar used for the following investigation contains information about monthly average horizontal solar radiation and monthly precipitation measured at the same points between January 1960 and December 2014.

The following code loads the dataset and creates time series objects of radiation and precipitation.

solar <- read.csv("~/Documents/Postgrads/Sem 2 2019/Forecasting/data files/data1.csv")
radiation <- ts(solar$solar, start = c(1960,1), frequency = 12)
head(radiation)

##            Jan       Feb       Mar       Apr       May       Jun
## 1960  5.051729  6.415832 10.847920 16.930264 24.030797 26.298202

precip <- ts(solar$ppt, start = c(1960,1), frequency = 12)
head(precip)

##        Jan   Feb   Mar   Apr   May   Jun
## 1960 1.333 0.921 0.947 0.615 0.544 0.703

solar_ts <- ts(solar, start = c(1960,1), frequency = 12)
precip_frc <- read.csv("~/Documents/Postgrads/Sem 2 2019/Forecasting/data files/data.x.csv")

Data exploration and visualisation

We will display a time series plot of solar radiation data to explore its characteristics.

plot(radiation, main = "Time series plot of solar radiation series", ylab = "Amount of solar radiation", xlab = "Time")
points(radiation,x=time(radiation), pch=as.vector(season(radiation)))

Figure 1

From the plot in Figure 1, we can observe the following characteristics of the series:

There is no apparent trend.
There is obvious seasonality, with lower values in December and January and higher values in June and July. The seasonal pattern is not consistent across the observed time.
Changing variance and behaviour of the series are not obvious due to the presence of seasonality.
There are two potential intervention points around 1965 and 1987.

We will further display sample ACF and conduct an Augmented Dickey-Fuller test to study stationarity and seasonality in the series. The length of our data allows to display more lags in the ACF plot to better observe any evidence of trend.

acf(radiation, lag.max = 48, main = "Sample ACF of solar radiation series")

Figure 2

adf.test(radiation, k=ar(radiation)$order)

## Warning in adf.test(radiation, k = ar(radiation)$order): p-value smaller
## than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  radiation
## Dickey-Fuller = -4.557, Lag order = 25, p-value = 0.01
## alternative hypothesis: stationary

The ACF plot in Figure 2 shows strong seasonal patterns and suggests no trend. ADF test with lag order = 25 reports stationarity in the series at 5% level of significance (p-value < 0.05). Overall, we conclude that solar radiation series has a strong seasonality pattern.

We will display a time series plot of precipitation series which we will use as a predictor series for distributed lag models.

plot(precip, main = "Time series plot of precipitation series", ylab = "Precipitation", xlab = "Time")
points(precip,x=time(precip), pch=as.vector(season(precip)))

Figure 3

Based on the plot in Figure 3, we can make the following comments on the characteristics of the series:

There might be a slight downward trend, especially in the beginning of the series.
There is a clear seasonality, while the pattern changes overtime, we can say that lower values are observed in July and August and higher values in December-January.
The existence of changing variance and behaviour of the series is not apparent due to seasonality.
There are no obvious intervention points.

To further explore the trend and seasonality components in precipitation series, we will create a sample ACF plot and conduct an ADF test over the series.

acf(precip, lag.max = 48, main = "Sample ACF plot of precipitation series")

Figure 4

adf.test(precip, k = ar(precip)$order)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  precip
## Dickey-Fuller = -3.2594, Lag order = 28, p-value = 0.07769
## alternative hypothesis: stationary

From Figure 4, we can observe that there is a strong seasonal pattern, a decaying pattern of seasonal lags also suggests the possible existence of trend. The ADF test reports p-value = 0.078 > 0.05 which suggests the series is nonstationary at 5% level of significance.

To clearly display the dependent radiation series versus the explanatory precipitation series within the same plot, we will standardise the data. The following code creates a time series plot to explore the relationship of the series.

scale <- scale(solar_ts)
plot(scale, plot.type="s", col = c("firebrick","dodgerblue2"), main = "Solar Radiation vs Precipitation over time (scaled)")
legend("topleft", lty=1, text.width = 7, col = c("firebrick","dodgerblue2"), c("Solar Radiation", "Precipitation"))

Figure 5

The plot in Figure 5 shows that the dependent and the independent series are likely to be negatively correlated. High values of radiation correspond to low values of precipitation and vice versa.

We also calculate the correlation coefficient to check the relationship.

cor(radiation, precip)

## [1] -0.4540277

The correlation coefficient is reported \(r=-0.45\) which suggests a moderate negative correlation between the series and confirms the conclusion made from the plot in Figure 5. After we have explored the characteristics of the individual series and found the evidence of relationship between them, we proceed to modelling stage.

Time series regression methods

Finite distributed lag model

To find a suitable model for forecasting solar radiation values, we will try fitting distributed lag models which include an independent explanatory series and its lags to help explain the overall variation and correlation structure in our dependent series.

To specify the finite lag length for the model, we create a loop that computes accuracy measures like AIC/BIC and MASE for the models with different lag lengths and select a model with the lowest values.

for (i in 1:10){
  model1 <- dlm(x = solar$ppt, y = solar$solar, q = i)
  cat("q =", i, "AIC =", AIC(model1$model), "BIC =", BIC(model1$model), "MASE =", MASE(model1)$MASE, "\n")
}

## q = 1 AIC = 4728.713 BIC = 4746.676 MASE = 1.688457 
## q = 2 AIC = 4712.649 BIC = 4735.095 MASE = 1.675967 
## q = 3 AIC = 4688.551 BIC = 4715.478 MASE = 1.662703 
## q = 4 AIC = 4663.6 BIC = 4695.003 MASE = 1.646357 
## q = 5 AIC = 4644.622 BIC = 4680.499 MASE = 1.613848 
## q = 6 AIC = 4637.489 BIC = 4677.837 MASE = 1.607532 
## q = 7 AIC = 4632.716 BIC = 4677.532 MASE = 1.607042 
## q = 8 AIC = 4625.986 BIC = 4675.267 MASE = 1.604806 
## q = 9 AIC = 4615.084 BIC = 4668.827 MASE = 1.593121 
## q = 10 AIC = 4602.658 BIC = 4660.858 MASE = 1.577996

finite_dlm <- dlm(x = solar$ppt, y = solar$solar, q = 10)
summary(finite_dlm)

## 
## Call:
## lm(formula = model.formula, data = design)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.9353  -5.4124  -0.7911   4.0184  30.8900 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.0105     1.0942  17.374  < 2e-16 ***
## x.t          -7.3843     1.8995  -3.887 0.000112 ***
## x.1          -0.4763     2.5395  -0.188 0.851288    
## x.2          -0.1324     2.5734  -0.051 0.958980    
## x.3           1.7902     2.5781   0.694 0.487691    
## x.4           1.9686     2.5808   0.763 0.445877    
## x.5           3.4928     2.5807   1.353 0.176402    
## x.6           0.5243     2.5787   0.203 0.838943    
## x.7           1.6762     2.5797   0.650 0.516088    
## x.8           0.9282     2.5673   0.362 0.717817    
## x.9           0.3754     2.5338   0.148 0.882272    
## x.10         -5.3798     1.8760  -2.868 0.004272 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.256 on 638 degrees of freedom
## Multiple R-squared:  0.3081, Adjusted R-squared:  0.2962 
## F-statistic: 25.82 on 11 and 638 DF,  p-value: < 2.2e-16
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 4602.658 4660.858

vif(finite_dlm$model)

##      x.t      x.1      x.2      x.3      x.4      x.5      x.6      x.7 
## 4.244594 7.665259 7.910115 7.952633 7.957841 7.941836 7.911213 7.901898 
##      x.8      x.9     x.10 
## 7.847965 7.653512 4.228221

It is observed that the values of information criteria as well as MASE decrease as the lag q increases, so we will fit a finite DLM with a number of lags = 10.

According to the significance tests of model coefficients obtained from the summary, nearly all lag weights of a predictor series are not statistically significant at 5% level. The adjusted \(R^2\) for finite_dlm is 0.296, which means that the model explains only 29.6% of the variability in radiation. F-test of the overall significance of the model reports the model is statistically significant at 5% level (p-value < 0.05). However, we conclude that the model is not a good fit to the data due to insignificant terms and low explainability.

There is no issue with multicollinearity in the model, VIF values are reported < 10.

residualcheck <- function(x){
  checkresiduals(x)
  bgtest(x)
  shapiro.test(x$residuals)
}
residualcheck(finite_dlm$model)

Figure 6

## 
##  Breusch-Godfrey test for serial correlation of order up to 15
## 
## data:  Residuals
## LM test = 588.43, df = 15, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  x$residuals
## W = 0.94643, p-value = 1.386e-14

The residualcheck function was created to apply a diagnostic check in a dynamic way. It displays residual analysis plots as well as performs the Breusch-Godfrey test of serial correlation and the Shapiro-Wilk normality test of the residuals.

From diagnostic checking in Figure 6, we can observe the following:

The residuals are not randomly distributed.
ACF plot shows that there is serial correlation as well as seasonality in the residuals, the Beusch-Godfrey test supports serial correlation at 5% level of significance (p-value < 0.05).
The histogram and Shapiro-Wilk (p-value < 0.05) test report that the normality assumption does not hold.

Overall, we can conclude that the finite DLM of lag 10 is not successful at capturing the autocorrelation and seasonality in the series.

Polynomial distributed lag model

The finite DLM model had significant issues in diagnostic checks and had low explainability, therefore, we will fit an second order polynomial DLM with lag length = 10.

poly_dlm <- polyDlm(x = as.vector(solar$ppt), y = as.vector(solar$solar), q = 10, k = 2)

## Estimates and t-tests for beta coefficients:
##         Estimate Std. Error t value  P(>|t|)
## beta.0    -5.040      0.435 -11.600 2.39e-28
## beta.1    -2.320      0.314  -7.400 4.33e-13
## beta.2    -0.188      0.251  -0.749 4.54e-01
## beta.3     1.370      0.237   5.770 1.23e-08
## beta.4     2.340      0.243   9.600 1.74e-20
## beta.5     2.720      0.248  11.000 7.32e-26
## beta.6     2.530      0.243  10.400 1.40e-23
## beta.7     1.750      0.235   7.430 3.39e-13
## beta.8     0.387      0.249   1.560 1.20e-01
## beta.9    -1.560      0.312  -4.990 7.68e-07
## beta.10   -4.090      0.433  -9.440 7.08e-20

summary(poly_dlm)

## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.012  -5.343  -1.090   4.097  31.641 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.77806    1.08588   17.29   <2e-16 ***
## z.t0        -5.04398    0.43502  -11.60   <2e-16 ***
## z.t1         3.01112    0.18300   16.45   <2e-16 ***
## z.t2        -0.29153    0.01761  -16.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.237 on 646 degrees of freedom
## Multiple R-squared:  0.3025, Adjusted R-squared:  0.2992 
## F-statistic: 93.38 on 3 and 646 DF,  p-value: < 2.2e-16

According to model summary, all lag weights are significant at 5% level except lag 2 and lag 8 (p-value > 0.05). The adjusted \(R^2=0.299\), the model explains about 30% of the variability in radiation which is similar to the finite DLM. The overall significance test reports the model is statistically significant at 5% level with p-value reported < 0.05.

There is an issue with multicollinearity as VIF values are greater than 10 for z.t1 and z.t2 terms.

residualcheck(poly_dlm$model)

Figure 7

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 580.37, df = 10, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  x$residuals
## W = 0.94582, p-value = 1.12e-14

The analysis of residuals from polynomial model in Figure 7 shows the following:

The errors are not randomly spread.
There are a lot of highly significant lags in the ACF plot as well as a wavy pattern at seasonal lags, so there is autocorrelation and seasonality still present in the residuals.
Beusch-Godfrey test reports a p-value < 0.05, therefore there is serial correlation in the residuals at 5% level of significance.
The normality of the residuals is also violated, as observed from the histogram and Shapiro-Wilk normality test report (p-value < 0.05).

Overall, we can conclude that the second order polynomial of lag 10 is not successful at capturing the autocorrelation and seasonality in the series and has low explainability.

Koyck transformation

We will implement Koyck transformation model with precipitation predictor series as follows.

koyck = koyckDlm(x = as.vector(solar$ppt), y = as.vector(solar$solar))
summary(koyck$model, diagnostics = T)

## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0926  -3.5961   0.3176   3.6103  14.8399 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.23925    0.76549  -2.925  0.00356 ** 
## Y.1          0.98546    0.02424  40.650  < 2e-16 ***
## X.t          5.34684    0.84383   6.336 4.37e-10 ***
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   1 656     710.7  <2e-16 ***
## Wu-Hausman         1 655     146.8  <2e-16 ***
## Sargan             0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.814 on 656 degrees of freedom
## Multiple R-Squared: 0.7598,  Adjusted R-squared: 0.7591 
## Wald test:  1104 on 2 and 656 DF,  p-value: < 2.2e-16

vif(koyck$model)

##      Y.1      X.t 
## 1.605001 1.605001

From the model summary, we can conclude that all terms of Koyck model are significant at 5% level. The model is reported to be overall statistically significant at 5% level (p-value < 0.05) and its adjusted \(R^2=0.759\) which means the model explains about 76% of the variability in radiation.

According to the Weak instruments test (p-value < 0.05), the model at the first stage of least-squares estimation is significant at 5% level.
From the Wu-Hausman test (p-value < 0.05), we can conclude that there is significant correlation between the explanatory variable and the error term at 5% level.

There is no effect of multicollinearity as all VIFs are less than 10.

residualcheck(koyck$model)

Figure 8

## 
##  Shapiro-Wilk normality test
## 
## data:  x$residuals
## W = 0.98487, p-value = 2.468e-06

From the residual analysis in Figure 8, we can conclude the following:

The errors are not spread randomly.
All the lags in ACF plot are significant and have a wave-like pattern, which suggests serial correlation and seasonality remaining in the residuals.
The errors are not normal. The histogram and the Shapiro-Wilk normality test with p-value < 0.05 suggest not normal residuals.

Overall, we can conclude that the Koyck model is also not successful at capturing the autocorrelation and seasonality in the series.

Autoregressive distributed lag models

The final model type from time series regression method is Autoregressive distributed lag models. For specifying the parameters of ARDL(p,q), we create a loop that fits autoregressive DLMs for a range of lag lengths and orders of the AR process and obtains their accuracy measures, like AIC/BIC and MASE.

Three models with lowest values of MASE were chosen for fitting and analysis. The models were:

ARDL(3,5)
ARDL(4,5)
ARDL(5,5)

We create a loop to fit these candidate models and do residual analysis in a dynamical way.

for (i in 1:5){
  for(j in 1:5){
    model2 = ardlDlm(x = as.vector(solar$ppt), y = as.vector(solar$solar), p = i , q = j)
    cat("p =", i, "q =", j, "AIC =", AIC(model2$model), "BIC =", BIC(model2$model), "MASE =", MASE(model2)$MASE, "\n")
  }
}

## p = 1 q = 1 AIC = 3712.311 BIC = 3734.765 MASE = 0.8392434 
## p = 1 q = 2 AIC = 3239.416 BIC = 3266.352 MASE = 0.4971918 
## p = 1 q = 3 AIC = 3143.522 BIC = 3174.936 MASE = 0.4740063 
## p = 1 q = 4 AIC = 3138.399 BIC = 3174.288 MASE = 0.4697571 
## p = 1 q = 5 AIC = 3100.283 BIC = 3140.644 MASE = 0.450425 
## p = 2 q = 1 AIC = 3639.223 BIC = 3666.159 MASE = 0.7834855 
## p = 2 q = 2 AIC = 3229.051 BIC = 3260.476 MASE = 0.4951319 
## p = 2 q = 3 AIC = 3137.634 BIC = 3173.535 MASE = 0.4738939 
## p = 2 q = 4 AIC = 3132.962 BIC = 3173.337 MASE = 0.4702773 
## p = 2 q = 5 AIC = 3097.288 BIC = 3142.134 MASE = 0.4503599 
## p = 3 q = 1 AIC = 3608.793 BIC = 3640.207 MASE = 0.7572489 
## p = 3 q = 2 AIC = 3226.623 BIC = 3262.524 MASE = 0.4955334 
## p = 3 q = 3 AIC = 3139.409 BIC = 3179.798 MASE = 0.4737144 
## p = 3 q = 4 AIC = 3134.777 BIC = 3179.638 MASE = 0.4701162 
## p = 3 q = 5 AIC = 3098.808 BIC = 3148.139 MASE = 0.4502885 
## p = 4 q = 1 AIC = 3602.664 BIC = 3638.553 MASE = 0.7580664 
## p = 4 q = 2 AIC = 3224.285 BIC = 3264.66 MASE = 0.4959949 
## p = 4 q = 3 AIC = 3131.289 BIC = 3176.15 MASE = 0.4695096 
## p = 4 q = 4 AIC = 3131.424 BIC = 3180.772 MASE = 0.4665123 
## p = 4 q = 5 AIC = 3096.024 BIC = 3149.839 MASE = 0.4479481 
## p = 5 q = 1 AIC = 3599.402 BIC = 3639.764 MASE = 0.7572617 
## p = 5 q = 2 AIC = 3221.853 BIC = 3266.699 MASE = 0.4954501 
## p = 5 q = 3 AIC = 3127.103 BIC = 3176.434 MASE = 0.4675479 
## p = 5 q = 4 AIC = 3127.868 BIC = 3181.684 MASE = 0.4651969 
## p = 5 q = 5 AIC = 3097.877 BIC = 3156.177 MASE = 0.4479311

for (i in c(3,4,5)){
  ardl <- ardlDlm(x = as.vector(solar$ppt), y = as.vector(solar$solar), p = i, q = 5)
  summary(ardl)
  residualcheck(ardl$model)
}

## 
## Time series regression with "ts" data:
## Start = 6, End = 660
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6649  -1.4447  -0.2663   1.0644  18.7430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.20532    0.42237   5.221 2.40e-07 ***
## X.t         -0.60604    0.54924  -1.103 0.270258    
## X.1          0.89253    0.77682   1.149 0.251001    
## X.2          1.60774    0.77838   2.065 0.039276 *  
## X.3         -0.38294    0.55738  -0.687 0.492308    
## Y.1          1.27345    0.03859  32.999  < 2e-16 ***
## Y.2         -0.02859    0.06250  -0.457 0.647495    
## Y.3         -0.40351    0.06036  -6.685 5.01e-11 ***
## Y.4         -0.22632    0.06234  -3.630 0.000305 ***
## Y.5          0.22116    0.03779   5.853 7.69e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.553 on 645 degrees of freedom
## Multiple R-squared:  0.9333, Adjusted R-squared:  0.9324 
## F-statistic:  1003 on 9 and 645 DF,  p-value: < 2.2e-16

Figure 9

## 
##  Breusch-Godfrey test for serial correlation of order up to 13
## 
## data:  Residuals
## LM test = 112.86, df = 13, p-value < 2.2e-16
## 
## 
## Time series regression with "ts" data:
## Start = 6, End = 660
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5901  -1.3826  -0.2867   1.0526  18.5709 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.46103    0.43729   5.628 2.72e-08 ***
## X.t         -0.61439    0.54768  -1.122 0.262364    
## X.1          0.78469    0.77617   1.011 0.312409    
## X.2          1.28470    0.79026   1.626 0.104509    
## X.3          0.80457    0.77946   1.032 0.302355    
## X.4         -1.20750    0.55570  -2.173 0.030148 *  
## Y.1          1.27195    0.03849  33.049  < 2e-16 ***
## Y.2         -0.01868    0.06249  -0.299 0.765112    
## Y.3         -0.40480    0.06019  -6.725 3.87e-11 ***
## Y.4         -0.23201    0.06221  -3.729 0.000209 ***
## Y.5          0.21746    0.03772   5.766 1.26e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.546 on 644 degrees of freedom
## Multiple R-squared:  0.9338, Adjusted R-squared:  0.9328 
## F-statistic: 908.5 on 10 and 644 DF,  p-value: < 2.2e-16

Figure 9

## 
##  Breusch-Godfrey test for serial correlation of order up to 14
## 
## data:  Residuals
## LM test = 103.8, df = 14, p-value = 8.811e-16
## 
## 
## Time series regression with "ts" data:
## Start = 6, End = 660
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5959  -1.3825  -0.2646   1.0410  18.5812 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.50740    0.45434   5.519 4.96e-08 ***
## X.t         -0.61416    0.54804  -1.121 0.262863    
## X.1          0.78299    0.77670   1.008 0.313788    
## X.2          1.26543    0.79241   1.597 0.110772    
## X.3          0.75184    0.79227   0.949 0.342998    
## X.4         -1.00181    0.77678  -1.290 0.197617    
## X.5         -0.21024    0.55439  -0.379 0.704639    
## Y.1          1.27063    0.03867  32.861  < 2e-16 ***
## Y.2         -0.01727    0.06264  -0.276 0.782907    
## Y.3         -0.40297    0.06043  -6.669 5.56e-11 ***
## Y.4         -0.23273    0.06229  -3.737 0.000203 ***
## Y.5          0.21571    0.03802   5.673 2.12e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.548 on 643 degrees of freedom
## Multiple R-squared:  0.9338, Adjusted R-squared:  0.9327 
## F-statistic: 824.9 on 11 and 643 DF,  p-value: < 2.2e-16

Figure 9

## 
##  Breusch-Godfrey test for serial correlation of order up to 15
## 
## data:  Residuals
## LM test = 107.98, df = 15, p-value = 3.937e-16

From model summaries, we can conclude that all the fitted ARDL models were reported to be statistically significant at 5% level with p-value < 0.05. All models have an adjusted \(R^2 = 0.933\), which means they explain about 93.3% of the variability in radiation.

Regarding model coefficient estimates, we can observe for ARDL(3,5) only X.2 lag of predictor series is significant at 5% level (p-value = 0.0393 < 0.05), for ARDL(4,5) only X.4 lag of predictor series is significant at 5% level (p-value = 03014 < 0.05), and all lags of predictor series are not statistically significant at 5% level for ARDL(5,5). All lags of independent series are statistically significant in all models except Y.2 (p-value = 0.7829 > 0.05).

The plots from diagnostic checking in Figure 9 show that there is a very similar overall picture in residuals from all three fitted models:

The residuals are not as randomly spread as desired, they show evidence of changing variance.
There are a some highly significant lags in the ACF plot. The seasonal lags are also highly significant. Therefore, there is autocorrelation and seasonality still present in the residuals.
Beusch-Godfrey test reports a p-value < 0.05, therefore there is serial correlation in the residuals at 5% level of significance.
Long tails on the histogram of residuals suggest the normality of the residuals is violated.

Based on the observation about model estimates made earlier, we can try to decrease the number of lags for predictor series. We will fit ARDL(1,5) and perform diagnostic checking.

ardl_15 <- ardlDlm(x = as.vector(solar$ppt), y = as.vector(solar$solar), p = 1, q = 5)
summary(ardl_15)

## 
## Time series regression with "ts" data:
## Start = 6, End = 660
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3364  -1.4314  -0.2553   1.0518  18.8604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.35018    0.39678   5.923 5.13e-09 ***
## X.t         -1.03172    0.50686  -2.036 0.042207 *  
## X.1          2.26211    0.50758   4.457 9.81e-06 ***
## Y.1          1.28228    0.03818  33.584  < 2e-16 ***
## Y.2         -0.03149    0.06265  -0.503 0.615375    
## Y.3         -0.41374    0.06025  -6.867 1.54e-11 ***
## Y.4         -0.22906    0.06233  -3.675 0.000258 ***
## Y.5          0.22742    0.03770   6.032 2.72e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 647 degrees of freedom
## Multiple R-squared:  0.9328, Adjusted R-squared:  0.932 
## F-statistic:  1282 on 7 and 647 DF,  p-value: < 2.2e-16

residualcheck(ardl_15$model)

Figure 10

## 
##  Breusch-Godfrey test for serial correlation of order up to 11
## 
## data:  Residuals
## LM test = 117.99, df = 11, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  x$residuals
## W = 0.89664, p-value < 2.2e-16

The p-value of the overall significance test is < 0.05, therefore ARDL(1,5) model is statistically significant at 5% level. All the estimated terms are significant at 5% level except Y.2 - second lag of independent series (p-value = 0.6154 > 0.05). Adjusted \(R^2 = 0.932\), which means they explain 93.2% of the variability in radiation.

The plots from diagnostic checking in Figure 10 show the same picture as the diagnostic checkings in Figure 9, so the comments are the same as for previously fitted models.

Overall, none of the models from time series regression method were successful at capturing the autocorrelation and seasonal pattern in radiation series.

We create a data frame accuracy to store the accuracy measures, like AIC/BIC and MASE from the models fitted so far. The accuracy measures for further models will be appended to this data frame.

attr(koyck$model,"class") = "lm"
ardl_35 <- ardlDlm(x = as.vector(solar$ppt), y = as.vector(solar$solar), p = 3, q = 5)
ardl_45 <- ardlDlm(x = as.vector(solar$ppt), y = as.vector(solar$solar), p = 4, q = 5)
ardl_55 <- ardlDlm(x = as.vector(solar$ppt), y = as.vector(solar$solar), p = 5, q = 5)
models <- c("Finite DLM", "Poly DLM", "Koyck", "ARDL_15", "ARDL_35", "ARDL_45", "ARDL_55")
aic <- AIC(finite_dlm$model, poly_dlm$model, koyck$model, ardl_15$model, ardl_35$model, ardl_45$model, ardl_55$model)$AIC
bic <- BIC(finite_dlm$model, poly_dlm$model, koyck$model, ardl_15$model, ardl_35$model, ardl_45$model, ardl_55$model)$BIC
MASE <- MASE(finite_dlm$model, poly_dlm$model, koyck$model, ardl_15, ardl_35, ardl_45, ardl_55)$MASE
accuracy <- data.frame(models, MASE, aic, bic)
colnames(accuracy) <- c("Model", "MASE", "AIC", "BIC")
head(accuracy)

##        Model      MASE      AIC      BIC
## 1 Finite DLM 1.5779955 4602.658 4660.858
## 2   Poly DLM 1.5925546 4591.904 4614.289
## 3      Koyck 1.0324829 3946.476 3964.439
## 4    ARDL_15 0.4504250 3100.283 3140.644
## 5    ARDL_35 0.4502885 3098.808 3148.139
## 6    ARDL_45 0.4479481 3096.024 3149.839

Exponential smoothing methods

Another forecasting method we will try is exponential smoothing. Because we have found a strong seasonal component in radiation series that we want to make predictions for, we will only consider models that include either additive or multiplicative seasonality. Since there are 6 possible combinations of exponential smoothing models with seasonality that can be implemented in R, we create a loop to fit them and store their accuracy measures.

exponential = c(T,F)
seasonality <- c("additive","multiplicative")
damped <- c(T,F)
exp <- expand.grid(exponential, seasonality, damped)
exp <- exp[-c(1,5),]
fit_aic <- array(NA, 6)
fit_bic <- array(NA, 6)
fit_mase <- array(NA, 6)
levels <- array(NA, dim=c(6,3))
for (i in 1:6){
  hw <- hw(radiation, exponential = exp[i,1], seasonal = toString(exp[i,2], damped = exp[i,3]))
  fit_aic[i] <- hw$model$aic
  fit_bic[i] <- hw$model$bic
  fit_mase[i] <- accuracy(hw)[6]
  levels[i,1] <- exp[i,1]
  levels[i,2] <- toString(exp[i,2])
  levels[i,3] <- exp[i,3]
  checkresiduals(hw)
}

Figure 11

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 205.55, df = 8, p-value < 2.2e-16
## 
## Model df: 16.   Total lags used: 24

Figure 11

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' multiplicative method with exponential trend
## Q* = 21.246, df = 8, p-value = 0.006521
## 
## Model df: 16.   Total lags used: 24

Figure 11

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' multiplicative method
## Q* = 38.585, df = 8, p-value = 5.868e-06
## 
## Model df: 16.   Total lags used: 24

Figure 11

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 205.55, df = 8, p-value < 2.2e-16
## 
## Model df: 16.   Total lags used: 24

Figure 11

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' multiplicative method with exponential trend
## Q* = 21.246, df = 8, p-value = 0.006521
## 
## Model df: 16.   Total lags used: 24

Figure 11

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' multiplicative method
## Q* = 38.585, df = 8, p-value = 5.868e-06
## 
## Model df: 16.   Total lags used: 24

Figure 11 shows residual analysis plots for each model fitted. We can observe that damping the trend in fact does not make any difference in terms of model fit, the residuals look identical.

Residuals from Holt-Winters’ additive method (Figure 11 plot 1) have issues with randomness, serial correlation and normality.
Residuals from Holt-Winters’ multiplicative method with multiplicative trend (Figure 11 plot 2) have only one lag significant in ACF which is an improvement from all the previous models. However, Ljung-Box test reports p-value = 0.006521 (< 0.05) which suggests autocorrelation is still present in the residuals. There are issues with normality and there are also some large residuals (standardised residuals > 3).
Residuals from Holt-Winters’ multiplicative method with additive trend (Figure 11 plot 3) show a similar picture to the previous model.

Overall, we can observe a slight improvement in residuals from exponential smoothing models in terms of serial correlation and especially seasonality. Holt-Winters’ multiplicative method is the most successful at capturing autocorrelation and seasonality in radiation series.

We append the accuracy measures for exponential smoothing models to the accuracy data frame. The format of model names is: trend (multiplicative or additive), seasonality (multiplicative or additive) and if the trend is damped (damped) or no (N).

values <- data.frame(levels, fit_mase, fit_aic, fit_bic)
colnames(values) <- c("Trend", "Seasonality", "Damped", "MASE", "AIC", "BIC")
values$Trend <- factor(values$Trend, levels = c(T,F), labels = c("multiplicative","additive"))
values$Damped <- factor(values$Damped, levels = c(T,F), labels = c("damped","N"))
values <- unite(values, col = "Model", c("Trend","Seasonality","Damped"))
accuracy <- rbind(accuracy, values)
accuracy

##                                   Model      MASE      AIC      BIC
## 1                            Finite DLM 1.5779955 4602.658 4660.858
## 2                              Poly DLM 1.5925546 4591.904 4614.289
## 3                                 Koyck 1.0324829 3946.476 3964.439
## 4                               ARDL_15 0.4504250 3100.283 3140.644
## 5                               ARDL_35 0.4502885 3098.808 3148.139
## 6                               ARDL_45 0.4479481 3096.024 3149.839
## 7                               ARDL_55 0.4479311 3097.877 3156.177
## 8              additive_additive_damped 0.2471600 5434.708 5511.076
## 9  multiplicative_multiplicative_damped 0.2320404 6584.208 6660.576
## 10       additive_multiplicative_damped 0.2233077 6648.746 6725.114
## 11                  additive_additive_N 0.2471600 5434.708 5511.076
## 12      multiplicative_multiplicative_N 0.2320404 6584.208 6660.576
## 13            additive_multiplicative_N 0.2233077 6648.746 6725.114

State-space models variations

For each exponential smoothing method, there are two corresponding state-space models (with additive or multiplicative errors). There are 8 state-space variations which include seasonality that we cam implement in R (some combinations are forbidden due to their stability issues). We create a loop to fit these models and capture their accuracy measures for further comparison.

var <- c("AAA", "MAA", "MAM", "MMM")
damps <- c(T,F)
ets_models <- expand.grid(var, damps)
ets_aic <- array(NA, 8)
ets_mase <- array(NA,8)
ets_bic <- array(NA,8)
mod <- array(NA, dim=c(8,2))
for (i in 1:8){
  ets <- ets(radiation, model = toString(ets_models[i, 1]), damped = ets_models[i,2])
  ets_aic[i] <- ets$aic
  ets_bic[i] <- ets$bic
  ets_mase[i] <- accuracy(ets)[6]
  mod[i,1] <- toString(ets_models[i,1])
  mod[i,2] <- ets_models[i,2]
}

The auto ETS model was fitted to see what the software automatically suggested model is.

auto_fit <- ets(radiation)
summary(auto_fit)

## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = radiation) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
##     gamma = 1e-04 
##     phi   = 0.9388 
## 
##   Initial states:
##     l = 11.154 
##     b = 0.7632 
##     s = -10.4919 -8.137 -3.348 2.5794 8.08 11.1219
##            9.9586 6.9916 1.9573 -1.8565 -7.1607 -9.6946
## 
##   sigma:  2.3446
## 
##      AIC     AICc      BIC 
## 5428.422 5429.489 5509.282 
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.01091357 2.314163 1.498521 -1.468083 12.44796 0.2461797
##                   ACF1
## Training set 0.1700724

The model suggested automatically is ETS(A,Ad,A) which is a model with additive errors, additive damped trend and additive seasonality.

checkresiduals(auto_fit)

Figure 12

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,A)
## Q* = 210.76, df = 7, p-value < 2.2e-16
## 
## Model df: 17.   Total lags used: 24

The residual analysis in Figure 12 shows the following:

Residuals are not spread randomly and have changing variance.
There are a lot of significant lags in the ACF plot similar to what we have seen in Holt-Winters’ additive method. Ljung-Box test reports p-value < 0.05, therefore there is evidence of autocorrelation in the residuals. In addition, lag 12 is significant which suggests seasonality effect is still present in the residuals.
According to long tails in the histogram, there are also issues with normality.

Overall, the residual analysis suggest this model is not successful at capturing autocorrelation and seasonality in radiation series.

We append the accuracy measures for state-space models to the accuracy data frame.

measures <- data.frame(mod, ets_mase, ets_aic, ets_bic)
measures$X2 <- factor(measures$X2, levels = c(T,F), labels = c("Damped","N"))
measures <- unite(measures, "Model", c("X1","X2"))
colnames(measures) <- c("Model", "MASE", "AIC", "BIC")
accuracy <- rbind(accuracy,measures)

The data frame accuracy is sorted by ascending MASE value.

accuracy <- arrange(accuracy, MASE)
kable(accuracy, caption = "Models and their accuracy parameters (sorted by MASE)")

Models and their accuracy parameters (sorted by MASE)
Model	MASE	AIC	BIC
additive_multiplicative_damped	0.2233077	6648.746	6725.114
additive_multiplicative_N	0.2233077	6648.746	6725.114
multiplicative_multiplicative_damped	0.2320404	6584.208	6660.576
multiplicative_multiplicative_N	0.2320404	6584.208	6660.576
AAA_Damped	0.2461797	5428.422	5509.282
additive_additive_damped	0.2471600	5434.708	5511.076
additive_additive_N	0.2471600	5434.708	5511.076
AAA_N	0.2471600	5434.708	5511.076
MMM_Damped	0.3201193	5995.550	6076.410
MAM_Damped	0.3222574	5953.502	6034.363
MAM_N	0.3721664	6105.959	6182.327
MAA_Damped	0.3798095	6469.079	6549.940
ARDL_55	0.4479311	3097.877	3156.177
ARDL_45	0.4479481	3096.024	3149.839
ARDL_35	0.4502885	3098.808	3148.139
ARDL_15	0.4504250	3100.283	3140.644
MAA_N	0.4748561	7602.755	7679.123
MMM_N	0.5292151	6670.168	6746.536
Koyck	1.0324829	3946.476	3964.439
Finite DLM	1.5779955	4602.658	4660.858
Poly DLM	1.5925546	4591.904	4614.289

The accuracy table will be used to compare all methods we have tried at the modelling stage in terms of their MASE. The model that minimises MASE is Holt-Winters’ multiplicative method with additive trend (there is no difference in models in terms of damping). The best state-space model in terms of lowest MASE is ETS(A,Ad,A) which is the model with additive errors, additive damped trend and additive seasonality. ETS(A,Ad,A) was also the model suggested automatically. We can see from the table that time series regression methods perform the worst in terms of MASE but this approach has lower AIC/BIC measures than the exponential smoothing approach.

Forecasting

For deciding on the final model to give two years ahead forecasts of solar radiation, we compare forecasts from three models:

Holt-Winters’ multiplicative method which has the lowest MASE and is the most successful at capturing the autocorrelation and seasonality in the series
Holt-Winters’ multiplicative method with multiplicative trend which has the second lowest MASE and is also good at capturing the autocorrelation and seasonality in the series
ETS(A,Ad,A) model was suggested by an automatic algorithm and has the lowest MASE of all state-space models but does not capture autocorrelation in the series

The fitted values and 2 year forecasts are displayed in Figure 13.

fit1 <- hw(radiation, seasonal = "multiplicative", h = 2*frequency(radiation))
fit2 <- hw(radiation, seasonal = "multiplicative", exponential = T, h = 2*frequency(radiation))
fit3 <- ets(radiation, model="AAA", damped = T)
frc_fit3 <- forecast(fit3)
plot(frc_fit3, fcol = "white", main = "Solar radiation series with two years ahead forecasts", ylab = "Radiation", ylim = c(-10,55))
lines(fitted(fit1), col = "darkgreen")
lines(fit1$mean, col = "darkgreen", lwd = 2)
lines(fitted(fit2), col = "brown2")
lines(fit2$mean, col = "brown2", lwd = 2)
lines(fitted(fit3), col = "dodgerblue3")
lines(frc_fit3$mean, col = "dodgerblue3", lwd = 2)
legend("bottomleft", lty = 1, col = c("black", "darkgreen", "brown2", "dodgerblue3"), c("Data", "Holt-Winters' Multiplicative", "Holt-Winters' Multiplicative Exponential", "ETS(A,Ad,A)"))

Figure 13

We can observe that the fitted values versus the original data are generally similar with all the models, the state-space model however is the furthest of all from the data. There are points around years 1966 and 1987 where we can see the Holt-Winters’ multiplicative method approximates the original series better.

The forecasts given by three models are quite different from each other. The state-space model suggests the highest peaks for the forecasts. Holt-Winters’ multiplicative approach suggests the amount of solar radiation will decrease over the next 2 years. Holt-Winters’ multiplicative method with multiplicative trend suggests that the amount of solar radiation will remain somewhat the same over the next year but will slightly decrease over the second year.

Overall, based on the residual analysis and the lowest MASE value, we take Holt-Winters’ multiplicative model to give 2 years ahead forecasts of the amount of solar radiation.

The final forecasts are displayed in Figure 14.

plot(fit1, fcol = "white", main = "Solar radiation series with two years ahead forecasts", ylab = "Radiation")
lines(fitted(fit1), col = "darkgreen")
lines(fit1$mean, col = "darkgreen", lwd = 2)
legend("topleft", lty = 1, col = c("black", "darkgreen"), c("Data", "Forecasts"))

Figure 14

The solar radiation 2 years ahead point forecast values with corresponding 95% confidence intervals are as follows:

frc <- fit1$mean
ub <- fit1$upper[,2]
lb <- fit1$lower[,2]
forecasts <- ts.intersect(ts(lb, start = c(2015,1), frequency = 12), ts(frc,start = c(2015,1), frequency = 12), ts(ub,start = c(2015,1), frequency = 12))
colnames(forecasts) <- c("Lower bound", "Point forecast", "Upper bound")
forecasts

##          Lower bound Point forecast Upper bound
## Jan 2015   1.2440033       5.610335    9.976666
## Feb 2015  -0.3306776       6.512806   13.356290
## Mar 2015  -2.6068190       8.786120   20.179058
## Apr 2015  -5.4366123      10.192785   25.822181
## May 2015  -9.6237347      12.512597   34.648928
## Jun 2015 -14.1890164      14.061609   42.312235
## Jul 2015 -18.2282011      14.502088   47.232377
## Aug 2015 -19.6235881      12.945620   45.514829
## Sep 2015 -18.5159294      10.352210   39.220350
## Oct 2015 -15.4143760       7.418279   30.250935
## Nov 2015 -11.9078451       4.989730   21.887305
## Dec 2015 -10.5204066       3.871776   18.263960
## Jan 2016 -12.9554236       4.199400   21.354224
## Feb 2016 -16.7893479       4.838892   26.467131
## Mar 2016 -25.1740619       6.477174   38.128410
## Apr 2016 -32.3437711       7.452633   47.249037
## May 2016 -43.8456686       9.069749   61.985166
## Jun 2016 -54.2836502      10.099485   74.482621
## Jul 2016 -61.5580226      10.315199   82.188420
## Aug 2016 -60.3285438       9.113765   78.556075
## Sep 2016 -52.8981594       7.208700   67.315559
## Oct 2016 -41.5237568       5.105869   51.735495
## Nov 2016 -30.5729044       3.391945   37.356795
## Dec 2016 -25.9541251       2.597256   31.148637

However, we can observe that the 95% confidence intervals for the forecasts from selected approach are very wide and do not provide reliable forecasts.

Task 2 - Analysis of the correlation between Residential Property Price Index (PPI) in Melbourne and population change in Victoria from September 2003 to December 2016.

The purpose of this task is to explore the correlation structure between a Residential Property Price Index series and population change in Victoria series and demonstrate if the correlation between them is spurious or not.

Data description

The data dataset used for this task contains the information on three variables: quarter, quarterly Residential Property Price Index (PPI) in Melbourne and population change over the previous quarter in Victoria from September 2003 to December 2016.

The following code loads the dataset.

data <- read_csv("~/Documents/Postgrads/Sem 2 2019/Forecasting/data files/data2.csv")

## Parsed with column specification:
## cols(
##   Quarter = col_character(),
##   price = col_double(),
##   change = col_integer()
## )

We will convert the variables of interest (price and change) to time series with a quarterly frequency, starting from the third quarter of 2003.

ppi_change <- ts(data[,2:3], start = c(2003,3), frequency = 4)
ppi <- ts(data$price, start = c(2003,3), frequency = 4)
change <- ts(data$change, start = c(2003,3), frequency = 4)
head(ppi_change)

##         price change
## 2003 Q3  60.7  14017
## 2003 Q4  62.1  12350
## 2004 Q1  60.8  17894
## 2004 Q2  60.9   9079
## 2004 Q3  60.9  16210
## 2004 Q4  62.4  13788

Data exploration

We will firstly display both series within the same plot.

plot(ppi_change, main = "Time series plots of residential PPI and population change data")

Figure 15

From the time series plot in Figure 1, we can observe that there are simultaneously increasing trends in the series. We also can suggest from this plot that the population change series has a seasonal pattern. Overall, the visualisation implies that there is some correlation between the two series.

To further explore the correlation structure between the series, we display the plot of sample cross-covariance function (CCF). Sample cross-correlations that are higher than \(1.96/\sqrt n\) magnitude (dashed lines on the plot) are deemed to be significantly different from zero.

Analysis of correlation

ccf(as.vector(ppi), as.vector(change), ylab = "CCF", main = "Sample CCF of PPI and population change")

Figure 16

The CCF plot in Figure 2 shows that a number of lags are significantly different from zero according to the \(1.96/\sqrt n\) bounds. The CCF plot suggests the existence of strong cross-correlation between the Residential Property Price Index in Melbourne and population change in Victoria. However, the cause of spurious correlation found between the series is likely to be the existence of nonstationarity in them.

Analysis of nonstationarity

We will explore and explain the nonstationarity of the series individually.

The following code displays a time series plot of quarterly Residential Property Price Index.

plot(ppi, ylab = "Property Price Index", type="o", main = "Time series plot of Residential PPI from 2003 to 2016")

Figure 17

From the plot in Figure 3, we can observe the following about the Residential PPI:

There is a clear upward general trend
There is no seasonality
There is no changing variance present
There is no intervention
Successive points imply autoregressive behaviour

We will also have a look at the sample ACF for this series. The following code creates an ACF plot of PPI series.

acf(ppi, main = "Sample ACF of Residential PPI")

Figure 18

adf.test(ppi)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ppi
## Dickey-Fuller = -1.3264, Lag order = 3, p-value = 0.8458
## alternative hypothesis: stationary

The slowly decaying pattern in ACF plot (Figure 4) suggests that the series is nonstationary, it provides the evidence of trend and no seasonality in the series.

We also conducted an Augmented Dickey-Fuller unit root test to test the \(H_0\) that the process is difference nonstationary. The ADF test reports p-value = 0.8458 > 0.05, there is not enough evidence to reject the \(H_0\) that the series is nonstationary at 5% level. Overall, we can conclude that the Residential PPI series is nonstationary.

The following code creates a time series plot of quarterly population change in Victoria from 2003 to 2016. We also put quarters on the plot to explore seasonal pattern.

plot(change, ylab = "People", main = "Time series plot of population change from 2003 to 2016")
points(change,x=time(change), pch=as.vector(season(change)))

Figure 19

From the plot in Figure 5, we can observe the following about the population change:

There is an upward general tendency
There is seasonality, population growth is higher in the first quarters and lower in the second quarters of the year
Changing variance is not obvious due to the presence of seasonality
There might be an intervention happening around year 2010
The behaviour of the series is not obvious because of the seasonal pattern

Displaying sample ACF of the series:

acf(change, main = "Sample ACF of population change")

Figure 20

adf.test(change)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  change
## Dickey-Fuller = -1.603, Lag order = 3, p-value = 0.7344
## alternative hypothesis: stationary

The ACF plot in Figure 6 shows that there is a trend as well as seasonality present in the series. So we can confirm the previous observation from the time series plot. ADF test reports p-value = 0.7344 > 0.05 which means there is again not enough evidence to reject \(H_0\) that the process is nonstationary at 5% level of significance.

Based on the previous analysis, we can conclude that both of our series are nonstationary. It is difficult to assess the true dependence between two processes when we have strongly autocorrelated nonstationary data.

Prewhitening

We will apply prewhitening approach to separate the relationship between the series from their autocorrelation. The most important step in prewhitening the series is to make sure they are stationary. Based on the previous analysis of the series, we will perform both regular and seasonal difference and then apply prewhitning. The seasonal difference was done for both series to ensure they have the same length.

diff <- ts.intersect(diff(diff(ppi,4)), diff(diff(change,4)))
prewhiten(as.vector(diff[,1]), as.vector(diff[,2]), ylab='CCF', main = "Sample CFF prewhitened")

Figure 21

From the sample CCF plot after prewhitening in Figure 7 suggests that there is in fact no significant correlation found between a Residential Property Price Index and population change in Victoria. We can conclude that the strong cross-correlation pattern found between raw series was spurious.

Conclusion

Task 1:

To give predictions of solar radiation amount for the next 2 years, we used three methods and compared them based on residual analysis and MASE value. The three methods were:

Time series regression models
Exponential smoothing
State-space models

The main findings from model fitting are that time series regression, some exponential smoothing models and the automatically suggested state-space model were not successful at capturing autocorrelation and seasonality in radiation data. There were also issues with normality assumption in the residuals from all the models. The most successful models to capture seasonality and serial correlation were Holt-Winters’ multiplicative method and Holt-Winters’ multiplicative method with multiplicative trend. These models were also found to be the best in terms of minimising MASE. Based on the performed analysis, we chose Holt-Winters’ multiplicative method for forecasting. The forecasts suggest the amount of solar radiation will decrease over the next 2 years. However, point forecasts have very wide 95% confidence intervals which means they can not be considered reliable.

Task 2:

It was found that there are simultaneously increasing trends in Residential PPI and population change series. Overall, the visualisation analysis found that there is some correlation between the two series. The CCF plot suggested the existence of strong cross-correlation between the series. We applied prewhitening technique to assess the true dependence between two processes. The sample CCF plot after prewhitening suggested that there is in fact no significant correlation found between a Residential Property Price Index and population change in Victoria. In conclusion, we can say that the strong cross-correlation pattern found between raw series was spurious.

MATH1307 Assignment 2

Anna Krinochkina, 3712761

Introduction

Task 1 - Analysis and forecasting of horizontal solar radiation time series

Data description

Data exploration and visualisation

Time series regression methods

Finite distributed lag model

Polynomial distributed lag model

Koyck transformation

Autoregressive distributed lag models

Exponential smoothing methods

State-space models variations

Forecasting

Task 2 - Analysis of the correlation between Residential Property Price Index (PPI) in Melbourne and population change in Victoria from September 2003 to December 2016.

Data description

Data exploration

Analysis of correlation

Analysis of nonstationarity

Prewhitening

Conclusion