Time Series Analysis: Lab1

Student 1: Allan Gholmi

Student 2: Ngo Trong Trung

Link to full source code

1. Parameter estimation

AR(2) model: \[ Y_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \varepsilon_t \]

a) Write up the conditional sum-of-squares function ..

\[S_c(\phi_1,\phi_2) = \sum_{t=3}^{n} [(Y_t-\mu) - \phi_1(Y_{t-1}-\mu) - \phi_2(Y_{t-2}-\mu)]\]

b) Write an R function that produces the least squares estimators …

The implementation in R:

AR2LS = function(Y){
  stopifnot(is.vector(Y))
  Yt_1 = L(Y,1) # create lag data
  Yt_2 = L(Y,2)
  m = lm(Y ~ Yt_1 + Yt_2)
  
  ret = list()
  ret$coefficients = m$coefficients

  m_ = mean(Y)
  phi1 = ret$coefficients[1]
  phi2 = ret$coefficients[2]
  ret$cond_SS = 0
  for (t in 3:length(Y)){
    ret$cond_SS = ret$cond_SS + ((Y[t] - m_) - phi1*(Y[t-1] - m_) - phi2*(Y[t-2] - m_))
  }
  ret$cond_SS = abs(ret$cond_SS)
  n_c = (length(Y) - 3 + 1)
  ret$est_std = 10*sqrt(ret$cond_SS/(n_c - 2))
  return(ret)
}

Choice of parameter \(\phi_1 = 0.1\) and \(\phi_2 = 0.5\), we have:

The fitted model:

## (Intercept)        Yt_1        Yt_2 
##  0.49749162 -0.01693975  0.15124382

## [1] "Conditional sum squared: 2.23659213986202"

## [1] "Estimation of STD: 2.9329639378976"

We can see the coefficients very different from original parameters.

We do the experiment for 100 time, this is the histogram of \(\phi_1\) and \(\phi_2\):

## [1] "For phi1: mean= -0.004507012473987   Standard error= 0.291791919054609"

## [1] "For phi2: mean= 0.0224131438882782   Standard error= 0.208758149698439"

Now we try the estimation with n=100:

## [1] "For phi1: mean= -0.0120800459553815   Standard error= 0.130945324700548"

## [1] "For phi2: mean= 0.0896249025000784   Standard error= 0.109600880444313"

We can see the estimation significantly improved when increasing n to 100, this is indicated by the decreasing by half in standard error. The explanation for this phenomenon is the “law of large number” as well as the time seres will reveal more information with more and more sample.

c) Derive the stationary variance of the model …

We have: \[ \begin{aligned} & f_{Y_T,Y_{T-1},...,Y_1}(y_T,y_{T-1},...,y_1) = f_{Y_1,Y_2}(y_1,y_2) * \Pi_{t=3}^T f_{Y_t/Y_1,Y_2}(y_t/y_1,y_2) \\ & Y_3 = \phi_1 Y_2 + \phi_2 Y_1 + \varepsilon_3 \\ & \text{Hence:} \\ & (Y_3/Y_2 = y_2, Y_1 = y_1) \sim N(\phi_1y_2 + \phi_2y_1 + \varepsilon_e^2) \\ \end{aligned} \] Because \(Y_1\) and \(Y_2\) have determined values, then \(Y_3\) become linear combination of 2 normal distribution, results another normal distribution. The process repeated for every t from 1 to T.

In general, we have \((Y_T/Y_2 = y_2, Y_1 = y_1) \sim N(\phi_1y_2 + \phi_2y_1 + \varepsilon_e^2)\) as a substitution, which give me the result as following: \[ \begin{aligned} & f_{Y_T,Y_{T-1},...Y_1}(y_T,y_{T-1},...,y_1) = \\ & \left( \frac{1}{2\pi\sigma_e^2\sqrt{1-\rho_1^2}} \right) * exp\left( -\frac{1}{2k\sigma_e^2}(y_1^2 + y_2^2 - 2\rho_1y_1y_2)\right)* \Pi_{t=3}^T \frac{1}{\sqrt{2\pi\sigma_e^2}}exp\left( -\frac{1}{2\sigma_e^2}(y_t - (\Phi_1y_1+\Phi_2y_2))^2 \right) = \\ & \left(\frac{1}{2\pi\sigma_e^2\sqrt{1-\rho_1^2}}\right)\left(\frac{1}{\sqrt{2\pi\sigma_e^2}}\right)^{T-3} * exp\left( -\frac{k(1-\rho_1^2)}{2k\sigma_e^2(1-\rho_1^2)} * \sum_{t=3}^T(y_t - (\Phi_1y_1+\Phi_2y_2))^2 - (y_1^2 + y_2^2 - 2\rho_1y_1y_2)\right) \end{aligned} \] where: \[ S(\Phi_1,\Phi_2) = \frac{\sum_{t=3}^T[(y_t-(\Phi_1y_1 + \Phi_2y_2))]^2 - (y_1^2 + y_2^2 - 2\rho_1y_1y_2)}{2k\sigma_e^2(1-\rho_1^2)} \]

d) Use the R-function arima() to make ML-estimates for the AR(2) …

With n = 100, we have:

## [1] "AR2LS model coefficients:"

## (Intercept)        Yt_1        Yt_2 
##  0.02777017 -0.15497804  0.42960542

## [1] "arima model coefficients:"

##         ar1         ar2   intercept 
## -0.15654658  0.43957714  0.04439706

## [1] "For AR2LS:  Standard error= 1.51698730082058"

## [1] "For arima:  Standard error= 0.854337316367242"

Both model give approximately similar performance in term of standar error.

2. Seasonal ARIMA modeling

a) Start by dividing the data into two parts, the first one can be used for modeling and the latter one can later be used evaluate prediction performance …

In this task, 75% of data was used for training and 25% for validation on model performance.

##   Year Month Electricity.consumption
## 1 1990     1                   15249
## 2 1990     2                   13211
## 3 1990     3                   14548
## 4 1990     4                   12803
## 5 1990     5                   11845

Next, we fit the linear model to data:

## 
## Call:
## lm(formula = trainTS ~ time(trainTS))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4008.7 -1787.5   160.5  1650.4  3559.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -125720.72   48487.82  -2.593   0.0101 * 
## time(trainTS)      69.42      24.25   2.863   0.0046 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1969 on 223 degrees of freedom
## Multiple R-squared:  0.03544,    Adjusted R-squared:  0.03112 
## F-statistic: 8.194 on 1 and 223 DF,  p-value: 0.004603

Residuals of linear model:

b) Look at the ACF, PACF and EACF for the possibly transformed, differenced or detrended data and come up with some plausible ARIMA orders …

For the model with rather few parameters (in line with the KISS prin- ciple): From the EACF, we have:

## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o o o o o  x  x  o 
## 1 x o o o o o o o o o o  x  x  o 
## 2 x o o o o o o o o o o  x  x  o 
## 3 x x x o x o o o o o o  x  x  o 
## 4 x x x o x o o o o o o  x  x  x 
## 5 x x o x o o o o o o o  x  x  o 
## 6 x x o o x o o o o o o  x  o  o 
## 7 x x o o o o o o o o o  x  o  o

Hence, we have the model:

model1 <- arima(trainTS, order=c(0,1,1), 
                      seasonal=list(order=c(0,1,1), period=12))

For the model with with rather many (possibly overfitting), we simply increase the number of parameters: (we need to do something here to explain our choice of following params)

# model2 <- arima(trainTS, order=c(3,1,5),
#               seasonal=list(order=c(2,1,4), period=12))

c) Compare the two models chosen in (b) by looking at the AIC, at which coefficients that are significant,

For the KISS model:

## 
## Call:
## arima(x = trainTS, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.4025  -0.8734
## s.e.   0.0858   0.0559
## 
## sigma^2 estimated as 312039:  log likelihood = -1650.49,  aic = 3304.99
## 
## Training set error measures:

## Warning in trainingaccuracy(f, test, d, D): test elements must be within
## sample

##               ME RMSE MAE MPE MAPE
## Training set NaN  NaN NaN NaN  NaN

For the overfitting model:

AIC of 2 models:

3. Forecasting

a) book - page 244 (residual, ask Kostas)

b) Pick your favorite model from 2(c) and use the function predict() to make predictions for the part of the data that you did not use in the modeling, the validation data …

### c) Now construct lead 1 predictions for all observations in the validation data by using Arima() in the forecast-library … First, one-step-ahead predictions for all observations in the validation data: The residuals of the model: The ACF and QQ-Plot of residuals:

TSA_Lab2