INTRODUCTION TO STATISTICAL DATA ANALYSIS USING R & RSTUDIO

Session 8: Time Series Analysis

EXERCISE DONE ACCORDING TO CLASS EXAMPLE

  1. Analyze and Visualize Time Series Data: Load a time series dataset and plot it.

Set the Current Work Directory

Load the Following Libraries

Import the Consumer Prices Index Data

      DATE  CPI
1 1/1/1967 34.8
2 2/1/1967 34.7
3 3/1/1967 34.7
4 4/1/1967 34.6
5 5/1/1967 34.6
6 6/1/1967 34.9

View the Data Set

Check the Structure of the Data Set

'data.frame':   676 obs. of  2 variables:
 $ DATE: chr  "1/1/1967" "2/1/1967" "3/1/1967" "4/1/1967" ...
 $ CPI : num  34.8 34.7 34.7 34.6 34.6 34.9 35 35.2 35.2 35.3 ...

Extract the Series of Interest

Convert to time series object

      Jan  Feb  Mar  Apr  May  Jun
1967 34.8 34.7 34.7 34.6 34.6 34.9

Plot the time series

Checkk Stationarity (Augmented Dickey Fuller Test)


    Augmented Dickey-Fuller Test

data:  CPI
Dickey-Fuller = -1.7508, Lag order = 8, p-value = 0.6838
alternative hypothesis: stationary

If p-value > 0.05, the data is non-stationary → Take the first difference:

Take the First Difference and test for Stationarity


    Augmented Dickey-Fuller Test

data:  CPI_diff
Dickey-Fuller = -5.1378, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary

Times Series Decomposition

Plot the ACF and PACF

  • ACF cuts off → Suggests MA(q)

  • PACF cuts off → Suggests AR(p)

Fit ARIMA

Series: CPI 
ARIMA(1,2,1)(0,0,2)[12] 

Coefficients:
         ar1      ma1     sma1     sma2
      0.2005  -0.7875  -0.1435  -0.1493
s.e.  0.0643   0.0464   0.0390   0.0388

sigma^2 = 0.1455:  log likelihood = -305.59
AIC=621.19   AICc=621.28   BIC=643.75

Training set error measures:
                      ME      RMSE       MAE        MPE      MAPE       MASE
Training set 0.008997016 0.3797776 0.2660481 0.01059324 0.2197267 0.05301554
                     ACF1
Training set -0.003908124

The structural representation of the ARIMA(1,2,1)(0,0,2)[12] model for the CPI series can be written as follows:

General ARIMA Model Form An ARIMA(p,d,q)(P,D,Q)[s] model is given by:

\[ \Phi(\beta^s) \phi(\beta) (1 - \beta)^d Y_t = \Theta(\beta^s) \theta(\beta) \epsilon_t \]

where:

\(Y_t\) is the CPI series, \(\beta\)is the backshift operator, \((1-\beta)^2\) represents the second-order differencing \((d=2)\) \(\phi(\beta)\) and \(\theta(\beta)\) are the non-seasonal AR and MA components,are the seasonal AR and MA components,\(\epsilon_t\) is the white noise error term.

\[ (1 - 0.2005\beta)(1 - \beta)^2 Y_t = (1 - 0.7875\beta)(1 - 0.1435\beta^{12} - 0.1493\beta^{24}) \epsilon_t \]

The ARIMA(1,2,1)(0,0,2)[12] model for the CPI series includes autoregressive (AR), moving average (MA), and seasonal moving average (SMA) components, each contributing to the time series’ dynamics. The autoregressive coefficient \((\phi_1=0.2005)\) suggests that past values of the CPI have a weak but positive influence on the current values. However, this effect is moderated by the presence of second-order differencing, which implies that the model is capturing trends in the data rather than just short-term dependencies. The moving average coefficient \((\theta_1=-0.7875)\) is strong and negative, indicating that past forecast errors significantly impact the current CPI values. This suggests that unexpected shocks in the past tend to be corrected in the following observations.

The seasonal moving average terms \(\phi_1=-0.1435\) and \(\phi_2=-0.1493\) at lags 12 and 24 capture the seasonality present in the CPI data. These coefficients indicate that past seasonal fluctuations, occurring at yearly and biennial intervals, influence the current value, but their effects are relatively moderate compared to the non-seasonal MA component. The variance of the error term \((\sigma^2=0.1455)\) is small, suggesting that the model’s residuals have low dispersion, which is desirable for a well-fitted model. The log-likelihood value of -305.59 and information criteria (AIC = 621.19, BIC = 643.75) indicate the model’s goodness-of-fit, with lower values preferred when comparing alternative models. Overall, the coefficients reflect a CPI process driven primarily by strong short-term shocks and moderate seasonal dependencies, requiring second-order differencing to achieve stationarity.

Note!!

The auto.arima() function selects the best (p,d,q) values based on AIC. You can also manually specify an ARIMA model:

Model Diagnostic


    Ljung-Box test

data:  Residuals from ARIMA(1,2,1)(0,0,2)[12]
Q* = 18.379, df = 20, p-value = 0.5624

Model df: 4.   Total lags used: 24

The residual diagnostic plots from the ARIMA(1,2,1)(0,0,2)[12] model provide insights into the adequacy of the model fit. The top panel displays the residual time series, which appears to fluctuate around zero with no evident trend. However, there are occasional large spikes, suggesting possible outliers or periods of higher volatility.

The autocorrelation function (ACF) plot in the lower-left panel shows that most autocorrelations lie within the confidence bounds, except for a few at higher lags, which may indicate some remaining seasonality or model misspecification. Ideally, residuals should exhibit no significant autocorrelation to confirm that the model has effectively captured the underlying structure of the data.

The histogram of residuals in the lower-right panel suggests that the residuals are approximately normally distributed, though there is some deviation in the tails. A fitted normal density curve is overlaid, and while the residuals generally conform to normality, the presence of extreme values at both ends suggests potential leptokurtic behavior.

Overall, while the model appears to capture the main dynamics of the series, the presence of occasional large residuals and some autocorrelation at higher lags warrants further investigation. Alternative model specifications or transformation techniques might be necessary to improve the fit.

Forecasting

$method
[1] "ARIMA(1,2,1)(0,0,2)[12]"

$model
Series: CPI 
ARIMA(1,2,1)(0,0,2)[12] 

Coefficients:
         ar1      ma1     sma1     sma2
      0.2005  -0.7875  -0.1435  -0.1493
s.e.  0.0643   0.0464   0.0390   0.0388

sigma^2 = 0.1455:  log likelihood = -305.59
AIC=621.19   AICc=621.28   BIC=643.75

$level
[1] 80 95

$mean
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
2023                                     319.3141 320.0558 320.9268 321.9420
2024 326.4583 327.3673 328.4571 329.4843                                    
          Sep      Oct      Nov      Dec
2023 322.7623 323.5906 324.5323 325.5392
2024                                    

$lower
              80%      95%
May 2023 318.8252 318.5664
Jun 2023 319.2095 318.7616
Jul 2023 319.7379 319.1085
Aug 2023 320.4088 319.5972
Sep 2023 320.8763 319.8778
Oct 2023 321.3397 320.1482
Nov 2023 321.9033 320.5115
Dec 2023 322.5180 320.9187
Jan 2024 323.0310 321.2166
Feb 2024 323.5198 321.4830
Mar 2024 324.1756 321.9091
Apr 2024 324.7552 322.2518

$upper
              80%      95%
May 2023 319.8030 320.0618
Jun 2023 320.9021 321.3501
Jul 2023 322.1158 322.7452
Aug 2023 323.4751 324.2867
Sep 2023 324.6484 325.6468
Oct 2023 325.8415 327.0331
Nov 2023 327.1614 328.5532
Dec 2023 328.5604 330.1597
Jan 2024 329.8856 331.7000
Feb 2024 331.2148 333.2515
Mar 2024 332.7386 335.0051
Apr 2024 334.2133 336.7167

Plot the Forecasted Values

Performance Metrics

                      ME      RMSE       MAE        MPE      MAPE       MASE
Training set 0.008997016 0.3797776 0.2660481 0.01059324 0.2197267 0.05301554
                     ACF1
Training set -0.003908124