1. Introduction

The objective of this analysis and modelling is to review time series theory and experiment with R packages.

We will be following an ARIMA modeling procedure of the Mauna Loa CO2 dataset as follows:

  1. Perform exploratory data analysis
  2. Decomposition of data
  3. Test the stationarity
  4. Fit a model used an automated algorithm
  5. Calculate forecasts

1.1. Load Libraries and Data

The first thing to do is to load the first data set we will use. This data set contains observations on the concentration of carbon dioxide (CO2) in the atmosphere made at Mauna Loa from 1959 to 1998. This is an in-built data set in R so can be loaded via the data function

library(ggfortify)
## Loading required package: ggplot2
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(forecast)
## Registered S3 methods overwritten by 'forecast':
##   method                 from     
##   autoplot.Arima         ggfortify
##   autoplot.acf           ggfortify
##   autoplot.ar            ggfortify
##   autoplot.bats          ggfortify
##   autoplot.decomposed.ts ggfortify
##   autoplot.ets           ggfortify
##   autoplot.forecast      ggfortify
##   autoplot.stl           ggfortify
##   autoplot.ts            ggfortify
##   fitted.ar              ggfortify
##   fortify.ts             ggfortify
##   residuals.ar           ggfortify
data("co2")

1. Exploratory Data Analysis

To perform exploratory analysis, let’s first review the data with summary statistics and plots

# Take a look at the entries
co2
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1959 315.42 316.31 316.50 317.56 318.13 318.00 316.39 314.65 313.68 313.18
## 1960 316.27 316.81 317.42 318.87 319.87 319.43 318.01 315.74 314.00 313.68
## 1961 316.73 317.54 318.38 319.31 320.42 319.61 318.42 316.63 314.83 315.16
## 1962 317.78 318.40 319.53 320.42 320.85 320.45 319.45 317.25 316.11 315.27
## 1963 318.58 318.92 319.70 321.22 322.08 321.31 319.58 317.61 316.05 315.83
## 1964 319.41 320.07 320.74 321.40 322.06 321.73 320.27 318.54 316.54 316.71
## 1965 319.27 320.28 320.73 321.97 322.00 321.71 321.05 318.71 317.66 317.14
## 1966 320.46 321.43 322.23 323.54 323.91 323.59 322.24 320.20 318.48 317.94
## 1967 322.17 322.34 322.88 324.25 324.83 323.93 322.38 320.76 319.10 319.24
## 1968 322.40 322.99 323.73 324.86 325.40 325.20 323.98 321.95 320.18 320.09
## 1969 323.83 324.26 325.47 326.50 327.21 326.54 325.72 323.50 322.22 321.62
## 1970 324.89 325.82 326.77 327.97 327.91 327.50 326.18 324.53 322.93 322.90
## 1971 326.01 326.51 327.01 327.62 328.76 328.40 327.20 325.27 323.20 323.40
## 1972 326.60 327.47 327.58 329.56 329.90 328.92 327.88 326.16 324.68 325.04
## 1973 328.37 329.40 330.14 331.33 332.31 331.90 330.70 329.15 327.35 327.02
## 1974 329.18 330.55 331.32 332.48 332.92 332.08 331.01 329.23 327.27 327.21
## 1975 330.23 331.25 331.87 333.14 333.80 333.43 331.73 329.90 328.40 328.17
## 1976 331.58 332.39 333.33 334.41 334.71 334.17 332.89 330.77 329.14 328.78
## 1977 332.75 333.24 334.53 335.90 336.57 336.10 334.76 332.59 331.42 330.98
## 1978 334.80 335.22 336.47 337.59 337.84 337.72 336.37 334.51 332.60 332.38
## 1979 336.05 336.59 337.79 338.71 339.30 339.12 337.56 335.92 333.75 333.70
## 1980 337.84 338.19 339.91 340.60 341.29 341.00 339.39 337.43 335.72 335.84
## 1981 339.06 340.30 341.21 342.33 342.74 342.08 340.32 338.26 336.52 336.68
## 1982 340.57 341.44 342.53 343.39 343.96 343.18 341.88 339.65 337.81 337.69
## 1983 341.20 342.35 342.93 344.77 345.58 345.14 343.81 342.21 339.69 339.82
## 1984 343.52 344.33 345.11 346.88 347.25 346.62 345.22 343.11 340.90 341.18
## 1985 344.79 345.82 347.25 348.17 348.74 348.07 346.38 344.51 342.92 342.62
## 1986 346.11 346.78 347.68 349.37 350.03 349.37 347.76 345.73 344.68 343.99
## 1987 347.84 348.29 349.23 350.80 351.66 351.07 349.33 347.92 346.27 346.18
## 1988 350.25 351.54 352.05 353.41 354.04 353.62 352.22 350.27 348.55 348.72
## 1989 352.60 352.92 353.53 355.26 355.52 354.97 353.75 351.52 349.64 349.83
## 1990 353.50 354.55 355.23 356.04 357.00 356.07 354.67 352.76 350.82 351.04
## 1991 354.59 355.63 357.03 358.48 359.22 358.12 356.06 353.92 352.05 352.11
## 1992 355.88 356.63 357.72 359.07 359.58 359.17 356.94 354.92 352.94 353.23
## 1993 356.63 357.10 358.32 359.41 360.23 359.55 357.53 355.48 353.67 353.95
## 1994 358.34 358.89 359.95 361.25 361.67 360.94 359.55 357.49 355.84 356.00
## 1995 359.98 361.03 361.66 363.48 363.82 363.30 361.94 359.50 358.11 357.80
## 1996 362.09 363.29 364.06 364.76 365.45 365.01 363.70 361.54 359.51 359.65
## 1997 363.23 364.06 364.61 366.40 366.84 365.68 364.52 362.57 360.24 360.83
##         Nov    Dec
## 1959 314.66 315.43
## 1960 314.84 316.03
## 1961 315.94 316.85
## 1962 316.53 317.53
## 1963 316.91 318.20
## 1964 317.53 318.55
## 1965 318.70 319.25
## 1966 319.63 320.87
## 1967 320.56 321.80
## 1968 321.16 322.74
## 1969 322.69 323.95
## 1970 323.85 324.96
## 1971 324.63 325.85
## 1972 326.34 327.39
## 1973 327.99 328.48
## 1974 328.29 329.41
## 1975 329.32 330.59
## 1976 330.14 331.52
## 1977 332.24 333.68
## 1978 333.75 334.78
## 1979 335.12 336.56
## 1980 336.93 338.04
## 1981 338.19 339.44
## 1982 339.09 340.32
## 1983 340.98 342.82
## 1984 342.80 344.04
## 1985 344.06 345.38
## 1986 345.48 346.72
## 1987 347.64 348.78
## 1988 349.91 351.18
## 1989 351.14 352.37
## 1990 352.69 354.07
## 1991 353.64 354.89
## 1992 354.09 355.33
## 1993 355.30 356.78
## 1994 357.59 359.05
## 1995 359.61 360.74
## 1996 360.80 362.38
## 1997 362.49 364.34
# Start date, and date and frequency
tsp(co2)
## [1] 1959.000 1997.917   12.000
# Check for missing values
sum(is.na(co2))
## [1] 0
# Check the cycle of the time series
cycle(co2)
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1959   1   2   3   4   5   6   7   8   9  10  11  12
## 1960   1   2   3   4   5   6   7   8   9  10  11  12
## 1961   1   2   3   4   5   6   7   8   9  10  11  12
## 1962   1   2   3   4   5   6   7   8   9  10  11  12
## 1963   1   2   3   4   5   6   7   8   9  10  11  12
## 1964   1   2   3   4   5   6   7   8   9  10  11  12
## 1965   1   2   3   4   5   6   7   8   9  10  11  12
## 1966   1   2   3   4   5   6   7   8   9  10  11  12
## 1967   1   2   3   4   5   6   7   8   9  10  11  12
## 1968   1   2   3   4   5   6   7   8   9  10  11  12
## 1969   1   2   3   4   5   6   7   8   9  10  11  12
## 1970   1   2   3   4   5   6   7   8   9  10  11  12
## 1971   1   2   3   4   5   6   7   8   9  10  11  12
## 1972   1   2   3   4   5   6   7   8   9  10  11  12
## 1973   1   2   3   4   5   6   7   8   9  10  11  12
## 1974   1   2   3   4   5   6   7   8   9  10  11  12
## 1975   1   2   3   4   5   6   7   8   9  10  11  12
## 1976   1   2   3   4   5   6   7   8   9  10  11  12
## 1977   1   2   3   4   5   6   7   8   9  10  11  12
## 1978   1   2   3   4   5   6   7   8   9  10  11  12
## 1979   1   2   3   4   5   6   7   8   9  10  11  12
## 1980   1   2   3   4   5   6   7   8   9  10  11  12
## 1981   1   2   3   4   5   6   7   8   9  10  11  12
## 1982   1   2   3   4   5   6   7   8   9  10  11  12
## 1983   1   2   3   4   5   6   7   8   9  10  11  12
## 1984   1   2   3   4   5   6   7   8   9  10  11  12
## 1985   1   2   3   4   5   6   7   8   9  10  11  12
## 1986   1   2   3   4   5   6   7   8   9  10  11  12
## 1987   1   2   3   4   5   6   7   8   9  10  11  12
## 1988   1   2   3   4   5   6   7   8   9  10  11  12
## 1989   1   2   3   4   5   6   7   8   9  10  11  12
## 1990   1   2   3   4   5   6   7   8   9  10  11  12
## 1991   1   2   3   4   5   6   7   8   9  10  11  12
## 1992   1   2   3   4   5   6   7   8   9  10  11  12
## 1993   1   2   3   4   5   6   7   8   9  10  11  12
## 1994   1   2   3   4   5   6   7   8   9  10  11  12
## 1995   1   2   3   4   5   6   7   8   9  10  11  12
## 1996   1   2   3   4   5   6   7   8   9  10  11  12
## 1997   1   2   3   4   5   6   7   8   9  10  11  12
# Review the table summary
summary(co2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   313.2   323.5   335.2   337.1   350.3   366.8
# Plot ts
ylab <- expression(CO[2] ~ (ppm))
autoplot(co2) + labs(x ="Date", y = ylab, title="Mauna Loa CO2 (PPM) from 1959 to 1998")

The series exhibit strong seasonality and an obvious trend. We can use standard R functions to compute seasonal or annual averages to give gross descriptions of the series.

Let’s use the boxplot function to see any seasonal effects:

boxplot(co2~cycle(co2),xlab="Date", ylab = ylab ,main ="Monthly CO2 Boxplot from 1959 to 1998")

From these exploratory plots, we can make some initial inferences:

2. Time Series Decomposition

We can decompose the time series into trend, seasonal and error components.

The additive model is:

Y[t]=T[t]+S[t]+e[t]

where:

Classical decomposition of time series is performed using the decompose() function

decomposeCO2 <- decompose(co2,"additive")
autoplot(decomposeCO2)

In these decomposed plots we can again see the trend and seasonality as inferred previously, but we can also observe the estimation of the random component depicted under the “remainder”.

3. Test Stationarity

A stationary time series has the conditions that the mean, variance and covariance are not functions of time. In order to fit arima models, the time series is required to be stationary. We will use two methods to test the stationarity.

3.1. ADF

In order to test the stationarity of the time series, let’s run the Augmented Dickey-Fuller Test using the adf.test function.

First set the hypothesis test:

  • The null hypothesis \(H_{0}\): that the time series is non stationary
  • The alternative hypothesis \(H_{A}\): that the time series is stationary
adf.test(co2)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  co2
## Dickey-Fuller = -2.8299, Lag order = 7, p-value = 0.2269
## alternative hypothesis: stationary

Where the p-value is less than 5%, we strong evidence against the null hypothesis, so we reject the null hypothesis. In this case, as per the test results above, the p-value is 0.22 which is >0.05 therefore we accept the null hypothesis that the time series is non stationary.

3.2. Autocorrelation (ACF e PACF)

Another way to test for stationarity is to use autocorrelation. We will use autocorrelation function acf and partial autocorrelation function pacf. These functions plot the correlation between a series and its lags ie previous observations with a 95% confidence interval in blue. If the autocorrelation crosses the dashed blue line, it means that specific lag is significantly correlated with current series.

autoplot(acf(co2,plot=FALSE))+ labs(title="Correlogram of CO2 from 1959 to 1998") 

The maximum at lag 1 or 12 months, indicates a positive relationship with the 12 month cycle.

autoplot(pacf(co2,plot=FALSE))+ labs(title="Correlogram of CO2 from 1959 to 1998") 

4. Remove Trend and Seasonal effect

We know that we need to address two issues before we test stationary series. One, we need to remove unequal variances. We do this using log of the series. Two, we need to address the trend component. We do this by taking difference of the series. Now, let’s test the resultant series.

Differencing is the commonly used technique to remove non-stationarity. This differencing is called as the Integration part in AR(I)MA. Now, we have three parameters

p : AR

d : I

q : MA

Seasonality can easily be incorporated in the ARIMA model directly.

adf.test(diff(log(co2)), alternative="stationary", k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(log(co2))
## Dickey-Fuller = -8.9109, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

We see that the series is stationary enough to do any kind of time series modelling.

Next step is to find the right parameters to be used in the ARIMA model. We already know that the ‘d’ component is 1 as we need 1 difference to make the series stationary. We do this using the Correlation plots. Following are the ACF plots for the series :

autoplot(acf(diff(log(co2)),plot=FALSE))+ labs(title="Correlogram of CO2 from 1959 to 1998") 

autoplot(pacf(diff(log(co2)),plot=FALSE))+ labs(title="Correlogram of CO2 from 1959 to 1998") 

5. Fit Model

5.1. Linear Model

Since there is an upwards trend we will look at a linear model first for comparison. We plot co2 raw dataset with a linear model.

autoplot(co2) + geom_smooth(method="lm")+ labs(x ="Date", y = ylab, title="Mauna Loa CO2 (PPM) from 1959 to 1998") 

This may not be the best model to fit as it doesn’t capture the seasonality and additive effects over time.

5.2. ARIMA Model

ARIMA stands for Auto Regression Integrated Moving Average. It is specified by three ordered parameters (p,d,q):

  • p is the order of the autoregressive model(number of time lags)
  • d is the degree of differencing(number of times the data have had past values subtracted)
  • q is the order of moving average model.

Due the fact that our times series exhibits seasonality, we will use actually a model called SARIMA, that is, as name suggest, a seasonality ARIMA. We write SARIMA as ARIMA(p,d,q)(P, D, Q)m:

  • p — the number of autoregressive
  • d — degree of differencing
  • q — the number of moving average terms
  • m — refers to the number of periods in each season
  • (P, D, Q )— represents the (p,d,q) for the seasonal part of the time series

Use the auto.arima function to fit the best model and coefficients, given the default parameters including seasonality as TRUE.

arimaCO2 <- auto.arima(co2)
arimaCO2
## Series: co2 
## ARIMA(1,1,1)(1,1,2)[12] 
## 
## Coefficients:
##          ar1      ma1     sar1     sma1     sma2
##       0.2569  -0.5847  -0.5489  -0.2620  -0.5123
## s.e.  0.1406   0.1204   0.5879   0.5701   0.4819
## 
## sigma^2 estimated as 0.08576:  log likelihood=-84.39
## AIC=180.78   AICc=180.97   BIC=205.5

How we expected model parameters are lag 1 differencing (d), an autoregressive term of first lag (p) and a moving average model of order 1 (q). The second part of the ARIMA model (P,D,Q) corresponds to the seasonal component (12 indicates the number of periods per season, in this case months).

ggtsdiag(arimaCO2)

The above figure shows the ACF of the residuals for a model. The “lag” (time span between observations) is shown along the horizontal, and the autocorrelation is on the vertical. The lines indicated bounds for statistical significance. The residual plots appear to be centered around 0 as noise, with no pattern. Ljung-Box test show a p-value pretty high; the SARIMA model is a fairly good fit.

ARIMA() fits the model using maximum likelihood estimation (assuming Gaussian residual series). Now, plot the Q–Q plots, which measures the agreement of a fitted distribution with observed data:

# qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.
par(mfrow=c(1,2))
hist(residuals(arimaCO2), main='Mauna Loa CO2 Monthly', xlab='CO2 PPM')
qqnorm(residuals(arimaCO2))
qqline(residuals(arimaCO2))

The linearity of the points suggests that the data are normally distributed with mean = 0.

6. Forecast

Finally we can plot a forecast of the time series using the forecast function, with a 95% confidence interval where h is the forecast horizon periods in months.

forecastCO2 <- forecast(arimaCO2, level = c(95), h = 36)
autoplot(forecastCO2)