The objective of this analysis and modelling is to review time series theory and experiment with R packages.
We will be following an ARIMA modeling procedure of the Mauna Loa CO2 dataset as follows:
The first thing to do is to load the first data set we will use. This data set contains observations on the concentration of carbon dioxide (CO2) in the atmosphere made at Mauna Loa from 1959 to 1998. This is an in-built data set in R so can be loaded via the data
function
library(ggfortify)
## Loading required package: ggplot2
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(forecast)
## Registered S3 methods overwritten by 'forecast':
## method from
## autoplot.Arima ggfortify
## autoplot.acf ggfortify
## autoplot.ar ggfortify
## autoplot.bats ggfortify
## autoplot.decomposed.ts ggfortify
## autoplot.ets ggfortify
## autoplot.forecast ggfortify
## autoplot.stl ggfortify
## autoplot.ts ggfortify
## fitted.ar ggfortify
## fortify.ts ggfortify
## residuals.ar ggfortify
data("co2")
To perform exploratory analysis, let’s first review the data with summary statistics and plots
# Take a look at the entries
co2
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1959 315.42 316.31 316.50 317.56 318.13 318.00 316.39 314.65 313.68 313.18
## 1960 316.27 316.81 317.42 318.87 319.87 319.43 318.01 315.74 314.00 313.68
## 1961 316.73 317.54 318.38 319.31 320.42 319.61 318.42 316.63 314.83 315.16
## 1962 317.78 318.40 319.53 320.42 320.85 320.45 319.45 317.25 316.11 315.27
## 1963 318.58 318.92 319.70 321.22 322.08 321.31 319.58 317.61 316.05 315.83
## 1964 319.41 320.07 320.74 321.40 322.06 321.73 320.27 318.54 316.54 316.71
## 1965 319.27 320.28 320.73 321.97 322.00 321.71 321.05 318.71 317.66 317.14
## 1966 320.46 321.43 322.23 323.54 323.91 323.59 322.24 320.20 318.48 317.94
## 1967 322.17 322.34 322.88 324.25 324.83 323.93 322.38 320.76 319.10 319.24
## 1968 322.40 322.99 323.73 324.86 325.40 325.20 323.98 321.95 320.18 320.09
## 1969 323.83 324.26 325.47 326.50 327.21 326.54 325.72 323.50 322.22 321.62
## 1970 324.89 325.82 326.77 327.97 327.91 327.50 326.18 324.53 322.93 322.90
## 1971 326.01 326.51 327.01 327.62 328.76 328.40 327.20 325.27 323.20 323.40
## 1972 326.60 327.47 327.58 329.56 329.90 328.92 327.88 326.16 324.68 325.04
## 1973 328.37 329.40 330.14 331.33 332.31 331.90 330.70 329.15 327.35 327.02
## 1974 329.18 330.55 331.32 332.48 332.92 332.08 331.01 329.23 327.27 327.21
## 1975 330.23 331.25 331.87 333.14 333.80 333.43 331.73 329.90 328.40 328.17
## 1976 331.58 332.39 333.33 334.41 334.71 334.17 332.89 330.77 329.14 328.78
## 1977 332.75 333.24 334.53 335.90 336.57 336.10 334.76 332.59 331.42 330.98
## 1978 334.80 335.22 336.47 337.59 337.84 337.72 336.37 334.51 332.60 332.38
## 1979 336.05 336.59 337.79 338.71 339.30 339.12 337.56 335.92 333.75 333.70
## 1980 337.84 338.19 339.91 340.60 341.29 341.00 339.39 337.43 335.72 335.84
## 1981 339.06 340.30 341.21 342.33 342.74 342.08 340.32 338.26 336.52 336.68
## 1982 340.57 341.44 342.53 343.39 343.96 343.18 341.88 339.65 337.81 337.69
## 1983 341.20 342.35 342.93 344.77 345.58 345.14 343.81 342.21 339.69 339.82
## 1984 343.52 344.33 345.11 346.88 347.25 346.62 345.22 343.11 340.90 341.18
## 1985 344.79 345.82 347.25 348.17 348.74 348.07 346.38 344.51 342.92 342.62
## 1986 346.11 346.78 347.68 349.37 350.03 349.37 347.76 345.73 344.68 343.99
## 1987 347.84 348.29 349.23 350.80 351.66 351.07 349.33 347.92 346.27 346.18
## 1988 350.25 351.54 352.05 353.41 354.04 353.62 352.22 350.27 348.55 348.72
## 1989 352.60 352.92 353.53 355.26 355.52 354.97 353.75 351.52 349.64 349.83
## 1990 353.50 354.55 355.23 356.04 357.00 356.07 354.67 352.76 350.82 351.04
## 1991 354.59 355.63 357.03 358.48 359.22 358.12 356.06 353.92 352.05 352.11
## 1992 355.88 356.63 357.72 359.07 359.58 359.17 356.94 354.92 352.94 353.23
## 1993 356.63 357.10 358.32 359.41 360.23 359.55 357.53 355.48 353.67 353.95
## 1994 358.34 358.89 359.95 361.25 361.67 360.94 359.55 357.49 355.84 356.00
## 1995 359.98 361.03 361.66 363.48 363.82 363.30 361.94 359.50 358.11 357.80
## 1996 362.09 363.29 364.06 364.76 365.45 365.01 363.70 361.54 359.51 359.65
## 1997 363.23 364.06 364.61 366.40 366.84 365.68 364.52 362.57 360.24 360.83
## Nov Dec
## 1959 314.66 315.43
## 1960 314.84 316.03
## 1961 315.94 316.85
## 1962 316.53 317.53
## 1963 316.91 318.20
## 1964 317.53 318.55
## 1965 318.70 319.25
## 1966 319.63 320.87
## 1967 320.56 321.80
## 1968 321.16 322.74
## 1969 322.69 323.95
## 1970 323.85 324.96
## 1971 324.63 325.85
## 1972 326.34 327.39
## 1973 327.99 328.48
## 1974 328.29 329.41
## 1975 329.32 330.59
## 1976 330.14 331.52
## 1977 332.24 333.68
## 1978 333.75 334.78
## 1979 335.12 336.56
## 1980 336.93 338.04
## 1981 338.19 339.44
## 1982 339.09 340.32
## 1983 340.98 342.82
## 1984 342.80 344.04
## 1985 344.06 345.38
## 1986 345.48 346.72
## 1987 347.64 348.78
## 1988 349.91 351.18
## 1989 351.14 352.37
## 1990 352.69 354.07
## 1991 353.64 354.89
## 1992 354.09 355.33
## 1993 355.30 356.78
## 1994 357.59 359.05
## 1995 359.61 360.74
## 1996 360.80 362.38
## 1997 362.49 364.34
# Start date, and date and frequency
tsp(co2)
## [1] 1959.000 1997.917 12.000
# Check for missing values
sum(is.na(co2))
## [1] 0
# Check the cycle of the time series
cycle(co2)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1959 1 2 3 4 5 6 7 8 9 10 11 12
## 1960 1 2 3 4 5 6 7 8 9 10 11 12
## 1961 1 2 3 4 5 6 7 8 9 10 11 12
## 1962 1 2 3 4 5 6 7 8 9 10 11 12
## 1963 1 2 3 4 5 6 7 8 9 10 11 12
## 1964 1 2 3 4 5 6 7 8 9 10 11 12
## 1965 1 2 3 4 5 6 7 8 9 10 11 12
## 1966 1 2 3 4 5 6 7 8 9 10 11 12
## 1967 1 2 3 4 5 6 7 8 9 10 11 12
## 1968 1 2 3 4 5 6 7 8 9 10 11 12
## 1969 1 2 3 4 5 6 7 8 9 10 11 12
## 1970 1 2 3 4 5 6 7 8 9 10 11 12
## 1971 1 2 3 4 5 6 7 8 9 10 11 12
## 1972 1 2 3 4 5 6 7 8 9 10 11 12
## 1973 1 2 3 4 5 6 7 8 9 10 11 12
## 1974 1 2 3 4 5 6 7 8 9 10 11 12
## 1975 1 2 3 4 5 6 7 8 9 10 11 12
## 1976 1 2 3 4 5 6 7 8 9 10 11 12
## 1977 1 2 3 4 5 6 7 8 9 10 11 12
## 1978 1 2 3 4 5 6 7 8 9 10 11 12
## 1979 1 2 3 4 5 6 7 8 9 10 11 12
## 1980 1 2 3 4 5 6 7 8 9 10 11 12
## 1981 1 2 3 4 5 6 7 8 9 10 11 12
## 1982 1 2 3 4 5 6 7 8 9 10 11 12
## 1983 1 2 3 4 5 6 7 8 9 10 11 12
## 1984 1 2 3 4 5 6 7 8 9 10 11 12
## 1985 1 2 3 4 5 6 7 8 9 10 11 12
## 1986 1 2 3 4 5 6 7 8 9 10 11 12
## 1987 1 2 3 4 5 6 7 8 9 10 11 12
## 1988 1 2 3 4 5 6 7 8 9 10 11 12
## 1989 1 2 3 4 5 6 7 8 9 10 11 12
## 1990 1 2 3 4 5 6 7 8 9 10 11 12
## 1991 1 2 3 4 5 6 7 8 9 10 11 12
## 1992 1 2 3 4 5 6 7 8 9 10 11 12
## 1993 1 2 3 4 5 6 7 8 9 10 11 12
## 1994 1 2 3 4 5 6 7 8 9 10 11 12
## 1995 1 2 3 4 5 6 7 8 9 10 11 12
## 1996 1 2 3 4 5 6 7 8 9 10 11 12
## 1997 1 2 3 4 5 6 7 8 9 10 11 12
# Review the table summary
summary(co2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 313.2 323.5 335.2 337.1 350.3 366.8
# Plot ts
ylab <- expression(CO[2] ~ (ppm))
autoplot(co2) + labs(x ="Date", y = ylab, title="Mauna Loa CO2 (PPM) from 1959 to 1998")
The series exhibit strong seasonality and an obvious trend. We can use standard R functions to compute seasonal or annual averages to give gross descriptions of the series.
Let’s use the boxplot function to see any seasonal effects:
boxplot(co2~cycle(co2),xlab="Date", ylab = ylab ,main ="Monthly CO2 Boxplot from 1959 to 1998")
From these exploratory plots, we can make some initial inferences:
We can decompose the time series into trend, seasonal and error components.
The additive model is:
Y[t]=T[t]+S[t]+e[t]
where:
Classical decomposition of time series is performed using the decompose()
function
decomposeCO2 <- decompose(co2,"additive")
autoplot(decomposeCO2)
In these decomposed plots we can again see the trend and seasonality as inferred previously, but we can also observe the estimation of the random component depicted under the “remainder”.
A stationary time series has the conditions that the mean, variance and covariance are not functions of time. In order to fit arima models, the time series is required to be stationary. We will use two methods to test the stationarity.
In order to test the stationarity of the time series, let’s run the Augmented Dickey-Fuller Test using the adf.test
function.
First set the hypothesis test:
adf.test(co2)
##
## Augmented Dickey-Fuller Test
##
## data: co2
## Dickey-Fuller = -2.8299, Lag order = 7, p-value = 0.2269
## alternative hypothesis: stationary
Where the p-value is less than 5%, we strong evidence against the null hypothesis, so we reject the null hypothesis. In this case, as per the test results above, the p-value is 0.22 which is >0.05 therefore we accept the null hypothesis that the time series is non stationary.
Another way to test for stationarity is to use autocorrelation. We will use autocorrelation function acf
and partial autocorrelation function pacf
. These functions plot the correlation between a series and its lags ie previous observations with a 95% confidence interval in blue. If the autocorrelation crosses the dashed blue line, it means that specific lag is significantly correlated with current series.
autoplot(acf(co2,plot=FALSE))+ labs(title="Correlogram of CO2 from 1959 to 1998")
The maximum at lag 1 or 12 months, indicates a positive relationship with the 12 month cycle.
autoplot(pacf(co2,plot=FALSE))+ labs(title="Correlogram of CO2 from 1959 to 1998")
We know that we need to address two issues before we test stationary series. One, we need to remove unequal variances. We do this using log of the series. Two, we need to address the trend component. We do this by taking difference of the series. Now, let’s test the resultant series.
Differencing is the commonly used technique to remove non-stationarity. This differencing is called as the Integration part in AR(I)MA. Now, we have three parameters
p : AR
d : I
q : MA
Seasonality can easily be incorporated in the ARIMA model directly.
adf.test(diff(log(co2)), alternative="stationary", k=0)
##
## Augmented Dickey-Fuller Test
##
## data: diff(log(co2))
## Dickey-Fuller = -8.9109, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
We see that the series is stationary enough to do any kind of time series modelling.
Next step is to find the right parameters to be used in the ARIMA model. We already know that the ‘d’ component is 1 as we need 1 difference to make the series stationary. We do this using the Correlation plots. Following are the ACF plots for the series :
autoplot(acf(diff(log(co2)),plot=FALSE))+ labs(title="Correlogram of CO2 from 1959 to 1998")
autoplot(pacf(diff(log(co2)),plot=FALSE))+ labs(title="Correlogram of CO2 from 1959 to 1998")
Since there is an upwards trend we will look at a linear model first for comparison. We plot co2 raw dataset with a linear model.
autoplot(co2) + geom_smooth(method="lm")+ labs(x ="Date", y = ylab, title="Mauna Loa CO2 (PPM) from 1959 to 1998")
This may not be the best model to fit as it doesn’t capture the seasonality and additive effects over time.
ARIMA stands for Auto Regression Integrated Moving Average. It is specified by three ordered parameters (p,d,q):
Due the fact that our times series exhibits seasonality, we will use actually a model called SARIMA, that is, as name suggest, a seasonality ARIMA. We write SARIMA as ARIMA(p,d,q)(P, D, Q)m:
Use the auto.arima
function to fit the best model and coefficients, given the default parameters including seasonality as TRUE.
arimaCO2 <- auto.arima(co2)
arimaCO2
## Series: co2
## ARIMA(1,1,1)(1,1,2)[12]
##
## Coefficients:
## ar1 ma1 sar1 sma1 sma2
## 0.2569 -0.5847 -0.5489 -0.2620 -0.5123
## s.e. 0.1406 0.1204 0.5879 0.5701 0.4819
##
## sigma^2 estimated as 0.08576: log likelihood=-84.39
## AIC=180.78 AICc=180.97 BIC=205.5
How we expected model parameters are lag 1 differencing (d), an autoregressive term of first lag (p) and a moving average model of order 1 (q). The second part of the ARIMA model (P,D,Q) corresponds to the seasonal component (12 indicates the number of periods per season, in this case months).
ggtsdiag(arimaCO2)
The above figure shows the ACF of the residuals for a model. The “lag” (time span between observations) is shown along the horizontal, and the autocorrelation is on the vertical. The lines indicated bounds for statistical significance. The residual plots appear to be centered around 0 as noise, with no pattern. Ljung-Box test show a p-value pretty high; the SARIMA model is a fairly good fit.
ARIMA() fits the model using maximum likelihood estimation (assuming Gaussian residual series). Now, plot the Q–Q plots, which measures the agreement of a fitted distribution with observed data:
# qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.
par(mfrow=c(1,2))
hist(residuals(arimaCO2), main='Mauna Loa CO2 Monthly', xlab='CO2 PPM')
qqnorm(residuals(arimaCO2))
qqline(residuals(arimaCO2))
The linearity of the points suggests that the data are normally distributed with mean = 0.
Finally we can plot a forecast of the time series using the forecast
function, with a 95% confidence interval where h is the forecast horizon periods in months.
forecastCO2 <- forecast(arimaCO2, level = c(95), h = 36)
autoplot(forecastCO2)