Adapted from Chapter 1. Introductory to Time Series in R. Cowpertwait et al.

TS in R, Chapter 1 Introduction.

TS Class

Air Passengers Data

Sample time series data - the number of air passengers booking for the period 1949-1960.

AP <- AirPassengers
AP
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
cl = class(AP)
st = start(AP)
e = end(AP)
fr = frequency(AP)

Class = ts; start date = 1949, 1; end date = 1960, 12; frequency = 12

Preliminary plot

plot(AP, ylab = "Passengers (1000's)")

Reduce seasonal effect by aggregating data to the annual level.

layout(matrix(c(1,2), ncol=1)) # Layout divides the device up into as many rows and columns as there are in matrix
plot(aggregate(AP), ylab = "Aggregated annually")
boxplot(AP ~ cycle(AP))

From boxplot - most people travel from June-September

Unemployment Data

Maine.month <- read.table("Maine.dat", header=TRUE)
attach(Maine.month)
class(Maine.month)
## [1] "data.frame"
head(Maine.month)
##   unemploy
## 1      6.7
## 2      6.7
## 3      6.4
## 4      5.9
## 5      5.2
## 6      4.8

Use ts to convert to time series

Maine.month.ts <- ts(unemploy, start = c(1996, 1), freq = 12)

Calculate the annual mean

Calculate precise percentage for unemployment

Maine.Feb <- window(Maine.month.ts, start = c(1996,2), freq = TRUE)
Maine.Aug <- window(Maine.month.ts, start = c(1996,8), freq = TRUE)
Feb.ratio <- mean(Maine.Feb) / mean(Maine.month.ts)
Aug.ratio <- mean(Maine.Aug) / mean(Maine.month.ts)

February ration is 1.222529 and August ration is 0.8163732

Multiple Time Series

Three time series

##   choc beer elec
## 1 1451 96.3 1497
## 2 2037 84.4 1463
## 3 2477 91.2 1648
## 4 2785 81.9 1595
## 5 2994 80.5 1777
## 6 2681 70.4 1824

Create a time series for each column:

Plot all three:

Use intersect function and plot:

AP.elec <- ts.intersect(AP, Elec.ts)
AP <- AP.elec[,1]
Elec <- AP.elec[,2]
layout(1:3) 
  
plot(AP, main = "", ylab = "Air passengers / 1000's") 
plot(Elec, main = "", ylab = "Electricity production / MkWh")
plot(as.vector(AP), as.vector(Elec), xlab = "Air passengers / 1000's", ylab = "Electricity production / MWh")
abline(reg = lm(Elec ~ AP))

Decomposition in R

plot(stl()) - produces a single figure showing the original series xt and the decomposed series m_t, s_t, and z_t

Elec.decom <- decompose(Elec.ts, type = "additive")
plot(Elec.decom)

Elec.decom <- decompose(Elec.ts, type = "mult")
plot(Elec.decom)

Trend <- Elec.decom$trend
Seasonal <- Elec.decom$seasonal
ts.plot(cbind(Trend, Trend * Seasonal), lty = 1:2)

The multiplicative model is more appropriate than the additive model because the variance of the original series and trend increase with time.

Decompose a time series into seasonal, trend and irregular components using loess, acronym STL.

plot(stl(Elec.ts,s.window = 7, t.jump=1))

plot(stl(Elec.ts,s.window = 7))

Regression

Adapted from 1) Hyndman et al. Forecasting: Principles and Practice. 2021.https://otexts.com/fpp2/ 2) St-Amant. 2020. Time Series Forecasting in R. 2020. https://towardsdatascience.com/a-guide-to-forecasting-in-r-6b0c9638c261

Basic Plot

the male and female monthly deaths from lung diseases in the UK. Three time series giving the monthly deaths from bronchitis, emphysema and asthma in the UK, 1974–1979, both sexes (ldeaths), males (mdeaths) and females (fdeaths).Source: https://robjhyndman.com/hyndsight/forecast7-ggplot2/

Basic R plot

plot(mdeaths)

Using library forecast and function autoplot()

autoplot(mdeaths) 

Regression

TSLM

tslm() function is designed to fit linear models to time series data. the function calls lm() to do the estimation.

deaths.lm  <- tslm(mdeaths ~ trend + season)
summary(deaths.lm)
## 
## Call:
## tslm(formula = mdeaths ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -422.70  -58.60   -0.11   64.29  644.03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2257.1179    74.8463  30.157  < 2e-16 ***
## trend          -4.1060     0.9662  -4.250 7.72e-05 ***
## season2       -44.3940    97.0086  -0.458  0.64890    
## season3      -151.1214    97.0230  -1.558  0.12468    
## season4      -460.1821    97.0471  -4.742 1.38e-05 ***
## season5      -799.2429    97.0807  -8.233 2.21e-11 ***
## season6      -922.4702    97.1240  -9.498 1.71e-13 ***
## season7      -968.5310    97.1768  -9.967 2.91e-14 ***
## season8     -1063.4250    97.2393 -10.936 8.09e-16 ***
## season9     -1075.3190    97.3112 -11.050 5.34e-16 ***
## season10     -851.7131    97.3927  -8.745 3.04e-12 ***
## season11     -711.1071    97.4838  -7.295 8.57e-10 ***
## season12     -288.1679    97.5843  -2.953  0.00451 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 168 on 59 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.8495 
## F-statistic: 34.41 on 12 and 59 DF,  p-value: < 2.2e-16

the trend is significant and has a negative coefficient. As well as seasons (here monthly data) is significant starting from month4.

Resampling

quarterly <- aggregate(mdeaths, nfrequency = 4)
deathsq.lm <- tslm(quarterly ~trend + season)
summary(deathsq.lm)
## 
## Call:
## tslm(formula = quarterly ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -677.20 -145.69   43.33  144.09  797.43 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6588.16     175.57  37.524  < 2e-16 ***
## trend         -36.95      10.01  -3.692  0.00155 ** 
## season2     -1986.38     193.66 -10.257 3.49e-09 ***
## season3     -2911.76     194.43 -14.976 5.66e-12 ***
## season4     -1655.47     195.72  -8.458 7.25e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 335 on 19 degrees of freedom
## Multiple R-squared:  0.9325, Adjusted R-squared:  0.9183 
## F-statistic: 65.59 on 4 and 19 DF,  p-value: 7.482e-11

STL

After applying tslm with quarterly data, we obtain the summary output with much more improved Rsquared 0.9. All coefficients are significant and negative. We can also decompose aggregated data to view trend and seasonality. We will use STL. STL is a robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”. Loess is a method for estimating nonlinear relationships. There is a negative trend and strong seasonality present in the splot.

We can set a window as “periodic” or the span (in lags) of for seasonal extraction, which should be odd and at least 7 (per Cleveland et al. 1990)

autoplot(stl(quarterly,"per")) 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the forecast package.
##   Please report the issue at <https://github.com/robjhyndman/forecast/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Forecasting

Forecast Library

forecast is a generic function from the library forecast for forecasting from time series or time series models.

Frequency is 12. We are going to switch to a different dataset to see different forecasting methods.

data=AirPassengers
training=window(data, start = c(1949,1), end = c(1955,12))
validation=window(data, start = c(1956,1))

Naive method

Naive method - the forecast for tomorrow is what we are observing today. Any forecasting method should be evaluated by being compared to a naive method.

We are using MAPE from MLmetrics library (mean absolute percentage error regression loss): https://www.rdocumentation.org/packages/MLmetrics/versions/1.1.1/topics/MAPE

naive = naive(training, h=length(validation))
nmap = MAPE(naive$mean, validation) * 100

Seasonal MAPE evaluation is 27.2741244.

Seasonal Naive method

Seasonal naive: the forecast for tomorrow is what we observed the week/month/year (depending what horizon we are working with) before.

Seasonal MAPE evaluation is 27.0468932.

Let us plot the sesonal naive forecast:

plot(data, col="blue", xlab="Year", ylab="Passengers", main="Seasonal Naive Forecast", type='l')
lines(naive$mean, col="red", lwd=2)

Notice that what happened in the last year of data is repeated as a forecast for the entire validation set.

Arima Model

ARIMA models contain three things: - AR(p): autoregressive part of the model. Means that we use p past observations from the timeseries as predictors - Differencing (d): transform timeseries into a stationary one by taking the differences between successive observations at appropriate lags d - MA(q): uses q past forecast errors as predictors

For seasonal data, we can use SARIMA.

To find the best estimated model, we will use auto.arima.

arima_optimal = auto.arima(training)
print(arima_optimal)
## Series: training 
## ARIMA(0,1,1)(1,1,0)[12] 
## 
## Coefficients:
##           ma1     sar1
##       -0.2591  -0.2603
## s.e.   0.1296   0.1179
## 
## sigma^2 = 96.77:  log likelihood = -262.5
## AIC=531   AICc=531.36   BIC=537.79

Optimal values: ARIMA(0,1,1) (1,1,0) [12] = (p,d,q) (P,D,Q) and 12 (Number of observation)

To forecast a SARIMA model, we will use sarima.for function from the astsa package.

sarima_forecast = sarima.for(training, n.ahead=length(validation),p=0,d=1,q=1,P=1,D=1,Q=0,S=12)

MAPE(sarima_forecast$pred, validation) * 100
## [1] 6.544624

Sarima model results are better than naive.s