Adapted from Chapter 1. Introductory to Time Series in R. Cowpertwait et al.
Sample time series data - the number of air passengers booking for the period 1949-1960.
AP <- AirPassengers
AP
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
cl = class(AP)
st = start(AP)
e = end(AP)
fr = frequency(AP)
Class = ts; start date = 1949, 1; end date = 1960, 12; frequency = 12
Preliminary plot
plot(AP, ylab = "Passengers (1000's)")
Reduce seasonal effect by aggregating data to the annual level.
layout(matrix(c(1,2), ncol=1)) # Layout divides the device up into as many rows and columns as there are in matrix
plot(aggregate(AP), ylab = "Aggregated annually")
boxplot(AP ~ cycle(AP))
From boxplot - most people travel from June-September
Maine.month <- read.table("Maine.dat", header=TRUE)
attach(Maine.month)
class(Maine.month)
## [1] "data.frame"
head(Maine.month)
## unemploy
## 1 6.7
## 2 6.7
## 3 6.4
## 4 5.9
## 5 5.2
## 6 4.8
Use ts to convert to time series
Maine.month.ts <- ts(unemploy, start = c(1996, 1), freq = 12)
Calculate the annual mean
Calculate precise percentage for unemployment
Maine.Feb <- window(Maine.month.ts, start = c(1996,2), freq = TRUE)
Maine.Aug <- window(Maine.month.ts, start = c(1996,8), freq = TRUE)
Feb.ratio <- mean(Maine.Feb) / mean(Maine.month.ts)
Aug.ratio <- mean(Maine.Aug) / mean(Maine.month.ts)
February ration is 1.222529 and August ration is 0.8163732
Three time series
## choc beer elec
## 1 1451 96.3 1497
## 2 2037 84.4 1463
## 3 2477 91.2 1648
## 4 2785 81.9 1595
## 5 2994 80.5 1777
## 6 2681 70.4 1824
Create a time series for each column:
Plot all three:
Use intersect function and plot:
AP.elec <- ts.intersect(AP, Elec.ts)
AP <- AP.elec[,1]
Elec <- AP.elec[,2]
layout(1:3)
plot(AP, main = "", ylab = "Air passengers / 1000's")
plot(Elec, main = "", ylab = "Electricity production / MkWh")
plot(as.vector(AP), as.vector(Elec), xlab = "Air passengers / 1000's", ylab = "Electricity production / MWh")
abline(reg = lm(Elec ~ AP))
plot(stl()) - produces a single figure showing the original series xt and the decomposed series m_t, s_t, and z_t
Elec.decom <- decompose(Elec.ts, type = "additive")
plot(Elec.decom)
Elec.decom <- decompose(Elec.ts, type = "mult")
plot(Elec.decom)
Trend <- Elec.decom$trend
Seasonal <- Elec.decom$seasonal
ts.plot(cbind(Trend, Trend * Seasonal), lty = 1:2)
The multiplicative model is more appropriate than the additive model because the variance of the original series and trend increase with time.
Decompose a time series into seasonal, trend and irregular components using loess, acronym STL.
plot(stl(Elec.ts,s.window = 7, t.jump=1))
plot(stl(Elec.ts,s.window = 7))
Adapted from 1) Hyndman et al. Forecasting: Principles and Practice. 2021.https://otexts.com/fpp2/ 2) St-Amant. 2020. Time Series Forecasting in R. 2020. https://towardsdatascience.com/a-guide-to-forecasting-in-r-6b0c9638c261
the male and female monthly deaths from lung diseases in the UK. Three time series giving the monthly deaths from bronchitis, emphysema and asthma in the UK, 1974–1979, both sexes (ldeaths), males (mdeaths) and females (fdeaths).Source: https://robjhyndman.com/hyndsight/forecast7-ggplot2/
Basic R plot
plot(mdeaths)
Using library forecast and function autoplot()
autoplot(mdeaths)
tslm() function is designed to fit linear models to time
series data. the function calls lm() to do the
estimation.
deaths.lm <- tslm(mdeaths ~ trend + season)
summary(deaths.lm)
##
## Call:
## tslm(formula = mdeaths ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -422.70 -58.60 -0.11 64.29 644.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2257.1179 74.8463 30.157 < 2e-16 ***
## trend -4.1060 0.9662 -4.250 7.72e-05 ***
## season2 -44.3940 97.0086 -0.458 0.64890
## season3 -151.1214 97.0230 -1.558 0.12468
## season4 -460.1821 97.0471 -4.742 1.38e-05 ***
## season5 -799.2429 97.0807 -8.233 2.21e-11 ***
## season6 -922.4702 97.1240 -9.498 1.71e-13 ***
## season7 -968.5310 97.1768 -9.967 2.91e-14 ***
## season8 -1063.4250 97.2393 -10.936 8.09e-16 ***
## season9 -1075.3190 97.3112 -11.050 5.34e-16 ***
## season10 -851.7131 97.3927 -8.745 3.04e-12 ***
## season11 -711.1071 97.4838 -7.295 8.57e-10 ***
## season12 -288.1679 97.5843 -2.953 0.00451 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 168 on 59 degrees of freedom
## Multiple R-squared: 0.875, Adjusted R-squared: 0.8495
## F-statistic: 34.41 on 12 and 59 DF, p-value: < 2.2e-16
the trend is significant and has a negative coefficient. As well as seasons (here monthly data) is significant starting from month4.
quarterly <- aggregate(mdeaths, nfrequency = 4)
deathsq.lm <- tslm(quarterly ~trend + season)
summary(deathsq.lm)
##
## Call:
## tslm(formula = quarterly ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -677.20 -145.69 43.33 144.09 797.43
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6588.16 175.57 37.524 < 2e-16 ***
## trend -36.95 10.01 -3.692 0.00155 **
## season2 -1986.38 193.66 -10.257 3.49e-09 ***
## season3 -2911.76 194.43 -14.976 5.66e-12 ***
## season4 -1655.47 195.72 -8.458 7.25e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 335 on 19 degrees of freedom
## Multiple R-squared: 0.9325, Adjusted R-squared: 0.9183
## F-statistic: 65.59 on 4 and 19 DF, p-value: 7.482e-11
After applying tslm with quarterly data, we obtain the
summary output with much more improved Rsquared 0.9. All coefficients
are significant and negative. We can also decompose aggregated data to
view trend and seasonality. We will use STL. STL is a robust method for
decomposing time series. STL is an acronym for
“Seasonal and Trend decomposition using Loess”. Loess
is a method for estimating nonlinear relationships. There is a negative
trend and strong seasonality present in the splot.
We can set a window as “periodic” or the span (in lags) of for seasonal extraction, which should be odd and at least 7 (per Cleveland et al. 1990)
autoplot(stl(quarterly,"per"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the forecast package.
## Please report the issue at <https://github.com/robjhyndman/forecast/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
forecast is a generic function from the library
forecast for forecasting from time series or time series
models.
Frequency is 12. We are going to switch to a different dataset to see different forecasting methods.
data=AirPassengers
training=window(data, start = c(1949,1), end = c(1955,12))
validation=window(data, start = c(1956,1))
Naive method - the forecast for tomorrow is what we are observing today. Any forecasting method should be evaluated by being compared to a naive method.
We are using MAPE from MLmetrics library (mean absolute percentage error regression loss): https://www.rdocumentation.org/packages/MLmetrics/versions/1.1.1/topics/MAPE
naive = naive(training, h=length(validation))
nmap = MAPE(naive$mean, validation) * 100
Seasonal MAPE evaluation is 27.2741244.
Seasonal naive: the forecast for tomorrow is what we observed the week/month/year (depending what horizon we are working with) before.
Seasonal MAPE evaluation is 27.0468932.
Let us plot the sesonal naive forecast:
plot(data, col="blue", xlab="Year", ylab="Passengers", main="Seasonal Naive Forecast", type='l')
lines(naive$mean, col="red", lwd=2)
Notice that what happened in the last year of data is repeated as a forecast for the entire validation set.
ARIMA models contain three things: - AR(p): autoregressive part of the model. Means that we use p past observations from the timeseries as predictors - Differencing (d): transform timeseries into a stationary one by taking the differences between successive observations at appropriate lags d - MA(q): uses q past forecast errors as predictors
For seasonal data, we can use SARIMA.
To find the best estimated model, we will use
auto.arima.
arima_optimal = auto.arima(training)
print(arima_optimal)
## Series: training
## ARIMA(0,1,1)(1,1,0)[12]
##
## Coefficients:
## ma1 sar1
## -0.2591 -0.2603
## s.e. 0.1296 0.1179
##
## sigma^2 = 96.77: log likelihood = -262.5
## AIC=531 AICc=531.36 BIC=537.79
Optimal values: ARIMA(0,1,1) (1,1,0) [12] = (p,d,q) (P,D,Q) and 12 (Number of observation)
To forecast a SARIMA model, we will use sarima.for
function from the astsa package.
sarima_forecast = sarima.for(training, n.ahead=length(validation),p=0,d=1,q=1,P=1,D=1,Q=0,S=12)
MAPE(sarima_forecast$pred, validation) * 100
## [1] 6.544624
Sarima model results are better than naive.s