The data I will be forecasting with comes from a Kaggle competition. It records the monthly high temperatures from 1973 - 2020 in Rio. Let’s load the data, and some packages and take a look at it.
library(lubridate)
library(fpp2)
library(dplyr)
library(tidyr)
library(seasonal)
library(urca)
df <- read.csv("C:/Users/jawilliams/Desktop/Forecasting/Data/station_rio_temp.csv", stringsAsFactors = F)
head(df,1)
## YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
## 1 1973 27.73 27.97 25.7 26.49 22.42 22.76 22.14 21.03 21.46 22.46 23.06 25.85
str(df)
## 'data.frame': 47 obs. of 13 variables:
## $ YEAR: int 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 ...
## $ JAN : num 27.7 26.7 25.3 27.5 27.1 ...
## $ FEB : num 28 27.2 26.9 26.2 28.5 ...
## $ MAR : num 25.7 26.6 26.4 25.6 26.9 ...
## $ APR : num 26.5 23.9 22.8 25 24.2 ...
## $ MAY : num 22.4 22.8 21.4 22 22.4 ...
## $ JUN : num 22.8 20.7 20.5 21.2 22.1 ...
## $ JUL : num 22.1 21.2 19.7 20.1 23.1 ...
## $ AUG : num 21 21.8 23 21.1 22.3 ...
## $ SEP : num 21.5 22.9 22.4 21.3 22.4 ...
## $ OCT : num 22.5 22.8 22.6 22.1 23.9 ...
## $ NOV : num 23.1 24.5 24.1 24.4 24.5 ...
## $ DEC : num 25.9 24.5 26.5 25.6 24.8 ...
psych::describe(df)
## vars n mean sd median trimmed mad min max range skew
## YEAR 1 47 1996.00 13.71 1996.00 1996.00 17.79 1973.00 2019.0 46.00 0.00
## JAN 2 47 110.09 274.33 27.48 27.57 1.60 23.86 999.9 976.04 2.88
## FEB 3 47 151.71 327.98 27.96 77.61 1.08 25.69 999.9 974.21 2.16
## MAR 4 47 88.69 240.50 26.57 26.68 0.65 24.80 999.9 975.10 3.46
## APR 5 47 87.34 240.86 25.09 25.26 1.29 22.73 999.9 977.17 3.46
## MAY 6 47 105.96 275.60 22.87 22.93 0.80 21.37 999.9 978.53 2.88
## JUN 7 47 84.03 241.74 21.50 21.69 1.35 19.74 999.9 980.16 3.46
## JUL 8 47 104.48 276.05 21.32 21.36 1.01 19.24 999.9 980.66 2.88
## AUG 9 47 105.16 275.85 22.07 22.07 1.22 19.95 999.9 979.95 2.88
## SEP 10 47 105.59 275.71 22.44 22.56 1.10 20.53 999.9 979.37 2.88
## OCT 11 47 86.07 241.20 23.67 23.89 1.20 21.62 999.9 978.28 3.46
## NOV 12 47 107.82 275.02 24.85 25.00 1.05 23.04 999.9 976.86 2.88
## DEC 13 47 129.84 303.45 26.17 51.36 1.04 24.54 999.9 975.36 2.47
## kurtosis se
## YEAR -1.28 2.00
## JAN 6.43 40.01
## FEB 2.73 47.84
## MAR 10.16 35.08
## APR 10.16 35.13
## MAY 6.43 40.20
## JUN 10.16 35.26
## JUL 6.43 40.27
## AUG 6.43 40.24
## SEP 6.43 40.22
## OCT 10.16 35.18
## NOV 6.43 40.12
## DEC 4.20 44.26
As currently constructed, the months are column names. I am going to need to coalesce the column names into a single character vector/ factor column to make an accurate time series. Additionally, each column has a max value 0f 999.90. That does not make sense, so I am going to go ahead and replace those values w/ the previously observed value in the data. I think this makes more sense in a time-series analysis instead of some sort of PCA imputation or even using a statistic like a mean or a median.
df <- df %>%
pivot_longer(., -YEAR, names_to = "Month", values_to = "Temp")
for (i in 1:nrow(df)){
if(df$Temp[i]==999.90){
df$Temp[i]=df$Temp[i-1]
}
}
head(df, 1)
## # A tibble: 1 x 3
## YEAR Month Temp
## <int> <chr> <dbl>
## 1 1973 JAN 27.7
psych::describe(df)
## vars n mean sd median trimmed mad min max range skew
## YEAR 1 564 1996.00 13.58 1996.00 1996.00 17.79 1973.00 2019.00 46.00 0.00
## Month* 2 564 NaN NA NA NaN NA Inf -Inf -Inf NA
## Temp 3 564 24.36 2.41 24.48 24.35 3.05 19.24 30.25 11.01 0.01
## kurtosis se
## YEAR -1.21 0.57
## Month* NA NA
## Temp -1.08 0.10
Ok, now that our data is tidy, we can go ahead and create a ts object and plot the series and it’s distribution. Let’s also go ahead and decompose our series using a few methods.
rio_ts <- ts(df$Temp, start = c(1973,1), frequency = 12)
hist(rio_ts, col = "grey", main = "Monthly Temp Highs Rio 1973- 2020")
autoplot(rio_ts)+
ylab("Celcius")+
ggtitle("Monthly Temperature High: Rio 1973 - 2020")
autoplot(decompose(rio_ts, type = 'a'))+
ggtitle("Additive Decomposition")
autoplot(decompose(rio_ts, type = 'm'))+
ggtitle("Multiplicative Decomposition")
So there seems to be quite a bit of variability in our series. There seems to be a seasonal component, Let’s take a look at a few more decompositions. First we will look at STL, which is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships. We can also take a look at the X11 method, which has the features of trend-cycle estimates are available for all observations including the end points, and the seasonal component is allowed to vary slowly over time.
rio_ts %>%
stl(t.window=13, s.window="periodic", robust=TRUE) %>%
autoplot() +
ggtitle("STL Decomposition")
rio_ts %>%
seas(x11 = "") %>%
autoplot() +
ggtitle("X11 Decomposition")
So the x11 method does a poor job decomposing our series, which makes sense due to the lack of trend. Let’s plot the acf and pacf plots.
Acf(rio_ts)
Pacf(rio_ts)
Very seasonal in both components. Let’s test if this series has a unit root, and needs to be differenced. I will use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test to stay consistent w/ the method used in the text. In this test, \(Ho\) is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, rejections of \(Ho\) suggest that differencing is required.
urca::ur.kpss(rio_ts)
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.6075
In this case, we see that the series is not stationary. Let’s see how many differences it requires.
ndiffs(rio_ts)
## [1] 1
So in order to create a stationary series, we need to difference once. Let’s replot the ACF and PACF with the differenced series.
rio_ts %>%
diff() %>%
Acf()
rio_ts %>%
diff() %>%
Pacf()
First things first, lets split our data into a training set, and a testing set to evaluate the models abilities to forecast. We have 47 years of monthly observations. Lets use the last 10 years of observations as the test set.
train <- window(rio_ts, end = c(2009, 12))
test <- window(rio_ts, start = c(2010, 1))
First, lets just build a standard linear model of the training set against time. I think it also is worth it to include seasonal dummies due to the prevalence of seasonality.
library(kableExtra)
my_szns <- seasonaldummy(train)
my_lm <- tslm(train ~ trend + my_szns)
summary(my_lm)
##
## Call:
## tslm(formula = train ~ trend + my_szns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1841 -0.8040 -0.1310 0.6475 5.0682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.3347598 0.2437037 103.957 < 2e-16 ***
## trend 0.0025763 0.0004884 5.275 2.11e-07 ***
## my_sznsJan 0.9334747 0.3066122 3.044 0.002474 **
## my_sznsFeb 1.1749525 0.3066040 3.832 0.000146 ***
## my_sznsMar 0.5631869 0.3065966 1.837 0.066913 .
## my_sznsApr -0.8558759 0.3065900 -2.792 0.005478 **
## my_sznsMay -2.7930468 0.3065842 -9.110 < 2e-16 ***
## my_sznsJun -4.0361637 0.3065791 -13.165 < 2e-16 ***
## my_sznsJul -4.4084697 0.3065748 -14.380 < 2e-16 ***
## my_sznsAug -3.7623974 0.3065713 -12.273 < 2e-16 ***
## my_sznsSep -3.4544332 0.3065686 -11.268 < 2e-16 ***
## my_sznsOct -2.3151176 0.3065667 -7.552 2.59e-13 ***
## my_sznsNov -1.1412075 0.3065655 -3.723 0.000223 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.319 on 431 degrees of freedom
## Multiple R-squared: 0.6991, Adjusted R-squared: 0.6908
## F-statistic: 83.46 on 12 and 431 DF, p-value: < 2.2e-16
lm_forecasts <- forecast(my_lm, h = length(test),
data.frame(my_szns=I(seasonaldummy(train,length(test)))))
accuracy(test, lm_forecasts$mean) %>%
kableExtra::kable(caption = "LM W/ Seasonal Components Accuracy")
| ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|
| Test set | 0.1424249 | 0.9990911 | 0.8062072 | 0.6829374 | 3.240758 | 0.284457 | 0.9596192 |
autoplot(train)+
autolayer(lm_forecasts, PI = F, series = "Forecasts")+
ylab("Celcius")+
ggtitle("Linear Model Forecasts")
autoplot(test)+
autolayer(lm_forecasts, PI = F, series = "Forecasts")+
ylab("Celcius")+
ggtitle("Linear Model Forecasts V Observed Values")
checkresiduals(lm_forecasts)
##
## Ljung-Box test
##
## data: Residuals from Linear regression model
## Q* = 336.65, df = 11, p-value < 2.2e-16
##
## Model df: 13. Total lags used: 24
So this model does a good job forecasting. The residuals are normally distributed, the forecasts look identical to the observed values, a large amount of the variation is explained by the trend and the seasonal dummies (\(R^2\) = .69) and the accuracy metrics perform very well. The only downsides are there appear to be some slight seasonal components remaining in the
ETS is an acronym for for (Error, Trend, Seasonal). This label can also be thought of as ExponenTial Smoothing. They are a type of exponential smoothing. Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight.
The ETS models I will test are “ZZZ”, which is auto selected, “ANN” is simple exponential smoothing with additive errors, “MNN” is simple exponential smoothing with multiplicative errors, “MAM” is multiplicative Holt-Winters’ method with multiplicative errors, and “MAN” Holt’s linear method with multiplicative errors.
ets_model_types <- c("ZZZ", "ANN", "MNN", "MAM", "MAN")
ts_model_list <- lapply(setNames(unique(ets_model_types),
unique(ets_model_types)),
function(x){
model <- ets(y = train, model = x)
fcsts <- forecast(model, h = length(test))
eval_metrics <-
accuracy(object = fcsts,
x = test)
return(eval_metrics)
})
kable(ts_model_list[["ZZZ"]], caption = "ZZZ")
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 0.0025934 | 1.081901 | 0.8551186 | -0.1381586 | 3.547346 | 0.6370697 | 0.0294343 | NA |
| Test set | -0.3903155 | 1.120964 | 0.8967971 | -1.8873461 | 3.728914 | 0.6681205 | 0.3604648 | 0.6906844 |
kable(ts_model_list[["ANN"]], caption = "ANN")
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -0.0034647 | 1.547825 | 1.202462 | -0.2222352 | 5.033491 | 0.8958431 | 0.2009522 | NA |
| Test set | -1.3826717 | 2.840626 | 2.393566 | -6.6314936 | 10.189557 | 1.7832243 | 0.7625628 | 1.832319 |
kable(ts_model_list[["MNN"]], caption = "MNN")
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -0.003202 | 1.547833 | 1.202687 | -0.2212878 | 5.034303 | 0.8960109 | 0.2009767 | NA |
| Test set | -1.382672 | 2.840626 | 2.393566 | -6.6314968 | 10.189559 | 1.7832245 | 0.7625628 | 1.832319 |
kable(ts_model_list[["MAM"]], caption = "MAM")
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -0.0030026 | 1.092631 | 0.8627335 | -0.1510995 | 3.572496 | 0.6427429 | 0.0334654 | NA |
| Test set | -0.3828081 | 1.075562 | 0.8558415 | -1.8087635 | 3.549387 | 0.6376083 | 0.2968924 | 0.6575996 |
kable(ts_model_list[["MAN"]], caption = "MAN")
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -0.1032384 | 1.551583 | 1.215404 | -0.6372677 | 5.09441 | 0.9054853 | 0.200792 | NA |
| Test set | -7.2230846 | 8.228701 | 7.336642 | -30.2379123 | 30.63279 | 5.4658524 | 0.877491 | 5.171838 |
Per the accuracy metrics on the training set data, it appears that “MAM” performs best. I’ll use this model type to forecast.
mam_ets <- ets(y = train, model = "MAM")
ets_forecasts <- forecast(mam_ets, h = length(test))
accuracy(test, ets_forecasts$mean) %>%
kableExtra::kable(caption = "MAM W/ Seasonal Components Accuracy")
| ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|
| Test set | 0.3828081 | 1.075562 | 0.8558415 | 1.612513 | 3.424819 | 0.2968924 | 1.002835 |
autoplot(train)+
autolayer(ets_forecasts, PI = F, series = "Forecasts")+
ylab("Celcius")+
ggtitle("MAM Forecasts")
autoplot(test)+
autolayer(ets_forecasts, PI = F, series = "Forecasts")+
ylab("Celcius")+
ggtitle("MAM Forecasts V Observed Values")
checkresiduals(ets_forecasts)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,Ad,M)
## Q* = 31.731, df = 7, p-value = 4.556e-05
##
## Model df: 17. Total lags used: 24
The model performs relatively the same as the linear model.
The final model type we will train is and ARIMA MOdel. I will train two models, one using the auto.arima() function and one where I will use forecast::stlf(method = "arima"). This function will forecast STL objects are obtained by applying a non-seasonal forecasting method to the seasonally adjusted data and re-seasonalizing using the last year of the seasonal component. We will difference the series and forecast.
arima_auto <- auto.arima(diff(train))
arima_stlf <- stlf(diff(train), method = "arima", h = length(test)-1)
accuracy(arima_auto) %>%
kable(caption = "ARIMA USing Auto Selction")
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 0.001658 | 1.292707 | 1.021138 | NaN | Inf | 0.8020089 | 0.0090488 |
accuracy(arima_stlf) %>%
kable(caption = "STLF ARIMA")
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 0.0347985 | 0.9359685 | 0.7372372 | NaN | Inf | 0.5790313 | -0.0017964 |
checkresiduals(arima_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(2,0,0)[12] with zero mean
## Q* = 72.903, df = 21, p-value = 1.195e-07
##
## Model df: 3. Total lags used: 24
checkresiduals(arima_stlf)
## Warning in checkresiduals(arima_stlf): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ARIMA(4,0,2) with zero mean
## Q* = 42.453, df = 18, p-value = 0.0009555
##
## Model df: 6. Total lags used: 24
Both of these models perform relatively the same, lets forecast both of them and compare to the test set.
test_diff <- diff(test)
fcst_arima <- forecast(arima_auto, h = length(diff(test)))
accuracy(test_diff, fcst_arima$mean) %>%
kable(caption = "Auto Selected ARIMA")
| ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|
| Test set | 0.0379671 | 1.559158 | 1.271807 | -102.5899 | 754.4673 | -0.0808615 | 5.713776 |
accuracy(test_diff, arima_stlf$mean) %>%
kable(caption = "STLF ARIMA")
| ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|
| Test set | 0.0090735 | 1.222493 | 1.004666 | -79.10763 | 236.2045 | -0.3560106 | 1.671647 |
autoplot(test_diff)+
autolayer(fcst_arima, PI = F, series = "Forecasts")+
ylab("Celcius")+
ggtitle("ARIMA(1,0,0)(2,0,0)[12] with zero mean")
autoplot(test_diff)+
autolayer(arima_stlf$mean, PI = F, series = "Forecasts")+
ylab("Celcius")+
ggtitle("STL + ARIMA(4,0,2) with zero mean")
checkresiduals(fcst_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(2,0,0)[12] with zero mean
## Q* = 72.903, df = 21, p-value = 1.195e-07
##
## Model df: 3. Total lags used: 24
checkresiduals(arima_stlf$residuals)
The STLF ARIMA just slightly outperforms the auto selected model.
Based on the distribution of residuals, and the evaluation statistics it appears that the linear model performs best.