INTRO

The data I will be forecasting with comes from a Kaggle competition. It records the monthly high temperatures from 1973 - 2020 in Rio. Let’s load the data, and some packages and take a look at it.

library(lubridate)
library(fpp2)
library(dplyr)
library(tidyr)
library(seasonal)
library(urca)

df <- read.csv("C:/Users/jawilliams/Desktop/Forecasting/Data/station_rio_temp.csv", stringsAsFactors = F)

head(df,1)
##   YEAR   JAN   FEB  MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT   NOV   DEC
## 1 1973 27.73 27.97 25.7 26.49 22.42 22.76 22.14 21.03 21.46 22.46 23.06 25.85
str(df)
## 'data.frame':    47 obs. of  13 variables:
##  $ YEAR: int  1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 ...
##  $ JAN : num  27.7 26.7 25.3 27.5 27.1 ...
##  $ FEB : num  28 27.2 26.9 26.2 28.5 ...
##  $ MAR : num  25.7 26.6 26.4 25.6 26.9 ...
##  $ APR : num  26.5 23.9 22.8 25 24.2 ...
##  $ MAY : num  22.4 22.8 21.4 22 22.4 ...
##  $ JUN : num  22.8 20.7 20.5 21.2 22.1 ...
##  $ JUL : num  22.1 21.2 19.7 20.1 23.1 ...
##  $ AUG : num  21 21.8 23 21.1 22.3 ...
##  $ SEP : num  21.5 22.9 22.4 21.3 22.4 ...
##  $ OCT : num  22.5 22.8 22.6 22.1 23.9 ...
##  $ NOV : num  23.1 24.5 24.1 24.4 24.5 ...
##  $ DEC : num  25.9 24.5 26.5 25.6 24.8 ...
psych::describe(df)
##      vars  n    mean     sd  median trimmed   mad     min    max  range skew
## YEAR    1 47 1996.00  13.71 1996.00 1996.00 17.79 1973.00 2019.0  46.00 0.00
## JAN     2 47  110.09 274.33   27.48   27.57  1.60   23.86  999.9 976.04 2.88
## FEB     3 47  151.71 327.98   27.96   77.61  1.08   25.69  999.9 974.21 2.16
## MAR     4 47   88.69 240.50   26.57   26.68  0.65   24.80  999.9 975.10 3.46
## APR     5 47   87.34 240.86   25.09   25.26  1.29   22.73  999.9 977.17 3.46
## MAY     6 47  105.96 275.60   22.87   22.93  0.80   21.37  999.9 978.53 2.88
## JUN     7 47   84.03 241.74   21.50   21.69  1.35   19.74  999.9 980.16 3.46
## JUL     8 47  104.48 276.05   21.32   21.36  1.01   19.24  999.9 980.66 2.88
## AUG     9 47  105.16 275.85   22.07   22.07  1.22   19.95  999.9 979.95 2.88
## SEP    10 47  105.59 275.71   22.44   22.56  1.10   20.53  999.9 979.37 2.88
## OCT    11 47   86.07 241.20   23.67   23.89  1.20   21.62  999.9 978.28 3.46
## NOV    12 47  107.82 275.02   24.85   25.00  1.05   23.04  999.9 976.86 2.88
## DEC    13 47  129.84 303.45   26.17   51.36  1.04   24.54  999.9 975.36 2.47
##      kurtosis    se
## YEAR    -1.28  2.00
## JAN      6.43 40.01
## FEB      2.73 47.84
## MAR     10.16 35.08
## APR     10.16 35.13
## MAY      6.43 40.20
## JUN     10.16 35.26
## JUL      6.43 40.27
## AUG      6.43 40.24
## SEP      6.43 40.22
## OCT     10.16 35.18
## NOV      6.43 40.12
## DEC      4.20 44.26

Exploratory Data Analysis

As currently constructed, the months are column names. I am going to need to coalesce the column names into a single character vector/ factor column to make an accurate time series. Additionally, each column has a max value 0f 999.90. That does not make sense, so I am going to go ahead and replace those values w/ the previously observed value in the data. I think this makes more sense in a time-series analysis instead of some sort of PCA imputation or even using a statistic like a mean or a median.

df <- df %>% 
  pivot_longer(., -YEAR, names_to = "Month", values_to = "Temp")

for (i in 1:nrow(df)){
  if(df$Temp[i]==999.90){
    df$Temp[i]=df$Temp[i-1]
  }
  
}

head(df, 1)
## # A tibble: 1 x 3
##    YEAR Month  Temp
##   <int> <chr> <dbl>
## 1  1973 JAN    27.7
psych::describe(df)
##        vars   n    mean    sd  median trimmed   mad     min     max range skew
## YEAR      1 564 1996.00 13.58 1996.00 1996.00 17.79 1973.00 2019.00 46.00 0.00
## Month*    2 564     NaN    NA      NA     NaN    NA     Inf    -Inf  -Inf   NA
## Temp      3 564   24.36  2.41   24.48   24.35  3.05   19.24   30.25 11.01 0.01
##        kurtosis   se
## YEAR      -1.21 0.57
## Month*       NA   NA
## Temp      -1.08 0.10

Ok, now that our data is tidy, we can go ahead and create a ts object and plot the series and it’s distribution. Let’s also go ahead and decompose our series using a few methods.

rio_ts <- ts(df$Temp, start = c(1973,1), frequency = 12)

hist(rio_ts, col = "grey", main = "Monthly Temp Highs Rio 1973- 2020")

autoplot(rio_ts)+
  ylab("Celcius")+
  ggtitle("Monthly Temperature High: Rio 1973 - 2020")

autoplot(decompose(rio_ts, type = 'a'))+
  ggtitle("Additive Decomposition")

autoplot(decompose(rio_ts, type = 'm'))+
  ggtitle("Multiplicative Decomposition")

So there seems to be quite a bit of variability in our series. There seems to be a seasonal component, Let’s take a look at a few more decompositions. First we will look at STL, which is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships. We can also take a look at the X11 method, which has the features of trend-cycle estimates are available for all observations including the end points, and the seasonal component is allowed to vary slowly over time.

rio_ts %>%
  stl(t.window=13, s.window="periodic", robust=TRUE) %>%
  autoplot() +
  ggtitle("STL Decomposition")

rio_ts %>%
  seas(x11 = "") %>%
  autoplot() +
  ggtitle("X11 Decomposition")

So the x11 method does a poor job decomposing our series, which makes sense due to the lack of trend. Let’s plot the acf and pacf plots.

Acf(rio_ts)

Pacf(rio_ts)

Very seasonal in both components. Let’s test if this series has a unit root, and needs to be differenced. I will use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test to stay consistent w/ the method used in the text. In this test, \(Ho\) is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, rejections of \(Ho\) suggest that differencing is required.

urca::ur.kpss(rio_ts)
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.6075

In this case, we see that the series is not stationary. Let’s see how many differences it requires.

ndiffs(rio_ts)
## [1] 1

So in order to create a stationary series, we need to difference once. Let’s replot the ACF and PACF with the differenced series.

rio_ts %>% 
  diff() %>% 
  Acf()

rio_ts %>% 
  diff() %>% 
  Pacf()

Models

First things first, lets split our data into a training set, and a testing set to evaluate the models abilities to forecast. We have 47 years of monthly observations. Lets use the last 10 years of observations as the test set.

train <- window(rio_ts, end = c(2009, 12))

test <- window(rio_ts, start = c(2010, 1))

Linear Model

First, lets just build a standard linear model of the training set against time. I think it also is worth it to include seasonal dummies due to the prevalence of seasonality.

library(kableExtra)

my_szns <- seasonaldummy(train)
my_lm <- tslm(train ~ trend + my_szns)

summary(my_lm)
## 
## Call:
## tslm(formula = train ~ trend + my_szns)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1841 -0.8040 -0.1310  0.6475  5.0682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.3347598  0.2437037 103.957  < 2e-16 ***
## trend        0.0025763  0.0004884   5.275 2.11e-07 ***
## my_sznsJan   0.9334747  0.3066122   3.044 0.002474 ** 
## my_sznsFeb   1.1749525  0.3066040   3.832 0.000146 ***
## my_sznsMar   0.5631869  0.3065966   1.837 0.066913 .  
## my_sznsApr  -0.8558759  0.3065900  -2.792 0.005478 ** 
## my_sznsMay  -2.7930468  0.3065842  -9.110  < 2e-16 ***
## my_sznsJun  -4.0361637  0.3065791 -13.165  < 2e-16 ***
## my_sznsJul  -4.4084697  0.3065748 -14.380  < 2e-16 ***
## my_sznsAug  -3.7623974  0.3065713 -12.273  < 2e-16 ***
## my_sznsSep  -3.4544332  0.3065686 -11.268  < 2e-16 ***
## my_sznsOct  -2.3151176  0.3065667  -7.552 2.59e-13 ***
## my_sznsNov  -1.1412075  0.3065655  -3.723 0.000223 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.319 on 431 degrees of freedom
## Multiple R-squared:  0.6991, Adjusted R-squared:  0.6908 
## F-statistic: 83.46 on 12 and 431 DF,  p-value: < 2.2e-16
lm_forecasts <- forecast(my_lm, h = length(test),
    data.frame(my_szns=I(seasonaldummy(train,length(test)))))

accuracy(test, lm_forecasts$mean) %>% 
  kableExtra::kable(caption = "LM W/ Seasonal Components Accuracy")
LM W/ Seasonal Components Accuracy
ME RMSE MAE MPE MAPE ACF1 Theil’s U
Test set 0.1424249 0.9990911 0.8062072 0.6829374 3.240758 0.284457 0.9596192
autoplot(train)+
  autolayer(lm_forecasts, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("Linear Model Forecasts")

autoplot(test)+
  autolayer(lm_forecasts, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("Linear Model Forecasts V Observed Values")

checkresiduals(lm_forecasts)

## 
##  Ljung-Box test
## 
## data:  Residuals from Linear regression model
## Q* = 336.65, df = 11, p-value < 2.2e-16
## 
## Model df: 13.   Total lags used: 24

So this model does a good job forecasting. The residuals are normally distributed, the forecasts look identical to the observed values, a large amount of the variation is explained by the trend and the seasonal dummies (\(R^2\) = .69) and the accuracy metrics perform very well. The only downsides are there appear to be some slight seasonal components remaining in the

ETS

ETS is an acronym for for (Error, Trend, Seasonal). This label can also be thought of as ExponenTial Smoothing. They are a type of exponential smoothing. Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight.

The ETS models I will test are “ZZZ”, which is auto selected, “ANN” is simple exponential smoothing with additive errors, “MNN” is simple exponential smoothing with multiplicative errors, “MAM” is multiplicative Holt-Winters’ method with multiplicative errors, and “MAN” Holt’s linear method with multiplicative errors.

ets_model_types <- c("ZZZ", "ANN", "MNN", "MAM", "MAN")

ts_model_list <- lapply(setNames(unique(ets_model_types), 
                                  unique(ets_model_types)),
                         
                         function(x){
                           
                           model <- ets(y = train, model = x)
                           
                           fcsts <- forecast(model, h = length(test))
                           
                           eval_metrics <-
                             accuracy(object = fcsts,
                                      x = test)
                           return(eval_metrics)
                           
                         })

kable(ts_model_list[["ZZZ"]], caption = "ZZZ")
ZZZ
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 0.0025934 1.081901 0.8551186 -0.1381586 3.547346 0.6370697 0.0294343 NA
Test set -0.3903155 1.120964 0.8967971 -1.8873461 3.728914 0.6681205 0.3604648 0.6906844
kable(ts_model_list[["ANN"]], caption = "ANN")
ANN
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -0.0034647 1.547825 1.202462 -0.2222352 5.033491 0.8958431 0.2009522 NA
Test set -1.3826717 2.840626 2.393566 -6.6314936 10.189557 1.7832243 0.7625628 1.832319
kable(ts_model_list[["MNN"]], caption = "MNN")
MNN
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -0.003202 1.547833 1.202687 -0.2212878 5.034303 0.8960109 0.2009767 NA
Test set -1.382672 2.840626 2.393566 -6.6314968 10.189559 1.7832245 0.7625628 1.832319
kable(ts_model_list[["MAM"]], caption = "MAM")
MAM
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -0.0030026 1.092631 0.8627335 -0.1510995 3.572496 0.6427429 0.0334654 NA
Test set -0.3828081 1.075562 0.8558415 -1.8087635 3.549387 0.6376083 0.2968924 0.6575996
kable(ts_model_list[["MAN"]], caption = "MAN")
MAN
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -0.1032384 1.551583 1.215404 -0.6372677 5.09441 0.9054853 0.200792 NA
Test set -7.2230846 8.228701 7.336642 -30.2379123 30.63279 5.4658524 0.877491 5.171838

Per the accuracy metrics on the training set data, it appears that “MAM” performs best. I’ll use this model type to forecast.

mam_ets <- ets(y = train, model = "MAM")

ets_forecasts <- forecast(mam_ets, h = length(test))

accuracy(test, ets_forecasts$mean) %>% 
  kableExtra::kable(caption = "MAM W/ Seasonal Components Accuracy")
MAM W/ Seasonal Components Accuracy
ME RMSE MAE MPE MAPE ACF1 Theil’s U
Test set 0.3828081 1.075562 0.8558415 1.612513 3.424819 0.2968924 1.002835
autoplot(train)+
  autolayer(ets_forecasts, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("MAM Forecasts")

autoplot(test)+
  autolayer(ets_forecasts, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("MAM Forecasts V Observed Values")

checkresiduals(ets_forecasts)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,Ad,M)
## Q* = 31.731, df = 7, p-value = 4.556e-05
## 
## Model df: 17.   Total lags used: 24

The model performs relatively the same as the linear model.

ARIMA

The final model type we will train is and ARIMA MOdel. I will train two models, one using the auto.arima() function and one where I will use forecast::stlf(method = "arima"). This function will forecast STL objects are obtained by applying a non-seasonal forecasting method to the seasonally adjusted data and re-seasonalizing using the last year of the seasonal component. We will difference the series and forecast.

arima_auto <- auto.arima(diff(train))

arima_stlf <- stlf(diff(train), method = "arima", h = length(test)-1)

accuracy(arima_auto) %>% 
  kable(caption = "ARIMA USing Auto Selction")
ARIMA USing Auto Selction
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.001658 1.292707 1.021138 NaN Inf 0.8020089 0.0090488
accuracy(arima_stlf) %>% 
  kable(caption = "STLF ARIMA")
STLF ARIMA
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.0347985 0.9359685 0.7372372 NaN Inf 0.5790313 -0.0017964
checkresiduals(arima_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[12] with zero mean
## Q* = 72.903, df = 21, p-value = 1.195e-07
## 
## Model df: 3.   Total lags used: 24
checkresiduals(arima_stlf)
## Warning in checkresiduals(arima_stlf): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ARIMA(4,0,2) with zero mean
## Q* = 42.453, df = 18, p-value = 0.0009555
## 
## Model df: 6.   Total lags used: 24

Both of these models perform relatively the same, lets forecast both of them and compare to the test set.

test_diff <- diff(test)

fcst_arima <- forecast(arima_auto, h = length(diff(test)))

accuracy(test_diff, fcst_arima$mean) %>% 
  kable(caption = "Auto Selected ARIMA")
Auto Selected ARIMA
ME RMSE MAE MPE MAPE ACF1 Theil’s U
Test set 0.0379671 1.559158 1.271807 -102.5899 754.4673 -0.0808615 5.713776
accuracy(test_diff, arima_stlf$mean) %>% 
  kable(caption = "STLF ARIMA")
STLF ARIMA
ME RMSE MAE MPE MAPE ACF1 Theil’s U
Test set 0.0090735 1.222493 1.004666 -79.10763 236.2045 -0.3560106 1.671647
autoplot(test_diff)+
  autolayer(fcst_arima, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("ARIMA(1,0,0)(2,0,0)[12] with zero mean")

autoplot(test_diff)+
  autolayer(arima_stlf$mean, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("STL +  ARIMA(4,0,2) with zero mean")

checkresiduals(fcst_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[12] with zero mean
## Q* = 72.903, df = 21, p-value = 1.195e-07
## 
## Model df: 3.   Total lags used: 24
checkresiduals(arima_stlf$residuals)

The STLF ARIMA just slightly outperforms the auto selected model.

COnclusion

Based on the distribution of residuals, and the evaluation statistics it appears that the linear model performs best.