INTRO

The data I will be forecasting with comes from a Kaggle competition. It records the monthly high temperatures from 1973 - 2020 in Rio. Let’s load the data, and some packages and take a look at it.

library(lubridate)
library(fpp2)
library(dplyr)
library(tidyr)
library(seasonal)
library(urca)

df <- read.csv("C:/Users/jawilliams/Desktop/Forecasting/Data/station_rio_temp.csv", stringsAsFactors = F)

head(df,1)

##   YEAR   JAN   FEB  MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT   NOV   DEC
## 1 1973 27.73 27.97 25.7 26.49 22.42 22.76 22.14 21.03 21.46 22.46 23.06 25.85

str(df)

## 'data.frame':    47 obs. of  13 variables:
##  $ YEAR: int  1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 ...
##  $ JAN : num  27.7 26.7 25.3 27.5 27.1 ...
##  $ FEB : num  28 27.2 26.9 26.2 28.5 ...
##  $ MAR : num  25.7 26.6 26.4 25.6 26.9 ...
##  $ APR : num  26.5 23.9 22.8 25 24.2 ...
##  $ MAY : num  22.4 22.8 21.4 22 22.4 ...
##  $ JUN : num  22.8 20.7 20.5 21.2 22.1 ...
##  $ JUL : num  22.1 21.2 19.7 20.1 23.1 ...
##  $ AUG : num  21 21.8 23 21.1 22.3 ...
##  $ SEP : num  21.5 22.9 22.4 21.3 22.4 ...
##  $ OCT : num  22.5 22.8 22.6 22.1 23.9 ...
##  $ NOV : num  23.1 24.5 24.1 24.4 24.5 ...
##  $ DEC : num  25.9 24.5 26.5 25.6 24.8 ...

psych::describe(df)

##      vars  n    mean     sd  median trimmed   mad     min    max  range skew
## YEAR    1 47 1996.00  13.71 1996.00 1996.00 17.79 1973.00 2019.0  46.00 0.00
## JAN     2 47  110.09 274.33   27.48   27.57  1.60   23.86  999.9 976.04 2.88
## FEB     3 47  151.71 327.98   27.96   77.61  1.08   25.69  999.9 974.21 2.16
## MAR     4 47   88.69 240.50   26.57   26.68  0.65   24.80  999.9 975.10 3.46
## APR     5 47   87.34 240.86   25.09   25.26  1.29   22.73  999.9 977.17 3.46
## MAY     6 47  105.96 275.60   22.87   22.93  0.80   21.37  999.9 978.53 2.88
## JUN     7 47   84.03 241.74   21.50   21.69  1.35   19.74  999.9 980.16 3.46
## JUL     8 47  104.48 276.05   21.32   21.36  1.01   19.24  999.9 980.66 2.88
## AUG     9 47  105.16 275.85   22.07   22.07  1.22   19.95  999.9 979.95 2.88
## SEP    10 47  105.59 275.71   22.44   22.56  1.10   20.53  999.9 979.37 2.88
## OCT    11 47   86.07 241.20   23.67   23.89  1.20   21.62  999.9 978.28 3.46
## NOV    12 47  107.82 275.02   24.85   25.00  1.05   23.04  999.9 976.86 2.88
## DEC    13 47  129.84 303.45   26.17   51.36  1.04   24.54  999.9 975.36 2.47
##      kurtosis    se
## YEAR    -1.28  2.00
## JAN      6.43 40.01
## FEB      2.73 47.84
## MAR     10.16 35.08
## APR     10.16 35.13
## MAY      6.43 40.20
## JUN     10.16 35.26
## JUL      6.43 40.27
## AUG      6.43 40.24
## SEP      6.43 40.22
## OCT     10.16 35.18
## NOV      6.43 40.12
## DEC      4.20 44.26

Exploratory Data Analysis

As currently constructed, the months are column names. I am going to need to coalesce the column names into a single character vector/ factor column to make an accurate time series. Additionally, each column has a max value 0f 999.90. That does not make sense, so I am going to go ahead and replace those values w/ the previously observed value in the data. I think this makes more sense in a time-series analysis instead of some sort of PCA imputation or even using a statistic like a mean or a median.

df <- df %>% 
  pivot_longer(., -YEAR, names_to = "Month", values_to = "Temp")

for (i in 1:nrow(df)){
  if(df$Temp[i]==999.90){
    df$Temp[i]=df$Temp[i-1]
  }
  
}

head(df, 1)

## # A tibble: 1 x 3
##    YEAR Month  Temp
##   <int> <chr> <dbl>
## 1  1973 JAN    27.7

psych::describe(df)

##        vars   n    mean    sd  median trimmed   mad     min     max range skew
## YEAR      1 564 1996.00 13.58 1996.00 1996.00 17.79 1973.00 2019.00 46.00 0.00
## Month*    2 564     NaN    NA      NA     NaN    NA     Inf    -Inf  -Inf   NA
## Temp      3 564   24.36  2.41   24.48   24.35  3.05   19.24   30.25 11.01 0.01
##        kurtosis   se
## YEAR      -1.21 0.57
## Month*       NA   NA
## Temp      -1.08 0.10

Ok, now that our data is tidy, we can go ahead and create a ts object and plot the series and it’s distribution. Let’s also go ahead and decompose our series using a few methods.

rio_ts <- ts(df$Temp, start = c(1973,1), frequency = 12)

hist(rio_ts, col = "grey", main = "Monthly Temp Highs Rio 1973- 2020")

autoplot(rio_ts)+
  ylab("Celcius")+
  ggtitle("Monthly Temperature High: Rio 1973 - 2020")

autoplot(decompose(rio_ts, type = 'a'))+
  ggtitle("Additive Decomposition")

autoplot(decompose(rio_ts, type = 'm'))+
  ggtitle("Multiplicative Decomposition")

So there seems to be quite a bit of variability in our series. There seems to be a seasonal component, Let’s take a look at a few more decompositions. First we will look at STL, which is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships. We can also take a look at the X11 method, which has the features of trend-cycle estimates are available for all observations including the end points, and the seasonal component is allowed to vary slowly over time.

rio_ts %>%
  stl(t.window=13, s.window="periodic", robust=TRUE) %>%
  autoplot() +
  ggtitle("STL Decomposition")

rio_ts %>%
  seas(x11 = "") %>%
  autoplot() +
  ggtitle("X11 Decomposition")

So the x11 method does a poor job decomposing our series, which makes sense due to the lack of trend. Let’s plot the acf and pacf plots.

Acf(rio_ts)

Pacf(rio_ts)

Very seasonal in both components. Let’s test if this series has a unit root, and needs to be differenced. I will use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test to stay consistent w/ the method used in the text. In this test, \(Ho\) is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, rejections of \(Ho\) suggest that differencing is required.

urca::ur.kpss(rio_ts)

## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.6075

In this case, we see that the series is not stationary. Let’s see how many differences it requires.

ndiffs(rio_ts)

## [1] 1

So in order to create a stationary series, we need to difference once. Let’s replot the ACF and PACF with the differenced series.

rio_ts %>% 
  diff() %>% 
  Acf()

rio_ts %>% 
  diff() %>% 
  Pacf()

Models

First things first, lets split our data into a training set, and a testing set to evaluate the models abilities to forecast. We have 47 years of monthly observations. Lets use the last 10 years of observations as the test set.

train <- window(rio_ts, end = c(2009, 12))

test <- window(rio_ts, start = c(2010, 1))

Linear Model

First, lets just build a standard linear model of the training set against time. I think it also is worth it to include seasonal dummies due to the prevalence of seasonality.

library(kableExtra)

my_szns <- seasonaldummy(train)
my_lm <- tslm(train ~ trend + my_szns)

summary(my_lm)

## 
## Call:
## tslm(formula = train ~ trend + my_szns)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1841 -0.8040 -0.1310  0.6475  5.0682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.3347598  0.2437037 103.957  < 2e-16 ***
## trend        0.0025763  0.0004884   5.275 2.11e-07 ***
## my_sznsJan   0.9334747  0.3066122   3.044 0.002474 ** 
## my_sznsFeb   1.1749525  0.3066040   3.832 0.000146 ***
## my_sznsMar   0.5631869  0.3065966   1.837 0.066913 .  
## my_sznsApr  -0.8558759  0.3065900  -2.792 0.005478 ** 
## my_sznsMay  -2.7930468  0.3065842  -9.110  < 2e-16 ***
## my_sznsJun  -4.0361637  0.3065791 -13.165  < 2e-16 ***
## my_sznsJul  -4.4084697  0.3065748 -14.380  < 2e-16 ***
## my_sznsAug  -3.7623974  0.3065713 -12.273  < 2e-16 ***
## my_sznsSep  -3.4544332  0.3065686 -11.268  < 2e-16 ***
## my_sznsOct  -2.3151176  0.3065667  -7.552 2.59e-13 ***
## my_sznsNov  -1.1412075  0.3065655  -3.723 0.000223 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.319 on 431 degrees of freedom
## Multiple R-squared:  0.6991, Adjusted R-squared:  0.6908 
## F-statistic: 83.46 on 12 and 431 DF,  p-value: < 2.2e-16

lm_forecasts <- forecast(my_lm, h = length(test),
    data.frame(my_szns=I(seasonaldummy(train,length(test)))))

accuracy(test, lm_forecasts$mean) %>% 
  kableExtra::kable(caption = "LM W/ Seasonal Components Accuracy")

LM W/ Seasonal Components Accuracy
	ME	RMSE	MAE	MPE	MAPE	ACF1	Theil’s U
Test set	0.1424249	0.9990911	0.8062072	0.6829374	3.240758	0.284457	0.9596192

autoplot(train)+
  autolayer(lm_forecasts, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("Linear Model Forecasts")

autoplot(test)+
  autolayer(lm_forecasts, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("Linear Model Forecasts V Observed Values")

checkresiduals(lm_forecasts)

## 
##  Ljung-Box test
## 
## data:  Residuals from Linear regression model
## Q* = 336.65, df = 11, p-value < 2.2e-16
## 
## Model df: 13.   Total lags used: 24

So this model does a good job forecasting. The residuals are normally distributed, the forecasts look identical to the observed values, a large amount of the variation is explained by the trend and the seasonal dummies (\(R^2\) = .69) and the accuracy metrics perform very well. The only downsides are there appear to be some slight seasonal components remaining in the

ETS

ETS is an acronym for for (Error, Trend, Seasonal). This label can also be thought of as ExponenTial Smoothing. They are a type of exponential smoothing. Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight.

The ETS models I will test are “ZZZ”, which is auto selected, “ANN” is simple exponential smoothing with additive errors, “MNN” is simple exponential smoothing with multiplicative errors, “MAM” is multiplicative Holt-Winters’ method with multiplicative errors, and “MAN” Holt’s linear method with multiplicative errors.

ets_model_types <- c("ZZZ", "ANN", "MNN", "MAM", "MAN")

ts_model_list <- lapply(setNames(unique(ets_model_types), 
                                  unique(ets_model_types)),
                         
                         function(x){
                           
                           model <- ets(y = train, model = x)
                           
                           fcsts <- forecast(model, h = length(test))
                           
                           eval_metrics <-
                             accuracy(object = fcsts,
                                      x = test)
                           return(eval_metrics)
                           
                         })

kable(ts_model_list[["ZZZ"]], caption = "ZZZ")

ZZZ
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Training set	0.0025934	1.081901	0.8551186	-0.1381586	3.547346	0.6370697	0.0294343	NA
Test set	-0.3903155	1.120964	0.8967971	-1.8873461	3.728914	0.6681205	0.3604648	0.6906844

kable(ts_model_list[["ANN"]], caption = "ANN")

ANN
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Training set	-0.0034647	1.547825	1.202462	-0.2222352	5.033491	0.8958431	0.2009522	NA
Test set	-1.3826717	2.840626	2.393566	-6.6314936	10.189557	1.7832243	0.7625628	1.832319

kable(ts_model_list[["MNN"]], caption = "MNN")

MNN
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Training set	-0.003202	1.547833	1.202687	-0.2212878	5.034303	0.8960109	0.2009767	NA
Test set	-1.382672	2.840626	2.393566	-6.6314968	10.189559	1.7832245	0.7625628	1.832319

kable(ts_model_list[["MAM"]], caption = "MAM")

MAM
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Training set	-0.0030026	1.092631	0.8627335	-0.1510995	3.572496	0.6427429	0.0334654	NA
Test set	-0.3828081	1.075562	0.8558415	-1.8087635	3.549387	0.6376083	0.2968924	0.6575996

kable(ts_model_list[["MAN"]], caption = "MAN")

MAN
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Training set	-0.1032384	1.551583	1.215404	-0.6372677	5.09441	0.9054853	0.200792	NA
Test set	-7.2230846	8.228701	7.336642	-30.2379123	30.63279	5.4658524	0.877491	5.171838

Per the accuracy metrics on the training set data, it appears that “MAM” performs best. I’ll use this model type to forecast.

mam_ets <- ets(y = train, model = "MAM")

ets_forecasts <- forecast(mam_ets, h = length(test))

accuracy(test, ets_forecasts$mean) %>% 
  kableExtra::kable(caption = "MAM W/ Seasonal Components Accuracy")

MAM W/ Seasonal Components Accuracy
	ME	RMSE	MAE	MPE	MAPE	ACF1	Theil’s U
Test set	0.3828081	1.075562	0.8558415	1.612513	3.424819	0.2968924	1.002835

autoplot(train)+
  autolayer(ets_forecasts, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("MAM Forecasts")

autoplot(test)+
  autolayer(ets_forecasts, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("MAM Forecasts V Observed Values")

checkresiduals(ets_forecasts)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,Ad,M)
## Q* = 31.731, df = 7, p-value = 4.556e-05
## 
## Model df: 17.   Total lags used: 24

The model performs relatively the same as the linear model.

ARIMA

The final model type we will train is and ARIMA MOdel. I will train two models, one using the auto.arima() function and one where I will use forecast::stlf(method = "arima"). This function will forecast STL objects are obtained by applying a non-seasonal forecasting method to the seasonally adjusted data and re-seasonalizing using the last year of the seasonal component. We will difference the series and forecast.

arima_auto <- auto.arima(diff(train))

arima_stlf <- stlf(diff(train), method = "arima", h = length(test)-1)

accuracy(arima_auto) %>% 
  kable(caption = "ARIMA USing Auto Selction")

ARIMA USing Auto Selction
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	0.001658	1.292707	1.021138	NaN	Inf	0.8020089	0.0090488

accuracy(arima_stlf) %>% 
  kable(caption = "STLF ARIMA")

STLF ARIMA
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	0.0347985	0.9359685	0.7372372	NaN	Inf	0.5790313	-0.0017964

checkresiduals(arima_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[12] with zero mean
## Q* = 72.903, df = 21, p-value = 1.195e-07
## 
## Model df: 3.   Total lags used: 24

checkresiduals(arima_stlf)

## Warning in checkresiduals(arima_stlf): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ARIMA(4,0,2) with zero mean
## Q* = 42.453, df = 18, p-value = 0.0009555
## 
## Model df: 6.   Total lags used: 24

Both of these models perform relatively the same, lets forecast both of them and compare to the test set.

test_diff <- diff(test)

fcst_arima <- forecast(arima_auto, h = length(diff(test)))

accuracy(test_diff, fcst_arima$mean) %>% 
  kable(caption = "Auto Selected ARIMA")

Auto Selected ARIMA
	ME	RMSE	MAE	MPE	MAPE	ACF1	Theil’s U
Test set	0.0379671	1.559158	1.271807	-102.5899	754.4673	-0.0808615	5.713776

accuracy(test_diff, arima_stlf$mean) %>% 
  kable(caption = "STLF ARIMA")

STLF ARIMA
	ME	RMSE	MAE	MPE	MAPE	ACF1	Theil’s U
Test set	0.0090735	1.222493	1.004666	-79.10763	236.2045	-0.3560106	1.671647

autoplot(test_diff)+
  autolayer(fcst_arima, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("ARIMA(1,0,0)(2,0,0)[12] with zero mean")

autoplot(test_diff)+
  autolayer(arima_stlf$mean, PI = F, series = "Forecasts")+
  ylab("Celcius")+
  ggtitle("STL +  ARIMA(4,0,2) with zero mean")

checkresiduals(fcst_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[12] with zero mean
## Q* = 72.903, df = 21, p-value = 1.195e-07
## 
## Model df: 3.   Total lags used: 24

checkresiduals(arima_stlf$residuals)

The STLF ARIMA just slightly outperforms the auto selected model.

COnclusion

Based on the distribution of residuals, and the evaluation statistics it appears that the linear model performs best.

JW HW 3

James Williams