###########################################################################################################
In this analysis we will be using different time series techniques and model structures to describe a time series and find the most appropriate model by evaluating forecasts on a test set.
For this analysis I chose to use New York City (Central Park) Daily Temperature data. I looked up New York daily weather data from the National Weather Service Forecast Office’s website https://w2.weather.gov/climate/xmacis.php?wfo=okx and transfered daily weather data (provided for each month in the site) into a spreasheet from January 2018 until March 2020. I then exported the data into a csv format to be able to load it and perform my analysis. The data includes precipitation, temperature, snowfall information from a station in Central Park and I chose to analyze and model daily Average Temperature.
rm(list = ls())
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(zoo)
library(data.table)
library(forecast)
library(fpp2)
# import data
df <- read.csv("nyc_temp.csv")
# look at loaded data
df %>% head()
## Date MaxTemp MinTemp AvgTemp DepTemp HDD CDD Precipitation NewSnow
## 1 1/1/18 19 7 13.0 -20.4 52 0 0 0
## 2 1/2/18 26 13 19.5 -13.8 45 0 0 0
## 3 1/3/18 30 16 23.0 -10.1 42 0 0 0
## 4 1/4/18 29 19 24.0 -9.0 41 0 0.76 9.8
## 5 1/5/18 19 9 14.0 -18.9 51 0 0 0
## 6 1/6/18 13 6 9.5 -23.3 55 0 0 0
## SnowDepth
## 1 T
## 2 0
## 3 T
## 4 1
## 5 7
## 6 6
# chose average temperature column
df_avg_temp <- df[4]
# look at average temperature
df_avg_temp %>% head()
## AvgTemp
## 1 13.0
## 2 19.5
## 3 23.0
## 4 24.0
## 5 14.0
## 6 9.5
Lets note here that our daily data starts on 1/1/2018 and ends on 3/31/2020
# Look at descriptive statistics
df_avg_temp %>% dim()
## [1] 821 1
df_avg_temp %>% summary()
## AvgTemp
## Min. : 9.00
## 1st Qu.:40.50
## Median :52.50
## Mean :54.35
## 3rd Qu.:70.00
## Max. :88.50
# check if there are any missing values
any(is.na(df_avg_temp))
## [1] FALSE
As we can see above: 1) We have 821 total observations (number of days in 2 years and 3 months). 2) The minimum average temperature in a day is 9 degrees F, the maximum is 88.50 degrees F and the mean is 54.35 degrees F. 3) Our data does not have any missing values that we would have to deal with.
# create ts object
ts_temp <- ts(df_avg_temp$AvgTemp, start = c(2018,1), end = c(2020,91), frequency = 365)
# plot
autoplot(ts_temp) +
ggtitle("Daily Average Temperature in NYC") +
ylab("Average Temperature (F)") +
xlab("Day")
We can make the following observations:
There is definitely seasonality (as expected). We see average temperatures’ troughs at the end/beginning of each year and peaks around summer months. We also see string daily variability.
We do not necessarily observe variation as far as seasonality is concerned. Seasonality amplitudes and daily volatility seem to be at similar levels as time increases.
There might be a small positive trend - as observed by slightly increasing levels at seasonal troughs but we can definitely not say that with certainty.
We will be spliting the daily data the following way: We will keep the first 2 full years (2018, 2019) to fit our models and the first three months of 2020 to test our models.
ts_temp_train <- window(ts_temp, end = 2020)
ts_temp_test <- window(ts_temp, start = 2020)
We will start by investigating and fitting a fairly simple model: the seasonal naive method. In this case we assume that any future forecast will be the exact same as the previous observed value from the same season of the year (eg same day of last year). This is a very simple model but since we do not expect huge deviations of average temperature from year to year (althought there definitely is volatility and possibly a trend). This technique is very robust and despite its lack of any particular complexity we expect reasonable results.
Fit seasonal naive model:
fit.snaive <- snaive(ts_temp_train, h=91)
Forecast
forecast.snaive <- fit.snaive %>% forecast()
Plot Forecasts
autoplot(forecast.snaive) +
autolayer(ts_temp_test, series="True Values") +
scale_x_continuous(limits = c(2019.6, 2020.2)) +
guides(colour = guide_legend(title = NULL))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).
RMSE on Test Set
RMSE.snaive <- sqrt(mean(forecast.snaive$mean - ts_temp_test)^2)
RMSE.snaive
## [1] 5.788889
Check Residuals
fit.snaive %>% checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 608.59, df = 146, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 146
As we can see above, the seasonal naive method produced a reasonable extrapolation and forecasts and we computed an out of sample RMSE of 5.788889. Residuals seem quite acceptable as they approximate a normal distribution and do not seem to be extensively autocorrelated. However, we will be definitely trying out more appropriate methods for better forecasts.
We will now try modeling our time series and producing forecasts using a slightly more complex model structure. Specifically, by using a linear regression with a trend predictor as well as fourier terms in order to appropriately model the monthly and daily variations in our data.
First we will be performing Cross Validation to see what is the optimal K parameter to for fourier components in our regression (minimizing AICc):
for (k in 1:30){
Fourier_AICc <- CV(
tslm(ts_temp_train ~ trend + fourier(ts_temp_train, K = k)))[['AICc']]
print(paste0("AICc for ", k, " terms = ", Fourier_AICc))
}
## [1] "AICc for 1 terms = 2906.41271533509"
## [1] "AICc for 2 terms = 2898.10913459815"
## [1] "AICc for 3 terms = 2897.23213518888"
## [1] "AICc for 4 terms = 2898.92703380921"
## [1] "AICc for 5 terms = 2889.97005572319"
## [1] "AICc for 6 terms = 2885.54111395005"
## [1] "AICc for 7 terms = 2883.45311260349"
## [1] "AICc for 8 terms = 2881.78714575574"
## [1] "AICc for 9 terms = 2884.4570565178"
## [1] "AICc for 10 terms = 2887.35683022475"
## [1] "AICc for 11 terms = 2889.29370719243"
## [1] "AICc for 12 terms = 2881.92366514089"
## [1] "AICc for 13 terms = 2875.44238413478"
## [1] "AICc for 14 terms = 2866.34280233341"
## [1] "AICc for 15 terms = 2869.59668612562"
## [1] "AICc for 16 terms = 2870.09516593584"
## [1] "AICc for 17 terms = 2874.40794493977"
## [1] "AICc for 18 terms = 2872.10944539411"
## [1] "AICc for 19 terms = 2873.64429308035"
## [1] "AICc for 20 terms = 2873.83195763287"
## [1] "AICc for 21 terms = 2875.44924394133"
## [1] "AICc for 22 terms = 2878.5183585661"
## [1] "AICc for 23 terms = 2874.76985014149"
## [1] "AICc for 24 terms = 2874.85680248355"
## [1] "AICc for 25 terms = 2865.24852340686"
## [1] "AICc for 26 terms = 2868.51752829822"
## [1] "AICc for 27 terms = 2870.19493848034"
## [1] "AICc for 28 terms = 2858.58685869846"
## [1] "AICc for 29 terms = 2862.60047185393"
## [1] "AICc for 30 terms = 2863.68350765109"
As we can see above, we minimize AICc by using 28 terms.
Fit Linear Regression with Fourier Terms:
fit.fourier <- tslm(ts_temp_train ~ trend + fourier(ts_temp_train, K=28))
Plot Fit to Make Sure it is Reasonable
autoplot(ts_temp_train) +
autolayer(fit.fourier$fitted.values, series="Fitted Series") +
xlab("Time (Days)") +
ylab("Average Temperature (F)") +
ggtitle("Daily Average Temperature in NYC") +
guides(colour = guide_legend(title = NULL))
Forecast
# forecast using fitted model for test set
forecast.fourier <- forecast(fit.fourier, newdata=data.frame(fourier(ts_temp_test, K = 30)), h = 91)
Plot Forecasts
autoplot(forecast.fourier) +
autolayer(ts_temp_test, series="True Values") +
scale_x_continuous(limits = c(2019.6, 2020.2)) +
guides(colour = guide_legend(title = NULL))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).
RMSE on test set
RMSE.fourier <- sqrt(mean(forecast.fourier$mean - ts_temp_test)^2)
RMSE.fourier
## [1] 5.495157
Check Residuals
fit.fourier %>% checkresiduals()
##
## Breusch-Godfrey test for serial correlation of order up to 146
##
## data: Residuals from Linear regression model
## LM test = 335.14, df = 146, p-value < 2.2e-16
As we can see above, the linear regression model structure with the fourier terms produced reasonable forecasts and we computed an out of sample RMSE of 5.495157 - better than the seasonal naive method. Residuals seem acceptable as they approximate a normal distribution and do not seem to be extensively autocorrelated.
We will now try decomposing our series using STL and then extrapolating using a simple naive model.
Fit:
fit.stl <- ts_temp_train %>% stl(s.window="periodic", robust=TRUE)
Look at Decomposition
fit.stl %>% autoplot()
Look at Decomposition
fit.stl %>% summary()
## Call:
## stl(x = ., s.window = "periodic", robust = TRUE)
##
## Time.series components:
## seasonal trend remainder
## Min. :-39.52042 Min. :55.91210 Min. :-37.52827
## 1st Qu.:-14.99644 1st Qu.:56.31646 1st Qu.: -3.07830
## Median : -1.10506 Median :56.45133 Median : 0.01762
## Mean : -0.01845 Mean :56.38009 Mean : -0.55453
## 3rd Qu.: 15.28905 3rd Qu.:56.48436 3rd Qu.: 2.80264
## Max. : 26.33357 Max. :56.54139 Max. : 24.89563
## IQR:
## STL.seasonal STL.trend STL.remainder data
## 30.2855 0.1679 5.8809 30.7500
## % 98.5 0.5 19.1 100.0
##
## Weights:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.8242 0.9452 0.8676 0.9887 1.0000
##
## Other components: List of 5
## $ win : Named num [1:3] 7311 549 365
## $ deg : Named int [1:3] 0 1 1
## $ jump : Named num [1:3] 732 55 37
## $ inner: int 1
## $ outer: int 15
Forecast
forecast.stl <- fit.stl %>% forecast(method="naive", h=91)
Plot Forecasts
autoplot(forecast.stl) +
autolayer(ts_temp_test, series="True Values") +
scale_x_continuous(limits = c(2019.6, 2020.2)) +
guides(colour = guide_legend(title = NULL))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).
RMSE on test set
RMSE.stl <- sqrt(mean(forecast.stl$mean - ts_temp_test)^2)
RMSE.stl
## [1] 8.067703
As we can see above, prediction intervals using this method are wider and increaseing dramatically while RMSE on the test set is 8.067703 - which is worse than any of the previous methods.
We can now try a slightly more complex time series method, specifically an STL decomposition with an ETS model fit. We will be using the stlf function to fit those together.
Fit
fit.ets= ts_temp_train %>% stlf(robust = TRUE,
method = "ets", h = 91)
Look at Fitted Model
fit.ets$model
## ETS(A,N,N)
##
## Call:
## ets(y = na.interp(x), model = etsmodel, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.5943
##
## Initial states:
## l = 33.7512
##
## sigma: 5.4886
##
## AIC AICc BIC
## 7313.819 7313.852 7327.602
As we can see above, the time series was decomposed and then an optimal ETS(A,N,N) model was fit that is a Simple Exponential Smoothing.
Plot
fit.ets %>% autoplot()
Forecast
forecast.ets <- fit.ets %>% forecast()
Plot Forecasts on Test Set
autoplot(forecast.ets) +
autolayer(ts_temp_test, series="True Values") +
scale_x_continuous(limits = c(2019.6, 2020.2)) +
guides(colour = guide_legend(title = NULL))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).
RMSE on test set
RMSE.ets <- sqrt(mean(forecast.ets$mean - ts_temp_test)^2)
RMSE.ets
## [1] 6.036763
Check residuals
fit.ets %>% checkresiduals()
## Warning in checkresiduals(.): The fitted degrees of freedom is based on the
## model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 301.91, df = 144, p-value = 3.03e-13
##
## Model df: 2. Total lags used: 146
As we can see above, the STL decomposition and Simple Exponential Smoothing (ETS(A,N,N)) model structure with the fourier terms produced reasonable forecasts and fit the data well. We computed an out of sample RMSE of 6.036763 - better than the previous STL and Naive method but worse than the linear model with fourier terms and the simple seasonal naive method. Residuals seem acceptable as they approximate a normal distribution and do not seem to be extensively autocorrelated.
We will now try an STL Decomposition with ARIMA model structure.
Fit
fit.arima <- ts_temp_train %>% stlf(method = "arima", h = 91)
Model Summary
fit.arima$model
## Series: x
## ARIMA(1,0,3) with non-zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3 mean
## 0.7700 0.0576 -0.1887 -0.1986 55.7952
## s.e. 0.1173 0.1265 0.1077 0.0649 0.3988
##
## sigma^2 estimated as 13.83: log likelihood=-1995.34
## AIC=4002.68 AICc=4002.8 BIC=4030.25
As we can see, after decomposition an optimal ARIMA(1,0,3) model was fitted using 1 autoregressive term, 3 moving average terms and no differecing.
Plot
fit.arima %>% autoplot()
Forecast
forecast.arima <- fit.arima %>% forecast()
Plot Forecasts on Test Set
autoplot(forecast.arima) +
autolayer(ts_temp_test, series="True Values") +
scale_x_continuous(limits = c(2019.6, 2020.2)) +
guides(colour = guide_legend(title = NULL))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).
RMSE on test set
RMSE.arima <- sqrt(mean(forecast.arima$mean - ts_temp_test)^2)
RMSE.arima
## [1] 5.039388
Check Residuals
fit.arima %>% checkresiduals
## Warning in checkresiduals(.): The fitted degrees of freedom is based on the
## model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ARIMA(1,0,3) with non-zero mean
## Q* = 231.19, df = 141, p-value = 2.534e-06
##
## Model df: 5. Total lags used: 146
As we can see above, the STL decomposition and ARIMA model structure with 1 AR and 3 AM terms produced good forecasts and fit the data very well. We computed an out of sample RMSE of 5.039388 - the best out of all model structures tried out. Residuals seem acceptable as they approximate a normal distribution and do not seem to be extremely autocorrelated (low p-value in Ljung-Box test).