###########################################################################################################

In this analysis we will be using different time series techniques and model structures to describe a time series and find the most appropriate model by evaluating forecasts on a test set.

For this analysis I chose to use New York City (Central Park) Daily Temperature data. I looked up New York daily weather data from the National Weather Service Forecast Office’s website https://w2.weather.gov/climate/xmacis.php?wfo=okx and transfered daily weather data (provided for each month in the site) into a spreasheet from January 2018 until March 2020. I then exported the data into a csv format to be able to load it and perform my analysis. The data includes precipitation, temperature, snowfall information from a station in Central Park and I chose to analyze and model daily Average Temperature.

rm(list = ls())

Load Useful libraries

library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(zoo)
library(data.table)
library(forecast)
library(fpp2)

Data Load and Preparation

Import Data

# import data
df <- read.csv("nyc_temp.csv")
# look at loaded data
df %>% head()
##     Date MaxTemp MinTemp AvgTemp DepTemp HDD CDD Precipitation NewSnow
## 1 1/1/18      19       7    13.0   -20.4  52   0             0       0
## 2 1/2/18      26      13    19.5   -13.8  45   0             0       0
## 3 1/3/18      30      16    23.0   -10.1  42   0             0       0
## 4 1/4/18      29      19    24.0    -9.0  41   0          0.76     9.8
## 5 1/5/18      19       9    14.0   -18.9  51   0             0       0
## 6 1/6/18      13       6     9.5   -23.3  55   0             0       0
##   SnowDepth
## 1         T
## 2         0
## 3         T
## 4         1
## 5         7
## 6         6
# chose average temperature column
df_avg_temp <-  df[4]
# look at average temperature
df_avg_temp %>% head()
##   AvgTemp
## 1    13.0
## 2    19.5
## 3    23.0
## 4    24.0
## 5    14.0
## 6     9.5

Lets note here that our daily data starts on 1/1/2018 and ends on 3/31/2020

Summary Statistics

# Look at descriptive statistics
df_avg_temp %>% dim()
## [1] 821   1
df_avg_temp %>% summary()
##     AvgTemp     
##  Min.   : 9.00  
##  1st Qu.:40.50  
##  Median :52.50  
##  Mean   :54.35  
##  3rd Qu.:70.00  
##  Max.   :88.50
# check if there are any missing values
any(is.na(df_avg_temp))
## [1] FALSE

As we can see above: 1) We have 821 total observations (number of days in 2 years and 3 months). 2) The minimum average temperature in a day is 9 degrees F, the maximum is 88.50 degrees F and the mean is 54.35 degrees F. 3) Our data does not have any missing values that we would have to deal with.

Create Time Series Object

# create ts object
ts_temp <-  ts(df_avg_temp$AvgTemp, start = c(2018,1), end = c(2020,91), frequency = 365)

Plot Data

# plot
autoplot(ts_temp) +
  ggtitle("Daily Average Temperature in NYC") +
  ylab("Average Temperature (F)") +
  xlab("Day")

We can make the following observations:

  1. There is definitely seasonality (as expected). We see average temperatures’ troughs at the end/beginning of each year and peaks around summer months. We also see string daily variability.

  2. We do not necessarily observe variation as far as seasonality is concerned. Seasonality amplitudes and daily volatility seem to be at similar levels as time increases.

  3. There might be a small positive trend - as observed by slightly increasing levels at seasonal troughs but we can definitely not say that with certainty.

Split Data to Create Training and Test Sets

We will be spliting the daily data the following way: We will keep the first 2 full years (2018, 2019) to fit our models and the first three months of 2020 to test our models.

ts_temp_train <- window(ts_temp, end = 2020)
ts_temp_test <- window(ts_temp, start = 2020)

Models and Evaluation

Sesonal Naive Model

We will start by investigating and fitting a fairly simple model: the seasonal naive method. In this case we assume that any future forecast will be the exact same as the previous observed value from the same season of the year (eg same day of last year). This is a very simple model but since we do not expect huge deviations of average temperature from year to year (althought there definitely is volatility and possibly a trend). This technique is very robust and despite its lack of any particular complexity we expect reasonable results.

Fit seasonal naive model:

fit.snaive <- snaive(ts_temp_train, h=91)

Forecast

forecast.snaive <- fit.snaive %>% forecast()

Plot Forecasts

autoplot(forecast.snaive) +
  autolayer(ts_temp_test, series="True Values") +
  scale_x_continuous(limits = c(2019.6, 2020.2)) +
  guides(colour = guide_legend(title = NULL)) 
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).

RMSE on Test Set

RMSE.snaive <- sqrt(mean(forecast.snaive$mean - ts_temp_test)^2)
RMSE.snaive
## [1] 5.788889

Check Residuals

fit.snaive %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 608.59, df = 146, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 146

As we can see above, the seasonal naive method produced a reasonable extrapolation and forecasts and we computed an out of sample RMSE of 5.788889. Residuals seem quite acceptable as they approximate a normal distribution and do not seem to be extensively autocorrelated. However, we will be definitely trying out more appropriate methods for better forecasts.

Linear Regression with Fourier Component

We will now try modeling our time series and producing forecasts using a slightly more complex model structure. Specifically, by using a linear regression with a trend predictor as well as fourier terms in order to appropriately model the monthly and daily variations in our data.

First we will be performing Cross Validation to see what is the optimal K parameter to for fourier components in our regression (minimizing AICc):

for (k in 1:30){
  Fourier_AICc <- CV(
    tslm(ts_temp_train ~ trend + fourier(ts_temp_train, K = k)))[['AICc']]
  print(paste0("AICc for ", k, " terms = ", Fourier_AICc))
}
## [1] "AICc for 1 terms = 2906.41271533509"
## [1] "AICc for 2 terms = 2898.10913459815"
## [1] "AICc for 3 terms = 2897.23213518888"
## [1] "AICc for 4 terms = 2898.92703380921"
## [1] "AICc for 5 terms = 2889.97005572319"
## [1] "AICc for 6 terms = 2885.54111395005"
## [1] "AICc for 7 terms = 2883.45311260349"
## [1] "AICc for 8 terms = 2881.78714575574"
## [1] "AICc for 9 terms = 2884.4570565178"
## [1] "AICc for 10 terms = 2887.35683022475"
## [1] "AICc for 11 terms = 2889.29370719243"
## [1] "AICc for 12 terms = 2881.92366514089"
## [1] "AICc for 13 terms = 2875.44238413478"
## [1] "AICc for 14 terms = 2866.34280233341"
## [1] "AICc for 15 terms = 2869.59668612562"
## [1] "AICc for 16 terms = 2870.09516593584"
## [1] "AICc for 17 terms = 2874.40794493977"
## [1] "AICc for 18 terms = 2872.10944539411"
## [1] "AICc for 19 terms = 2873.64429308035"
## [1] "AICc for 20 terms = 2873.83195763287"
## [1] "AICc for 21 terms = 2875.44924394133"
## [1] "AICc for 22 terms = 2878.5183585661"
## [1] "AICc for 23 terms = 2874.76985014149"
## [1] "AICc for 24 terms = 2874.85680248355"
## [1] "AICc for 25 terms = 2865.24852340686"
## [1] "AICc for 26 terms = 2868.51752829822"
## [1] "AICc for 27 terms = 2870.19493848034"
## [1] "AICc for 28 terms = 2858.58685869846"
## [1] "AICc for 29 terms = 2862.60047185393"
## [1] "AICc for 30 terms = 2863.68350765109"

As we can see above, we minimize AICc by using 28 terms.

Fit Linear Regression with Fourier Terms:

fit.fourier <- tslm(ts_temp_train ~ trend + fourier(ts_temp_train, K=28))

Plot Fit to Make Sure it is Reasonable

autoplot(ts_temp_train) +
  autolayer(fit.fourier$fitted.values, series="Fitted Series") +
  xlab("Time (Days)") +
  ylab("Average Temperature (F)") +
  ggtitle("Daily Average Temperature in NYC") +
  guides(colour = guide_legend(title = NULL))

Forecast

# forecast using fitted model for test set
forecast.fourier <- forecast(fit.fourier, newdata=data.frame(fourier(ts_temp_test, K = 30)), h = 91)

Plot Forecasts

autoplot(forecast.fourier) +
  autolayer(ts_temp_test, series="True Values") +
  scale_x_continuous(limits = c(2019.6, 2020.2)) +
  guides(colour = guide_legend(title = NULL)) 
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).

RMSE on test set

RMSE.fourier <- sqrt(mean(forecast.fourier$mean - ts_temp_test)^2)
RMSE.fourier
## [1] 5.495157

Check Residuals

fit.fourier %>% checkresiduals()

## 
##  Breusch-Godfrey test for serial correlation of order up to 146
## 
## data:  Residuals from Linear regression model
## LM test = 335.14, df = 146, p-value < 2.2e-16

As we can see above, the linear regression model structure with the fourier terms produced reasonable forecasts and we computed an out of sample RMSE of 5.495157 - better than the seasonal naive method. Residuals seem acceptable as they approximate a normal distribution and do not seem to be extensively autocorrelated.

STL Decomposition + Naive

We will now try decomposing our series using STL and then extrapolating using a simple naive model.

Fit:

fit.stl <- ts_temp_train %>% stl(s.window="periodic", robust=TRUE)

Look at Decomposition

fit.stl %>% autoplot()

Look at Decomposition

fit.stl %>% summary()
##  Call:
##  stl(x = ., s.window = "periodic", robust = TRUE)
## 
##  Time.series components:
##     seasonal             trend            remainder        
##  Min.   :-39.52042   Min.   :55.91210   Min.   :-37.52827  
##  1st Qu.:-14.99644   1st Qu.:56.31646   1st Qu.: -3.07830  
##  Median : -1.10506   Median :56.45133   Median :  0.01762  
##  Mean   : -0.01845   Mean   :56.38009   Mean   : -0.55453  
##  3rd Qu.: 15.28905   3rd Qu.:56.48436   3rd Qu.:  2.80264  
##  Max.   : 26.33357   Max.   :56.54139   Max.   : 24.89563  
##  IQR:
##      STL.seasonal STL.trend STL.remainder data   
##      30.2855       0.1679    5.8809       30.7500
##    %  98.5          0.5      19.1         100.0  
## 
##  Weights:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8242  0.9452  0.8676  0.9887  1.0000 
## 
##  Other components: List of 5
##  $ win  : Named num [1:3] 7311 549 365
##  $ deg  : Named int [1:3] 0 1 1
##  $ jump : Named num [1:3] 732 55 37
##  $ inner: int 1
##  $ outer: int 15

Forecast

forecast.stl <- fit.stl %>% forecast(method="naive", h=91)

Plot Forecasts

autoplot(forecast.stl) +
  autolayer(ts_temp_test, series="True Values") +
  scale_x_continuous(limits = c(2019.6, 2020.2)) +
  guides(colour = guide_legend(title = NULL))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).

RMSE on test set

RMSE.stl <- sqrt(mean(forecast.stl$mean - ts_temp_test)^2)
RMSE.stl
## [1] 8.067703

As we can see above, prediction intervals using this method are wider and increaseing dramatically while RMSE on the test set is 8.067703 - which is worse than any of the previous methods.

STL & ETS Forecast

We can now try a slightly more complex time series method, specifically an STL decomposition with an ETS model fit. We will be using the stlf function to fit those together.

Fit

fit.ets= ts_temp_train %>% stlf(robust = TRUE,
                    method = "ets", h = 91)

Look at Fitted Model

fit.ets$model
## ETS(A,N,N) 
## 
## Call:
##  ets(y = na.interp(x), model = etsmodel, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.5943 
## 
##   Initial states:
##     l = 33.7512 
## 
##   sigma:  5.4886
## 
##      AIC     AICc      BIC 
## 7313.819 7313.852 7327.602

As we can see above, the time series was decomposed and then an optimal ETS(A,N,N) model was fit that is a Simple Exponential Smoothing.

Plot

fit.ets %>% autoplot()

Forecast

forecast.ets <- fit.ets %>% forecast()

Plot Forecasts on Test Set

autoplot(forecast.ets) +
  autolayer(ts_temp_test, series="True Values") +
  scale_x_continuous(limits = c(2019.6, 2020.2)) +
  guides(colour = guide_legend(title = NULL))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).

RMSE on test set

RMSE.ets <- sqrt(mean(forecast.ets$mean - ts_temp_test)^2)
RMSE.ets
## [1] 6.036763

Check residuals

fit.ets %>% checkresiduals()
## Warning in checkresiduals(.): The fitted degrees of freedom is based on the
## model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 301.91, df = 144, p-value = 3.03e-13
## 
## Model df: 2.   Total lags used: 146

As we can see above, the STL decomposition and Simple Exponential Smoothing (ETS(A,N,N)) model structure with the fourier terms produced reasonable forecasts and fit the data well. We computed an out of sample RMSE of 6.036763 - better than the previous STL and Naive method but worse than the linear model with fourier terms and the simple seasonal naive method. Residuals seem acceptable as they approximate a normal distribution and do not seem to be extensively autocorrelated.

STL Decomposition + ARIMA

We will now try an STL Decomposition with ARIMA model structure.

Fit

fit.arima <- ts_temp_train %>% stlf(method = "arima", h = 91)

Model Summary

fit.arima$model
## Series: x 
## ARIMA(1,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1     ma1      ma2      ma3     mean
##       0.7700  0.0576  -0.1887  -0.1986  55.7952
## s.e.  0.1173  0.1265   0.1077   0.0649   0.3988
## 
## sigma^2 estimated as 13.83:  log likelihood=-1995.34
## AIC=4002.68   AICc=4002.8   BIC=4030.25

As we can see, after decomposition an optimal ARIMA(1,0,3) model was fitted using 1 autoregressive term, 3 moving average terms and no differecing.

Plot

fit.arima %>% autoplot()

Forecast

forecast.arima <- fit.arima %>% forecast()

Plot Forecasts on Test Set

autoplot(forecast.arima) +
  autolayer(ts_temp_test, series="True Values") +
  scale_x_continuous(limits = c(2019.6, 2020.2)) +
  guides(colour = guide_legend(title = NULL))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 584 row(s) containing missing values (geom_path).
## Warning: Removed 17 row(s) containing missing values (geom_path).

RMSE on test set

RMSE.arima <- sqrt(mean(forecast.arima$mean - ts_temp_test)^2)
RMSE.arima
## [1] 5.039388

Check Residuals

fit.arima %>% checkresiduals
## Warning in checkresiduals(.): The fitted degrees of freedom is based on the
## model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ARIMA(1,0,3) with non-zero mean
## Q* = 231.19, df = 141, p-value = 2.534e-06
## 
## Model df: 5.   Total lags used: 146

As we can see above, the STL decomposition and ARIMA model structure with 1 AR and 3 AM terms produced good forecasts and fit the data very well. We computed an out of sample RMSE of 5.039388 - the best out of all model structures tried out. Residuals seem acceptable as they approximate a normal distribution and do not seem to be extremely autocorrelated (low p-value in Ljung-Box test).