1 Project Setting

The setting for this project is as follows. You are an employee of the US Embassy in Beijing. After noticing the strange thickness in the air, and getting a bit of a cough, you decide that you do not what any other visitor to suffer breathing the highly dangerous Beijing air. So, you decide you want to get funding for the US Embassy to provide free air pollution masks to all visitors. However, the bureaucrats who can give you the money want to know two things:

How much money should we allocate to you over the next year
How many air pollution masks should be made available each day over the course of the year (rush shipping is expensive)

So, your task is to predict the air pollution levels each day in Beijing at the US Embassy (who records it every hour), so that you can give the bureaucrats an actionable plan. We are assuming that more people will come in to get their free mask on a day where air pollution is bad, than on a day where it isnt, and we certainly do not want to run out, as that would be unfair.

Given this setting, we will now discuss how the data was processed:

2 Data Pre-processing

2.1 Data Definition

Open to the public is data from 5 cities:

Beijing
Chengdu
Guangzhou
Shanghai
Shenyang

Each of these datasets contains:

Air quality measurements at multiple locations (including a US Embassy post)
Humity
Air pressure
Temperature
Precipitation
Wind direction
Wind Speed

Measured at every hour. For detailed info on the data quality and collection methods, please refer to this publication

2.2 Pre-processing pipeline

The next step in our analysis workflow is to get the data ready for analysis. First, we will have to do a few tasks

import data
Represent data as a time series
inspect for NAs
figure out how to deal with NAs
store in a conveniently manipulatable format

All of this can be seen in R/preprocessing.R, but we will do an in depth explanation here

2.2.1 Required libraries

library(functional)  # to compose the preprocessing pipeline
library(data.table)  # to read csvs
library(rlist)  # for list manipulations
library(pipeR)  # fast, dumb pipes
library(imputeTS)  # to impute NAs
library(pander)  # so i can read the output
library(foreach)  # go fast
library(doParallel)  # go fast

2.2.2 Data loading

We want to load all of the csv files stored in data/, so we will write a function which loads all of them into a list of data.table(fast data frame) objects. We will want the name of each item in the list to be the name of the csv file, minus the useless numbers

import <- function(path) {
    # first we list the files in our path, in this case 'data/'
    files <- list.files(path)
    # then we search for csvs
    files <- files[grepl(files, pattern = ".csv")]
    # paste the csv names onto the filepath(data/whatever.csv)
    filepaths <- sapply(files, function(x) paste0(path, x))
    # read them in to a list
    out <- lapply(filepaths, fread)
    # Get rid of .csv in each filename
    fnames <- gsub(".csv", "", files)
    # get rid of the confusing numbers
    fnames <- gsub("[[:digit:]]+", "", fnames)
    # set the names of the list to be the clean filenams
    names(out) <- fnames
    out
}

2.2.3 NA identification

Next we want to see how many NAs we are dealing with. Since we have about 20 million total observations, we will represent these as a proportion. We will apply the NA count function to our list using the beautiful rapply function

naCount <- function(xs) {
    rapply(xs, function(x) sum(is.na(x)/length(x)), how = "list")
}
# load in the data now
datadir = "data/"
china <- import(datadir)
pander(naCount(china))

BeijingPM_:
- No: 0
- year: 0
- month: 0
- day: 0
- hour: 0
- season: 0
- PM_Dongsi: 0.5236
- PM_Dongsihuan: 0.61
- PM_Nongzhanguan: 0.5259
- PM_US Post: 0.04178
- DEWP: 0.00009509
- HUMI: 0.006447
- PRES: 0.006447
- TEMP: 0.00009509
- cbwd: 0.00009509
- Iws: 0.00009509
- precipitation: 0.009204
- Iprec: 0.009204
ChengduPM_:
- No: 0
- year: 0
- month: 0
- day: 0
- hour: 0
- season: 0
- PM_Caotangsi: 0.5356
- PM_Shahepu: 0.5323
- PM_US Post: 0.4504
- DEWP: 0.01006
- HUMI: 0.01017
- PRES: 0.009908
- TEMP: 0.01002
- cbwd: 0.009908
- Iws: 0.01014
- precipitation: 0.0562
- Iprec: 0.0562
GuangzhouPM_:
- No: 0
- year: 0
- month: 0
- day: 0
- hour: 0
- season: 0.00001902
- PM_City Station: 0.3848
- PM_5th Middle School: 0.5988
- PM_US Post: 0.3848
- DEWP: 0.00001902
- HUMI: 0.00001902
- PRES: 0.00001902
- TEMP: 0.00001902
- cbwd: 0.00001902
- Iws: 0.00001902
- precipitation: 0.00001902
- Iprec: 0.00001902
ShanghaiPM_:
- No: 0
- year: 0
- month: 0
- day: 0
- hour: 0
- season: 0
- PM_Jingan: 0.5303
- PM_US Post: 0.3527
- PM_Xuhui: 0.521
- DEWP: 0.0002472
- HUMI: 0.0002472
- PRES: 0.0005325
- TEMP: 0.0002472
- cbwd: 0.0002282
- Iws: 0.0002282
- precipitation: 0.07624
- Iprec: 0.07624
ShenyangPM_:
- No: 0
- year: 0
- month: 0
- day: 0
- hour: 0
- season: 0
- PM_Taiyuanjie: 0.5362
- PM_US Post: 0.5877
- PM_Xiaoheyan: 0.5317
- DEWP: 0.01316
- HUMI: 0.01293
- PRES: 0.01316
- TEMP: 0.01316
- cbwd: 0.01316
- Iws: 0.01316
- precipitation: 0.2427
- Iprec: 0.2427

It looks like the quality of the data collected outside of the US Embassy posts and in cities other than beijing is really bad. Luckily for us, BeijingPM$PM_US.Post has relatively high quality data. That being said, we are still going to have to impute the NAs. Before that though, lets convert this to time series data, as we will be imputing the NAs in a different way with time series data than we will for normal data.

2.2.4 Data transformation: Hourly time series

First, we define a function which converts the data to a time series object:

tots <- function(v) {
    ts(v, frequency = 365 * 24)
}

Next, we are going to want want to scale this up, to work on a whole data frame. We will also convert the data frame to a list, just to avoid any typing issues down the road. We also do not want to convert non time series data to time series, so we must avoid those as well:

totslist <- function(df) {
    # define the column names which we do not want to alter
    badlist <- c("No", "year", "month", "day", "hour", "season", "cbwd")
    # get the column names of the original data frame
    nms <- colnames(df)
    # coerce it into a list
    df <- as.list(df)
    # if the name of the item of the list is in our to not change list, do
    # nothing otherwise, convert to ts
    for (name in nms) {
        if (name %in% badlist) {
            df[[name]] <- df[[name]]
        } else {
            df[[name]] <- tots(df[[name]])
        }
    }
    # return the created list
    df
}

Next, we want to scale this function to work on a list of data frames:

totsall <- function(xs) {
    lapply(xs, totslist)
}

2.2.5 NA imputation

Imputing NAs in time series is a tricky topic, as we cannot just ignore them (that would mess up our sampling rate). There are several methods of dealing with NAs in time series, including kalman filters, polynomial interpretation, and spline interpolation. In this case, as we are only concerned with the time series that have relatively few NAs, I elected to use spline interpolation, as it is much faster than the kalman filter, while still providing pretty good accuracy. First, a function was constructed to use the try catch pattern to impute NAs of items with the time-series class.

The logic is as follows:

try spline interpolation on the vector
- if the output is a ts object, return the results
- if the output is an error or a non time series object, return the original object

imp_test <- function(v) {
    out <- try(na.interpolation(v, "spline"))
    ifelse(is.ts(out), return(out), return(v))
}

Next, we apply that function to a single list, in parallel to save time:

impute <- function(xs) {
    foreach(i = 1:length(xs), .final = function(x) {
        setNames(x, names(xs))
    }) %dopar% imp_test(xs[[i]])
}

Finally, we apply that function to our list of lists

impute_list <- function(xs) {
    lapply(xs, impute)
}

Next we are ready to convert to a more useful data structure, and then pipeline all these functions together for quick preprocessing

2.2.6 Pipelining

First, we will convert our slow, massive list into the fast, superior hash table. For large, complex datasets, a hash table was shown to be at more or less 3 times faster than a list, and as we will be repeatedly accessing data from this, we would prefer it to be a quick index. For more info on R hash tables, please refer to this link.

to_hash <- function(xs) {
    list2env(xs, envir = NULL, hash = TRUE)
}

Univariate EDA Next we are ready to pipeline it all together into a single preprocessing function, using functional::Compose

preprocess <- Compose(import, totsall, impute_list, to_hash)

3 Univariate EDA

Required libraries

library(ggthemes)
library(ggplot2)
library(cowplot)

3.1 Helper functions

First, we define some helper functions, in order to save us time and make our code more readable. First, lets steal some stuff from forecast and tswge:

# pretty plots for our analysis
seasonplot <- forecast::ggseasonplot
subseriesplot <- forecast::ggsubseriesplot
lagplot <- forecast::gglagplot
sampplot <- tswge::plotts.sample.wge
# clean up errant data points
clean <- forecast::tsclean

Next let us define a function which creates a function to resample time series, and then define ones with new sampling rates. Finally, we will write a function which applies each resampling function to a vector

# resample generator
change_samples <- function(n) {
    function(xs) {
        out <- unname(tapply(xs, (seq_along(xs) - 1)%/%n, sum))
        out <- ts(out, frequency = (8760/n))
        out
    }
}

# daily and weekly sampling, monthly is 4 weeks
to_daily <- change_samples(24)
to_weekly <- change_samples(24 * 7)
to_monthly <- change_samples(24 * 7 * 4)
to_season <- change_samples(24 * (365/4))

# pipelining final cleaning and conversion, removing outlier and the couple
# of errant negative values (which do not make sense)
cleandays <- function(xs) {
    xs %>>% clean %>>% abs %>>% to_daily
}

cleanweeks <- function(xs) {
    xs %>>% clean %>>% abs %>>% to_weekly
}
cleanmonths <- function(xs) {
    xs %>>% clean %>>% abs %>>% to_monthly
}
cleanseas <- function(xs) {
    xs %>>% clean %>>% abs %>>% to_season
}

# resample the whole ts
resample <- function(xs) {
    day <- xs %>>% cleandays %>>% window(end = 6)
    week <- xs %>>% cleanweeks %>>% window(end = 6)
    month <- xs %>>% cleanmonths %>>% window(end = 6)
    seas <- xs %>>% cleanseas %>>% window(end = 6)
    list(day = day, week = week, month = month, season = seas)
}

3.2 Wge sample plots

First lets load in and resample our data:

library(ggthemes)
library(ggplot2)
library(cowplot)
china <- preprocess(datadir)

#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric

bj <- china$BeijingPM_$PM_US
bj <- resample(bj)

3.2.1 Daily data

Now lets look at the sample plots, first with the daily data

dy <- sampplot(bj$day)

So looking at this, we see that we have what looks like a wandering behavior, according to the frequency plot, but the lags tell a different story. There appears to be some strong, high order ARMA stuff going on. Looking at the line plot, we see that there looks to be a long seasonal component, maybe a yearly one, which would explain why the Parzen window is telling us its somewhat wandering. We also see what looks to be a few more peaks in the parzenwindow, which either tells us that there is a complex seasonality going on (likely at this sampling rate), or we have some sort of high order ARMA model (or potentially both). Lets widen our sampling rate to weeks, months, and quarters, to see if there is anything gong on there, maybe be able to catch that yearly pattern. Lets check it out:

So lets go through each of these plots and discuss:

3.2.2 Weekly data

wy <- sampplot(bj$w)

Here we can see better that the time series is not so much wandering, as having a very long seasonal pattern (parzen window). We can tell by the little hook on the left. The ACFs are not so useful in this case, as they are tiny beyond lag 1, due to our weird sampling. However, we know for sure there is some sort of seasonal pattern going on. Potentialy there is a multiseasonal pattern again, or some high order ARMA.

3.2.3 Monthly data

my <- sampplot(bj$m)

Here we see another oscillating component in the ACFs, indicating again seasonal or high order ARMA (it looks a lot like patemp from tswge, which is seasonal data that is well described by a high order ARMA model). The frequency is again painting a similar picture, there is something going on with a clear frequency. lets look at the quarterly data to know for sure

3.2.4 Quarterly Data

sy <- sampplot(bj$s)

With a bit more than 20 data points, and a supposed seasonal period of a year (4 datapoints), we would expect to have a frequency of about 0.2, and lo and behold, we have one. There is some strong evidence of a yearly seasonal pattern in the data (also look at the peak in the ACFs at 4, pretty good evidence right there). The scatterplot tells us a pretty obvious story as well

3.3 Seasonal Plots

Lets check out this seasonal pattern a bit more with a seasonal plot. We will plot them all at once this time and then discuss:

sda <- seasonplot(bj$day) + scale_color_hc() + theme_hc() + ggtitle("Seasonal plot: Daily")
sdw <- bj$week %>>% seasonplot + theme_hc() + scale_color_hc() + ggtitle("Seasonal plot: Weekly")
sdm <- bj$month %>>% seasonplot + theme_hc() + scale_color_hc() + ggtitle("Seasonal plot: Monthly")
sds <- bj$seas %>>% seasonplot + theme_hc() + scale_color_hc() + ggtitle("Seasonal plot: Quarterly")
plot_grid(sda, sdw, sdm, sds, ncol = 2)

3.3.1 Interpretation

The shape of these seasonal plots, and how they all kind of line up, indicate to me that there is some sort of seasonality. It is especially apparent in the Monthly plot, where we have it line up very well, but it is clear in the other plots as well.

Why is this happening? Is there a real world explanation to this?

Indeed there is. Every year, around November, citywide central heating turns on inBeijing for the winter. This would cause a clear change in the air pollution, as heating is not an energy efficient process. In the summer, it goes down, because central heating is off, and people are more likely to go outside and walk and open their windows.

4 Classical Univariate Analysis

With our newfound knowledge, we will now perform a classical analysis of this time series.

library(tswgewrapped)
# > my wrapper scripts of tswge, with more convenient syntax

4.1 Setup

The setup for this bit of analysis is as follows:

china <- preprocess("data/")

#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric

bj <- china$BeijingPM_$PM_US
bjs <- resample(bj)
bjus <- bjs$day

# split
train <- window(bjus, end = 5)
test <- window(bjus, start = 5)

# clean up
rm(bj, bjus, bjs)

4.2 S3 objects and methods

To make analysis of models from different libraries with different structures easier, I wrote some S3 classes and generic methods in order to view and evaluate them. These are shown below

4.2.1 Scores generic

The scores generic is used to get the score of a model of any type.

scores <- function(obj) {
    UseMethod("scores")
}

4.2.2 The wge class

We next create a wge class, where we can store our tswge forecasts

as.wge <- function(x) structure(x, class = "wge")

We will use 3 methods to evaluate TSWGE models:

4.2.2.1 ASE

ASE <- function(predicted, actual) {
    mean((actual - predicted)^2)
}

4.2.2.2 MAPE

MAPE <- function(predicted, actual) {
    100 * mean(abs((actual - predicted)/actual))
}

4.2.2.3 Confidence Score

Confidence score is a made up metric I created, in order to evaluate the prediction interval. How it works is as follows: For each point in the actual observations, if it lies outside the prediction interval, give that point a score of 1. If it is within the confidence interval, give it a score of 0. Then, average the scores and multiply by 100 to get the percentage we got wrong.

checkConfint <- function(upper, lower, actual) {
    (actual < lower) | (actual > upper)
}

confScore <- function(upper, lower, actual) {
    rawScores <- ifelse(checkConfint(upper, lower, actual), 1, 0)
    return(sum(rawScores)/(length(actual)) * 100)
}

4.2.2.4 Method for scoring wge objects

To score tswge objects, we can simply use the following S3 method:

scores.wge <- function(xs) {
    mape <- MAPE(xs$f, test)
    ase <- ASE(xs$f, test)
    confs <- confScore(xs$ul, xs$ll, test)
    c(ASE = ase, MAPE = mape, Conf.Score = confs)
}

4.2.3 Visual comparison of prediction and reality

To compare the predictions vs reality of a wge object, I wrote an autoplot method for wge objects:

.testPredPlot <- function(xs) {
    p <- ggplot() + theme_hc() + scale_color_hc()
    doplot <- function(df) {
        p <<- p + geom_line(data = df, aes(x = t, y = ppm, color = type))
    }
    out <- lapply(xs, doplot)
    out[[2]]
}


autoplot.wge <- function(obj) {
    testdf <- data.frame(type = "actual", t = seq_along(test), ppm = as.numeric(test))
    preddf <- data.frame(type = "predicted", t = seq_along(obj$f), ppm = as.numeric(obj$f))
    confdf <- data.frame(upper = obj$ul, lower = obj$ll, t = seq_along(test))
    dfl <- list(testdf, preddf)
    .testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2), 
        linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2), 
        linetype = 3003) + guides(alpha = FALSE)
}

4.3 Differencing

Before we begin our analysis, we will create a seasonal and trend based difference of our original time series, mostly for the sake of thouroughness. The difference function used is as follows

difference

#> function (type, x, n) 
#> {
#>     szn_trans <- function(x, n) {
#>         artrans.wge(x, phi.tr = c(rep(0, n - 1), 1))
#>     }
#>     arima_trans <- function(x, n) {
#>         f <- artrans.wge(x, phi.tr = 1)
#>         if (n == 1) {
#>             res <- f
#>             return(res)
#>         }
#>         else {
#>             arima_trans(f, n - 1)
#>         }
#>     }
#>     if (is.character(enexpr(type)) == F) {
#>         type <- as.character(enexpr(type))
#>     }
#>     if (type %in% c("arima", "ARIMA", "Arima")) {
#>         return((arima_trans(x, n)))
#>     }
#>     if (type %in% c("ARUMA", "Aruma", "aruma", "Seasonal", "seasonal")) {
#>         szn_trans(x, n)
#>     }
#> }
#> <bytecode: 0xa230c38>
#> <environment: namespace:tswgewrapped>

4.3.1 Seasonal differencing

Our EDA tells us there should be a seasonality of ~365 days. So, lets take that out of our time series.

trainSea <- difference(seasonal, train, 365)

Nothing appeared to happen, This is either because the difference is out of the scale of our ACF plot, (365 is a lot of lags), or because there is something more going on. We will see.

4.3.2 Trend based differencing

Just for the sake of completeness, lets take a trend based difference of our data:

trainTrend <- difference(arima, train, 1)

There is little to say here, I do not believe in this model.

4.4 Estimation of order

To estimate the order, first all the train sets were put into a list:

trainers <- list(ARMA = train, Seasonal = trainSea, ARIMA = trainTrend)

Then, a wrapper around aic.wge speed up (especially with parallel computation) aic5.wge, was used on the trainers. The speed differences are minor but still much time is saved. The wrappers are:

tswgewrapped:::expand

#> function (v1, v2) 
#> {
#>     c(sapply(v1, function(x) rep(x, length(v2))))
#> }
#> <bytecode: 0x5b62368>
#> <environment: namespace:tswgewrapped>

tswgewrapped:::rewrite

#> function (v1, v2) 
#> {
#>     rep(v1, length(v2))
#> }
#> <bytecode: 0x5ad45a0>
#> <environment: namespace:tswgewrapped>

tswgewrapped:::getpq

#> function (x, p = 8, q = 5, type = "aic", silent = TRUE) 
#> {
#>     if (silent == FALSE) {
#>         cat("Calculating ", type, " for ARMA(", p, ", ", q, " )\n", 
#>             sep = "")
#>     }
#>     res <- try(aic.wge(x, p, q, type))
#>     if (is.list(res)) {
#>         out <- c(res$p, res$q, res$value)
#>     }
#>     else {
#>         out <- c(p, q, 9999)
#>     }
#>     out
#> }
#> <bytecode: 0x59a2978>
#> <environment: namespace:tswgewrapped>

aic5

#> function (x, p = 0:8, q = 0:5, type = "aic", silent = TRUE) 
#> {
#>     ip <- expand(p, q)
#>     iq <- rewrite(q, p)
#>     out <- mapply(function(v1, v2) getpq(x, v1, v2, type, silent), 
#>         ip, iq)
#>     out <- (as.data.frame(t(out)))
#>     colnames(out) <- c("p", "q", type)
#>     head(out[order(out[, 3], decreasing = F), ], 5)
#> }
#> <bytecode: 0x58da788>
#> <environment: namespace:tswgewrapped>

tswgewrapped:::aics

#> [[1]]
#> function (x, p = 0:8, q = 0:5, silent) 
#> aic5(x, p, q, type = "aic", silent)
#> <bytecode: 0x57781b8>
#> <environment: namespace:tswgewrapped>
#> 
#> [[2]]
#> function (x, p = 0:8, q = 0:5, silent) 
#> aic5(x, p, q, type = "bic", silent)
#> <bytecode: 0x5778810>
#> <environment: namespace:tswgewrapped>

aicbic

#> function (vec, p = 0:8, q = 0:5, parallel = FALSE, cl = NULL, 
#>     silent = FALSE) 
#> {
#>     if (parallel == TRUE) {
#>         parLapply(cl, aics, function(f) f(vec, p, q, silent = TRUE))
#>     }
#>     else {
#>         lapply(aics, function(f) f(vec, p, q, silent))
#>     }
#> }
#> <bytecode: 0x567fbc8>
#> <environment: namespace:tswgewrapped>

The rest of the source code can be seen here. The calculation of all the AICs and BICs is shown below

aics <- lapply(trainers, aicbic, 0:10, 0:8)

pander(aics)

ARMA:

  *

    -------------------------
     &nbsp;   p    q    aic
    -------- ---- --- -------
     **90**   9    8   14.62

     **53**   5    7   14.62

     **68**   7    4   14.62

     **11**   1    1   14.62

     **99**   10   8   14.62
    -------------------------

  *

    ------------------------
     &nbsp;   p   q    bic
    -------- --- --- -------
     **11**   1   1   14.63

     **19**   2   0   14.63

     **4**    0   3   14.63

     **20**   2   1   14.63

     **12**   1   2   14.63
    ------------------------

Seasonal:

  *

    ------------------------
     &nbsp;   p    q   aic
    -------- ---- --- ------
     **50**   5    4   15.3

     **51**   5    5   15.3

     **72**   7    8   15.3

     **70**   7    6   15.3

     **99**   10   8   15.3
    ------------------------

  *

    ------------------------
     &nbsp;   p   q    bic
    -------- --- --- -------
     **11**   1   1   15.32

     **3**    0   2   15.33

     **19**   2   0   15.33

     **4**    0   3   15.33

     **12**   1   2   15.33
    ------------------------

ARIMA:

  *

    ------------------------
     &nbsp;   p   q    aic
    -------- --- --- -------
     **58**   6   3   14.63

     **38**   4   1   14.63

     **49**   5   3   14.63

     **44**   4   7   14.63

     **20**   2   1   14.63
    ------------------------

  *

    ------------------------
     &nbsp;   p   q    bic
    -------- --- --- -------
     **20**   2   1   14.64

     **4**    0   3   14.64

     **38**   4   1   14.65

     **56**   6   1   14.66

     **49**   5   3   14.66
    ------------------------

4.5 Estimation of Parameters

To estimate the parameters, we will be using the wrapper estimation function I wrote (just more convenient syntax), as well as the wrapper around ljung.wge I wrote, which is just more convenient to read (and doesnt print 100 residuals)

estimate

#> function (xs, p, q = 0, type = "mle", ...) 
#> {
#>     if (q > 0) {
#>         return(est.arma.wge(xs, p, q, ...))
#>     }
#>     else {
#>         return(est.ar.wge(xs, p, type, ...))
#>     }
#> }
#> <bytecode: 0xf8e0298>
#> <environment: namespace:tswgewrapped>

tswgewrapped:::silence

#> function (x) 
#> {
#>     sink("/dev/null")
#>     x
#>     sink()
#> }
#> <bytecode: 0x658ca00>
#> <environment: namespace:tswgewrapped>

hush

#> function (f) 
#> {
#>     silence(res <- f)
#>     return(res)
#> }
#> <bytecode: 0x4ec3ba0>
#> <environment: namespace:tswgewrapped>

ljung_box

#> function (x, p, q, k_val = c(24, 48)) 
#> {
#>     ljung <- function(k) {
#>         hush(ljung.wge(x = x, p = p, q = q, K = k))
#>     }
#>     sapply(k_val, ljung)
#> }
#> <bytecode: 0x1296b818>
#> <environment: namespace:tswgewrapped>

All estimations were run with different orders until the best results were found:

4.5.1 ARMA estimates

estARMA <- estimate(train, 9, 8)

#> 
#> Coefficients of Original polynomial:  
#> 0.1142 0.2027 0.4502 -0.5994 0.0159 0.0679 0.7182 0.0737 -0.1559 
#> 
#> Factor                 Roots                Abs Recip    System Freq 
#> 1-1.3528B+0.9692B^2    0.6979+-0.7381i      0.9845       0.1295
#> 1+1.5505B+0.9575B^2   -0.8097+-0.6235i      0.9785       0.3956
#> 1-0.9753B              1.0253               0.9753       0.0000
#> 1+0.5870B+0.8377B^2   -0.3503+-1.0349i      0.9153       0.3020
#> 1+0.4933B             -2.0270               0.4933       0.5000
#> 1-0.4169B              2.3985               0.4169       0.0000
#>   
#>

ljung_box(estARMA$res, 9, 8)

#>            [,1]             [,2]            
#> test       "Ljung-Box test" "Ljung-Box test"
#> K          24               48              
#> chi.square 12.9592          31.2323         
#> df         7                31              
#> pval       0.07310876       0.4545512

par(mfrow = c(1, 2))
acf(estARMA$res)
plot(estARMA$res)

4.5.2 ARUMA estimates

estSeas <- estimate(trainSea, 7, 6)

#> 
#> Coefficients of Original polynomial:  
#> -0.1091 -0.1268 0.5919 -0.3199 -0.0725 -0.4779 0.2916 
#> 
#> Factor                 Roots                Abs Recip    System Freq 
#> 1+1.5049B+0.9193B^2   -0.8184+-0.6464i      0.9588       0.3936
#> 1+0.5136B+0.8599B^2   -0.2987+-1.0362i      0.9273       0.2947
#> 1-1.3829B+0.7006B^2    0.9869+-0.6733i      0.8370       0.0953
#> 1-0.5265B              1.8993               0.5265       0.0000
#>   
#>

ljung_box(estSeas$res, 7, 6)

#>            [,1]             [,2]            
#> test       "Ljung-Box test" "Ljung-Box test"
#> K          24               48              
#> chi.square 18.49173         44.33813        
#> df         11               35              
#> pval       0.07084993       0.1339141

par(mfrow = c(1, 2))
acf(estARMA$res)
plot(estARMA$res)

4.5.3 ARIMA estimates

estTrend <- estimate(trainTrend, 6, 3)

#> 
#> Coefficients of Original polynomial:  
#> -0.0335 -0.6124 0.4784 -0.1861 0.0499 -0.0888 
#> 
#> Factor                 Roots                Abs Recip    System Freq 
#> 1+0.6325B+0.7674B^2   -0.4121+-1.0646i      0.8760       0.3088
#> 1-1.0402B+0.3712B^2    1.4011+-0.8549i      0.6093       0.0872
#> 1+0.4412B+0.3116B^2   -0.7078+-1.6456i      0.5582       0.3147
#>   
#>

ljung_box(estTrend$res, 6, 3)

#>            [,1]             [,2]            
#> test       "Ljung-Box test" "Ljung-Box test"
#> K          24               48              
#> chi.square 17.52443         43.90103        
#> df         15               39              
#> pval       0.2884916        0.2715692

par(mfrow = c(1, 2))
acf(estARMA$res)
plot(estARMA$res)

4.5.4 Discussion

All of our models look pretty good as far as estimates go. However, I think only ARUMA and ARMA are appropriate. We are only proceeding with ARIMA to be thorough.

4.6 Forecasting

We will now proceed with our daily, one year ahead forecasts, as well as the evaluation of them. The fcst function from tswgewrapped is again just a wrappper for fore.whatever.wge:

tswgewrapped::fcst

#> function (type, ...) 
#> {
#>     phrase <- paste0("fore.", enexpr(type), ".wge")
#>     func <- parse_expr(phrase)
#>     eval(expr((!!func)(...)))
#> }
#> <bytecode: 0x5d63758>
#> <environment: namespace:tswgewrapped>

4.6.1 ARMA

armaCast <- tswgewrapped::fcst(type = aruma, x = train, phi = estARMA$phi, theta = estARMA$theta, 
    n.ahead = length(test)) %>>% as.wge

This one damps to the mean pretty quickly, however this may be a decent predictor, it is common sense that most likely the result will be the mean. It may provide a good baseline in the future, but is not in its own right a very good predictor

4.6.2 ARUMA

seaCast <- tswgewrapped::fcst(type = aruma, x = train, phi = estSeas$phi, theta = estSeas$theta, 
    s = 365, n.ahead = length(test)) %>>% as.wge

This looks a lot more like the time series, so it is likely a more appropriate model at this forecast horizon, but the forecasts themselves appear to be off

4.6.3 ARIMA

This will just be the last value over and over, highly uninteresting and not appropriate

trendCast <- tswgewrapped::fcst(type = aruma, x = train, phi = estTrend$phi, 
    theta = estTrend$theta, d = 1, n.ahead = length(test)) %>>% as.wge

Nothing surprising here

4.6.3.1 Note

An airline model was not attempted, because it does not make much sense.

4.7 Model evaluation

4.7.1 Scores

As the appropiateness of each model has already been noted, so now we can look at the scores of each model:

casts <- list(arma = armaCast, seasonal = seaCast, arima = trendCast)
data.frame(list.rbind(lapply(casts, scores)))

These results are not surprising, our common sense ARMA forecast in the long run did pretty well (just the mean), and will serve well as a benchmark for future forecasts. Our seasonal model, which looks pretty, did not do so hot, as it was just off a bit. Our ARIMA model only did well because the last value was near the mean. Below we can see the comparison between the train and test:

autoplot(armaCast) + ggtitle("ARMA")
autoplot(seaCast) + ggtitle("Seasonal")
autoplot(trendCast) + ggtitle("ARIMA")

One important thing to note is the prediction intervals for these models. They did an excellent job of capturing the actual data within their prediction intervals. This is really useful information to us.

4.8 Discussion

Although our forecasts were not very good, especially our ARMA forecast will provide us with a nice benchmark to compare our newer models to, at least by ASE. I think the seasonality is an important factor in this dataset, and although our seasonal model did not do so well, it is worth looking more into. We will proceed next with some multiseasonal approaches. As it is daily data, there tend to be two major seasonalities: a weekly pattern, and a seasonal pattern.

4.9 Saving the important data:

We would like to save the important information from our models, so that way in the end, we have a nice list of all of our information. That is, the scores of the model, the predictions themselves, and the the prediction limits:

makeModel <- function(...) {
    UseMethod("makeModel")
}

makeModel.wge <- function(obj) {
    score <- scores(obj)
    return(list(preds = obj$f, ul = obj$ul, ll = obj$ll, ase = score[1], mape = score[2], 
        Conf.Score = score[3]))
}
models <- lapply(casts, makeModel)

5 Multiseasonal setup

First, we will load up the forecast library, in order to get functionality for multiseasonal forecasts

library(forecast)

Next, we will recast our train and test sets as msts objects:

periods <- c(7, 365.25)
train <- msts(train, seasonal.periods = periods, end = 5)
test <- msts(test, seasonal.periods = periods, start = 5)

6 Multiseasonal Dynamic Harmonic Regression

In class, we learned that you can perform standard regression on a time series, then apply our arima analysis on the errors of that time series, and then adjust for that. The auto.Arima function does all of this for you, if you supply it with an x regressor. The math behid it is basically representing the equation as follows

\[y_t = \beta_0 + \beta_1 x_{1,t} + ... + \beta_k x_{k,t} + \eta_t\]

where $\eta_t$ is a time series of errors. By estimating that time series, we can then properly estimate $y_t$, as well as make forecasts.

In the case of multiple seasonalities, we can get a bit more creative. We can represent the time series as a fourier series, that is we can write a bunch of sine and cosine terms, a set for our weekly seasonality, and a set for our yearly seasonality. We can see this below:

exampleFourier <- data.frame(fourier(train, K = c(3, 100)))

Note that we do three fourier expansions of our weekly trend, and about 100 of our yearly trend. We want to do about half as many expansions as our trend. Lets check out what they look like, we will look at one fourier expansions on a weekly level, and one on a yearly level

weekly <- exampleFourier[[1]] + exampleFourier[[2]] + exampleFourier[[3]] + 
    exampleFourier[[4]] + exampleFourier[[5]] + exampleFourier[[6]]
par(mfrow = c(1, 2))
plot(weekly[1:63], type = "l")
plot(train[1:63], type = "l")

So we see we may not have all of the patterns of the data on a weekly scale represented, but it is not a half bad representation. Let us try with the yearly data as well(we will provide a smoothing on the weekly level first so we can see)

First lets do a low pass filter of our training dataset:

mafun <- function(xs, n) {
    stats::filter(xs, rep(1, n))/n
}
trsmooth <- mafun(train, 56)

yearly <- exampleFourier[[7]] + exampleFourier[[8]] + exampleFourier[[9]] + 
    exampleFourier[[10]] + exampleFourier[[11]] + exampleFourier[[12]]
par(mfrow = c(1, 2))
plot(yearly, type = "l")
plot(trsmooth, type = "l")

Again, not a bad little representation of the series. If we incorporated the remaining 206 components of our fourier expansion of the yearly trend, we would have a really powerful representation of the 365.25 day period.

We can use this fourier expansion as a regressor on our time series, and this should give us a pretty good representation of the complex seasonality of this series.

6.1 S3 methods

We need to construct a few S3 methods for this section. First, the forecast function for Arima objects (those constructed by auto.arima and Arima) returns a forecast of the same length as the length of the external regressors (regardless of the value of h). To deal with this, we write a new method called newFore, which creates a new Arima object that can make forecasts off the length of the test set, and then does a forecast using those xregressors. This returns the same value as running:

predict(`Arima object`, newxreg = `regressors from the test set`, n.ahead = whatever)

but gives us the benefits of the forecast object, such as nice plots, confidence intervals, etc.

newFore <- function(...) {
    UseMethod("newFore")
}
newFore.Arima <- function(obj, newdata, xreg = NULL, h = 1) {
    refit <- Arima(newdata, model = obj, xreg = xreg)
    forecast(refit, h = h, xreg = xreg)
}

Next we want to add new methods for these forecast objects, so we can view our results in a consitent manner:

as.fore <- function(x) structure(x, class = "fore")

autoplot.fore <- function(obj) {
    testdf <- data.frame(type = "actual", t = seq_along(test), ppm = as.numeric(test))
    preddf <- data.frame(type = "predicted", t = seq_along(test), ppm = as.numeric(obj$fitted[1:length(test)]))
    
    confdf <- data.frame(upper = obj$upper[, 2], lower = obj$lower[, 2], t = seq_along(test))
    dfl <- list(testdf, preddf)
    .testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2), 
        linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2), 
        linetype = 3003) + guides(alpha = FALSE)
}




scores.fore <- function(obj) {
    mape <- MAPE(obj$fitted[1:length(test)], test)
    ase <- ASE(obj$fitted, test)
    confs <- confScore(obj$upper[1:length(test), 2], obj$lower[1:length(test), 
        2], test)
    c(ASE = ase, MAPE = mape, Conf.Score = confs)
}

6.2 Building the model

First we construct a fourier expansion of both the train and test set:

trainExp <- fourier(train, K = c(3, 100))
testExp <- fourier(test, K = c(3, 100))

Next, we plug this in to the auto.arima function, and build our model. Note we are passing the parameter lambda = 0 into the function. Internally, auto.arima does a Box-Cox transformation on the data before running it, to make the model work smoothly. In our case, we do not want values less than zero, because that wouldnt make sense. So, we set lambda, the parameter passed into the box cox transformation to 0, which represents a logarithmic transformation. In turn, this will keep our model, and thus our predictions

mseadyn <- auto.arima(train, seasonal = FALSE, lambda = 0, xreg = trainExp)

Lets check out where the roots lie on the unit circle:

autoplot(mseadyn)

And lets see what order model it identified within the residuals:

data.frame(p = length(mseadyn$model$phi), q = (mseadyn$model$theta), d = length(mseadyn$model$Delta))

6.3 Forecasting

First, lets look at what the origina forecast looks like, without doing the newFore function:

forecast(mseadyn, xreg = trainExp) %>>% autoplot

Whew, 5 years ahead. Thats a bit ridiculous. Now lets run the newFore on it, and then check it out again

mseafor <- newFore(mseadyn, newdata = test, xreg = testExp, h = 1)
autoplot(mseafor)

Note the scale on this plot it is now a one year ahead forecast, which is exactly what we want. Unlike the basic ARUMA forecast we did earlier, the confidence limits also lie above zero in all cases, and hopefully it will be a bit better.

6.4 Model Evaluation

This model seems to be slightly more appropriate than the simpler models we fit, as it is highly likely this data is multiseasonal, and I much prefer having a lower ARMA order than (9,8). Lets check it out

6.4.1 Visual inspection

mseafor %>>% as.fore %>>% autoplot

This appears to be a ridiculously good forecast. Lets check out how it scored:

6.4.2 Scores

mseafor %>>% as.fore %>>% scores %>>% t %>>% data.frame

Wow. The ASE is nearly a million units smaller than the baseline ARMA forecast. This is a massive improvement, and it seems that our hypothesis of multiseasonality in the data was correct. Lets try a different multiple seasonality technique. Aside fro that, a good percentage of the data was stored in our prediction interval. This indicates that on its own, this model is a good forecasting tool.

6.5 Model saving

We will now add this model to our list of models:

makeModel.fore <- function(obj) {
    score <- scores(obj)
    return(list(preds = obj$fitted, ul = obj$upper[, 2], ll = obj$lower[, 2], 
        ase = score[1], mape = score[2], Conf.Score = score[3]))
}

models$dynamic <- makeModel(as.fore(mseafor))

7 TBATS

TBATS is a new (to me) and exciting method for dealing with time series with complex seasonalities. TBATS stands for: Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components

What?

Basically, the TBATS algorithm is going to:

Perform a box-cox transformation on the time series
Break it into Seasonal, ARMA (noise/residual), and Trend (with damping) components
Estimate the Seasonal component(s) as a fourier series
Estimate the trend component
Estimate the ARMA component
Combine everything back together to represent the series

How is this different from what we just did

Unlike what we just did with dynamic harmonic regression, where the seasonality was fixed (as it is simply read in from the attributes of the msts object), seasonality here is allowed to change over time, as it is calculated dynamically from the data. This means that that pattern we saw in our EDA in the quarterly season plots can be maybe accounted for (where year one and two had a different pattern than the rest of the years).

Why havent i heard of this

Unlike a lot of time series techniques, tbats is from the 21st century, and not well proven. It also has a hard time sometimes, as it is an automated modeling tool, so some special cases will mess it up. It also is exceptionally slow, without using parallel processing it took about 20 minutes to run, on this small amount of data. Despite this, one of its great strengths is its automation, making it very easy to get a pretty good model, a lot of the time.

Lets go ahead and get into it.

7.1 S3 methods

tbats comes from the forecast package, and has similar issues with forecasting as we saw above (its forecasts are the same length as the data it was trained on, h is ignored). So lets write a newFore method for bats objects

newFore.bats <- function(obj, newdata, h = 1) {
    refit <- tbats(model = obj, y = newdata)
    forecast(refit, h)
}

Thankfully, the structure of a tbats forecast is exactly the same as the structure of our dynamic regression forecasts, so for the rest of the methods, we can just coerce our tbats forecasts into fore objects.

7.2 Model Building

First, lets note whether train and test are msts or ts objects, just for sanity

sapply(list(train, test), class)

#>      [,1]   [,2]  
#> [1,] "msts" "msts"
#> [2,] "ts"   "ts"

Ok great we are good to go.

cores <- parallel::detectCores()
# > [1] 12
bjbats <- tbats(train, use.parallel = TRUE, num.cores = cores - 1)

autoplot(bjbats)

This is interesting too, so we see the I think trend that the algorithm smoothed out of the time series. Lets go ahead and make the forecast and see how we did

7.3 Forecasting

Since we already know what will happen if we just make a forecast with the forecast function, lets instead use our newFore method without comparison

batF <- newFore(obj = bjbats, test, h = 366)
autoplot(batF)

Interesting. What is the autoplot showing us? It is a good question. It is hard to say whether this model is any good or not just from this. Lets try looking at the actual forecasted values with our autoplot method

7.4 Model Evaluation

7.4.1 Visual Inspection

forbats <- as.fore(batF)
autoplot(forbats)

The autoplot of the forecast itself was strange, but looking at the fitted values, it was a pretty dang good forecast. It is interesting how both of these multiseasonal models captured the shape of the data nicely (see around the 100 point mark), but not the scale. TBATS did an exceptional job of getting the shape of the data (even more than our dynamic regression, see around t = 200), but it did worse as far as scaling goes.

Lets check out the scores:

7.4.2 Scores

scores(forbats) %>>% data.frame

By the numbers, this is an exceptional model as well (or at least better than our naive baseline). As far as appropriateness goes, it is hard to say as the TBATS algorithm is a bit black box. However, it got a pretty good forecast and looks similar to our original data, so I will give it a pass. I believe the fact that the seasonality can change is what made this a bit better than our regression with ARMA errors. As far as prediction intervals go, this model managed to capture even further the swings of the data.

7.5 Saving the Forecast

models$tbats <- makeModel(forbats)

8 Prophet Algorithm

Facebook released their internal fast and easy forecasting algorithm, prophet. The prophet algorithm works in the following manner:

It decomposes the time series into seasonal, trend, and holiday components.
The seasonal component is estimated as in TBATS, with fourier series. By default, it looks for a daily (if we have sub daily data), weekly, and yearly trends, but these can be specified (for example we can include a 5 day work week). These fourier series are then multiplied by a normal distribution with the same variance as our data.
The trend component is modeled basically as logistic growth. However, it is not just simple logistic growth, they use piecewise logistic growth, which means that the trend component is allowed to change with time.
- If the trend does not follow a logistic growth pattern, it also checks for a linear trend with changepoitnts (aka a piecewise function).
If a list of holidays and events with their dates are provided, a matrix of dummy variable regressors for them is added to the model.
Components are combined as a general additive model (GAM), and then you are done

8.1 S3 methods

We will give the usual S3 methods. Because prophet forecasts include the entire dataset, we will have to write methods that take out the training parts. First, we define the coercion method

as.proph <- function(x) structure(x, class = "proph")

Next we define our autoplot and scoring methods:

scores.proph <- function(obj) {
    c(ase = ASE(obj$yhat[-(1:length(train))], test), mape = MAPE(obj$yhat[-(1:length(train))], 
        test), Conf.Score = confScore(upper = obj$yhat_upper[-(1:length(train))], 
        lower = obj$yhat_lower[-(1:length(train))], test))
}
autoplot.proph <- function(obj) {
    testdf <- data.frame(type = "actual", t = seq_along(test), ppm = as.numeric(test))
    preddf <- data.frame(type = "predicted", t = seq_along(test), ppm = as.numeric(obj$yhat[-(1:length(train))]))
    confdf <- data.frame(t = seq_along(test), upper = obj$yhat_upper[-(1:length(train))], 
        lower = obj$yhat_lower[-(1:length(train))])
    dfl <- list(testdf, preddf)
    .testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2), 
        linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2), 
        linetype = 3003) + guides(alpha = FALSE)
    
}

8.2 Building a model

First, we need to load the data in the proper format

st <- as.Date("2010-1-1")
en <- as.Date("2013-12-31")
trainDates <- seq(st, en, by = "day")
# data MUST BE IN THIS EXACT FORMAT
traindf <- data.frame(ds = trainDates, y = as.numeric(train))

Next, lets make our model. This is pretty easy

library(prophet)
model <- prophet(traindf)

8.3 Forecasting

First, we must make a data frame for the future, because prophet has a silly API

future <- make_future_dataframe(model, periods = 366)

Now we can forecast, with the predict function

fore <- predict(model, future)
plot(model, fore)

We can also view how it modeled/forecasted all the components

prophet_plot_components(model, fore) + theme_hc()

#> NULL

8.4 Model Evaluation

Lets check out how we did. First, lets take a look at our forecast:

pfore <- as.proph(fore)
autoplot(pfore)

This is a pretty weird little forecast. Because of the smoothing, and the normalized nature of all the prophet algorithm (it is much closer to regression than anything else), it captures the shape of the series well, but it does not get the extremety of the series. This model may be good for detecting overall patterns in the data, but not for specific predictions. Lets check out the scores.

data.frame(t(scores(pfore)))

scores.proph

#> function(obj) {
#>     c(ase = ASE(obj$yhat[-(1:length(train))], test), mape = MAPE(obj$yhat[-(1:length(train))], 
#>         test), Conf.Score = confScore(upper = obj$yhat_upper[-(1:length(train))], 
#>         lower = obj$yhat_lower[-(1:length(train))], test))
#> }

# function(obj){ c( ase = ASE(obj$yhat[-(1:length(train))], test), mape =
# MAPE(obj$yhat[-(1:length(train))], test), Conf.Score = confScore(upper =
# obj$yhat_upper, lower = obj$yhat_lower, test) ) }
str(pfore)

#> List of 19
#>  $ ds                        : POSIXct[1:1827], format: "2010-01-01" "2010-01-02" ...
#>  $ trend                     : num [1:1827] 2525 2525 2525 2524 2524 ...
#>  $ additive_terms            : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#>  $ additive_terms_lower      : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#>  $ additive_terms_upper      : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#>  $ weekly                    : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#>  $ weekly_lower              : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#>  $ weekly_upper              : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#>  $ yearly                    : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#>  $ yearly_lower              : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#>  $ yearly_upper              : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#>  $ multiplicative_terms      : num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ multiplicative_terms_lower: num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ multiplicative_terms_upper: num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ yhat_lower                : num [1:1827] -109.41 1.68 257.35 91.15 88.32 ...
#>  $ yhat_upper                : num [1:1827] 4662 4662 4556 4661 4854 ...
#>  $ trend_lower               : num [1:1827] 2525 2525 2525 2524 2524 ...
#>  $ trend_upper               : num [1:1827] 2525 2525 2525 2524 2524 ...
#>  $ yhat                      : num [1:1827] 2297 2485 2438 2411 2572 ...
#>  - attr(*, "row.names")= int [1:1827] 1 2 3 4 5 6 7 8 9 10 ...
#>  - attr(*, "class")= chr "proph"

These are not brilliant, only ever so slightly better than our baseline ARMA/mean score. This says a lot abouthow hard to beat common sense is.

8.5 Saving the model

str(pfore)

#> List of 19
#>  $ ds                        : POSIXct[1:1827], format: "2010-01-01" "2010-01-02" ...
#>  $ trend                     : num [1:1827] 2525 2525 2525 2524 2524 ...
#>  $ additive_terms            : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#>  $ additive_terms_lower      : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#>  $ additive_terms_upper      : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#>  $ weekly                    : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#>  $ weekly_lower              : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#>  $ weekly_upper              : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#>  $ yearly                    : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#>  $ yearly_lower              : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#>  $ yearly_upper              : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#>  $ multiplicative_terms      : num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ multiplicative_terms_lower: num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ multiplicative_terms_upper: num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ yhat_lower                : num [1:1827] -109.41 1.68 257.35 91.15 88.32 ...
#>  $ yhat_upper                : num [1:1827] 4662 4662 4556 4661 4854 ...
#>  $ trend_lower               : num [1:1827] 2525 2525 2525 2524 2524 ...
#>  $ trend_upper               : num [1:1827] 2525 2525 2525 2524 2524 ...
#>  $ yhat                      : num [1:1827] 2297 2485 2438 2411 2572 ...
#>  - attr(*, "row.names")= int [1:1827] 1 2 3 4 5 6 7 8 9 10 ...
#>  - attr(*, "class")= chr "proph"

makeModel.proph <- function(obj) {
    score = scores(obj)
    return(list(preds = obj$yhat, ul = obj$yhat_upper, ll = obj$yhat_lower, 
        ase = score[1], mape = score[2], Conf.Score = score[3]))
}
models$prophet <- makeModel(pfore)

9 Discussion of Multiseasonal models as a whole

Considering the nature of our data, and our background knowledge of central heating in china, both of these multiseasonal models seem to be more or less appropriate. They both handle this seasonality in a similar manner, and both proved a very reasonable forecast.

10 Multivariate Setup

Now it is time for us to dive in to the hopefully even more powerful multivariate models. Before we do so, we are going to need to redefine our train and test sets, which means we are going to have to write some more functions. Instead of confusing ourselves, lets go ahead and clean away our train and test sets, to avoid confusion:

rm(train, test)

10.1 Functions

First, we are now going to have to convert our non time series/numerical data to daily, which means we will have to find a new method of resampling them. For that, we are going to find the most common categorical value in each day, and then take that. We will use the .mode function, defined below:

.mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

.mode(c(rep("cat", 4), rep("dog", 8), rep("moose", 1)))

#> [1] "dog"

Next we are going to write a function which lumps the categorical variables into groups of each day, takes the .mode of each group, and returns a new vector in that format.

For the lumping, we will use the %/% operator, which returns an integer specifying how many times a number divides another number. For instance:

((1:50) - 1)%/%5

#>  [1] 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
#> [36] 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9

Then, we will use the glorious tapply function, which does the same thing as the apply family, except it applies the function onto a grouped variable (table apply).

.dailyMode <- function(xs) {
    out <- tapply(xs, (seq_along(xs) - 1)%/%24, .mode)
    out <- unname(out)
    out[1:1826]  # the number of days in 6 years, because we cleaned away extra days
}

Next we are going to write a function that applies this (and our daily time series conversion) onto a list of time series objects and categorical, and spit out a data frame:

dlist <- function(xs) {
    tsbj <- xs %>>% purrr::keep(is.ts)
    tsnobj <- xs %>>% purrr::discard(is.ts)
    out <- append(lapply(tsbj, function(x) window(cleandays(x), end = 6)), lapply(tsnobj, 
        .dailyMode))
    as.data.frame(out)
}

Finally, we are going to write a function that creates a train set and a test set of data frames of time series and categorical variables. Note the use of the <<- superassignment operator. This causes the value to be stored one level out, so if it is a single level function, this will cause the value to be stored in the global environment:

dfsplit <- function(df, n = 5) {
    index <- length(window(df[[1]], end = n))
    tsdf <- df %>% purrr::keep(is.ts)
    nodf <- df %>% purrr::discard(is.ts)
    outs <- lapply(tsdf, function(x) window(x, end = n))
    outn <- nodf[1:index, ]
    train <<- as.data.frame(append(outs, outn))
    tests <- lapply(tsdf, function(x) window(x, start = n))
    test <<- as.data.frame(append(tests, nodf[(index):nrow(nodf), ]))
}

Note we have to do things in this weird way, because otherwise it is extremely difficult to preserve the class ts, which we would like for time series data to be in. Just cbinding or anything like that breaks it immediately.

11 Multivariate EDA

First, lets load up our data:

bj <- (china$BeijingPM_)
bj2 <- dlist(bj)
dfsplit(bj2)

Before we begin our analysis, it is definitely important to do more EDA, as we have a whole lot more information. First, lets plot all the time series in the training dataframe:

library(tidyverse)

plotAllTs <- function(df) {
    df %>% keep(is.ts) %>% gather() %>% mutate(.data = ., t = rep(1:nrow(df), 
        (nrow(.)/nrow(df)))) %>% ggplot(aes(x = t, y = value)) + facet_wrap(~key, 
        scales = "free") + geom_line() + theme_hc() + theme(axis.text.y = element_blank())
}
plotAllTs(train)

Already, we see some interesting stuff. First of all we probably want to get rid of the other stations observing particulates in our analysis, as this isnt very interesting. Of all the time seres here, humidity appears to have a similar trend pattern to our target. Pressure looks similar but a bit off, so at a high lag maybe it is interesting. Dewpoint has the wrong period, and lprec and precipitation are the same but hard to glean anything out of.

11.1 Analysis of wind direction:

Lets see if the wind direction is of any use to us:

ggplot(train) + geom_point(aes(x = No, y = PM_US.Post, color = cbwd)) + geom_line(aes(x = No, 
    y = PM_US.Post, alpha = 0.1)) + geom_smooth(aes(x = No, y = PM_US.Post), 
    method = "auto") + theme_hc() + scale_color_hc()

Just upon a glimpse this is not that useful. Lets look at a table:

train %>% arrange(cbwd) %>% group_by(cbwd) %>% summarise(mean = mean(PM_US.Post))

It looks like it is an interesting factor to some degree, but at this scale with our mode technique, it is probably strongly distorted. Still it is worth a look. We will note this and examine it when we are performing our models.

11.2 Analysis of all numeric variables

plot_vs_response <- function(x) {
    plot(train$PM_US.Post ~ train[[x]], xlab = x)
    lw1 <- loess(train$PM_US.Post ~ train[[x]])
    j <- order(train[[x]])
    lines(train[[x]][j], lw1$fitted[j], col = "red", lwd = 3)
}


numNames <- train %>% purrr::keep(is.numeric) %>% names
numNames <- numNames[-c(1:4, 11:17)]

par(mfrow = c(2, 3))
lapply(numNames, plot_vs_response)

So all of these appear to be fairly interesting i think. However, the summing technique we performed does not make these plots easily interpretable. Mathematically however, they likely still work out, as it is the relationship that matters. When we do linear regression, whether or not we have the total temperature sum or the mean (sum/n) should not make a difference in the significance of the relationships, only the slopes. This means we should be ok. Precipitation does not look useful.

11.3 CCF analysis

Lets look at the cross correlation between all the variables next:

ppm <- train$PM_US.Post
ccfplot <- function(x) {
    ccf(ppm, train[[x]], main = x)
}
ccfs <- lapply(numNames, ccfplot)

It appears that low lags are where these variables have the most cross correlation, other than temperature, which seems to have a bigger CCF at a pretty far from zero lag.

Next, to further our EDA, lets perform a cluster analysis of the time series data

12 Time Series Clustering

There are two types of clustering techniques: Subsequence clustering, and multivariate clustering. Thankfully for us, it has been definitively shown that subsequence clustering (mostly used for anomaly detection, but also for determining if we should break a time series into multiple time series for forecasting) is completely useless. However, multivariate time series clustering is still a valid and exciting technique.

Multivariate time series clustering is commonly used in the world of quantitative finance, to determine which stocks react to other which other stocks. By finding out if X stock is associated with Y stock, they can do things to make money. With further research, they can determine the nature of the relationship between X stock and Y stock. If X stock grows with Y stock, they can time their investments so that when they see Y stock grow, they do something that makes them money with X stock. If the relationship is negative, they can bet on Y stock (which is growing) and short X stock.

We can attempt to use this clustering to get a better idea of how our time series are associated with each other.

12.1 More info on clustering time series

source

12.1.1 Types of time series clustering

12.1.1.1 Hierarchical Clustering

Hierarchical clustering works in one of 2 ways:

Each item is given its own group. Groups are compared by a similarity metric. Good groups are merged. This is tried over and over until the clusters do not get better.
All the items are put in a group. The group is divided until it cannot be divided more without messing up the clusters

12.1.1.2 Partitional Clustering

Partitional Clustering kind of works the opposite of hierarchical clustering. With partitional clustering, you specify the number of groups, and it then solves the problem of maximizing inter cluster distance while minimizing intra cluster distrance. An example of this is kmeans.

12.1.1.3 Fuzzy Clustering

Fuzzy clustering is similar to partitional clustering, except instead of hard clusters, they are soft or fuzzy. This means that items can be in multiple clusters, but more in one cluster than another. An example of this is c-means.

12.1.2 Time series distance measuers

12.1.2.1 Dynamic Time Warping

Dynamic time warping, or DTW, pictured above, finds the best way to warp two time series to line up with each other over time. It creates a matrix which has dimensions m*n, where m and n are the size of two different time series. It does this for every possible pair. Then, for each matrix, it finds a path of indices which has the minimum number of steps and costs the least to line the two time series up. The distance measure is then computed based on the the number of steps it took to line the two up, weighted by cost. For more info please see the aforementioned paper

12.1.2.2 Shape-Based Distance (SBD)

Shape based distance is more or less a cheaper version of DTW. What it does is it takes the cross correlation between the two time series (with normalized coefficients), makes a series out of that, fourier transforms that, and calculates the distance based off of that fourier series.

12.1.2.3 Global Alignment Kernel (GAK)

A global alignment kernel is another powerful distance measure (albeit really computational expensive). Instead of calculating a matrix of costs, and then running through that (relatively cheap), it calculates a set of all possible alignments and minimizes that, and then returns a similarity measure based on that alignment. Its a bit complex but the aforementioned paper explains it nicely.

12.1.3 Centroids

12.1.3.1 Shape extraction

Shape extraction works very similarly to SBD. It picks a random time series as a centroid, and uses SBD to match it to the other time series optimally. As mentioned in the paper, it uses an eigenvector of a matrix made wit SBD aligned series

12.1.3.2 DTW barycenter averaging (DBA)

This centroid is represented by the average of a bunch of points grouped by their DTW alignments. It select a random centroid series, and with each iteration, it calculates the DTW alignment with each series in the cluster and the centroid. This is repeated until convergence is met.

12.1.3.3 Partition around mediods (PAM)

A mediod is just a time series whose average distance to all the other objects in the cluster is at the smallest (and an element of the original data). Then, that cluster is partitioned around all the mediods iteratively. IN our case, a given number of series are chosen as initial centroids. Then all the distances are calculated, and then clusters are created based off of those distances. Then, it picks the one that is closest to all the other time series in the new clusters as a mediod, and does it again, until convergence

12.2 Moving Forward

Now that we have a basic understanding of how time series clustering works, we can try it out.

12.3 Clustering setup

library(dtwclust)

traints <- as.list(keep(train, is.ts))

Now that we have our data and libraries loaded, lets set ourselves up for parallel computation:

# alter for the appropriate number of cores for you my computer has 12, but
# i have a lot of tabs open define the cluster
workers <- makeCluster(10L)
# load the necessary libraries onto the cluster
invisible(clusterEvalQ(workers, library(dtwclust)))
# register the cluster for parallel computation
registerDoParallel(workers)

12.4 Partitional Clustering

As we are new to this, lets set ourselves up for a grid search of a bunch of clusters. We will first check out a partitional cluster. For this, we want to use the dtw distance metric, but also check out the DBA centroid. We will also play with their normalization methods, and the time window they use in alignment. we will also try out 2-6 clusters

12.4.1 Grid search setup

cfg <- compare_clusterings_configs(types = "partitional", k = 2L:6L, controls = list(partitional = partitional_control(iter.max = 40L)), 
    distances = pdc_configs("distance", partitional = list(dtw_basic = list(window.size = seq(from = 5L, 
        to = 50L, by = 5L), norm = c("L1", "L2")))), centroids = pdc_configs("centroid", 
        share.config = c("p"), dba = list(window.size = seq(from = 5L, to = 50L, 
            by = 5L), norm = c("L1", "L2"))), no.expand = c("window.size", "norm"))

For evaluation, as we do not have labeled data, we will use silhouette validation as our metric.

evaluators <- cvi_evaluators("Sil")

12.4.2 Partitional Grid Search

Lets find the best cluster now:

comparison <- compare_clusterings(traints, types = "partitional", configs = cfg, 
    seed = 666L, score.clus = evaluators$score, pick.clus = evaluators$pick)

stopCluster(cl)
registerDoSEQ()

12.4.3 Partitional grid search: model picking

comparison$results$partitional %>% arrange(desc(Sil)) %>% head

Well we have it. We can explore the results more ourself, the most important part is we found two groups

12.4.4 Best Cluster

We can conveniently rerun the best cluster with the repeat_clustering function:

clusters <- repeat_clustering(traints, comparison, comparison$pick$config_id)
plot(clusters)

clusts <- clusters@cluster
names(clusts) <- names(traints)
data.frame((clusts))

So looks like air pressure is on its own. Lets see if hierarchical clustering can help us out with a better cluster

12.5 Hierarchical clustering

Lets try out a bunch of different hierarchical clusters. We should check out different preprocessing methods, different control methods, all the possible centroids, GAK, DTW, SBD, and different window sizes for DTW.

12.5.1 Grid search setup

cfg2 <- compare_clusterings_configs(types = "h", k = 2L:6L, controls = list(hierarchical = hierarchical_control(method = "all")), 
    preprocs = pdc_configs("preproc", none = list(), zscore = list()), centroid = pdc_configs("centroid", 
        shape_extraction = list(), default = list(), dba = list()), distances = pdc_configs("distance", 
        hierarchical = list(dtw_basic = list(window.size = seq(from = 5L, to = 50L, 
            by = 5L), norm = c("L1", "L2")), gak = list(), sbd = list())), no.expand = c("window.size", 
        "norm"))

evaluators <- cvi_evaluators("Sil")

12.5.2 Hierarchical grid search

Note we will use silhouettes for consistency

comparison2 <- compare_clusterings(traints, types = "hierarchical", configs = cfg2, 
    seed = 666L, score.clus = evaluators$score, pick.clus = evaluators$pick)

Looks like GAK and DBA centroids absolutely won. Lets visualize the results

comparison2$results$hiera %>% arrange(desc(Sil))

comparison2$results$hierarchical %>% arrange(desc(Sil), distance) %>% ggplot(aes(fill = distance, 
    y = Sil, x = distance)) + geom_boxplot() + facet_wrap(centroid ~ .) + scale_fill_hc() + 
    theme_hc()

So it looks like GAK out performed DTW, which outperformed SBD. From centroid to centroid, it really didnt matter.

12.5.3 Best Cluster

cluster2 <- repeat_clustering(traints, comparison2, comparison2$pick$config_id)
plot(cluster2)

plot(cluster2, type = "sc")

At this point, it is obvious that our results are not going to change with more work. We have two clusters, pressure in one and everything else in the rest.

12.6 Discussion of results

With our clustering, we discovered a few interesting things. Firstly, it appears that by all metrics,air pressure is really different from the rest of the series. Whether that is a good thing or bad thing remains to be seen, but regardless, it is interesting. We will have to pay attention to it as a predictor.

We also need to note how strongly humidity is clustered with our PM2.5 content.This hopefully suggests it will be another strong predictor variable. However, it is important to keep in mind that all our time series are clustered pretty closely, the largest distance is only 0.015. So our clustering may not have had any important results, but at the very least we learned something.

We will now proceed with our multivariate analysis.

13 VAR

Our first multivariate technique is going to be VAR. One of the interesting properties of VAR is what it means about the data. VAR is telling us that our time series are endogenous, that is to say that they all affect each other. Now with our weather patterns, this makes sense. Weather is a complex, dynamic system of complex feedback cycles, so it is natural that all the weather variables are endogenous to each other. However; let us discuss how our air particulate content relates to the weather, to determine weather a VAR model is physically appropriate

13.1 Discussion of Appropriateness

In this section we will discuss how each of our weather patterns interplay (or dont) with our air quality.

13.1.1 Humidity

13.1.1.1 Effect of Humidity on Air Quality

Humidity has a strong effect on air quality. On a humid day, water in the air will “stick” to air particulates. This weighs them down and makes them much larger, causing them to stick around for longer. This will cause the air particulates to hang around in one place.

It is also important to note that this has an effect on temperature. When water sticks to air pollution, energy from the sun is reflected off of the water, imparting a bit of energy to the water and spreading the sunlight some more. This adds to the haze appearance when there is a lot of pollution. Over time, due to water’s high specific heat, should cause the air to heat up, as the small amount of extra energy stored in the water attached to the pollution will be transferred quickly to the air. (Note this should also have an affect on pressure slowly, because $P \propto T$).

13.1.1.2 Effect of Air Quality on Humidity

Surprisingly, air quality also has a positive feedback with humidity. This is due to the “sticking” effect we discussed earlier. Heavy air particulate matter with water attached causes other air particles to stick around too. You can actually prove this with an experiment.

Supplies

An aerosol, such as hairspray
A jar with a lid
Ice
Warm water

Take the warm water and put it in your jar. Put the lid on upside down with ice in it, for about 30 seconds (this causes the evaporated water in the air to condense a bit, raising the humidity).

Now, lift the ice lid up, and spray a tiny bit of your aerosol in the jar and quickly replave the ice lid.

Then, watch a cloud form inside the jar.

You can do this experiment with any air particulate, for example smoke from a match also works nicely.

It is safe to say that a bidirectional VAR forecast of humidity is appropriate.

13.1.2 Temperature

13.1.2.1 How Temperature Relates to Surface Air Quality

Temperature has a clear effect on air quality. Not only does cold air cause air particulates to stick around more (slower moving), but also more condensed water particles in the air, which we already know about.

Similarly, in the winter, tropospheric temperature inversions can occur when the stratosphere (think, up high) gets heated more than the troposphere (we live here). Normally, the air naturally convects because it is hotter in the troposphere than the stratosphere, but in the winter, with our long nights, the opposite can occur. This is a temerature inversion. This causes the convection to stop, and the air to remain trapped there. This traps the pollutants in the same place, and causes them to accumilate

13.1.2.2 Effect of air quality on temperature

The air quality + humidity should have a slight effect on temperature, but global weather patterns are more powerful

13.1.2.3 Discussion

It is appropriate to say that temperature affects air quality, but only slightly appropriate to say the opposite

13.1.3 Dewpoint

Dewppoint is simply a function of humidity and temperature, so this variable will not be included in our model, as it is simply overfitting.

13.1.4 Wind speed

This is a complex relationship, but basically it does have some sort of complex relationship with air pollution. When it is windy, pollutants can be moved thousands of miles, while when it is still, they can accumulate. So it depends, but it does have an effect. Heavy particulte matter may also slow the wind down, but it is unclear and unproven. Overall though, it is somewhat appropriate to do this bidirectional forecast.

13.1.5 Pressure

Given the complex relationship between pressure, temperature, humidity, and wind speed, which are all somewhat related to air quality in interesting ways, it is safe to include pressure in our VAR models as well

13.1.6 Precipitation

No matter how much i think, I cannot say that you can predict surface air quality with precipitation or vice versa, so this is likely inappropriate

13.2 S3 methods

We need three S3 methods here, an as method, an autoplot method, and a scores method:

as.var <- function(x) structure(x, class = "var")

autoplot.var <- function(obj) {
    us <- obj$fcst$PM_US.Post
    testdf <- data.frame(type = "actual", t = seq_along(test$PM_US.Post), ppm = as.numeric(test$PM_US.Post))
    preddf <- data.frame(type = "predicted", t = seq_along(test$PM_US.Post), 
        ppm = (obj$fcst$PM_US.Post[, 1]))
    dfl <- list(testdf, preddf)
    confdf <- data.frame(t = seq_along(test$PM_US.Post), upper = us[, 3], lower = us[, 
        2])
    .testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2), 
        linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2), 
        linetype = 3003) + guides(alpha = FALSE)
}
scores.var <- function(obj) {
    mape <- MAPE(obj$fcst$PM[, 1], test$PM_US)
    ase <- ASE(obj$fcst$PM[, 1], test$PM_US)
    us <- obj$fcst$PM_US.Post
    conf <- confScore(upper = us[, 3], lower = us[, 2], test$PM_US)
    c(ASE = ase, MAPE = mape, Conf.Score = conf)
}

13.3 Order Selection:

First, lets select the possible orders for our VAR forecast

train[1:3] <- NULL
train[9:14] <- NULL
test[1:3] <- NULL
test[9:14] <- NULL
traints <- (purrr::keep(train, is.ts))
traints <- (purrr::keep(train, is.ts))
library(vars)

# simple order
ord <- VARselect(traints, lag.max = 20, type = c("both"))

# Seaosnal order
ords <- VARselect(traints, lag.max = 20, type = "both", season = c(7, 365))

# Another order
ordn <- VARselect(traints, lag.max = 20)

Lets check out what the orders are:

data.frame(rbind(ord$selection, ords$selection, ordn$selection), row.names = c("ord", 
    "ords", "ordn"))

13.4 Model Tuning

Next, lets pick the proper parameters with a grid search. First, lets build the grid:

types <- c("const", "trend", "both", "none")
ics <- c("AIC", "HQ", "SC", "FPE")
ps <- c(8, 3, 1)
szn <- c(NULL, 365)
init <- list(ps = ps, types = types, ics = ics, szn = szn)
grid <- expand.grid(init, stringsAsFactors = FALSE)

Then lets set ourselves up for parallel processing:

workers <- makeCluster(11L)
invisible(clusterEvalQ(workers, library(vars)))
registerDoParallel(workers)

Finally, lets run the grid search

gridSearch <- foreach(i = 1:nrow(grid), .combine = rbind) %dopar% {
    model <- VAR(traints, p = grid[i, 1], type = grid[i, 2], ic = grid[i, 3], 
        season = grid[i, 4])
    preds <- predict(model, n.ahead = 366)
    pred <- preds$fcst$PM_US.Post[, 1]
    ASE <- mean((test$PM_US.Post - pred)^2)
    c(ASE = ASE, p = grid[i, 1], type = grid[i, 2], ic = grid[i, 3], season = grid[i, 
        4])
}
stopCluster(workers)
registerDoSEQ()

Lets check out the results:

gridSearch <- gridSearch %>% as.data.frame
gridSearch$ASE <- as.numeric(gridSearch$ASE)
gridSearch[2:4] <- lapply(gridSearch[2:4], factor)
gridSearch %>% arrange(ASE)

Next lets visualize them

library(latticeExtra)
gridV <- (cloud(ASE ~ p + type, gridSearch, panel.3d.cloud = panel.3dbars, scales = list(arrows = F, 
    col = 1), xbase = 0.2, ybase = 0.2, pretty = T, col.facet = level.colors(gridSearch$ASE, 
    at = do.breaks(range(gridSearch$ASE), 20), col.regions = cm.colors), colorkey = list(col = cm.colors, 
    at = do.breaks(range(gridSearch$ASE), 20)), screen = list(z = 40, x = -30)))
gridV

13.5 VAR Model: parameter selection and forecasting

We will now go through and select the proper parameters for the model, by calculating the ASE of each VAR prediction, as well as looking at the summary of a linear model (to see significant regressors and help us pick), as well as thinking about our appropriateness assessment:

var1 <- VAR(traints, p = 3, type = "none", ic = "AIC", season = 365)
preds <- predict(var1, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
mean((test$PM_US.Post - pred)^2)

#> [1] 3748146

plot(preds)

thelm <- summary(lm(data = traints, PM_US.Post ~ .))

trainsimp <- traints %>% select(PM_US.Post, HUMI, DEWP, TEMP, precipitation, 
    PRES, Iws)
var2 <- VAR(trainsimp, p = 3, type = "none", ic = "AIC", season = 365)
preds <- predict(var2, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
mean((test$PM_US.Post - pred)^2)

#> [1] 3747685

trainsimp2 <- traints %>% select(PM_US.Post, HUMI, TEMP, precipitation, PRES, 
    Iws)

var3 <- VAR(trainsimp2, p = 3, type = "none", ic = "AIC", season = 365)
preds <- predict(var3, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
mean((test$PM_US.Post - pred)^2)

#> [1] 3746216

trainsimp3 <- traints %>% select(PM_US.Post, HUMI, TEMP, PRES, Iws)

var4 <- VAR(trainsimp3, p = 3, type = "none", ic = "AIC", season = 365)
preds <- predict(var4, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
mean((test$PM_US.Post - pred)^2)

#> [1] 3745787

plot(preds)

We have now reached a point where we cannot improve our model any further (trust me i tried). Lets remember these variables for the rest of our multivariate analysis.

13.6 Model assessment

vpred <- as.var(preds)
autoplot(vpred)

data.frame(t(scores(vpred)))

Objectively, this model performed worse than the ARMA/mean baseline model. However, visually, it was only off by a bit. It seems to have gotten the period a bit wrong, while it is accurate in scale. It also has a few issues with the shape, but the forecast does not look unreasonable. This may be an interesting model in our bagged results.

13.7 Saving the Model

makeModel.var <- function(obj) {
    score = scores(obj)
    us <- obj$fcst$PM_US.Post
    return(list(preds = us[, 1], ul = us[, 3], ll = us[, 2], ase = score[1], 
        mape = score[2], Conf.Score = score[3]))
}

models$var <- makeModel(vpred)

14 Neural Network models

We will try two different neural networks, a neural net ar (nnetar) model and a Long short-term memory (LSTM) model.

The nnetar model is a simple artificial neural network (ANN). For this we will use the nnetar function from the forecast library. It is still an ANN, just as the ones from nnfor, and it is significantly less customizable than the models from nnfor. However, with this much data, and the slowness of nnfor, it actually takes over 7 hours to produce a model with our data using nnfor. As I plan on eventually deploying this model, and am not going to be spending months tuning a slow nnfor model to find the right hyperparameters (with that slowness, I will simply be using near defaults), there is no benefit from using the slower, more customizable model.

A LSTM model is much more complex than the nnetar model, but will hopefully provide us with a better view of the data. We will explor both in depth in the following sections

15 ANN Model: nnetar

For the first attempt at neural network forecasting, we will be using nnetar. Nnetaris basically a less flexible implementation of the functions in the nnfor. What it lacks in flexibiility it makes up for with extreme speed. We will have our slow model in the next section. For discussion on the comparison of nnfor and nnetar, please see this link.

15.1 What does the nnetar function do?

Basically, the nnetar function takes in your data, properly scales it for the neural network. We will discuss scaling extensively in the next section, do not worry. After scaling, if needed, it calculates the AR order of the time series, as well as the lags. Then, it takes your scaled time series, the things it calculated and adjustements it made, into the nnet function from the nnet library. Then, you have a model and you make predictions. It is very user friendly, and very very fast.

15.2 S3 Methods

We will use standard, unremarkable coercion, autoplotting, and scoring methods for the neural net forecasts.

as.nfor <- function(x) structure(x, class = "nfor")

scores.nfor <- function(obj) {
    mape <- MAPE(obj$mean, test[[1]])
    ase <- ASE(obj$mean, test[[1]])
    confs <- confScore(upper = obj$upper[, 2], lower = obj$lower[, 2], test[[1]])
    c(ASE = ase, MAPE = mape, Conf.Score = confs)
}
autoplot.nfor <- function(obj) {
    testdf <- data.frame(type = "actual", t = seq_along(test[[1]]), ppm = as.numeric(test[[1]]))
    preddf <- data.frame(type = "predicted", t = seq_along(test[[1]]), ppm = as.numeric(obj$mean))
    confdf <- data.frame(t = seq_along(test[[1]]), upper = obj$upper[, 2], lower = obj$lower[, 
        2])
    dfl <- list(testdf, preddf)
    .testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2), 
        linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2), 
        linetype = 3003) + guides(alpha = FALSE)
}

15.3 Fitting the model

This is much simpler and faster than the nnfor mlp and elm function, and without significant tinkering or expertise. Lets check it out.

train <- traints %>% dplyr::select(PM_US.Post, HUMI, TEMP, PRES, Iws)
nnet <- nnetar(train[[1]], xreg = train[-1], repeats = 1000)

Lets check out what the model did

data.frame(summary(nnet))

Interesting. We see that it chose an AR(1) our data. Lets look more closely at the model and the method to see what else is going on.

nnet$model

#> 
#> Average of 1000 networks, each of which is
#> a 32-16-1 network with 545 weights
#> options were - linear output units

nnet$method

#> [1] "NNAR(27,1,16)[365]"

Pretty simple. 32 input nodes, 16 nodes in the hidden layer, and then collapsed into the output node. We can also see the details of our model. It is a NNAR(27,1,16)[365]. That is read as a neural network model, with the data lagged 27 times, AR(1), and 16 nodes in the hidden layer.

15.4 Forecasting

The nnet models do not have the same issue as the other forecast models, so we can forecast without an s3 method

test <- test %>% dplyr::select(PM_US.Post, HUMI, TEMP, PRES, Iws)

nnetfor <- forecast(nnet, h = 366, xreg = test[-1], PI = TRUE)
summary(nnetfor)

summary(nnetfor)

#> 
#> Forecast method: NNAR(27,1,16)[365]
#> 
#> Model Information:
#> 
#> Average of 1000 networks, each of which is
#> a 32-16-1 network with 545 weights
#> options were - linear output units 
#> 
#> Error measures:
#>                    ME     RMSE      MAE       MPE     MAPE       MASE
#> Training set -2.89273 255.8209 175.6316 -7.769946 13.20826 0.09239913
#>                    ACF1
#> Training set 0.04863066
#> 
#> Forecasts:
#>          Point Forecast      Lo 80     Hi 80       Lo 95     Hi 95
#> 5.002740     1339.20550 1011.85824 1670.6711  857.642823 1851.1271
#> 5.005479     1072.26691  746.07991 1422.5409  584.665921 1599.8374
#> 5.008219     3281.69830 2927.27151 3661.5466 2708.337736 3880.4528
#> 5.010959     3064.48096 2706.95320 3471.8127 2517.434651 3687.0782
#> 5.013699     4556.19267 4189.38914 4953.8422 4007.710172 5168.3828
#> 5.016438     3650.69394 3315.96512 4007.2461 3137.514323 4199.3474
#> 5.019178     5060.29640 4675.27418 5464.4144 4483.506858 5634.0865
#> 5.021918     4647.92472 4270.49016 5017.1308 4054.253112 5210.0362
#> 5.024658     1095.72828  750.81122 1451.8302  513.228930 1657.7528
#> 5.027397      124.64159 -222.10335  469.2862 -391.302772  625.3250
#> 5.030137      947.99981  615.11889 1302.5336  407.006327 1497.3995
#> 5.032877     1459.87298 1107.92088 1812.9981  922.349359 1969.3696
#> 5.035616     1156.11511  823.42339 1506.7453  641.718666 1686.7778
#> 5.038356     2389.15683 2011.13576 2764.0454 1822.154157 2931.1461
#> 5.041096     2175.54803 1806.60137 2594.0679 1615.215991 2819.8103
#> 5.043836     2671.33788 2295.61226 3030.5593 2130.184948 3225.6743
#> 5.046575     4261.25416 3902.64144 4625.4366 3683.159259 4872.1414
#> 5.049315     2181.27793 1809.85389 2545.2824 1604.471911 2716.9366
#> 5.052055     1016.44815  660.43977 1381.5785  454.538837 1547.9646
#> 5.054795     1663.48973 1302.77972 2044.3881 1109.557537 2278.9433
#> 5.057534      208.40978 -142.04538  545.1323 -315.604080  715.0302
#> 5.060274      848.87614  500.31863 1225.4736  333.342421 1420.5820
#> 5.063014     1927.36826 1563.02717 2267.4899 1404.887231 2491.4851
#> 5.065753     3503.33876 3129.28113 3921.3180 2921.573128 4137.8058
#> 5.068493     2813.73489 2407.02876 3237.3838 2204.989494 3422.5302
#> 5.071233     1322.99016  971.74984 1726.1907  769.804229 1923.3792
#> 5.073973     1145.10358  824.07322 1540.6060  601.212583 1750.9945
#> 5.076712     1663.36460 1347.98974 2067.6659 1172.791778 2231.0488
#> 5.079452     1077.29895  756.78786 1477.9547  566.761966 1664.8647
#> 5.082192     4141.80219 3738.53213 4521.4711 3540.273216 4761.4195
#> 5.084932     3632.14728 3190.84333 3977.7264 2935.709522 4181.4765
#> 5.087671     3964.10321 3492.11418 4344.6998 3263.879611 4555.4213
#> 5.090411     4669.97800 4257.46413 5036.1584 4022.918534 5233.0617
#> 5.093151     1936.76346 1583.72951 2298.3926 1398.180466 2469.3966
#> 5.095890      410.95151   88.09599  771.7577 -111.312182  955.9518
#> 5.098630      -62.73875 -382.47534  300.6455 -547.047881  454.4265
#> 5.101370      810.78421  458.00350 1215.3599  305.885308 1383.6918
#> 5.104110     3470.48150 3087.87975 3920.4471 2875.634587 4155.4488
#> 5.106849     2442.99958 2076.17785 2789.8361 1929.506348 2971.2965
#> 5.109589     2082.90088 1705.24424 2483.9142 1501.050425 2683.3213
#> 5.112329      441.12394  131.10675  813.5223  -68.826925 1002.2304
#> 5.115068      306.31884  -34.49080  714.0193 -231.241729  882.8351
#> 5.117808     1960.02718 1647.71565 2380.8029 1415.940250 2584.5935
#> 5.120548     1459.95121 1128.99371 1857.2987  908.518207 2043.9908
#> 5.123288     1025.59762  704.60523 1392.1999  540.178374 1545.0355
#> 5.126027     4054.44992 3603.51958 4486.5424 3386.661920 4732.5465
#> 5.128767     5417.71484 4949.64338 5851.5731 4689.720706 6081.0369
#> 5.131507     4235.95144 3832.32488 4633.0501 3604.510969 4851.2300
#> 5.134247     2109.10927 1726.73953 2499.3963 1555.246455 2671.7370
#> 5.136986     1020.04493  653.92133 1355.6565  463.475785 1558.8091
#> 5.139726     1567.26470 1204.87683 1966.4314  990.121550 2186.7936
#> 5.142466     2518.59917 2179.16718 2903.7867 1982.700803 3133.0654
#> 5.145205     4506.68941 4109.31942 4947.7197 3901.069447 5185.4595
#> 5.147945     5030.82192 4560.24225 5490.0893 4331.399721 5711.9559
#> 5.150685     4432.95967 4052.90363 4815.1828 3862.227863 5001.6398
#> 5.153425     5134.79042 4741.66382 5519.2959 4530.918703 5740.8619
#> 5.156164     5384.02305 4985.75904 5755.3968 4770.046906 6017.1330
#> 5.158904     5603.69168 5159.95782 5975.5308 4927.415169 6203.5252
#> 5.161644     1294.01429  950.86114 1633.6417  765.351751 1822.9615
#> 5.164384     2280.72214 1905.94676 2625.7322 1722.153138 2840.1198
#> 5.167123     1601.45004 1235.52762 2000.3932 1026.221812 2170.2820
#> 5.169863     2308.98310 1950.09748 2730.1708 1732.364976 2965.1794
#> 5.172603     3726.75793 3345.07384 4202.7025 3175.893430 4417.4248
#> 5.175342      893.70690  582.65168 1259.5642  416.997036 1427.9778
#> 5.178082      233.70448  -77.57567  644.9516 -259.268045  781.9006
#> 5.180822      311.94033  -24.17591  670.2967 -188.719375  852.7438
#> 5.183562      812.27307  469.07221 1171.8810  255.413694 1338.3329
#> 5.186301     2334.04013 1980.95639 2730.1262 1769.034796 2963.3056
#> 5.189041     2151.60602 1771.76291 2559.5405 1577.742455 2764.8018
#> 5.191781     2308.40678 1931.10943 2702.0562 1743.756538 2931.1690
#> 5.194521     2928.90131 2563.41828 3355.0956 2375.885053 3540.9891
#> 5.197260     1249.52145  904.64868 1644.4759  693.399949 1823.2122
#> 5.200000      156.00057 -149.71626  557.9509 -308.989650  771.2694
#> 5.202740      -81.52132 -425.26754  276.3496 -584.993072  485.6870
#> 5.205479      874.02715  478.12864 1245.6138  304.661172 1422.5047
#> 5.208219      884.32307  532.27847 1264.0464  354.544948 1482.9091
#> 5.210959      747.76019  425.63863 1137.5560  210.443257 1306.2489
#> 5.213699     -242.98104 -565.16321  144.1200 -764.884046  339.3119
#> 5.216438      674.25798  323.41272 1087.3260  136.063630 1260.0074
#> 5.219178      222.81910  -93.27160  586.2539 -308.055533  782.2805
#> 5.221918      418.12692   62.78139  806.9649 -108.536338  994.0245
#> 5.224658      545.02480  176.98451  953.4324    6.694831 1154.2573
#> 5.227397      623.61224  260.51548 1038.9239   56.978698 1234.9255
#> 5.230137     1822.18953 1457.12427 2266.9873 1246.892348 2475.8278
#> 5.232877     2438.19606 2073.60318 2832.5300 1882.557348 3015.0050
#> 5.235616     2787.94290 2410.14350 3202.3436 2199.453351 3419.6071
#> 5.238356     2315.26427 1956.58656 2844.5040 1724.921068 3081.0693
#> 5.241096     2115.61873 1743.94240 2601.0031 1510.689205 2804.2377
#> 5.243836     1826.45685 1442.17046 2278.2382 1219.988528 2470.0078
#> 5.246575     1144.14454  743.79127 1555.6671  557.189995 1740.0962
#> 5.249315     1036.05501  633.11849 1389.9824  442.242324 1572.1173
#> 5.252055     1304.03347  938.87441 1725.7057  760.699543 1919.8415
#> 5.254795     1075.40623  727.24414 1502.5849  537.226158 1702.0793
#> 5.257534      238.41204 -101.64055  667.5157 -306.593445  845.0439
#> 5.260274      377.15783   34.44831  773.4903 -120.605441  978.6424
#> 5.263014      452.94937  100.55160  818.4850  -80.511047 1014.2798
#> 5.265753      834.19213  433.10893 1180.4150  226.072713 1353.5598
#> 5.268493     1360.80626  990.62160 1750.3929  755.726722 1969.4416
#> 5.271233     1872.91892 1486.51887 2316.0928 1260.549579 2533.7550
#> 5.273973     1802.24412 1440.76821 2196.8817 1231.938581 2393.9921
#> 5.276712     1572.49300 1135.02572 1984.0423  900.215339 2165.0096
#> 5.279452     1542.95774 1133.98921 1976.9861  934.700267 2178.3280
#> 5.282192     1761.25633 1306.42296 2210.9993 1095.806728 2461.5882
#> 5.284932     1648.92966 1216.34580 2080.9157 1041.212487 2292.9794
#> 5.287671     1402.71516 1026.78813 1875.2363  808.152218 2126.7971
#> 5.290411      779.08320  445.68652 1167.3658  241.399405 1347.7438
#> 5.293151      509.05037  197.03416  952.3378  -41.502059 1164.4427
#> 5.295890     1603.25031 1286.11020 2034.2894 1103.226402 2206.2872
#> 5.298630     1702.71948 1349.27028 2110.3871 1175.100927 2282.6283
#> 5.301370     1387.07491 1045.02837 1777.2429  857.484454 1976.1361
#> 5.304110      843.45755  515.57590 1313.3492  309.969782 1506.8547
#> 5.306849      307.46289  -19.67237  767.7331 -207.883690 1013.0170
#> 5.309589      683.78938  310.90254 1111.8767  102.778541 1357.7287
#> 5.312329     1796.48933 1453.24113 2158.0380 1208.259998 2382.9012
#> 5.315068     2079.56334 1643.76220 2467.8589 1448.210780 2686.3045
#> 5.317808     1813.70296 1369.26626 2251.0084 1105.222914 2468.3653
#> 5.320548     1769.17186 1349.75845 2229.6508 1086.037098 2490.8161
#> 5.323288     1513.08782 1067.94688 2000.1957  817.223517 2220.3719
#> 5.326027     1328.22923  889.46912 1763.8929  658.776752 2046.9940
#> 5.328767     1206.29192  801.25907 1636.1395  643.476893 1854.1926
#> 5.331507     1402.46057 1051.74226 1809.4455  875.300586 2030.1605
#> 5.334247     1744.09973 1385.37643 2137.9889 1178.291187 2341.4351
#> 5.336986      653.03222  346.09740 1187.8512  134.719068 1401.1668
#> 5.339726      788.75123  474.27173 1227.2936  276.635647 1425.7192
#> 5.342466      492.95826   73.16160  880.9058  -93.312985 1090.4150
#> 5.345205      760.02883  375.19726 1147.7440  146.051482 1370.2150
#> 5.347945     1372.75151 1000.74589 1830.3763  754.793369 2064.9193
#> 5.350685     1271.44536  920.67807 1711.9304  729.790192 1921.9990
#> 5.353425     1548.72324 1166.51741 1979.9804  938.602206 2209.9840
#> 5.356164     1803.86492 1386.75454 2162.2317 1189.167325 2362.7209
#> 5.358904     1802.18453 1338.35269 2230.9355 1117.104784 2497.3396
#> 5.361644     2761.39169 2396.87758 3175.3454 2179.209769 3395.9783
#> 5.364384     2227.14956 1850.02434 2643.1442 1637.557633 2851.5345
#> 5.367123     1460.53942 1103.82187 1854.5220  879.380805 2100.9307
#> 5.369863      903.99471  595.85735 1291.7683  366.638891 1499.9222
#> 5.372603     1196.29003  789.94918 1602.0256  548.505098 1847.2532
#> 5.375342     1060.65534  660.93528 1474.4249  434.132997 1730.2175
#> 5.378082      671.70141  341.81595 1170.2688  130.821692 1373.3934
#> 5.380822     1060.62953  719.58696 1507.5397  512.566405 1723.2543
#> 5.383562     1454.10650 1111.21788 1918.8157  909.759212 2085.8801
#> 5.386301     1025.77752  670.01293 1523.1333  466.087412 1767.8786
#> 5.389041     1047.92554  679.53048 1492.2826  455.104998 1732.5528
#> 5.391781     1654.68020 1251.23450 2010.1395 1031.444833 2167.2093
#> 5.394521     1880.12008 1485.73340 2272.0338 1266.260370 2475.5236
#> 5.397260     2318.64465 1925.00434 2748.3599 1726.101174 2955.9273
#> 5.400000     1747.05275 1299.11908 2175.7503 1089.347099 2381.2309
#> 5.402740     1078.89032  667.15149 1489.9299  439.810611 1710.7990
#> 5.405479     1211.03546  814.98712 1569.5276  629.339667 1792.8098
#> 5.408219     1130.81692  765.16752 1567.6353  559.278491 1772.0795
#> 5.410959      890.41503  598.79938 1315.9357  391.272555 1529.6412
#> 5.413699      830.69918  514.18967 1282.6550  300.178917 1487.4614
#> 5.416438      690.99443  352.47951 1166.5024  140.268909 1349.4337
#> 5.419178     1312.72277  957.66073 1769.1496  769.311121 1990.7598
#> 5.421918     1301.95145  935.84873 1854.7755  707.795035 2105.1606
#> 5.424658     1332.25552  983.76640 1818.1396  737.217391 2075.4721
#> 5.427397      955.30762  636.20558 1449.9107  444.859826 1665.2925
#> 5.430137      617.59735  300.46115 1095.4582   99.536135 1314.4195
#> 5.432877     1492.10778 1148.27769 1935.7973  920.251676 2152.9523
#> 5.435616     1886.77813 1451.33818 2302.2786 1243.288864 2511.4108
#> 5.438356     2303.09418 1852.48462 2690.6670 1652.795131 2942.2231
#> 5.441096     2131.48734 1686.20589 2605.8134 1459.828193 2829.8044
#> 5.443836     2196.81118 1738.25990 2666.5970 1506.716559 2894.0051
#> 5.446575     1709.35872 1238.11319 2126.9449  982.166577 2444.0887
#> 5.449315      877.84324  463.09562 1359.1883  264.105518 1639.7292
#> 5.452055     1120.50635  740.24233 1575.4116  511.719051 1803.7518
#> 5.454795     1491.80607 1080.22873 1999.7009  871.350914 2206.9099
#> 5.457534     1637.09232 1271.57553 2160.2527 1035.825921 2364.8334
#> 5.460274     1794.58006 1451.18089 2253.2945 1273.278701 2453.3620
#> 5.463014     2223.33723 1895.23428 2666.0771 1674.720098 2850.8705
#> 5.465753     1920.85410 1552.90750 2413.8436 1323.903880 2634.7670
#> 5.468493     2236.57927 1845.40481 2661.2462 1672.851244 2891.2017
#> 5.471233     1908.23902 1536.77939 2445.4768 1292.970919 2703.7515
#> 5.473973     1942.17103 1538.28947 2480.4624 1275.503594 2709.2405
#> 5.476712     2215.01492 1793.49445 2742.8055 1520.350486 3032.0207
#> 5.479452     1835.70651 1356.91466 2285.2955 1096.746044 2537.6526
#> 5.482192     1642.26542 1173.01984 2079.0586  986.675844 2334.8779
#> 5.484932     1901.16609 1514.66311 2334.4344 1309.423144 2537.7177
#> 5.487671     1779.64345 1423.67525 2220.9710 1250.795938 2412.6849
#> 5.490411     1318.12337 1016.19596 1830.8775  780.025379 2054.5040
#> 5.493151      789.70860  461.64355 1316.4737  270.937998 1565.0409
#> 5.495890      946.12725  551.68966 1445.5074  350.746617 1707.0813
#> 5.498630     1108.73453  707.66416 1599.9051  512.138771 1827.8412
#> 5.501370     1786.84528 1413.96380 2191.5529 1177.742498 2388.9179
#> 5.504110     2543.18179 2180.50626 3023.1945 1960.839112 3255.4630
#> 5.506849     2330.80163 1989.79670 2840.2425 1723.905599 3040.8319
#> 5.509589     1935.66277 1612.22273 2403.1874 1392.092202 2665.1855
#> 5.512329     1708.32390 1382.11453 2237.4151 1127.742252 2498.7669
#> 5.515068     2246.39763 1865.99623 2606.4996 1631.265104 2800.6658
#> 5.517808     2515.52833 2073.78922 2934.3002 1871.371645 3180.4965
#> 5.520548     1651.26785 1231.78487 2176.2572  976.507377 2395.5220
#> 5.523288      659.70841  339.28859 1177.4919   89.093228 1436.0455
#> 5.526027      544.64223  211.46856  998.4419   -9.812119 1203.9084
#> 5.528767      573.65984  232.77433  980.2888   54.522247 1218.0476
#> 5.531507      525.83151  176.13466  962.8125  -60.545558 1169.4122
#> 5.534247      484.83512  172.72676  958.9233  -39.445627 1142.7861
#> 5.536986      684.98632  345.12913 1169.1507   97.051853 1354.5759
#> 5.539726     1382.21730 1016.58857 1779.0753  827.826280 1965.5735
#> 5.542466     1909.30597 1499.14553 2261.5284 1281.655870 2481.2735
#> 5.545205     2124.92557 1703.84557 2578.2176 1489.812891 2825.9102
#> 5.547945     2074.84900 1630.56386 2634.6132 1398.117036 2866.7775
#> 5.550685     1981.99069 1566.60608 2469.0605 1326.922499 2712.3140
#> 5.553425     1898.14558 1506.94148 2310.5883 1331.683264 2480.5272
#> 5.556164     1261.71642  857.90746 1705.7475  630.044980 1930.3174
#> 5.558904      837.78831  452.18674 1289.3410  225.536915 1510.3808
#> 5.561644      910.87903  521.42745 1343.6774  311.159973 1527.1595
#> 5.564384     1584.60805 1185.99171 2046.5042  995.640843 2263.9592
#> 5.567123     1438.75092 1071.97040 1904.9812  842.762719 2105.4542
#> 5.569863      871.47178  520.19382 1336.8806  339.174551 1554.7133
#> 5.572603     1272.84954  883.65967 1661.8736  710.379129 1864.9401
#> 5.575342     2201.75012 1776.34605 2533.0472 1590.306218 2688.5483
#> 5.578082     2632.41010 2234.92851 2977.2152 2025.090135 3183.7188
#> 5.580822     2324.58769 1926.97387 2778.7765 1762.580139 2991.8050
#> 5.583562     2148.30553 1803.78362 2576.4194 1561.856798 2826.9569
#> 5.586301     1913.19509 1565.03540 2370.1910 1338.382615 2612.5358
#> 5.589041     2079.66802 1663.73695 2488.3794 1444.433928 2695.4613
#> 5.591781     2142.30430 1741.87700 2559.0278 1511.688768 2778.8050
#> 5.594521     2241.81073 1901.56226 2675.5393 1621.212881 2867.1111
#> 5.597260     2020.91205 1629.01587 2467.2351 1395.207720 2688.2622
#> 5.600000     1383.93481 1010.22867 1828.9718  758.997932 2087.0936
#> 5.602740     1165.71235  741.99423 1575.7721  554.347045 1775.9169
#> 5.605479     1440.10234  998.46996 1833.0942  817.537664 2035.2885
#> 5.608219     1788.28187 1392.56283 2208.7829 1142.521813 2385.1794
#> 5.610959     1465.97458 1122.60921 2002.0824  898.421063 2247.0742
#> 5.613699      744.76459  485.19027 1318.7136  263.972583 1527.7493
#> 5.616438      319.10128   46.31797  853.6378 -209.955367 1109.9520
#> 5.619178     1704.13502 1366.81144 2189.8909 1200.454649 2422.1267
#> 5.621918     1797.48948 1402.43104 2349.2482 1134.985109 2574.0252
#> 5.624658     2061.28769 1642.22444 2644.3286 1381.067542 2863.1610
#> 5.627397     1957.53285 1558.32691 2517.8074 1343.829075 2725.2653
#> 5.630137     1550.93487 1202.00490 2071.7085  969.406192 2289.2089
#> 5.632877     1283.57591  905.30671 1758.8678  676.703903 1972.1517
#> 5.635616     1172.04305  722.02688 1601.9005  458.056092 1816.6824
#> 5.638356     1722.24048 1314.08621 2100.6821 1105.936433 2300.0844
#> 5.641096     1615.84211 1233.25042 2106.3719 1004.285541 2365.3371
#> 5.643836     1420.50873 1099.44238 1952.6583  902.437622 2199.5089
#> 5.646575     1200.35672  897.10685 1671.2415  714.922562 1877.3350
#> 5.649315     1027.66789  638.55041 1491.8772  411.845849 1725.7276
#> 5.652055      684.07689  328.40074 1163.8807  151.950640 1368.7206
#> 5.654795      228.91172  -59.48776  708.7280 -281.307622  959.4922
#> 5.657534      120.96256 -172.61273  635.1022 -387.704560  871.9500
#> 5.660274      881.32254  577.61223 1419.0887  322.442929 1610.2289
#> 5.663014     1827.06845 1462.87816 2319.3366 1217.998766 2548.8861
#> 5.665753     2618.83323 2211.31835 3092.8912 1974.158358 3350.7403
#> 5.668493     3255.94158 2835.58601 3723.3735 2625.202774 3963.1305
#> 5.671233     3255.39553 2859.24381 3766.7656 2644.380045 3987.2078
#> 5.673973     3398.45663 3060.25890 3948.2402 2788.505088 4184.1882
#> 5.676712     2290.25524 1844.13391 2805.9096 1580.543282 3034.4962
#> 5.679452     2116.06775 1577.10069 2494.0814 1351.255116 2752.7220
#> 5.682192     2185.46322 1654.44168 2590.0160 1430.957895 2862.8098
#> 5.684932     2131.95104 1719.92604 2598.0147 1502.116287 2865.9887
#> 5.687671     1755.91858 1412.05994 2237.1886 1214.274123 2474.3252
#> 5.690411      817.21176  489.15283 1299.7028  312.285248 1491.7253
#> 5.693151     1146.36756  801.97750 1604.4369  583.133536 1810.5372
#> 5.695890     1391.79117 1010.52622 1862.8211  788.344973 2092.4861
#> 5.698630     1719.02453 1315.78195 2237.2121 1120.511190 2507.2542
#> 5.701370     2184.30905 1750.83908 2758.3883 1526.805576 3041.7566
#> 5.704110     2620.22742 2123.33631 3188.1230 1880.350955 3441.3112
#> 5.706849     2819.26071 2254.20344 3344.5404 2003.006507 3625.1703
#> 5.709589     2240.06732 1613.05423 2798.6360 1300.273113 3086.7309
#> 5.712329     2514.91060 1822.76748 3173.4193 1493.462239 3544.4920
#> 5.715068     2756.22257 1986.53217 3365.6362 1585.863115 3748.4887
#> 5.717808     2290.81100 1582.88917 2827.0690 1292.285863 3096.8189
#> 5.720548     2171.63054 1617.45589 2740.3086 1320.758144 3031.0100
#> 5.723288     2215.56483 1842.83086 2781.4023 1574.070857 3077.2453
#> 5.726027     1442.76638 1162.35465 2058.8797  943.852348 2279.2373
#> 5.728767     1982.61713 1693.19327 2615.9650 1497.372149 2838.9854
#> 5.731507     3511.56448 3143.99145 4257.3153 2854.308992 4568.7244
#> 5.734247     4066.98052 3508.28277 4819.3412 3190.256929 5144.2374
#> 5.736986     4673.03279 4124.56276 5382.4701 3727.593048 5685.2741
#> 5.739726     3379.82910 2977.61429 4104.4235 2661.726486 4351.9687
#> 5.742466     2672.52041 2222.80543 3452.5512 1883.398423 3692.1925
#> 5.745205     2739.35627 2214.58062 3520.0294 1819.584406 3799.3732
#> 5.747945     2376.61111 1655.39566 3016.4422 1286.329928 3293.3026
#> 5.750685     1256.71887  704.76201 1688.8935  399.455814 1985.6926
#> 5.753425     3363.27353 2847.86713 3799.0320 2540.998813 4101.7579
#> 5.756164     4059.57934 3595.12204 4552.3083 3330.075501 4806.0914
#> 5.758904     4823.71511 4286.04649 5429.4877 3924.405008 5705.8248
#> 5.761644     4304.93857 3752.95049 4977.6599 3458.583094 5298.4280
#> 5.764384     3218.91340 2662.75149 3669.6876 2343.790640 3928.0335
#> 5.767123     2500.20591 1917.92658 2977.5322 1630.980676 3238.0600
#> 5.769863     3523.20842 2913.74723 4208.5129 2585.815565 4570.5882
#> 5.772603     5320.76335 4584.61540 6130.5732 4284.957062 6554.4786
#> 5.775342     6825.31500 5956.84944 7477.9351 5597.351090 7949.0125
#> 5.778082     6435.50869 5532.67050 6913.7408 5181.592284 7276.4162
#> 5.780822     4659.48838 3896.74789 5003.7900 3569.122561 5244.5479
#> 5.783562      915.80130  508.08246 1301.8880  316.429232 1518.0622
#> 5.786301      141.31020 -133.19656  560.4422 -346.961618  774.7574
#> 5.789041      877.07911  557.47624 1340.5853  367.722393 1582.7429
#> 5.791781      825.66272  529.42890 1215.1530  335.127430 1381.9756
#> 5.794521      961.04393  535.89791 1356.0338  340.066991 1557.8270
#> 5.797260     2261.49970 1880.07823 2655.3619 1675.354601 2905.5619
#> 5.800000     3185.44021 2859.70133 3656.2505 2662.951529 3893.8174
#> 5.802740     4265.03456 3877.56355 4815.9046 3645.421907 5038.6823
#> 5.805479     5336.03929 4889.40013 5796.8106 4618.355122 6054.4976
#> 5.808219     2011.36940 1618.45923 2385.3808 1407.543628 2564.9865
#> 5.810959     2493.83695 2104.46732 2869.3530 1884.055339 3063.0961
#> 5.813699     3673.79389 3163.11004 4087.7375 2852.625626 4311.6182
#> 5.816438     5816.91063 5188.41003 6226.9618 4839.944949 6490.5151
#> 5.819178     7090.36039 6485.28393 7433.2228 6262.536479 7682.0430
#> 5.821918     1893.43945 1489.29400 2202.1998 1298.111759 2363.5995
#> 5.824658      392.13892   68.04909  754.6563 -140.234233  947.3764
#> 5.827397     2437.92979 2094.27619 2844.3757 1886.875088 3020.1831
#> 5.830137     3274.39154 2906.87570 3787.9481 2626.593297 3997.2914
#> 5.832877     3934.02275 3544.83538 4511.9851 3285.818067 4736.8229
#> 5.835616     3941.62867 3571.49903 4431.3631 3370.263222 4658.1462
#> 5.838356     2028.84787 1660.20864 2405.3852 1501.541270 2619.6414
#> 5.841096      408.30850   53.17513  747.2980 -145.809693  973.4113
#> 5.843836     2356.81531 1940.26514 2739.3937 1711.298481 2913.3870
#> 5.846575     5076.19482 4631.75839 5437.5053 4452.481608 5653.9376
#> 5.849315     2777.31111 2405.98286 3149.8597 2184.604145 3314.5557
#> 5.852055      972.02539  619.12322 1293.7625  430.209063 1492.1821
#> 5.854795     2070.05459 1709.09945 2444.5984 1529.549762 2661.9611
#> 5.857534     3655.58141 3223.26976 4045.9010 3012.349439 4290.8253
#> 5.860274     4216.49804 3767.48410 4627.6066 3538.269753 4859.5948
#> 5.863014     4807.32102 4272.88689 5201.4645 4075.083366 5467.8800
#> 5.865753     1700.79523 1275.64230 2020.6780 1123.611075 2186.7178
#> 5.868493      162.70030 -180.50249  520.5441 -386.776695  698.5046
#> 5.871233      336.58997   14.84860  717.8487 -210.817860  953.1292
#> 5.873973     1594.24235 1242.14105 2009.9658 1030.869178 2229.1342
#> 5.876712     4247.60117 3851.95382 4643.5350 3598.423733 4863.8023
#> 5.879452     1726.79571 1383.60998 2114.9341 1202.395059 2262.1936
#> 5.882192      910.80668  584.68506 1310.9916  396.143362 1481.5880
#> 5.884932     2554.78735 2128.10088 2917.2171 1930.612562 3109.7647
#> 5.887671     5047.36390 4607.40155 5353.6829 4424.182956 5554.7841
#> 5.890411     6754.00034 6346.30356 7112.4349 6126.059385 7273.6348
#> 5.893151     4782.96873 4370.39299 5173.4132 4166.275838 5414.2935
#> 5.895890     1580.21565 1238.00559 1988.9165 1024.601170 2154.6764
#> 5.898630     3546.14537 3224.95793 3965.3997 3009.258111 4168.8612
#> 5.901370     3010.00082 2640.18356 3444.1658 2452.803090 3656.8493
#> 5.904110     5256.97800 4808.45005 5658.0714 4593.772589 5892.2359
#> 5.906849     7381.74414 6930.65326 7722.1541 6687.689543 7952.6811
#> 5.909589     2911.28119 2536.75872 3284.9924 2375.657167 3493.7180
#> 5.912329     1864.13901 1507.87535 2228.7325 1326.554314 2467.0749
#> 5.915068     4343.97411 3966.38420 4709.9717 3760.965273 4917.1118
#> 5.917808     1623.22527 1301.41708 1987.7872 1109.570612 2173.7314
#> 5.920548      455.02797  122.96454  798.4788  -54.258839  967.7240
#> 5.923288     1001.82495  686.28231 1382.6481  534.135802 1606.2918
#> 5.926027      371.94579   35.39497  720.5184 -130.755943  904.9348
#> 5.928767      516.18215  172.78335  866.3072   25.170164 1075.4245
#> 5.931507      763.15214  449.62897 1125.4330  255.341241 1280.7408
#> 5.934247     2223.18551 1867.05014 2625.4145 1640.866993 2779.3496
#> 5.936986     1872.09593 1487.13888 2197.0968 1335.503951 2353.8174
#> 5.939726     2029.58132 1674.77814 2402.2378 1507.772584 2585.6720
#> 5.942466     3912.56448 3544.60796 4277.8349 3334.086104 4480.8363
#> 5.945205     2781.81695 2418.24720 3160.0340 2204.561793 3334.1135
#> 5.947945     1827.93104 1502.34505 2197.0985 1293.881219 2434.1875
#> 5.950685      493.32217  161.01243  842.2407   17.902779 1059.1554
#> 5.953425     2021.24789 1622.25403 2392.3539 1433.197368 2604.3571
#> 5.956164     3186.81076 2821.30021 3567.6945 2625.145295 3762.0705
#> 5.958904     1209.45987  845.03222 1535.1383  657.612097 1730.0034
#> 5.961644      751.71858  391.18502 1122.1082  207.258106 1310.3864
#> 5.964384     2155.30229 1743.26351 2569.5664 1532.379176 2797.6358
#> 5.967123     3358.22954 2983.90224 3773.5847 2750.678526 4015.5985
#> 5.969863     1487.08416 1159.54504 1852.4702  988.750269 2052.4877
#> 5.972603      167.62963 -143.51261  498.7079 -338.024009  693.3939
#> 5.975342      441.36026   92.19126  781.6495  -99.142992  978.1282
#> 5.978082     2241.09549 1826.63894 2633.7527 1648.049803 2863.6830
#> 5.980822     2297.95127 1928.62308 2678.4636 1774.801605 2870.1625
#> 5.983562      537.85449  199.62422  923.7762    5.031299 1128.0251
#> 5.986301      883.52077  525.16518 1297.6274  288.187258 1477.4858
#> 5.989041     2607.02528 2228.93105 3055.6847 2018.050881 3272.3306
#> 5.991781     3773.48803 3358.48869 4205.4128 3115.767762 4395.2801
#> 5.994521     3596.19724 3161.48550 3994.0003 2943.010380 4200.5396
#> 5.997260     2917.91696 2529.43442 3348.1583 2307.945234 3569.6427
#> 6.000000     1977.96914 1644.62011 2426.2562 1435.240011 2630.1158
#> 6.002740      347.22479   11.90982  727.6610 -164.578460  915.4272

autoplot(nnetfor) + theme_hc()

15.5 Model evaluation

Lets take a closer look at the forecast:

nfor <- as.nfor(nnetfor)
autoplot(nfor)

Visually, this is not a bad forecast. We managed to capture a lot of the behavior of the time series, and even neared it on a few of the peaks. Lets check out the scores to see. I imagine the MAPE will be very good in this case.

data.frame(t(scores(nfor)))

Looks like this one did about the same, maybe slightly better, than the common sense ARMA benchmark, in terms of ASE. However in terms of MAPE, this is far and away our best model. This means that in general, it was pretty close to the truth. However, when it missed, it missed by a lot. This is evidenced in our plot as well. We will now move to the more complex, and interesting, LSTM model

15.6 Saving the model

15.6.1 How is Prediction Interval Calculated for an ANN/MLP

Please refer to this chapter of Forecasting: Principles and Practice.

makeModel.nfor <- function(obj) {
    score <- scores(obj)
    return(list(preds = obj$mean, ul = obj$upper[, 2], ll = obj$upper[, 2], 
        ase = score[1], mape = score[2], Conf.Score = score[3]))
}
models$nnetar = makeModel(nfor)

16 Long Short-Term Memory (LSTM)

LSTM models are a type of recurrent neural network. Before we get into the actual modeling, we need to stop and think about what this means.

The model in this section is the result of hours and hours of hard work (over 700 models were trained by hand), and can be used, in its trained form, in analysis/daily/winner.h5.

To load and test on your own dataset, you simply need to get the data in the right format (I will work on automating this later, for a fully reproducible, deployable model), and run lstm_model <- load_model_hdf5("/path/to/winner.h5"), and the fully trained model will pop up in your global environment.will pop up in your global environment.

16.1 What is a recurrent neural network(RNN)?

To explain this, we need to discuss what a normal neural network, a feedforward neural network is. In a feedforward neural network, inputs go into the input layer, is mapped through an activation function, goes to some number of hidden layers with their own activation functions. The hidden layer finds a numerical way to represent the data that was fed into it, and then this representation is passed into the output layer. Information is fed straight in, and straight out. This means that a simple NN has no way of discerning anything about the order of things, or time. They know nothing about previous iterations.

In contrast, in an RNN, the inputs are not just the data, but the previous outputs of the RNN. This means it has a sort of memory about what it did in the past, and can learn from it. This also means they have a notion of order and time, and will do well for dealing with time series. The information of previous iterations is stored in the RNNs hidden layers, so with each iteration they are aware of the past. They are especially good at finding correlations between events which occured a long way apart.

An RNN is a neural network that shares weights it has created over time

16.2 What is an LSTM?

An LSTM is a special class of RNN, where extra (outside of the normal information flow of a RNN) information is stored in a gated cell. The cell is updated and changed by the RNN. What happens is the signal from inside the RNN is passed through a sigmoid funciton, and then based on the strength and weight of that signal, either allow it to be stored in the long term memory cell. This functional method allows for backpropogation of errors (it can remember the error of previous steps, and allow that to help the gradient boosted optimization of the network (we will discuss this)).

There are three gates in the LSTM:

The input gate
- Information from the inputs is stored here
The forget gate
- Information from the hidden layer is used to update this gate. It is allowed to both remember and forget information. This is what separates the LSTM from a GRU
The ouptut gate
- Information at the output layer is stored here

Thats the gist of it. We will get more into it as we build our model.

For an extremely helpful discussion of neural networks and especially RNNs, please refer to this link

Please also note that the code in this section is based off of the code found in the excellent book Deep Learning with R, an excellent investment.

16.3 Data preparation

As LSTMs rely on the sigmoid function, which is between zero and one, we must prepare our data accordingly (aka center and scale it). It also has to be in matrix form (for now). We also want the LSTM to see the lagged version of our variables. We are going to lag not only the xreg variables, but also our target data itself. This allows the neural network to do a sort of autocorrelation analysis of the data, as well as analysis of the cross corelation at many lags. For this, we will use the mlag_dfr function from my tswgewrapped package. This takes as an input column names, a vector of lags, and then adds those columns to the data frame. As we have tons of data, and are probably only going to go to about lag 50, we will deal with the NAs created in our typical manner: spline interpolation. Lets check out the mlag_dfr and lag_dfr functions:

tswgewrapped::lag_dfr

#> function (df, vars, lags) 
#> {
#>     out <- lapply(lags, function(x) stats::lag(df[[vars]], x))
#>     out <- do.call(cbind.data.frame, out)
#>     cnames <- sapply(lags, function(x) paste(vars, x, sep = "_"))
#>     colnames(out) <- cnames
#>     out
#> }
#> <bytecode: 0x106c20a8>
#> <environment: namespace:tswgewrapped>

tswgewrapped::mlag_dfr

#> function (df, vars, lags) 
#> {
#>     out <- lapply(vars, function(x) lag_dfr(df, x, lags))
#>     out <- do.call(data.frame, unlist(out, recursive = FALSE))
#>     cbind(df, out)
#> }
#> <bytecode: 0x9c6ba58>
#> <environment: namespace:tswgewrapped>

Now lets do the initial preprocessing of our data

library(keras)
# reset the data state of this report
rm(train, test, traints)
dfsplit(bj2)

# unneccessarily remove columns we arent using
train[1:3] <- NULL
train[9:14] <- NULL
test[1:3] <- NULL
test[9:14] <- NULL
traints <- (purrr::keep(train, is.ts))
test <- test %>% dplyr::select(PM_US.Post, HUMI, TEMP, PRES, Iws)
train <- traints %>% dplyr::select(PM_US.Post, HUMI, TEMP, PRES, Iws)
test <- mlag_dfr(test, names(test), 1:50)
train <- mlag_dfr(train, names(train), 1:50)
test <- impute(test) %>% as.data.frame
train <- impute(train) %>% as.data.frame
train <- data.matrix(train)
test <- data.matrix(test)

# make an overall data matrix
dat <- rbind(train, test)

# We have one too many columns because of the nature of the ts objects (with
# their windows, this is fine for forecasting but not for an LSTM)
dat <- dat[-1, ]

# do not worry about this, we are doing this so later on when we do our
# bagging, we have a validation technique it will all make sense i promise

maxt <- floor(nrow(dat) * 4/5)
maxv <- floor(nrow(dat))
val <- dat[(maxv - 365 * 2):(maxv - 365), ]
val <- scale(val, center = apply(val, 2, mean), scale = apply(val, 2, sd))
valScale <- attr(val, "scaled:scale")[1]
valCent <- attr(val, "scaled:center")[1]
val <- array(val, c(nrow(val), 2, 255))

# normalize the data for the lstm
mn <- apply(dat, 2, mean)
std <- apply(dat, 2, sd)
dat <- scale(dat, center = mn, scale = std)

# lets also make an array for our test dataset in the form of our input to
# the lstm, which we will discuss next Note we are doing this in a
# convoluded manner because we want to be able to denormalize our data. Next
# time i will use the recipes package because this is a massive headache

test2 <- scale(test, center = apply(test, 2, mean), scale = apply(test, 2, sd))
test3 <- array(test2, c(nrow(test2), 10, 255))

That was a lot of code, but we did nothing fancy. Next, we are going to write a generator function. This keeps us from having to load our data into the GPU’s memory (if you are using a GPU).

16.4 Generator function

As mentioned above, if you are running GPU based calculations, you often dont want to load an entire dataset in your GPU’s memory. So, it is good practice to write a generator function. What this will do is create another function. It will generate samples, moving forward through time, on the fly. As we want to move through the time series slowly, with overlap, there would be a massive amount of data to store. We need to define three values for this:

Lookback how far back we will go to grab our observations steps How far apart our samples will be delay How far ahead our targets will be from the observations batch_size How many times we do this per batch

lookback <- 60  # observations will go back 60 days
steps <- 6  # We will sample every 6 days
delay <- 1  # we will set our prediction targets to be 1 day ahead
batch_size <- 150  #the number of times we will do this per batch

Next, we write a series of functions to generate these samples for our model

generator <- function(data, lookback, delay, min_index, max_index, shuffle = FALSE, 
    batch_size = 128, step = 6) {
    if (is.null(max_index)) 
        max_index <- nrow(data) - delay - 1
    i <- min_index + lookback
    function() {
        if (shuffle) {
            rows <- sample(c((min_index + lookback):max_index), size = batch_size)
        } else {
            if (i + batch_size >= max_index) 
                i <<- min_index + lookback
            rows <- c(i:min(i + batch_size - 1, max_index))
            i <<- i + length(rows)
        }
        
        samples <- array(0, dim = c(length(rows), lookback/step, dim(data)[[-1]]))
        targets <- array(0, dim = c(length(rows)))
        
        for (j in 1:length(rows)) {
            indices <- seq(rows[[j]] - lookback, rows[[j]] - 1, length.out = dim(samples)[[2]])
            samples[j, , ] <- data[indices, ]
            targets[[j]] <- data[rows[[j]] + delay, 2]
        }
        list(samples, targets)
    }
}


train_gen <- generator(dat, lookback = lookback, delay = delay, min_index = 1, 
    max_index = maxt, shuffle = TRUE, step = step, batch_size = batch_size)

val_gen = generator(dat, lookback = lookback, delay = delay, min_index = maxt + 
    1, max_index = maxv, step = step, batch_size = batch_size)

16.5 Model Construction

We will discuss now the construction of our model. Our final model consists of 6 layers in this case, 5 LSTM layers and one standard densely connected neural network. We will discuss each unique layer in some detail, as extreme measures were necessary to avoid overfitting with such a small amount of data

16.5.1 How Does Training an LSTM Work?

Training an LSTM works as follows:

You define the number of batches per epoch. An epoch is a single training run for a neural network. So you define the number of times your generator function samples the data per epoch
In each epoch, the neural net first reads in your training data, and trains on that, using whatever masking/dropout you have (we will discuss this). Then, when it is done with that, it runs the validation (test set in this case) data through the neural net, and compares results. On the validation set, it removes any predefined masking or dropout (which is why the loss is generally lower on the validation set). Things that are confusing now will be discussed later, as it makes more sense to define with an example. Basically, each epoch it runs through the training data N steps, with some tricks to prevent overfitting the training data. Then it runs through once with you validation/test set. With each step of this, weights for each parameter are stored in the nodes themselves, and other useful information such as error is stored in the gates. The weights and the info in the gates gets added to (and removed from in the case of the forget gate). This is done N steps, over M epochs, repeatedly updating the LSTM’s weights and other information until we say so.

16.5.2 Simple LSTM Layer

The first layer consists of an LSTM with 10 hidden nodes. A small amount of nodes at the beginning was necessary to avoid overfitting (even at 50 nodes we rapidly overfit). The next tool we used to combat overfitting was dropout. There are two types of dropout. First we will discuss normal dropout:

16.5.2.1 Normal Dropout

Normal Dropout is used in the input and output layers of a neural network. For a given dropout rate N, it uses matrix multiplication to randomly remove N percentage of the data going in to the node. This is NOT used on the validation set. By showing the neural network less points of the significantly larger training set, it prevents overfitting on the training set

16.5.2.2 Recurrent Dropout

Recurrent Dropout is the exact same thing as normal dropout, except it affects the hidden layer instead of the input and output layers

Our simple LSTM layer just has these parameters

16.5.3 Bidirectional LSTM Layers

Next we pop the output of our first LSTM layer into two BiLSTM layers. A normal LSTM passes the data in from start to end. A bidirectional LSTM actually represents two LSTMs in parallel, one that reads the data from start to end, and one that reads the data from end to start. This helps it find even more interesting patterns and nuances in our data, increasing its representational power. Essentially, we are going to allow the LSTM to track long term patterns with equal power as short term powers with this.

16.5.4 The Dense Layer

A dense layer is a standard neural network. Adding this to our model further increases the representational power.

16.6 The Model

We will now define the model:

model_lstm <- keras_model_sequential() %>% layer_lstm(units = 10, dropout = 0.4, 
    recurrent_dropout = 0.1, activation = "sigmoid", input_shape = list(NULL, 
        dim(dat)[[-1]]), return_sequences = TRUE) %>% bidirectional(layer_lstm(units = 40, 
    dropout = 0.4, recurrent_dropout = 0.4, activation = "sigmoid", return_sequences = TRUE)) %>% 
    bidirectional(layer_lstm(units = 80, dropout = 0.4, recurrent_dropout = 0.4, 
        activation = "sigmoid")) %>% layer_dense(units = 1)

Lets see what the model looks like now

model_lstm

#> Model
#> Model: "sequential_16"
#> ___________________________________________________________________________
#> Layer (type)                     Output Shape                  Param #     
#> ===========================================================================
#> lstm_48 (LSTM)                   (None, None, 10)              10640       
#> ___________________________________________________________________________
#> bidirectional_32 (Bidirectional) (None, None, 80)              16320       
#> ___________________________________________________________________________
#> bidirectional_33 (Bidirectional) (None, 160)                   103040      
#> ___________________________________________________________________________
#> dense_16 (Dense)                 (None, 1)                     161         
#> ===========================================================================
#> Total params: 130,161
#> Trainable params: 130,161
#> Non-trainable params: 0
#> ___________________________________________________________________________

16.7 Compiling the model

Next, we need to compile the model, or define how it trains. We will want to train it to optimize ASE, as we have been using that metric on the rest of the models. For our optimizing method, we will use rmsprop. To learn how RMSprop works, please seethis link. It is typically a good optimizer for RNNs.

The parameter to really tune here is learning rate. We will see in the next section how to tune it. By default, the learning rate is 0.001.

model_lstm %>% compile(optimizer = optimizer_rmsprop(), loss = "mse")

16.8 Training the model

Finally, we get to train our model. We will save the history of our training, so we can diagnose the model

lstmHist <- model_lstm %>% fit_generator(train_gen, steps_per_epoch = 40, epochs = 20, 
    validation_data = val_gen, validation_steps = val_steps)

16.9 Model Diagnostics

Lets look at the loss curve of our training to see how we did:

plot(lstmHist)

There are two things to know about the loss curve:

If the validation loss is greater than training loss, you are severly overfitting.
If the loss is too steep, learning rate is too high. If it is too flat, the learning rate is too low

For an amazing guide about tuning this, please refer to this link

Knowing this, we can now assess our model. First of all, thanks to our dropout, we are absolutely not overfitting. Looking at the shape, the learning rate is likely slightly too high, but this is not worth tuning. (This model performs excellently, and after several hundred tries I know it is not worth it to fiddle further).

16.10 Forecasting

16.10.1 Calculating Prediction Interval with LSTMs

As with the ANN, there is no set way of calculting the prediction interval of an LSTM, as it is not a stochastic, parametric tool, it is simply a composition of lots of functions. However, we can use our dropout to approximate a prediction interval. For an in depth explanation, please refer to this paper. Basically, dropout is a random process. So each time we run our model on the training set, with our dropout, and our random generator function, we will get slightly different results (whereas when we make a prediction, we will not have the dropout, so it will be the same result every time). So, what we can do, is we can evaluate how well our model represents the training set, and use that to calculte a prediction interval. First, we will write a function that gets the mean error of our model over the training set:

getMeanError <- function(mod) {
    # 28 is simply the integer number of times we need to sample from our
    # training set to represent the whole thing.  In this case it will run
    # through the entire set three times
    sqerror <- evaluate(mod, train_gen, 28)
    return(sqrt(sqerror))
}

Next, we will write a function which does this n times, so we have a sample of errors. First, a brief note about writing for loops in R

16.10.1.1 How to write a good for loop in R

First, we must allocate the space in our computer’s RAM for the object we are constructing. If we do not do this, with each iteration of the for loop, R will actually destroy the object, and then create a new one. This creates horrificly slow for loops. For more info on this, please see the amazing book The R Inferno. To allocate this space, we do as follows:

out <- double(n)

Note that we allocated it as a double. If you want to make the for loops even faster, you must help out your R interpreter. If you do not define the type of the output of your loop, with each iteration the R interpreter will have to guess the type, which prevents it from optimizing. Double is the equivalent of numeric, but makes more sense to the reader.

Next, we must define the for loop. We will us seq_along instead of length, because if somebody defines the object to be something with length zero, seq_along will behave properly, while length will break everything

for (i in seq_along(out)){
  do things here
}

Now, lets write our for loop function, to get the errors n timese:

GetErrors <- function(n) {
    errors <- double(n)
    for (i in seq_along(errors)) {
        cat("evaluation number", i, "\n")
        errors[i] <- getMeanError(model_lstm)
    }
    return(errors)
}

Next, lets get our errors, for 1000 runs of the LSTM

trainErr <- GetErrors(1000)

Next, lets look at the errors, and get their mean and std deviation:

hist(trainErr)

errorMean <- mean(trainErr)
errorStd <- sd(trainErr)

Finally, lets write a function that takes our predictions, and gives them an interval for uncertainty:

makeInterval <- function(prediction) {
    pm <- errorMean + errorStd
    upper <- prediction + pm
    lower <- prediction - pm
    return(data.frame(fitted = prediction, upper = upper, lower = lower))
}

16.10.2 S3 methods

We define three S3 methods for predictions made by Keras:

as.keras <- function(x) structure(x, class = "keras")
autoplot.keras <- function(obj) {
    testdf <- data.frame(type = "actual", t = seq_along(test[, 1]), ppm = as.numeric(test[, 
        1]))
    preddf <- data.frame(type = "predicted", t = seq_along(test[, 1]), ppm = as.numeric(obj$fitted))
    confdf <- data.frame(t = seq_along(test[, 1]), upper = obj$upper, lower = obj$lower)
    dfl <- list(testdf, preddf)
    .testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2), 
        linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2), 
        linetype = 3003) + guides(alpha = FALSE)
}
scores.keras <- function(obj) {
    c(ase = ASE(obj$fitted, test[, 1]), mape = MAPE(obj$fitted, test[, 1]), 
        Conf.Score = confScore(upper = obj$upper, lower = obj$lower, test[, 
            1]))
}

Now that we are set up, lets predict off of the test set to see how we did

lstmTest <- predict(model_lstm, test3, n.ahead = nrow(test2))
lstmTest <- makeInterval(lstmTest)
descaleTest <- function(x) {
    x * attr(test2, "scaled:scale")[1] + attr(test2, "scaled:center")[1]
}
lstmTest <- lapply(lstmTest, descaleTest)
lstmTest <- as.keras(lstmTest)

That was pretty easy. Now we see why we made these test2 and test3 objects, with the proper dimensions for our neural network. Without this it would be very difficult. The dimensions of the input array for our model are as follows:

array(nrow(input), lookback/step, ncol(input))

16.11 Forecast Evaluation

autoplot(lstmTest)

Interesting. It matched the shape of the data excellently, or at least the overal pattern, but the peaks and troughs seem to be flattened. This is due to our slightly too high learning rate.To put it in human terms, the LSTM is learning too fast, and “skipping over” some details.

However, again trying to guess these is a bit overkill at this length forecast, and being conservative and “pretty good” is completely appropriate. Lets check out the scores:

data.frame(t(scores(lstmTest)))

Wow. This gave us the very best forecast so far. Let us try now to build an ensemble forecast. To do this, and this will be explained later, lets fit the LSTM to the second to last year of our dataset:

lstmVal <- predict(model_lstm, val, n.ahead = nrow(test2))
lstmVal <- makeInterval(lstmVal)
descaleVal <- function(x) {
    x * valScale + valCent
}
lstmVal <- lapply(lstmVal, descaleVal)
lstmVal <- as.keras(lstmVal)

16.12 Calculating a Prediction Interval from an LSTM

There is no defined

17 Ensmeble Forecast

18 Reproducing this analysis

All work performed in this project is intended to be completely reproducible. The following sections describe the necessary steps to reproduce this analysis. Obviously the first step is to clone the parent repository:

git clone https://github.com/josephsdavid/ChinesePM.git

18.1 Reproducing the working environment

18.1.1 Using Nix

The author is a huge fan of the reproducible package manager nix, which allows the user to share a reproducible file shell.nix. When someone else runs from the command line nix-shell, the contents of the file are installed in isolation from the rest of the users system, and is dropped into a shell which perfectly reproduces the working environment of this project. This includes automatically installing and setting up the tensorflow/keras deep learning stack, installing complex dependencies, and if you comment out the tensorflow line and uncomment the tensorflow with CUDA line, will automatically set up tensorflow to run with Nvidia CUDA.

To install remaining dependencies (namely devtools packages), you can run Rscript libs.R to automatically install them. This technique is highly recommended. Below are the contents of libs.R:

library(remotes)

# packages that dont build on nix shell
options(repos = "http://cran.rstudio.org")
myPackages = installed.packages()
cranPackages <- c("nnfor", "dtwclust")
to.install <- setdiff(cranPackages, myPackages[, 1])
if (length(to.install) > 0) install.packages(to.install)

if (!("tswgewrapped" %in% installed.packages())) {
    install_github("josephsdavid/tswgewrapped")
}

The shell.nix file can be seen in the github repository.

Note that using this method will also automatically setup tensorflow, Keras, tensorboard, and any other necesary dependencies for deep learning time series models (with R and python).

18.1.2 Without Nix

To reproduce the R dependencies required for this analysis, please run Rscript noNix.R This reads shell.nix, searches for R packages, compares what is there to what is installed on your computer, and then if they are different, it installs them

library(remotes)
options(repos = "http://cran.rstudio.org")

myPackages = installed.packages()
shell <- readLines("shell.nix")
shell <- shell[grepl(shell, pattern = "rPackages")]
shell <- gsub("rPackages.", "", shell)
qhell <- gsub(" ", "", shell)
otherPackages <- c("nnfor", "dtwclust")
cranPackages <- c(shell, otherPackages)

to.install <- setdiff(cranPackages, myPackages[, 1])
if (length(to.install) > 0) install.packages(to.install)

if (!("tswgewrapped" %in% installed.packages())) {
    install_github("josephsdavid/tswgewrapped")
}

Note that this method will require you to manually set up a deep learning stack

18.2 Reproducing the models

To reproduce the models from this project, you have two options: you can either load the saved .Rda files, or run all the models (as shown in this report). Note this will likely take several days, as some of the models take a long time to train, and multiple grid searches over 10,000+ hyperparameters were performed.

Time Series Project: Analysis of PM2.5 in Beijing

David Josephs

2019-08-11