The setting for this project is as follows. You are an employee of the US Embassy in Beijing. After noticing the strange thickness in the air, and getting a bit of a cough, you decide that you do not what any other visitor to suffer breathing the highly dangerous Beijing air. So, you decide you want to get funding for the US Embassy to provide free air pollution masks to all visitors. However, the bureaucrats who can give you the money want to know two things:
How much money should we allocate to you over the next year
How many air pollution masks should be made available each day over the course of the year (rush shipping is expensive)
So, your task is to predict the air pollution levels each day in Beijing at the US Embassy (who records it every hour), so that you can give the bureaucrats an actionable plan. We are assuming that more people will come in to get their free mask on a day where air pollution is bad, than on a day where it isnt, and we certainly do not want to run out, as that would be unfair.
Given this setting, we will now discuss how the data was processed:
Open to the public is data from 5 cities:
Beijing
Chengdu
Guangzhou
Shanghai
Shenyang
Each of these datasets contains:
Air quality measurements at multiple locations (including a US Embassy post)
Humity
Air pressure
Temperature
Precipitation
Wind direction
Wind Speed
Measured at every hour. For detailed info on the data quality and collection methods, please refer to this publication
The next step in our analysis workflow is to get the data ready for analysis. First, we will have to do a few tasks
import data
Represent data as a time series
inspect for NAs
figure out how to deal with NAs
store in a conveniently manipulatable format
All of this can be seen in R/preprocessing.R, but we will do an in depth explanation here
library(functional) # to compose the preprocessing pipeline
library(data.table) # to read csvs
library(rlist) # for list manipulations
library(pipeR) # fast, dumb pipes
library(imputeTS) # to impute NAs
library(pander) # so i can read the output
library(foreach) # go fast
library(doParallel) # go fast
We want to load all of the csv files stored in data/, so we will write a function which loads all of them into a list of data.table(fast data frame) objects. We will want the name of each item in the list to be the name of the csv file, minus the useless numbers
import <- function(path) {
# first we list the files in our path, in this case 'data/'
files <- list.files(path)
# then we search for csvs
files <- files[grepl(files, pattern = ".csv")]
# paste the csv names onto the filepath(data/whatever.csv)
filepaths <- sapply(files, function(x) paste0(path, x))
# read them in to a list
out <- lapply(filepaths, fread)
# Get rid of .csv in each filename
fnames <- gsub(".csv", "", files)
# get rid of the confusing numbers
fnames <- gsub("[[:digit:]]+", "", fnames)
# set the names of the list to be the clean filenams
names(out) <- fnames
out
}
Next we want to see how many NAs we are dealing with. Since we have about 20 million total observations, we will represent these as a proportion. We will apply the NA count function to our list using the beautiful rapply function
naCount <- function(xs) {
rapply(xs, function(x) sum(is.na(x)/length(x)), how = "list")
}
# load in the data now
datadir = "data/"
china <- import(datadir)
pander(naCount(china))
BeijingPM_:
ChengduPM_:
GuangzhouPM_:
ShanghaiPM_:
ShenyangPM_:
It looks like the quality of the data collected outside of the US Embassy posts and in cities other than beijing is really bad. Luckily for us, BeijingPM$PM_US.Post has relatively high quality data. That being said, we are still going to have to impute the NAs. Before that though, lets convert this to time series data, as we will be imputing the NAs in a different way with time series data than we will for normal data.
First, we define a function which converts the data to a time series object:
tots <- function(v) {
ts(v, frequency = 365 * 24)
}
Next, we are going to want want to scale this up, to work on a whole data frame. We will also convert the data frame to a list, just to avoid any typing issues down the road. We also do not want to convert non time series data to time series, so we must avoid those as well:
totslist <- function(df) {
# define the column names which we do not want to alter
badlist <- c("No", "year", "month", "day", "hour", "season", "cbwd")
# get the column names of the original data frame
nms <- colnames(df)
# coerce it into a list
df <- as.list(df)
# if the name of the item of the list is in our to not change list, do
# nothing otherwise, convert to ts
for (name in nms) {
if (name %in% badlist) {
df[[name]] <- df[[name]]
} else {
df[[name]] <- tots(df[[name]])
}
}
# return the created list
df
}
Next, we want to scale this function to work on a list of data frames:
totsall <- function(xs) {
lapply(xs, totslist)
}
Imputing NAs in time series is a tricky topic, as we cannot just ignore them (that would mess up our sampling rate). There are several methods of dealing with NAs in time series, including kalman filters, polynomial interpretation, and spline interpolation. In this case, as we are only concerned with the time series that have relatively few NAs, I elected to use spline interpolation, as it is much faster than the kalman filter, while still providing pretty good accuracy. First, a function was constructed to use the try catch pattern to impute NAs of items with the time-series class.
The logic is as follows:
ts object, return the resultsimp_test <- function(v) {
out <- try(na.interpolation(v, "spline"))
ifelse(is.ts(out), return(out), return(v))
}
Next, we apply that function to a single list, in parallel to save time:
impute <- function(xs) {
foreach(i = 1:length(xs), .final = function(x) {
setNames(x, names(xs))
}) %dopar% imp_test(xs[[i]])
}
Finally, we apply that function to our list of lists
impute_list <- function(xs) {
lapply(xs, impute)
}
Next we are ready to convert to a more useful data structure, and then pipeline all these functions together for quick preprocessing
First, we will convert our slow, massive list into the fast, superior hash table. For large, complex datasets, a hash table was shown to be at more or less 3 times faster than a list, and as we will be repeatedly accessing data from this, we would prefer it to be a quick index. For more info on R hash tables, please refer to this link.
to_hash <- function(xs) {
list2env(xs, envir = NULL, hash = TRUE)
}
Univariate EDA Next we are ready to pipeline it all together into a single preprocessing function, using functional::Compose
preprocess <- Compose(import, totsall, impute_list, to_hash)
Required libraries
library(ggthemes)
library(ggplot2)
library(cowplot)
First, we define some helper functions, in order to save us time and make our code more readable. First, lets steal some stuff from forecast and tswge:
# pretty plots for our analysis
seasonplot <- forecast::ggseasonplot
subseriesplot <- forecast::ggsubseriesplot
lagplot <- forecast::gglagplot
sampplot <- tswge::plotts.sample.wge
# clean up errant data points
clean <- forecast::tsclean
Next let us define a function which creates a function to resample time series, and then define ones with new sampling rates. Finally, we will write a function which applies each resampling function to a vector
# resample generator
change_samples <- function(n) {
function(xs) {
out <- unname(tapply(xs, (seq_along(xs) - 1)%/%n, sum))
out <- ts(out, frequency = (8760/n))
out
}
}
# daily and weekly sampling, monthly is 4 weeks
to_daily <- change_samples(24)
to_weekly <- change_samples(24 * 7)
to_monthly <- change_samples(24 * 7 * 4)
to_season <- change_samples(24 * (365/4))
# pipelining final cleaning and conversion, removing outlier and the couple
# of errant negative values (which do not make sense)
cleandays <- function(xs) {
xs %>>% clean %>>% abs %>>% to_daily
}
cleanweeks <- function(xs) {
xs %>>% clean %>>% abs %>>% to_weekly
}
cleanmonths <- function(xs) {
xs %>>% clean %>>% abs %>>% to_monthly
}
cleanseas <- function(xs) {
xs %>>% clean %>>% abs %>>% to_season
}
# resample the whole ts
resample <- function(xs) {
day <- xs %>>% cleandays %>>% window(end = 6)
week <- xs %>>% cleanweeks %>>% window(end = 6)
month <- xs %>>% cleanmonths %>>% window(end = 6)
seas <- xs %>>% cleanseas %>>% window(end = 6)
list(day = day, week = week, month = month, season = seas)
}
First lets load in and resample our data:
library(ggthemes)
library(ggplot2)
library(cowplot)
china <- preprocess(datadir)
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
bj <- china$BeijingPM_$PM_US
bj <- resample(bj)
Now lets look at the sample plots, first with the daily data
dy <- sampplot(bj$day)
So looking at this, we see that we have what looks like a wandering behavior, according to the frequency plot, but the lags tell a different story. There appears to be some strong, high order ARMA stuff going on. Looking at the line plot, we see that there looks to be a long seasonal component, maybe a yearly one, which would explain why the Parzen window is telling us its somewhat wandering. We also see what looks to be a few more peaks in the parzenwindow, which either tells us that there is a complex seasonality going on (likely at this sampling rate), or we have some sort of high order ARMA model (or potentially both). Lets widen our sampling rate to weeks, months, and quarters, to see if there is anything gong on there, maybe be able to catch that yearly pattern. Lets check it out:
So lets go through each of these plots and discuss:
wy <- sampplot(bj$w)
Here we can see better that the time series is not so much wandering, as having a very long seasonal pattern (parzen window). We can tell by the little hook on the left. The ACFs are not so useful in this case, as they are tiny beyond lag 1, due to our weird sampling. However, we know for sure there is some sort of seasonal pattern going on. Potentialy there is a multiseasonal pattern again, or some high order ARMA.
my <- sampplot(bj$m)
Here we see another oscillating component in the ACFs, indicating again seasonal or high order ARMA (it looks a lot like patemp from tswge, which is seasonal data that is well described by a high order ARMA model). The frequency is again painting a similar picture, there is something going on with a clear frequency. lets look at the quarterly data to know for sure
sy <- sampplot(bj$s)
With a bit more than 20 data points, and a supposed seasonal period of a year (4 datapoints), we would expect to have a frequency of about 0.2, and lo and behold, we have one. There is some strong evidence of a yearly seasonal pattern in the data (also look at the peak in the ACFs at 4, pretty good evidence right there). The scatterplot tells us a pretty obvious story as well
Lets check out this seasonal pattern a bit more with a seasonal plot. We will plot them all at once this time and then discuss:
sda <- seasonplot(bj$day) + scale_color_hc() + theme_hc() + ggtitle("Seasonal plot: Daily")
sdw <- bj$week %>>% seasonplot + theme_hc() + scale_color_hc() + ggtitle("Seasonal plot: Weekly")
sdm <- bj$month %>>% seasonplot + theme_hc() + scale_color_hc() + ggtitle("Seasonal plot: Monthly")
sds <- bj$seas %>>% seasonplot + theme_hc() + scale_color_hc() + ggtitle("Seasonal plot: Quarterly")
plot_grid(sda, sdw, sdm, sds, ncol = 2)
The shape of these seasonal plots, and how they all kind of line up, indicate to me that there is some sort of seasonality. It is especially apparent in the Monthly plot, where we have it line up very well, but it is clear in the other plots as well.
Why is this happening? Is there a real world explanation to this?
Indeed there is. Every year, around November, citywide central heating turns on inBeijing for the winter. This would cause a clear change in the air pollution, as heating is not an energy efficient process. In the summer, it goes down, because central heating is off, and people are more likely to go outside and walk and open their windows.
With our newfound knowledge, we will now perform a classical analysis of this time series.
library(tswgewrapped)
# > my wrapper scripts of tswge, with more convenient syntax
The setup for this bit of analysis is as follows:
china <- preprocess("data/")
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
#> Error in na.interpolation(v, "spline") : Input x is not numeric
bj <- china$BeijingPM_$PM_US
bjs <- resample(bj)
bjus <- bjs$day
# split
train <- window(bjus, end = 5)
test <- window(bjus, start = 5)
# clean up
rm(bj, bjus, bjs)
To make analysis of models from different libraries with different structures easier, I wrote some S3 classes and generic methods in order to view and evaluate them. These are shown below
The scores generic is used to get the score of a model of any type.
scores <- function(obj) {
UseMethod("scores")
}
We next create a wge class, where we can store our tswge forecasts
as.wge <- function(x) structure(x, class = "wge")
We will use 3 methods to evaluate TSWGE models:
ASE <- function(predicted, actual) {
mean((actual - predicted)^2)
}
MAPE <- function(predicted, actual) {
100 * mean(abs((actual - predicted)/actual))
}
Confidence score is a made up metric I created, in order to evaluate the prediction interval. How it works is as follows: For each point in the actual observations, if it lies outside the prediction interval, give that point a score of 1. If it is within the confidence interval, give it a score of 0. Then, average the scores and multiply by 100 to get the percentage we got wrong.
checkConfint <- function(upper, lower, actual) {
(actual < lower) | (actual > upper)
}
confScore <- function(upper, lower, actual) {
rawScores <- ifelse(checkConfint(upper, lower, actual), 1, 0)
return(sum(rawScores)/(length(actual)) * 100)
}
To score tswge objects, we can simply use the following S3 method:
scores.wge <- function(xs) {
mape <- MAPE(xs$f, test)
ase <- ASE(xs$f, test)
confs <- confScore(xs$ul, xs$ll, test)
c(ASE = ase, MAPE = mape, Conf.Score = confs)
}
To compare the predictions vs reality of a wge object, I wrote an autoplot method for wge objects:
.testPredPlot <- function(xs) {
p <- ggplot() + theme_hc() + scale_color_hc()
doplot <- function(df) {
p <<- p + geom_line(data = df, aes(x = t, y = ppm, color = type))
}
out <- lapply(xs, doplot)
out[[2]]
}
autoplot.wge <- function(obj) {
testdf <- data.frame(type = "actual", t = seq_along(test), ppm = as.numeric(test))
preddf <- data.frame(type = "predicted", t = seq_along(obj$f), ppm = as.numeric(obj$f))
confdf <- data.frame(upper = obj$ul, lower = obj$ll, t = seq_along(test))
dfl <- list(testdf, preddf)
.testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2),
linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2),
linetype = 3003) + guides(alpha = FALSE)
}
Before we begin our analysis, we will create a seasonal and trend based difference of our original time series, mostly for the sake of thouroughness. The difference function used is as follows
difference
#> function (type, x, n)
#> {
#> szn_trans <- function(x, n) {
#> artrans.wge(x, phi.tr = c(rep(0, n - 1), 1))
#> }
#> arima_trans <- function(x, n) {
#> f <- artrans.wge(x, phi.tr = 1)
#> if (n == 1) {
#> res <- f
#> return(res)
#> }
#> else {
#> arima_trans(f, n - 1)
#> }
#> }
#> if (is.character(enexpr(type)) == F) {
#> type <- as.character(enexpr(type))
#> }
#> if (type %in% c("arima", "ARIMA", "Arima")) {
#> return((arima_trans(x, n)))
#> }
#> if (type %in% c("ARUMA", "Aruma", "aruma", "Seasonal", "seasonal")) {
#> szn_trans(x, n)
#> }
#> }
#> <bytecode: 0xa230c38>
#> <environment: namespace:tswgewrapped>
Our EDA tells us there should be a seasonality of ~365 days. So, lets take that out of our time series.
trainSea <- difference(seasonal, train, 365)
Nothing appeared to happen, This is either because the difference is out of the scale of our ACF plot, (365 is a lot of lags), or because there is something more going on. We will see.
Just for the sake of completeness, lets take a trend based difference of our data:
trainTrend <- difference(arima, train, 1)
There is little to say here, I do not believe in this model.
To estimate the order, first all the train sets were put into a list:
trainers <- list(ARMA = train, Seasonal = trainSea, ARIMA = trainTrend)
Then, a wrapper around aic.wge speed up (especially with parallel computation) aic5.wge, was used on the trainers. The speed differences are minor but still much time is saved. The wrappers are:
tswgewrapped:::expand
#> function (v1, v2)
#> {
#> c(sapply(v1, function(x) rep(x, length(v2))))
#> }
#> <bytecode: 0x5b62368>
#> <environment: namespace:tswgewrapped>
tswgewrapped:::rewrite
#> function (v1, v2)
#> {
#> rep(v1, length(v2))
#> }
#> <bytecode: 0x5ad45a0>
#> <environment: namespace:tswgewrapped>
tswgewrapped:::getpq
#> function (x, p = 8, q = 5, type = "aic", silent = TRUE)
#> {
#> if (silent == FALSE) {
#> cat("Calculating ", type, " for ARMA(", p, ", ", q, " )\n",
#> sep = "")
#> }
#> res <- try(aic.wge(x, p, q, type))
#> if (is.list(res)) {
#> out <- c(res$p, res$q, res$value)
#> }
#> else {
#> out <- c(p, q, 9999)
#> }
#> out
#> }
#> <bytecode: 0x59a2978>
#> <environment: namespace:tswgewrapped>
aic5
#> function (x, p = 0:8, q = 0:5, type = "aic", silent = TRUE)
#> {
#> ip <- expand(p, q)
#> iq <- rewrite(q, p)
#> out <- mapply(function(v1, v2) getpq(x, v1, v2, type, silent),
#> ip, iq)
#> out <- (as.data.frame(t(out)))
#> colnames(out) <- c("p", "q", type)
#> head(out[order(out[, 3], decreasing = F), ], 5)
#> }
#> <bytecode: 0x58da788>
#> <environment: namespace:tswgewrapped>
tswgewrapped:::aics
#> [[1]]
#> function (x, p = 0:8, q = 0:5, silent)
#> aic5(x, p, q, type = "aic", silent)
#> <bytecode: 0x57781b8>
#> <environment: namespace:tswgewrapped>
#>
#> [[2]]
#> function (x, p = 0:8, q = 0:5, silent)
#> aic5(x, p, q, type = "bic", silent)
#> <bytecode: 0x5778810>
#> <environment: namespace:tswgewrapped>
aicbic
#> function (vec, p = 0:8, q = 0:5, parallel = FALSE, cl = NULL,
#> silent = FALSE)
#> {
#> if (parallel == TRUE) {
#> parLapply(cl, aics, function(f) f(vec, p, q, silent = TRUE))
#> }
#> else {
#> lapply(aics, function(f) f(vec, p, q, silent))
#> }
#> }
#> <bytecode: 0x567fbc8>
#> <environment: namespace:tswgewrapped>
The rest of the source code can be seen here. The calculation of all the AICs and BICs is shown below
aics <- lapply(trainers, aicbic, 0:10, 0:8)
pander(aics)
ARMA:
*
-------------------------
p q aic
-------- ---- --- -------
**90** 9 8 14.62
**53** 5 7 14.62
**68** 7 4 14.62
**11** 1 1 14.62
**99** 10 8 14.62
-------------------------
*
------------------------
p q bic
-------- --- --- -------
**11** 1 1 14.63
**19** 2 0 14.63
**4** 0 3 14.63
**20** 2 1 14.63
**12** 1 2 14.63
------------------------Seasonal:
*
------------------------
p q aic
-------- ---- --- ------
**50** 5 4 15.3
**51** 5 5 15.3
**72** 7 8 15.3
**70** 7 6 15.3
**99** 10 8 15.3
------------------------
*
------------------------
p q bic
-------- --- --- -------
**11** 1 1 15.32
**3** 0 2 15.33
**19** 2 0 15.33
**4** 0 3 15.33
**12** 1 2 15.33
------------------------ARIMA:
*
------------------------
p q aic
-------- --- --- -------
**58** 6 3 14.63
**38** 4 1 14.63
**49** 5 3 14.63
**44** 4 7 14.63
**20** 2 1 14.63
------------------------
*
------------------------
p q bic
-------- --- --- -------
**20** 2 1 14.64
**4** 0 3 14.64
**38** 4 1 14.65
**56** 6 1 14.66
**49** 5 3 14.66
------------------------To estimate the parameters, we will be using the wrapper estimation function I wrote (just more convenient syntax), as well as the wrapper around ljung.wge I wrote, which is just more convenient to read (and doesnt print 100 residuals)
estimate
#> function (xs, p, q = 0, type = "mle", ...)
#> {
#> if (q > 0) {
#> return(est.arma.wge(xs, p, q, ...))
#> }
#> else {
#> return(est.ar.wge(xs, p, type, ...))
#> }
#> }
#> <bytecode: 0xf8e0298>
#> <environment: namespace:tswgewrapped>
tswgewrapped:::silence
#> function (x)
#> {
#> sink("/dev/null")
#> x
#> sink()
#> }
#> <bytecode: 0x658ca00>
#> <environment: namespace:tswgewrapped>
hush
#> function (f)
#> {
#> silence(res <- f)
#> return(res)
#> }
#> <bytecode: 0x4ec3ba0>
#> <environment: namespace:tswgewrapped>
ljung_box
#> function (x, p, q, k_val = c(24, 48))
#> {
#> ljung <- function(k) {
#> hush(ljung.wge(x = x, p = p, q = q, K = k))
#> }
#> sapply(k_val, ljung)
#> }
#> <bytecode: 0x1296b818>
#> <environment: namespace:tswgewrapped>
All estimations were run with different orders until the best results were found:
estARMA <- estimate(train, 9, 8)
#>
#> Coefficients of Original polynomial:
#> 0.1142 0.2027 0.4502 -0.5994 0.0159 0.0679 0.7182 0.0737 -0.1559
#>
#> Factor Roots Abs Recip System Freq
#> 1-1.3528B+0.9692B^2 0.6979+-0.7381i 0.9845 0.1295
#> 1+1.5505B+0.9575B^2 -0.8097+-0.6235i 0.9785 0.3956
#> 1-0.9753B 1.0253 0.9753 0.0000
#> 1+0.5870B+0.8377B^2 -0.3503+-1.0349i 0.9153 0.3020
#> 1+0.4933B -2.0270 0.4933 0.5000
#> 1-0.4169B 2.3985 0.4169 0.0000
#>
#>
ljung_box(estARMA$res, 9, 8)
#> [,1] [,2]
#> test "Ljung-Box test" "Ljung-Box test"
#> K 24 48
#> chi.square 12.9592 31.2323
#> df 7 31
#> pval 0.07310876 0.4545512
par(mfrow = c(1, 2))
acf(estARMA$res)
plot(estARMA$res)
estSeas <- estimate(trainSea, 7, 6)
#>
#> Coefficients of Original polynomial:
#> -0.1091 -0.1268 0.5919 -0.3199 -0.0725 -0.4779 0.2916
#>
#> Factor Roots Abs Recip System Freq
#> 1+1.5049B+0.9193B^2 -0.8184+-0.6464i 0.9588 0.3936
#> 1+0.5136B+0.8599B^2 -0.2987+-1.0362i 0.9273 0.2947
#> 1-1.3829B+0.7006B^2 0.9869+-0.6733i 0.8370 0.0953
#> 1-0.5265B 1.8993 0.5265 0.0000
#>
#>
ljung_box(estSeas$res, 7, 6)
#> [,1] [,2]
#> test "Ljung-Box test" "Ljung-Box test"
#> K 24 48
#> chi.square 18.49173 44.33813
#> df 11 35
#> pval 0.07084993 0.1339141
par(mfrow = c(1, 2))
acf(estARMA$res)
plot(estARMA$res)
estTrend <- estimate(trainTrend, 6, 3)
#>
#> Coefficients of Original polynomial:
#> -0.0335 -0.6124 0.4784 -0.1861 0.0499 -0.0888
#>
#> Factor Roots Abs Recip System Freq
#> 1+0.6325B+0.7674B^2 -0.4121+-1.0646i 0.8760 0.3088
#> 1-1.0402B+0.3712B^2 1.4011+-0.8549i 0.6093 0.0872
#> 1+0.4412B+0.3116B^2 -0.7078+-1.6456i 0.5582 0.3147
#>
#>
ljung_box(estTrend$res, 6, 3)
#> [,1] [,2]
#> test "Ljung-Box test" "Ljung-Box test"
#> K 24 48
#> chi.square 17.52443 43.90103
#> df 15 39
#> pval 0.2884916 0.2715692
par(mfrow = c(1, 2))
acf(estARMA$res)
plot(estARMA$res)
All of our models look pretty good as far as estimates go. However, I think only ARUMA and ARMA are appropriate. We are only proceeding with ARIMA to be thorough.
We will now proceed with our daily, one year ahead forecasts, as well as the evaluation of them. The fcst function from tswgewrapped is again just a wrappper for fore.whatever.wge:
tswgewrapped::fcst
#> function (type, ...)
#> {
#> phrase <- paste0("fore.", enexpr(type), ".wge")
#> func <- parse_expr(phrase)
#> eval(expr((!!func)(...)))
#> }
#> <bytecode: 0x5d63758>
#> <environment: namespace:tswgewrapped>
armaCast <- tswgewrapped::fcst(type = aruma, x = train, phi = estARMA$phi, theta = estARMA$theta,
n.ahead = length(test)) %>>% as.wge
This one damps to the mean pretty quickly, however this may be a decent predictor, it is common sense that most likely the result will be the mean. It may provide a good baseline in the future, but is not in its own right a very good predictor
seaCast <- tswgewrapped::fcst(type = aruma, x = train, phi = estSeas$phi, theta = estSeas$theta,
s = 365, n.ahead = length(test)) %>>% as.wge
This looks a lot more like the time series, so it is likely a more appropriate model at this forecast horizon, but the forecasts themselves appear to be off
This will just be the last value over and over, highly uninteresting and not appropriate
trendCast <- tswgewrapped::fcst(type = aruma, x = train, phi = estTrend$phi,
theta = estTrend$theta, d = 1, n.ahead = length(test)) %>>% as.wge
Nothing surprising here
An airline model was not attempted, because it does not make much sense.
As the appropiateness of each model has already been noted, so now we can look at the scores of each model:
casts <- list(arma = armaCast, seasonal = seaCast, arima = trendCast)
data.frame(list.rbind(lapply(casts, scores)))
These results are not surprising, our common sense ARMA forecast in the long run did pretty well (just the mean), and will serve well as a benchmark for future forecasts. Our seasonal model, which looks pretty, did not do so hot, as it was just off a bit. Our ARIMA model only did well because the last value was near the mean. Below we can see the comparison between the train and test:
autoplot(armaCast) + ggtitle("ARMA")
autoplot(seaCast) + ggtitle("Seasonal")
autoplot(trendCast) + ggtitle("ARIMA")
One important thing to note is the prediction intervals for these models. They did an excellent job of capturing the actual data within their prediction intervals. This is really useful information to us.
Although our forecasts were not very good, especially our ARMA forecast will provide us with a nice benchmark to compare our newer models to, at least by ASE. I think the seasonality is an important factor in this dataset, and although our seasonal model did not do so well, it is worth looking more into. We will proceed next with some multiseasonal approaches. As it is daily data, there tend to be two major seasonalities: a weekly pattern, and a seasonal pattern.
We would like to save the important information from our models, so that way in the end, we have a nice list of all of our information. That is, the scores of the model, the predictions themselves, and the the prediction limits:
makeModel <- function(...) {
UseMethod("makeModel")
}
makeModel.wge <- function(obj) {
score <- scores(obj)
return(list(preds = obj$f, ul = obj$ul, ll = obj$ll, ase = score[1], mape = score[2],
Conf.Score = score[3]))
}
models <- lapply(casts, makeModel)
First, we will load up the forecast library, in order to get functionality for multiseasonal forecasts
library(forecast)
Next, we will recast our train and test sets as msts objects:
periods <- c(7, 365.25)
train <- msts(train, seasonal.periods = periods, end = 5)
test <- msts(test, seasonal.periods = periods, start = 5)
In class, we learned that you can perform standard regression on a time series, then apply our arima analysis on the errors of that time series, and then adjust for that. The auto.Arima function does all of this for you, if you supply it with an x regressor. The math behid it is basically representing the equation as follows
\[y_t = \beta_0 + \beta_1 x_{1,t} + ... + \beta_k x_{k,t} + \eta_t\]
where \(\eta_t\) is a time series of errors. By estimating that time series, we can then properly estimate \(y_t\), as well as make forecasts.
In the case of multiple seasonalities, we can get a bit more creative. We can represent the time series as a fourier series, that is we can write a bunch of sine and cosine terms, a set for our weekly seasonality, and a set for our yearly seasonality. We can see this below:
exampleFourier <- data.frame(fourier(train, K = c(3, 100)))
Note that we do three fourier expansions of our weekly trend, and about 100 of our yearly trend. We want to do about half as many expansions as our trend. Lets check out what they look like, we will look at one fourier expansions on a weekly level, and one on a yearly level
weekly <- exampleFourier[[1]] + exampleFourier[[2]] + exampleFourier[[3]] +
exampleFourier[[4]] + exampleFourier[[5]] + exampleFourier[[6]]
par(mfrow = c(1, 2))
plot(weekly[1:63], type = "l")
plot(train[1:63], type = "l")
So we see we may not have all of the patterns of the data on a weekly scale represented, but it is not a half bad representation. Let us try with the yearly data as well(we will provide a smoothing on the weekly level first so we can see)
First lets do a low pass filter of our training dataset:
mafun <- function(xs, n) {
stats::filter(xs, rep(1, n))/n
}
trsmooth <- mafun(train, 56)
yearly <- exampleFourier[[7]] + exampleFourier[[8]] + exampleFourier[[9]] +
exampleFourier[[10]] + exampleFourier[[11]] + exampleFourier[[12]]
par(mfrow = c(1, 2))
plot(yearly, type = "l")
plot(trsmooth, type = "l")
Again, not a bad little representation of the series. If we incorporated the remaining 206 components of our fourier expansion of the yearly trend, we would have a really powerful representation of the 365.25 day period.
We can use this fourier expansion as a regressor on our time series, and this should give us a pretty good representation of the complex seasonality of this series.
We need to construct a few S3 methods for this section. First, the forecast function for Arima objects (those constructed by auto.arima and Arima) returns a forecast of the same length as the length of the external regressors (regardless of the value of h). To deal with this, we write a new method called newFore, which creates a new Arima object that can make forecasts off the length of the test set, and then does a forecast using those xregressors. This returns the same value as running:
predict(`Arima object`, newxreg = `regressors from the test set`, n.ahead = whatever)
but gives us the benefits of the forecast object, such as nice plots, confidence intervals, etc.
newFore <- function(...) {
UseMethod("newFore")
}
newFore.Arima <- function(obj, newdata, xreg = NULL, h = 1) {
refit <- Arima(newdata, model = obj, xreg = xreg)
forecast(refit, h = h, xreg = xreg)
}
Next we want to add new methods for these forecast objects, so we can view our results in a consitent manner:
as.fore <- function(x) structure(x, class = "fore")
autoplot.fore <- function(obj) {
testdf <- data.frame(type = "actual", t = seq_along(test), ppm = as.numeric(test))
preddf <- data.frame(type = "predicted", t = seq_along(test), ppm = as.numeric(obj$fitted[1:length(test)]))
confdf <- data.frame(upper = obj$upper[, 2], lower = obj$lower[, 2], t = seq_along(test))
dfl <- list(testdf, preddf)
.testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2),
linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2),
linetype = 3003) + guides(alpha = FALSE)
}
scores.fore <- function(obj) {
mape <- MAPE(obj$fitted[1:length(test)], test)
ase <- ASE(obj$fitted, test)
confs <- confScore(obj$upper[1:length(test), 2], obj$lower[1:length(test),
2], test)
c(ASE = ase, MAPE = mape, Conf.Score = confs)
}
First we construct a fourier expansion of both the train and test set:
trainExp <- fourier(train, K = c(3, 100))
testExp <- fourier(test, K = c(3, 100))
Next, we plug this in to the auto.arima function, and build our model. Note we are passing the parameter lambda = 0 into the function. Internally, auto.arima does a Box-Cox transformation on the data before running it, to make the model work smoothly. In our case, we do not want values less than zero, because that wouldnt make sense. So, we set lambda, the parameter passed into the box cox transformation to 0, which represents a logarithmic transformation. In turn, this will keep our model, and thus our predictions
mseadyn <- auto.arima(train, seasonal = FALSE, lambda = 0, xreg = trainExp)
Lets check out where the roots lie on the unit circle:
autoplot(mseadyn)
And lets see what order model it identified within the residuals:
data.frame(p = length(mseadyn$model$phi), q = (mseadyn$model$theta), d = length(mseadyn$model$Delta))
First, lets look at what the origina forecast looks like, without doing the newFore function:
forecast(mseadyn, xreg = trainExp) %>>% autoplot
Whew, 5 years ahead. Thats a bit ridiculous. Now lets run the newFore on it, and then check it out again
mseafor <- newFore(mseadyn, newdata = test, xreg = testExp, h = 1)
autoplot(mseafor)
Note the scale on this plot it is now a one year ahead forecast, which is exactly what we want. Unlike the basic ARUMA forecast we did earlier, the confidence limits also lie above zero in all cases, and hopefully it will be a bit better.
This model seems to be slightly more appropriate than the simpler models we fit, as it is highly likely this data is multiseasonal, and I much prefer having a lower ARMA order than (9,8). Lets check it out
mseafor %>>% as.fore %>>% autoplot
This appears to be a ridiculously good forecast. Lets check out how it scored:
mseafor %>>% as.fore %>>% scores %>>% t %>>% data.frame
Wow. The ASE is nearly a million units smaller than the baseline ARMA forecast. This is a massive improvement, and it seems that our hypothesis of multiseasonality in the data was correct. Lets try a different multiple seasonality technique. Aside fro that, a good percentage of the data was stored in our prediction interval. This indicates that on its own, this model is a good forecasting tool.
We will now add this model to our list of models:
makeModel.fore <- function(obj) {
score <- scores(obj)
return(list(preds = obj$fitted, ul = obj$upper[, 2], ll = obj$lower[, 2],
ase = score[1], mape = score[2], Conf.Score = score[3]))
}
models$dynamic <- makeModel(as.fore(mseafor))
TBATS is a new (to me) and exciting method for dealing with time series with complex seasonalities. TBATS stands for: Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components
What?
Basically, the TBATS algorithm is going to:
Perform a box-cox transformation on the time series
Break it into Seasonal, ARMA (noise/residual), and Trend (with damping) components
Estimate the Seasonal component(s) as a fourier series
Estimate the trend component
Estimate the ARMA component
Combine everything back together to represent the series
How is this different from what we just did
Unlike what we just did with dynamic harmonic regression, where the seasonality was fixed (as it is simply read in from the attributes of the msts object), seasonality here is allowed to change over time, as it is calculated dynamically from the data. This means that that pattern we saw in our EDA in the quarterly season plots can be maybe accounted for (where year one and two had a different pattern than the rest of the years).
Why havent i heard of this
Unlike a lot of time series techniques, tbats is from the 21st century, and not well proven. It also has a hard time sometimes, as it is an automated modeling tool, so some special cases will mess it up. It also is exceptionally slow, without using parallel processing it took about 20 minutes to run, on this small amount of data. Despite this, one of its great strengths is its automation, making it very easy to get a pretty good model, a lot of the time.
Lets go ahead and get into it.
tbats comes from the forecast package, and has similar issues with forecasting as we saw above (its forecasts are the same length as the data it was trained on, h is ignored). So lets write a newFore method for bats objects
newFore.bats <- function(obj, newdata, h = 1) {
refit <- tbats(model = obj, y = newdata)
forecast(refit, h)
}
Thankfully, the structure of a tbats forecast is exactly the same as the structure of our dynamic regression forecasts, so for the rest of the methods, we can just coerce our tbats forecasts into fore objects.
First, lets note whether train and test are msts or ts objects, just for sanity
sapply(list(train, test), class)
#> [,1] [,2]
#> [1,] "msts" "msts"
#> [2,] "ts" "ts"
Ok great we are good to go.
cores <- parallel::detectCores()
# > [1] 12
bjbats <- tbats(train, use.parallel = TRUE, num.cores = cores - 1)
autoplot(bjbats)
This is interesting too, so we see the I think trend that the algorithm smoothed out of the time series. Lets go ahead and make the forecast and see how we did
Since we already know what will happen if we just make a forecast with the forecast function, lets instead use our newFore method without comparison
batF <- newFore(obj = bjbats, test, h = 366)
autoplot(batF)
Interesting. What is the autoplot showing us? It is a good question. It is hard to say whether this model is any good or not just from this. Lets try looking at the actual forecasted values with our autoplot method
forbats <- as.fore(batF)
autoplot(forbats)
The autoplot of the forecast itself was strange, but looking at the fitted values, it was a pretty dang good forecast. It is interesting how both of these multiseasonal models captured the shape of the data nicely (see around the 100 point mark), but not the scale. TBATS did an exceptional job of getting the shape of the data (even more than our dynamic regression, see around t = 200), but it did worse as far as scaling goes.
Lets check out the scores:
scores(forbats) %>>% data.frame
By the numbers, this is an exceptional model as well (or at least better than our naive baseline). As far as appropriateness goes, it is hard to say as the TBATS algorithm is a bit black box. However, it got a pretty good forecast and looks similar to our original data, so I will give it a pass. I believe the fact that the seasonality can change is what made this a bit better than our regression with ARMA errors. As far as prediction intervals go, this model managed to capture even further the swings of the data.
models$tbats <- makeModel(forbats)
Facebook released their internal fast and easy forecasting algorithm, prophet. The prophet algorithm works in the following manner:
It decomposes the time series into seasonal, trend, and holiday components.
The seasonal component is estimated as in TBATS, with fourier series. By default, it looks for a daily (if we have sub daily data), weekly, and yearly trends, but these can be specified (for example we can include a 5 day work week). These fourier series are then multiplied by a normal distribution with the same variance as our data.
The trend component is modeled basically as logistic growth. However, it is not just simple logistic growth, they use piecewise logistic growth, which means that the trend component is allowed to change with time.
If a list of holidays and events with their dates are provided, a matrix of dummy variable regressors for them is added to the model.
Components are combined as a general additive model (GAM), and then you are done
We will give the usual S3 methods. Because prophet forecasts include the entire dataset, we will have to write methods that take out the training parts. First, we define the coercion method
as.proph <- function(x) structure(x, class = "proph")
Next we define our autoplot and scoring methods:
scores.proph <- function(obj) {
c(ase = ASE(obj$yhat[-(1:length(train))], test), mape = MAPE(obj$yhat[-(1:length(train))],
test), Conf.Score = confScore(upper = obj$yhat_upper[-(1:length(train))],
lower = obj$yhat_lower[-(1:length(train))], test))
}
autoplot.proph <- function(obj) {
testdf <- data.frame(type = "actual", t = seq_along(test), ppm = as.numeric(test))
preddf <- data.frame(type = "predicted", t = seq_along(test), ppm = as.numeric(obj$yhat[-(1:length(train))]))
confdf <- data.frame(t = seq_along(test), upper = obj$yhat_upper[-(1:length(train))],
lower = obj$yhat_lower[-(1:length(train))])
dfl <- list(testdf, preddf)
.testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2),
linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2),
linetype = 3003) + guides(alpha = FALSE)
}
First, we need to load the data in the proper format
st <- as.Date("2010-1-1")
en <- as.Date("2013-12-31")
trainDates <- seq(st, en, by = "day")
# data MUST BE IN THIS EXACT FORMAT
traindf <- data.frame(ds = trainDates, y = as.numeric(train))
Next, lets make our model. This is pretty easy
library(prophet)
model <- prophet(traindf)
First, we must make a data frame for the future, because prophet has a silly API
future <- make_future_dataframe(model, periods = 366)
Now we can forecast, with the predict function
fore <- predict(model, future)
plot(model, fore)
We can also view how it modeled/forecasted all the components
prophet_plot_components(model, fore) + theme_hc()
#> NULL
Lets check out how we did. First, lets take a look at our forecast:
pfore <- as.proph(fore)
autoplot(pfore)
This is a pretty weird little forecast. Because of the smoothing, and the normalized nature of all the prophet algorithm (it is much closer to regression than anything else), it captures the shape of the series well, but it does not get the extremety of the series. This model may be good for detecting overall patterns in the data, but not for specific predictions. Lets check out the scores.
data.frame(t(scores(pfore)))
scores.proph
#> function(obj) {
#> c(ase = ASE(obj$yhat[-(1:length(train))], test), mape = MAPE(obj$yhat[-(1:length(train))],
#> test), Conf.Score = confScore(upper = obj$yhat_upper[-(1:length(train))],
#> lower = obj$yhat_lower[-(1:length(train))], test))
#> }
# function(obj){ c( ase = ASE(obj$yhat[-(1:length(train))], test), mape =
# MAPE(obj$yhat[-(1:length(train))], test), Conf.Score = confScore(upper =
# obj$yhat_upper, lower = obj$yhat_lower, test) ) }
str(pfore)
#> List of 19
#> $ ds : POSIXct[1:1827], format: "2010-01-01" "2010-01-02" ...
#> $ trend : num [1:1827] 2525 2525 2525 2524 2524 ...
#> $ additive_terms : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#> $ additive_terms_lower : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#> $ additive_terms_upper : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#> $ weekly : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#> $ weekly_lower : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#> $ weekly_upper : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#> $ yearly : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#> $ yearly_lower : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#> $ yearly_upper : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#> $ multiplicative_terms : num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#> $ multiplicative_terms_lower: num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#> $ multiplicative_terms_upper: num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#> $ yhat_lower : num [1:1827] -109.41 1.68 257.35 91.15 88.32 ...
#> $ yhat_upper : num [1:1827] 4662 4662 4556 4661 4854 ...
#> $ trend_lower : num [1:1827] 2525 2525 2525 2524 2524 ...
#> $ trend_upper : num [1:1827] 2525 2525 2525 2524 2524 ...
#> $ yhat : num [1:1827] 2297 2485 2438 2411 2572 ...
#> - attr(*, "row.names")= int [1:1827] 1 2 3 4 5 6 7 8 9 10 ...
#> - attr(*, "class")= chr "proph"
These are not brilliant, only ever so slightly better than our baseline ARMA/mean score. This says a lot abouthow hard to beat common sense is.
str(pfore)
#> List of 19
#> $ ds : POSIXct[1:1827], format: "2010-01-01" "2010-01-02" ...
#> $ trend : num [1:1827] 2525 2525 2525 2524 2524 ...
#> $ additive_terms : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#> $ additive_terms_lower : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#> $ additive_terms_upper : num [1:1827] -227.6 -39.4 -86.7 -113.8 47.6 ...
#> $ weekly : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#> $ weekly_lower : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#> $ weekly_upper : num [1:1827] 26.1 147.8 29.9 -70.2 17.3 ...
#> $ yearly : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#> $ yearly_lower : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#> $ yearly_upper : num [1:1827] -253.7 -187.2 -116.6 -43.7 30.2 ...
#> $ multiplicative_terms : num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#> $ multiplicative_terms_lower: num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#> $ multiplicative_terms_upper: num [1:1827] 0 0 0 0 0 0 0 0 0 0 ...
#> $ yhat_lower : num [1:1827] -109.41 1.68 257.35 91.15 88.32 ...
#> $ yhat_upper : num [1:1827] 4662 4662 4556 4661 4854 ...
#> $ trend_lower : num [1:1827] 2525 2525 2525 2524 2524 ...
#> $ trend_upper : num [1:1827] 2525 2525 2525 2524 2524 ...
#> $ yhat : num [1:1827] 2297 2485 2438 2411 2572 ...
#> - attr(*, "row.names")= int [1:1827] 1 2 3 4 5 6 7 8 9 10 ...
#> - attr(*, "class")= chr "proph"
makeModel.proph <- function(obj) {
score = scores(obj)
return(list(preds = obj$yhat, ul = obj$yhat_upper, ll = obj$yhat_lower,
ase = score[1], mape = score[2], Conf.Score = score[3]))
}
models$prophet <- makeModel(pfore)
Considering the nature of our data, and our background knowledge of central heating in china, both of these multiseasonal models seem to be more or less appropriate. They both handle this seasonality in a similar manner, and both proved a very reasonable forecast.
Now it is time for us to dive in to the hopefully even more powerful multivariate models. Before we do so, we are going to need to redefine our train and test sets, which means we are going to have to write some more functions. Instead of confusing ourselves, lets go ahead and clean away our train and test sets, to avoid confusion:
rm(train, test)
First, we are now going to have to convert our non time series/numerical data to daily, which means we will have to find a new method of resampling them. For that, we are going to find the most common categorical value in each day, and then take that. We will use the .mode function, defined below:
.mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
.mode(c(rep("cat", 4), rep("dog", 8), rep("moose", 1)))
#> [1] "dog"
Next we are going to write a function which lumps the categorical variables into groups of each day, takes the .mode of each group, and returns a new vector in that format.
For the lumping, we will use the %/% operator, which returns an integer specifying how many times a number divides another number. For instance:
((1:50) - 1)%/%5
#> [1] 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
#> [36] 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9
Then, we will use the glorious tapply function, which does the same thing as the apply family, except it applies the function onto a grouped variable (table apply).
.dailyMode <- function(xs) {
out <- tapply(xs, (seq_along(xs) - 1)%/%24, .mode)
out <- unname(out)
out[1:1826] # the number of days in 6 years, because we cleaned away extra days
}
Next we are going to write a function that applies this (and our daily time series conversion) onto a list of time series objects and categorical, and spit out a data frame:
dlist <- function(xs) {
tsbj <- xs %>>% purrr::keep(is.ts)
tsnobj <- xs %>>% purrr::discard(is.ts)
out <- append(lapply(tsbj, function(x) window(cleandays(x), end = 6)), lapply(tsnobj,
.dailyMode))
as.data.frame(out)
}
Finally, we are going to write a function that creates a train set and a test set of data frames of time series and categorical variables. Note the use of the <<- superassignment operator. This causes the value to be stored one level out, so if it is a single level function, this will cause the value to be stored in the global environment:
dfsplit <- function(df, n = 5) {
index <- length(window(df[[1]], end = n))
tsdf <- df %>% purrr::keep(is.ts)
nodf <- df %>% purrr::discard(is.ts)
outs <- lapply(tsdf, function(x) window(x, end = n))
outn <- nodf[1:index, ]
train <<- as.data.frame(append(outs, outn))
tests <- lapply(tsdf, function(x) window(x, start = n))
test <<- as.data.frame(append(tests, nodf[(index):nrow(nodf), ]))
}
Note we have to do things in this weird way, because otherwise it is extremely difficult to preserve the class ts, which we would like for time series data to be in. Just cbinding or anything like that breaks it immediately.
First, lets load up our data:
bj <- (china$BeijingPM_)
bj2 <- dlist(bj)
dfsplit(bj2)
Before we begin our analysis, it is definitely important to do more EDA, as we have a whole lot more information. First, lets plot all the time series in the training dataframe:
library(tidyverse)
plotAllTs <- function(df) {
df %>% keep(is.ts) %>% gather() %>% mutate(.data = ., t = rep(1:nrow(df),
(nrow(.)/nrow(df)))) %>% ggplot(aes(x = t, y = value)) + facet_wrap(~key,
scales = "free") + geom_line() + theme_hc() + theme(axis.text.y = element_blank())
}
plotAllTs(train)
Already, we see some interesting stuff. First of all we probably want to get rid of the other stations observing particulates in our analysis, as this isnt very interesting. Of all the time seres here, humidity appears to have a similar trend pattern to our target. Pressure looks similar but a bit off, so at a high lag maybe it is interesting. Dewpoint has the wrong period, and lprec and precipitation are the same but hard to glean anything out of.
Lets see if the wind direction is of any use to us:
ggplot(train) + geom_point(aes(x = No, y = PM_US.Post, color = cbwd)) + geom_line(aes(x = No,
y = PM_US.Post, alpha = 0.1)) + geom_smooth(aes(x = No, y = PM_US.Post),
method = "auto") + theme_hc() + scale_color_hc()
Just upon a glimpse this is not that useful. Lets look at a table:
train %>% arrange(cbwd) %>% group_by(cbwd) %>% summarise(mean = mean(PM_US.Post))
It looks like it is an interesting factor to some degree, but at this scale with our mode technique, it is probably strongly distorted. Still it is worth a look. We will note this and examine it when we are performing our models.
plot_vs_response <- function(x) {
plot(train$PM_US.Post ~ train[[x]], xlab = x)
lw1 <- loess(train$PM_US.Post ~ train[[x]])
j <- order(train[[x]])
lines(train[[x]][j], lw1$fitted[j], col = "red", lwd = 3)
}
numNames <- train %>% purrr::keep(is.numeric) %>% names
numNames <- numNames[-c(1:4, 11:17)]
par(mfrow = c(2, 3))
lapply(numNames, plot_vs_response)
So all of these appear to be fairly interesting i think. However, the summing technique we performed does not make these plots easily interpretable. Mathematically however, they likely still work out, as it is the relationship that matters. When we do linear regression, whether or not we have the total temperature sum or the mean (sum/n) should not make a difference in the significance of the relationships, only the slopes. This means we should be ok. Precipitation does not look useful.
Lets look at the cross correlation between all the variables next:
ppm <- train$PM_US.Post
ccfplot <- function(x) {
ccf(ppm, train[[x]], main = x)
}
ccfs <- lapply(numNames, ccfplot)
It appears that low lags are where these variables have the most cross correlation, other than temperature, which seems to have a bigger CCF at a pretty far from zero lag.
Next, to further our EDA, lets perform a cluster analysis of the time series data
There are two types of clustering techniques: Subsequence clustering, and multivariate clustering. Thankfully for us, it has been definitively shown that subsequence clustering (mostly used for anomaly detection, but also for determining if we should break a time series into multiple time series for forecasting) is completely useless. However, multivariate time series clustering is still a valid and exciting technique.
Multivariate time series clustering is commonly used in the world of quantitative finance, to determine which stocks react to other which other stocks. By finding out if X stock is associated with Y stock, they can do things to make money. With further research, they can determine the nature of the relationship between X stock and Y stock. If X stock grows with Y stock, they can time their investments so that when they see Y stock grow, they do something that makes them money with X stock. If the relationship is negative, they can bet on Y stock (which is growing) and short X stock.
We can attempt to use this clustering to get a better idea of how our time series are associated with each other.
Hierarchical clustering works in one of 2 ways:
Each item is given its own group. Groups are compared by a similarity metric. Good groups are merged. This is tried over and over until the clusters do not get better.
All the items are put in a group. The group is divided until it cannot be divided more without messing up the clusters
Partitional Clustering kind of works the opposite of hierarchical clustering. With partitional clustering, you specify the number of groups, and it then solves the problem of maximizing inter cluster distance while minimizing intra cluster distrance. An example of this is kmeans.
Fuzzy clustering is similar to partitional clustering, except instead of hard clusters, they are soft or fuzzy. This means that items can be in multiple clusters, but more in one cluster than another. An example of this is c-means.
Dynamic time warping, or DTW, pictured above, finds the best way to warp two time series to line up with each other over time. It creates a matrix which has dimensions m*n, where m and n are the size of two different time series. It does this for every possible pair. Then, for each matrix, it finds a path of indices which has the minimum number of steps and costs the least to line the two time series up. The distance measure is then computed based on the the number of steps it took to line the two up, weighted by cost. For more info please see the aforementioned paper
Shape based distance is more or less a cheaper version of DTW. What it does is it takes the cross correlation between the two time series (with normalized coefficients), makes a series out of that, fourier transforms that, and calculates the distance based off of that fourier series.
A global alignment kernel is another powerful distance measure (albeit really computational expensive). Instead of calculating a matrix of costs, and then running through that (relatively cheap), it calculates a set of all possible alignments and minimizes that, and then returns a similarity measure based on that alignment. Its a bit complex but the aforementioned paper explains it nicely.
Shape extraction works very similarly to SBD. It picks a random time series as a centroid, and uses SBD to match it to the other time series optimally. As mentioned in the paper, it uses an eigenvector of a matrix made wit SBD aligned series
This centroid is represented by the average of a bunch of points grouped by their DTW alignments. It select a random centroid series, and with each iteration, it calculates the DTW alignment with each series in the cluster and the centroid. This is repeated until convergence is met.
A mediod is just a time series whose average distance to all the other objects in the cluster is at the smallest (and an element of the original data). Then, that cluster is partitioned around all the mediods iteratively. IN our case, a given number of series are chosen as initial centroids. Then all the distances are calculated, and then clusters are created based off of those distances. Then, it picks the one that is closest to all the other time series in the new clusters as a mediod, and does it again, until convergence
Now that we have a basic understanding of how time series clustering works, we can try it out.
library(dtwclust)
traints <- as.list(keep(train, is.ts))
Now that we have our data and libraries loaded, lets set ourselves up for parallel computation:
# alter for the appropriate number of cores for you my computer has 12, but
# i have a lot of tabs open define the cluster
workers <- makeCluster(10L)
# load the necessary libraries onto the cluster
invisible(clusterEvalQ(workers, library(dtwclust)))
# register the cluster for parallel computation
registerDoParallel(workers)
As we are new to this, lets set ourselves up for a grid search of a bunch of clusters. We will first check out a partitional cluster. For this, we want to use the dtw distance metric, but also check out the DBA centroid. We will also play with their normalization methods, and the time window they use in alignment. we will also try out 2-6 clusters
cfg <- compare_clusterings_configs(types = "partitional", k = 2L:6L, controls = list(partitional = partitional_control(iter.max = 40L)),
distances = pdc_configs("distance", partitional = list(dtw_basic = list(window.size = seq(from = 5L,
to = 50L, by = 5L), norm = c("L1", "L2")))), centroids = pdc_configs("centroid",
share.config = c("p"), dba = list(window.size = seq(from = 5L, to = 50L,
by = 5L), norm = c("L1", "L2"))), no.expand = c("window.size", "norm"))
For evaluation, as we do not have labeled data, we will use silhouette validation as our metric.
evaluators <- cvi_evaluators("Sil")
Lets find the best cluster now:
comparison <- compare_clusterings(traints, types = "partitional", configs = cfg,
seed = 666L, score.clus = evaluators$score, pick.clus = evaluators$pick)
stopCluster(cl)
registerDoSEQ()
comparison$results$partitional %>% arrange(desc(Sil)) %>% head
Well we have it. We can explore the results more ourself, the most important part is we found two groups
We can conveniently rerun the best cluster with the repeat_clustering function:
clusters <- repeat_clustering(traints, comparison, comparison$pick$config_id)
plot(clusters)
clusts <- clusters@cluster
names(clusts) <- names(traints)
data.frame((clusts))
So looks like air pressure is on its own. Lets see if hierarchical clustering can help us out with a better cluster
Lets try out a bunch of different hierarchical clusters. We should check out different preprocessing methods, different control methods, all the possible centroids, GAK, DTW, SBD, and different window sizes for DTW.
cfg2 <- compare_clusterings_configs(types = "h", k = 2L:6L, controls = list(hierarchical = hierarchical_control(method = "all")),
preprocs = pdc_configs("preproc", none = list(), zscore = list()), centroid = pdc_configs("centroid",
shape_extraction = list(), default = list(), dba = list()), distances = pdc_configs("distance",
hierarchical = list(dtw_basic = list(window.size = seq(from = 5L, to = 50L,
by = 5L), norm = c("L1", "L2")), gak = list(), sbd = list())), no.expand = c("window.size",
"norm"))
evaluators <- cvi_evaluators("Sil")
Note we will use silhouettes for consistency
comparison2 <- compare_clusterings(traints, types = "hierarchical", configs = cfg2,
seed = 666L, score.clus = evaluators$score, pick.clus = evaluators$pick)
Looks like GAK and DBA centroids absolutely won. Lets visualize the results
comparison2$results$hiera %>% arrange(desc(Sil))
comparison2$results$hierarchical %>% arrange(desc(Sil), distance) %>% ggplot(aes(fill = distance,
y = Sil, x = distance)) + geom_boxplot() + facet_wrap(centroid ~ .) + scale_fill_hc() +
theme_hc()
So it looks like GAK out performed DTW, which outperformed SBD. From centroid to centroid, it really didnt matter.
cluster2 <- repeat_clustering(traints, comparison2, comparison2$pick$config_id)
plot(cluster2)
plot(cluster2, type = "sc")
At this point, it is obvious that our results are not going to change with more work. We have two clusters, pressure in one and everything else in the rest.
With our clustering, we discovered a few interesting things. Firstly, it appears that by all metrics,air pressure is really different from the rest of the series. Whether that is a good thing or bad thing remains to be seen, but regardless, it is interesting. We will have to pay attention to it as a predictor.
We also need to note how strongly humidity is clustered with our PM2.5 content.This hopefully suggests it will be another strong predictor variable. However, it is important to keep in mind that all our time series are clustered pretty closely, the largest distance is only 0.015. So our clustering may not have had any important results, but at the very least we learned something.
We will now proceed with our multivariate analysis.
Our first multivariate technique is going to be VAR. One of the interesting properties of VAR is what it means about the data. VAR is telling us that our time series are endogenous, that is to say that they all affect each other. Now with our weather patterns, this makes sense. Weather is a complex, dynamic system of complex feedback cycles, so it is natural that all the weather variables are endogenous to each other. However; let us discuss how our air particulate content relates to the weather, to determine weather a VAR model is physically appropriate
In this section we will discuss how each of our weather patterns interplay (or dont) with our air quality.
Humidity has a strong effect on air quality. On a humid day, water in the air will “stick” to air particulates. This weighs them down and makes them much larger, causing them to stick around for longer. This will cause the air particulates to hang around in one place.
It is also important to note that this has an effect on temperature. When water sticks to air pollution, energy from the sun is reflected off of the water, imparting a bit of energy to the water and spreading the sunlight some more. This adds to the haze appearance when there is a lot of pollution. Over time, due to water’s high specific heat, should cause the air to heat up, as the small amount of extra energy stored in the water attached to the pollution will be transferred quickly to the air. (Note this should also have an affect on pressure slowly, because \(P \propto T\)).
Surprisingly, air quality also has a positive feedback with humidity. This is due to the “sticking” effect we discussed earlier. Heavy air particulate matter with water attached causes other air particles to stick around too. You can actually prove this with an experiment.
Supplies
An aerosol, such as hairspray
A jar with a lid
Ice
Warm water
Take the warm water and put it in your jar. Put the lid on upside down with ice in it, for about 30 seconds (this causes the evaporated water in the air to condense a bit, raising the humidity).
Now, lift the ice lid up, and spray a tiny bit of your aerosol in the jar and quickly replave the ice lid.
Then, watch a cloud form inside the jar.
You can do this experiment with any air particulate, for example smoke from a match also works nicely.
It is safe to say that a bidirectional VAR forecast of humidity is appropriate.
Temperature has a clear effect on air quality. Not only does cold air cause air particulates to stick around more (slower moving), but also more condensed water particles in the air, which we already know about.
Similarly, in the winter, tropospheric temperature inversions can occur when the stratosphere (think, up high) gets heated more than the troposphere (we live here). Normally, the air naturally convects because it is hotter in the troposphere than the stratosphere, but in the winter, with our long nights, the opposite can occur. This is a temerature inversion. This causes the convection to stop, and the air to remain trapped there. This traps the pollutants in the same place, and causes them to accumilate
The air quality + humidity should have a slight effect on temperature, but global weather patterns are more powerful
It is appropriate to say that temperature affects air quality, but only slightly appropriate to say the opposite
Dewppoint is simply a function of humidity and temperature, so this variable will not be included in our model, as it is simply overfitting.
This is a complex relationship, but basically it does have some sort of complex relationship with air pollution. When it is windy, pollutants can be moved thousands of miles, while when it is still, they can accumulate. So it depends, but it does have an effect. Heavy particulte matter may also slow the wind down, but it is unclear and unproven. Overall though, it is somewhat appropriate to do this bidirectional forecast.
Given the complex relationship between pressure, temperature, humidity, and wind speed, which are all somewhat related to air quality in interesting ways, it is safe to include pressure in our VAR models as well
No matter how much i think, I cannot say that you can predict surface air quality with precipitation or vice versa, so this is likely inappropriate
We need three S3 methods here, an as method, an autoplot method, and a scores method:
as.var <- function(x) structure(x, class = "var")
autoplot.var <- function(obj) {
us <- obj$fcst$PM_US.Post
testdf <- data.frame(type = "actual", t = seq_along(test$PM_US.Post), ppm = as.numeric(test$PM_US.Post))
preddf <- data.frame(type = "predicted", t = seq_along(test$PM_US.Post),
ppm = (obj$fcst$PM_US.Post[, 1]))
dfl <- list(testdf, preddf)
confdf <- data.frame(t = seq_along(test$PM_US.Post), upper = us[, 3], lower = us[,
2])
.testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2),
linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2),
linetype = 3003) + guides(alpha = FALSE)
}
scores.var <- function(obj) {
mape <- MAPE(obj$fcst$PM[, 1], test$PM_US)
ase <- ASE(obj$fcst$PM[, 1], test$PM_US)
us <- obj$fcst$PM_US.Post
conf <- confScore(upper = us[, 3], lower = us[, 2], test$PM_US)
c(ASE = ase, MAPE = mape, Conf.Score = conf)
}
First, lets select the possible orders for our VAR forecast
train[1:3] <- NULL
train[9:14] <- NULL
test[1:3] <- NULL
test[9:14] <- NULL
traints <- (purrr::keep(train, is.ts))
traints <- (purrr::keep(train, is.ts))
library(vars)
# simple order
ord <- VARselect(traints, lag.max = 20, type = c("both"))
# Seaosnal order
ords <- VARselect(traints, lag.max = 20, type = "both", season = c(7, 365))
# Another order
ordn <- VARselect(traints, lag.max = 20)
Lets check out what the orders are:
data.frame(rbind(ord$selection, ords$selection, ordn$selection), row.names = c("ord",
"ords", "ordn"))
Next, lets pick the proper parameters with a grid search. First, lets build the grid:
types <- c("const", "trend", "both", "none")
ics <- c("AIC", "HQ", "SC", "FPE")
ps <- c(8, 3, 1)
szn <- c(NULL, 365)
init <- list(ps = ps, types = types, ics = ics, szn = szn)
grid <- expand.grid(init, stringsAsFactors = FALSE)
Then lets set ourselves up for parallel processing:
workers <- makeCluster(11L)
invisible(clusterEvalQ(workers, library(vars)))
registerDoParallel(workers)
Finally, lets run the grid search
gridSearch <- foreach(i = 1:nrow(grid), .combine = rbind) %dopar% {
model <- VAR(traints, p = grid[i, 1], type = grid[i, 2], ic = grid[i, 3],
season = grid[i, 4])
preds <- predict(model, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
ASE <- mean((test$PM_US.Post - pred)^2)
c(ASE = ASE, p = grid[i, 1], type = grid[i, 2], ic = grid[i, 3], season = grid[i,
4])
}
stopCluster(workers)
registerDoSEQ()
Lets check out the results:
gridSearch <- gridSearch %>% as.data.frame
gridSearch$ASE <- as.numeric(gridSearch$ASE)
gridSearch[2:4] <- lapply(gridSearch[2:4], factor)
gridSearch %>% arrange(ASE)
Next lets visualize them
library(latticeExtra)
gridV <- (cloud(ASE ~ p + type, gridSearch, panel.3d.cloud = panel.3dbars, scales = list(arrows = F,
col = 1), xbase = 0.2, ybase = 0.2, pretty = T, col.facet = level.colors(gridSearch$ASE,
at = do.breaks(range(gridSearch$ASE), 20), col.regions = cm.colors), colorkey = list(col = cm.colors,
at = do.breaks(range(gridSearch$ASE), 20)), screen = list(z = 40, x = -30)))
gridV
We will now go through and select the proper parameters for the model, by calculating the ASE of each VAR prediction, as well as looking at the summary of a linear model (to see significant regressors and help us pick), as well as thinking about our appropriateness assessment:
var1 <- VAR(traints, p = 3, type = "none", ic = "AIC", season = 365)
preds <- predict(var1, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
mean((test$PM_US.Post - pred)^2)
#> [1] 3748146
plot(preds)
thelm <- summary(lm(data = traints, PM_US.Post ~ .))
trainsimp <- traints %>% select(PM_US.Post, HUMI, DEWP, TEMP, precipitation,
PRES, Iws)
var2 <- VAR(trainsimp, p = 3, type = "none", ic = "AIC", season = 365)
preds <- predict(var2, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
mean((test$PM_US.Post - pred)^2)
#> [1] 3747685
trainsimp2 <- traints %>% select(PM_US.Post, HUMI, TEMP, precipitation, PRES,
Iws)
var3 <- VAR(trainsimp2, p = 3, type = "none", ic = "AIC", season = 365)
preds <- predict(var3, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
mean((test$PM_US.Post - pred)^2)
#> [1] 3746216
trainsimp3 <- traints %>% select(PM_US.Post, HUMI, TEMP, PRES, Iws)
var4 <- VAR(trainsimp3, p = 3, type = "none", ic = "AIC", season = 365)
preds <- predict(var4, n.ahead = 366)
pred <- preds$fcst$PM_US.Post[, 1]
mean((test$PM_US.Post - pred)^2)
#> [1] 3745787
plot(preds)
We have now reached a point where we cannot improve our model any further (trust me i tried). Lets remember these variables for the rest of our multivariate analysis.
vpred <- as.var(preds)
autoplot(vpred)
data.frame(t(scores(vpred)))
Objectively, this model performed worse than the ARMA/mean baseline model. However, visually, it was only off by a bit. It seems to have gotten the period a bit wrong, while it is accurate in scale. It also has a few issues with the shape, but the forecast does not look unreasonable. This may be an interesting model in our bagged results.
makeModel.var <- function(obj) {
score = scores(obj)
us <- obj$fcst$PM_US.Post
return(list(preds = us[, 1], ul = us[, 3], ll = us[, 2], ase = score[1],
mape = score[2], Conf.Score = score[3]))
}
models$var <- makeModel(vpred)
We will try two different neural networks, a neural net ar (nnetar) model and a Long short-term memory (LSTM) model.
The nnetar model is a simple artificial neural network (ANN). For this we will use the nnetar function from the forecast library. It is still an ANN, just as the ones from nnfor, and it is significantly less customizable than the models from nnfor. However, with this much data, and the slowness of nnfor, it actually takes over 7 hours to produce a model with our data using nnfor. As I plan on eventually deploying this model, and am not going to be spending months tuning a slow nnfor model to find the right hyperparameters (with that slowness, I will simply be using near defaults), there is no benefit from using the slower, more customizable model.
A LSTM model is much more complex than the nnetar model, but will hopefully provide us with a better view of the data. We will explor both in depth in the following sections
For the first attempt at neural network forecasting, we will be using nnetar. Nnetaris basically a less flexible implementation of the functions in the nnfor. What it lacks in flexibiility it makes up for with extreme speed. We will have our slow model in the next section. For discussion on the comparison of nnfor and nnetar, please see this link.
Basically, the nnetar function takes in your data, properly scales it for the neural network. We will discuss scaling extensively in the next section, do not worry. After scaling, if needed, it calculates the AR order of the time series, as well as the lags. Then, it takes your scaled time series, the things it calculated and adjustements it made, into the nnet function from the nnet library. Then, you have a model and you make predictions. It is very user friendly, and very very fast.
We will use standard, unremarkable coercion, autoplotting, and scoring methods for the neural net forecasts.
as.nfor <- function(x) structure(x, class = "nfor")
scores.nfor <- function(obj) {
mape <- MAPE(obj$mean, test[[1]])
ase <- ASE(obj$mean, test[[1]])
confs <- confScore(upper = obj$upper[, 2], lower = obj$lower[, 2], test[[1]])
c(ASE = ase, MAPE = mape, Conf.Score = confs)
}
autoplot.nfor <- function(obj) {
testdf <- data.frame(type = "actual", t = seq_along(test[[1]]), ppm = as.numeric(test[[1]]))
preddf <- data.frame(type = "predicted", t = seq_along(test[[1]]), ppm = as.numeric(obj$mean))
confdf <- data.frame(t = seq_along(test[[1]]), upper = obj$upper[, 2], lower = obj$lower[,
2])
dfl <- list(testdf, preddf)
.testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2),
linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2),
linetype = 3003) + guides(alpha = FALSE)
}
This is much simpler and faster than the nnfor mlp and elm function, and without significant tinkering or expertise. Lets check it out.
train <- traints %>% dplyr::select(PM_US.Post, HUMI, TEMP, PRES, Iws)
nnet <- nnetar(train[[1]], xreg = train[-1], repeats = 1000)
Lets check out what the model did
data.frame(summary(nnet))
Interesting. We see that it chose an AR(1) our data. Lets look more closely at the model and the method to see what else is going on.
nnet$model
#>
#> Average of 1000 networks, each of which is
#> a 32-16-1 network with 545 weights
#> options were - linear output units
nnet$method
#> [1] "NNAR(27,1,16)[365]"
Pretty simple. 32 input nodes, 16 nodes in the hidden layer, and then collapsed into the output node. We can also see the details of our model. It is a NNAR(27,1,16)[365]. That is read as a neural network model, with the data lagged 27 times, AR(1), and 16 nodes in the hidden layer.
The nnet models do not have the same issue as the other forecast models, so we can forecast without an s3 method
test <- test %>% dplyr::select(PM_US.Post, HUMI, TEMP, PRES, Iws)
nnetfor <- forecast(nnet, h = 366, xreg = test[-1], PI = TRUE)
summary(nnetfor)
summary(nnetfor)
#>
#> Forecast method: NNAR(27,1,16)[365]
#>
#> Model Information:
#>
#> Average of 1000 networks, each of which is
#> a 32-16-1 network with 545 weights
#> options were - linear output units
#>
#> Error measures:
#> ME RMSE MAE MPE MAPE MASE
#> Training set -2.89273 255.8209 175.6316 -7.769946 13.20826 0.09239913
#> ACF1
#> Training set 0.04863066
#>
#> Forecasts:
#> Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> 5.002740 1339.20550 1011.85824 1670.6711 857.642823 1851.1271
#> 5.005479 1072.26691 746.07991 1422.5409 584.665921 1599.8374
#> 5.008219 3281.69830 2927.27151 3661.5466 2708.337736 3880.4528
#> 5.010959 3064.48096 2706.95320 3471.8127 2517.434651 3687.0782
#> 5.013699 4556.19267 4189.38914 4953.8422 4007.710172 5168.3828
#> 5.016438 3650.69394 3315.96512 4007.2461 3137.514323 4199.3474
#> 5.019178 5060.29640 4675.27418 5464.4144 4483.506858 5634.0865
#> 5.021918 4647.92472 4270.49016 5017.1308 4054.253112 5210.0362
#> 5.024658 1095.72828 750.81122 1451.8302 513.228930 1657.7528
#> 5.027397 124.64159 -222.10335 469.2862 -391.302772 625.3250
#> 5.030137 947.99981 615.11889 1302.5336 407.006327 1497.3995
#> 5.032877 1459.87298 1107.92088 1812.9981 922.349359 1969.3696
#> 5.035616 1156.11511 823.42339 1506.7453 641.718666 1686.7778
#> 5.038356 2389.15683 2011.13576 2764.0454 1822.154157 2931.1461
#> 5.041096 2175.54803 1806.60137 2594.0679 1615.215991 2819.8103
#> 5.043836 2671.33788 2295.61226 3030.5593 2130.184948 3225.6743
#> 5.046575 4261.25416 3902.64144 4625.4366 3683.159259 4872.1414
#> 5.049315 2181.27793 1809.85389 2545.2824 1604.471911 2716.9366
#> 5.052055 1016.44815 660.43977 1381.5785 454.538837 1547.9646
#> 5.054795 1663.48973 1302.77972 2044.3881 1109.557537 2278.9433
#> 5.057534 208.40978 -142.04538 545.1323 -315.604080 715.0302
#> 5.060274 848.87614 500.31863 1225.4736 333.342421 1420.5820
#> 5.063014 1927.36826 1563.02717 2267.4899 1404.887231 2491.4851
#> 5.065753 3503.33876 3129.28113 3921.3180 2921.573128 4137.8058
#> 5.068493 2813.73489 2407.02876 3237.3838 2204.989494 3422.5302
#> 5.071233 1322.99016 971.74984 1726.1907 769.804229 1923.3792
#> 5.073973 1145.10358 824.07322 1540.6060 601.212583 1750.9945
#> 5.076712 1663.36460 1347.98974 2067.6659 1172.791778 2231.0488
#> 5.079452 1077.29895 756.78786 1477.9547 566.761966 1664.8647
#> 5.082192 4141.80219 3738.53213 4521.4711 3540.273216 4761.4195
#> 5.084932 3632.14728 3190.84333 3977.7264 2935.709522 4181.4765
#> 5.087671 3964.10321 3492.11418 4344.6998 3263.879611 4555.4213
#> 5.090411 4669.97800 4257.46413 5036.1584 4022.918534 5233.0617
#> 5.093151 1936.76346 1583.72951 2298.3926 1398.180466 2469.3966
#> 5.095890 410.95151 88.09599 771.7577 -111.312182 955.9518
#> 5.098630 -62.73875 -382.47534 300.6455 -547.047881 454.4265
#> 5.101370 810.78421 458.00350 1215.3599 305.885308 1383.6918
#> 5.104110 3470.48150 3087.87975 3920.4471 2875.634587 4155.4488
#> 5.106849 2442.99958 2076.17785 2789.8361 1929.506348 2971.2965
#> 5.109589 2082.90088 1705.24424 2483.9142 1501.050425 2683.3213
#> 5.112329 441.12394 131.10675 813.5223 -68.826925 1002.2304
#> 5.115068 306.31884 -34.49080 714.0193 -231.241729 882.8351
#> 5.117808 1960.02718 1647.71565 2380.8029 1415.940250 2584.5935
#> 5.120548 1459.95121 1128.99371 1857.2987 908.518207 2043.9908
#> 5.123288 1025.59762 704.60523 1392.1999 540.178374 1545.0355
#> 5.126027 4054.44992 3603.51958 4486.5424 3386.661920 4732.5465
#> 5.128767 5417.71484 4949.64338 5851.5731 4689.720706 6081.0369
#> 5.131507 4235.95144 3832.32488 4633.0501 3604.510969 4851.2300
#> 5.134247 2109.10927 1726.73953 2499.3963 1555.246455 2671.7370
#> 5.136986 1020.04493 653.92133 1355.6565 463.475785 1558.8091
#> 5.139726 1567.26470 1204.87683 1966.4314 990.121550 2186.7936
#> 5.142466 2518.59917 2179.16718 2903.7867 1982.700803 3133.0654
#> 5.145205 4506.68941 4109.31942 4947.7197 3901.069447 5185.4595
#> 5.147945 5030.82192 4560.24225 5490.0893 4331.399721 5711.9559
#> 5.150685 4432.95967 4052.90363 4815.1828 3862.227863 5001.6398
#> 5.153425 5134.79042 4741.66382 5519.2959 4530.918703 5740.8619
#> 5.156164 5384.02305 4985.75904 5755.3968 4770.046906 6017.1330
#> 5.158904 5603.69168 5159.95782 5975.5308 4927.415169 6203.5252
#> 5.161644 1294.01429 950.86114 1633.6417 765.351751 1822.9615
#> 5.164384 2280.72214 1905.94676 2625.7322 1722.153138 2840.1198
#> 5.167123 1601.45004 1235.52762 2000.3932 1026.221812 2170.2820
#> 5.169863 2308.98310 1950.09748 2730.1708 1732.364976 2965.1794
#> 5.172603 3726.75793 3345.07384 4202.7025 3175.893430 4417.4248
#> 5.175342 893.70690 582.65168 1259.5642 416.997036 1427.9778
#> 5.178082 233.70448 -77.57567 644.9516 -259.268045 781.9006
#> 5.180822 311.94033 -24.17591 670.2967 -188.719375 852.7438
#> 5.183562 812.27307 469.07221 1171.8810 255.413694 1338.3329
#> 5.186301 2334.04013 1980.95639 2730.1262 1769.034796 2963.3056
#> 5.189041 2151.60602 1771.76291 2559.5405 1577.742455 2764.8018
#> 5.191781 2308.40678 1931.10943 2702.0562 1743.756538 2931.1690
#> 5.194521 2928.90131 2563.41828 3355.0956 2375.885053 3540.9891
#> 5.197260 1249.52145 904.64868 1644.4759 693.399949 1823.2122
#> 5.200000 156.00057 -149.71626 557.9509 -308.989650 771.2694
#> 5.202740 -81.52132 -425.26754 276.3496 -584.993072 485.6870
#> 5.205479 874.02715 478.12864 1245.6138 304.661172 1422.5047
#> 5.208219 884.32307 532.27847 1264.0464 354.544948 1482.9091
#> 5.210959 747.76019 425.63863 1137.5560 210.443257 1306.2489
#> 5.213699 -242.98104 -565.16321 144.1200 -764.884046 339.3119
#> 5.216438 674.25798 323.41272 1087.3260 136.063630 1260.0074
#> 5.219178 222.81910 -93.27160 586.2539 -308.055533 782.2805
#> 5.221918 418.12692 62.78139 806.9649 -108.536338 994.0245
#> 5.224658 545.02480 176.98451 953.4324 6.694831 1154.2573
#> 5.227397 623.61224 260.51548 1038.9239 56.978698 1234.9255
#> 5.230137 1822.18953 1457.12427 2266.9873 1246.892348 2475.8278
#> 5.232877 2438.19606 2073.60318 2832.5300 1882.557348 3015.0050
#> 5.235616 2787.94290 2410.14350 3202.3436 2199.453351 3419.6071
#> 5.238356 2315.26427 1956.58656 2844.5040 1724.921068 3081.0693
#> 5.241096 2115.61873 1743.94240 2601.0031 1510.689205 2804.2377
#> 5.243836 1826.45685 1442.17046 2278.2382 1219.988528 2470.0078
#> 5.246575 1144.14454 743.79127 1555.6671 557.189995 1740.0962
#> 5.249315 1036.05501 633.11849 1389.9824 442.242324 1572.1173
#> 5.252055 1304.03347 938.87441 1725.7057 760.699543 1919.8415
#> 5.254795 1075.40623 727.24414 1502.5849 537.226158 1702.0793
#> 5.257534 238.41204 -101.64055 667.5157 -306.593445 845.0439
#> 5.260274 377.15783 34.44831 773.4903 -120.605441 978.6424
#> 5.263014 452.94937 100.55160 818.4850 -80.511047 1014.2798
#> 5.265753 834.19213 433.10893 1180.4150 226.072713 1353.5598
#> 5.268493 1360.80626 990.62160 1750.3929 755.726722 1969.4416
#> 5.271233 1872.91892 1486.51887 2316.0928 1260.549579 2533.7550
#> 5.273973 1802.24412 1440.76821 2196.8817 1231.938581 2393.9921
#> 5.276712 1572.49300 1135.02572 1984.0423 900.215339 2165.0096
#> 5.279452 1542.95774 1133.98921 1976.9861 934.700267 2178.3280
#> 5.282192 1761.25633 1306.42296 2210.9993 1095.806728 2461.5882
#> 5.284932 1648.92966 1216.34580 2080.9157 1041.212487 2292.9794
#> 5.287671 1402.71516 1026.78813 1875.2363 808.152218 2126.7971
#> 5.290411 779.08320 445.68652 1167.3658 241.399405 1347.7438
#> 5.293151 509.05037 197.03416 952.3378 -41.502059 1164.4427
#> 5.295890 1603.25031 1286.11020 2034.2894 1103.226402 2206.2872
#> 5.298630 1702.71948 1349.27028 2110.3871 1175.100927 2282.6283
#> 5.301370 1387.07491 1045.02837 1777.2429 857.484454 1976.1361
#> 5.304110 843.45755 515.57590 1313.3492 309.969782 1506.8547
#> 5.306849 307.46289 -19.67237 767.7331 -207.883690 1013.0170
#> 5.309589 683.78938 310.90254 1111.8767 102.778541 1357.7287
#> 5.312329 1796.48933 1453.24113 2158.0380 1208.259998 2382.9012
#> 5.315068 2079.56334 1643.76220 2467.8589 1448.210780 2686.3045
#> 5.317808 1813.70296 1369.26626 2251.0084 1105.222914 2468.3653
#> 5.320548 1769.17186 1349.75845 2229.6508 1086.037098 2490.8161
#> 5.323288 1513.08782 1067.94688 2000.1957 817.223517 2220.3719
#> 5.326027 1328.22923 889.46912 1763.8929 658.776752 2046.9940
#> 5.328767 1206.29192 801.25907 1636.1395 643.476893 1854.1926
#> 5.331507 1402.46057 1051.74226 1809.4455 875.300586 2030.1605
#> 5.334247 1744.09973 1385.37643 2137.9889 1178.291187 2341.4351
#> 5.336986 653.03222 346.09740 1187.8512 134.719068 1401.1668
#> 5.339726 788.75123 474.27173 1227.2936 276.635647 1425.7192
#> 5.342466 492.95826 73.16160 880.9058 -93.312985 1090.4150
#> 5.345205 760.02883 375.19726 1147.7440 146.051482 1370.2150
#> 5.347945 1372.75151 1000.74589 1830.3763 754.793369 2064.9193
#> 5.350685 1271.44536 920.67807 1711.9304 729.790192 1921.9990
#> 5.353425 1548.72324 1166.51741 1979.9804 938.602206 2209.9840
#> 5.356164 1803.86492 1386.75454 2162.2317 1189.167325 2362.7209
#> 5.358904 1802.18453 1338.35269 2230.9355 1117.104784 2497.3396
#> 5.361644 2761.39169 2396.87758 3175.3454 2179.209769 3395.9783
#> 5.364384 2227.14956 1850.02434 2643.1442 1637.557633 2851.5345
#> 5.367123 1460.53942 1103.82187 1854.5220 879.380805 2100.9307
#> 5.369863 903.99471 595.85735 1291.7683 366.638891 1499.9222
#> 5.372603 1196.29003 789.94918 1602.0256 548.505098 1847.2532
#> 5.375342 1060.65534 660.93528 1474.4249 434.132997 1730.2175
#> 5.378082 671.70141 341.81595 1170.2688 130.821692 1373.3934
#> 5.380822 1060.62953 719.58696 1507.5397 512.566405 1723.2543
#> 5.383562 1454.10650 1111.21788 1918.8157 909.759212 2085.8801
#> 5.386301 1025.77752 670.01293 1523.1333 466.087412 1767.8786
#> 5.389041 1047.92554 679.53048 1492.2826 455.104998 1732.5528
#> 5.391781 1654.68020 1251.23450 2010.1395 1031.444833 2167.2093
#> 5.394521 1880.12008 1485.73340 2272.0338 1266.260370 2475.5236
#> 5.397260 2318.64465 1925.00434 2748.3599 1726.101174 2955.9273
#> 5.400000 1747.05275 1299.11908 2175.7503 1089.347099 2381.2309
#> 5.402740 1078.89032 667.15149 1489.9299 439.810611 1710.7990
#> 5.405479 1211.03546 814.98712 1569.5276 629.339667 1792.8098
#> 5.408219 1130.81692 765.16752 1567.6353 559.278491 1772.0795
#> 5.410959 890.41503 598.79938 1315.9357 391.272555 1529.6412
#> 5.413699 830.69918 514.18967 1282.6550 300.178917 1487.4614
#> 5.416438 690.99443 352.47951 1166.5024 140.268909 1349.4337
#> 5.419178 1312.72277 957.66073 1769.1496 769.311121 1990.7598
#> 5.421918 1301.95145 935.84873 1854.7755 707.795035 2105.1606
#> 5.424658 1332.25552 983.76640 1818.1396 737.217391 2075.4721
#> 5.427397 955.30762 636.20558 1449.9107 444.859826 1665.2925
#> 5.430137 617.59735 300.46115 1095.4582 99.536135 1314.4195
#> 5.432877 1492.10778 1148.27769 1935.7973 920.251676 2152.9523
#> 5.435616 1886.77813 1451.33818 2302.2786 1243.288864 2511.4108
#> 5.438356 2303.09418 1852.48462 2690.6670 1652.795131 2942.2231
#> 5.441096 2131.48734 1686.20589 2605.8134 1459.828193 2829.8044
#> 5.443836 2196.81118 1738.25990 2666.5970 1506.716559 2894.0051
#> 5.446575 1709.35872 1238.11319 2126.9449 982.166577 2444.0887
#> 5.449315 877.84324 463.09562 1359.1883 264.105518 1639.7292
#> 5.452055 1120.50635 740.24233 1575.4116 511.719051 1803.7518
#> 5.454795 1491.80607 1080.22873 1999.7009 871.350914 2206.9099
#> 5.457534 1637.09232 1271.57553 2160.2527 1035.825921 2364.8334
#> 5.460274 1794.58006 1451.18089 2253.2945 1273.278701 2453.3620
#> 5.463014 2223.33723 1895.23428 2666.0771 1674.720098 2850.8705
#> 5.465753 1920.85410 1552.90750 2413.8436 1323.903880 2634.7670
#> 5.468493 2236.57927 1845.40481 2661.2462 1672.851244 2891.2017
#> 5.471233 1908.23902 1536.77939 2445.4768 1292.970919 2703.7515
#> 5.473973 1942.17103 1538.28947 2480.4624 1275.503594 2709.2405
#> 5.476712 2215.01492 1793.49445 2742.8055 1520.350486 3032.0207
#> 5.479452 1835.70651 1356.91466 2285.2955 1096.746044 2537.6526
#> 5.482192 1642.26542 1173.01984 2079.0586 986.675844 2334.8779
#> 5.484932 1901.16609 1514.66311 2334.4344 1309.423144 2537.7177
#> 5.487671 1779.64345 1423.67525 2220.9710 1250.795938 2412.6849
#> 5.490411 1318.12337 1016.19596 1830.8775 780.025379 2054.5040
#> 5.493151 789.70860 461.64355 1316.4737 270.937998 1565.0409
#> 5.495890 946.12725 551.68966 1445.5074 350.746617 1707.0813
#> 5.498630 1108.73453 707.66416 1599.9051 512.138771 1827.8412
#> 5.501370 1786.84528 1413.96380 2191.5529 1177.742498 2388.9179
#> 5.504110 2543.18179 2180.50626 3023.1945 1960.839112 3255.4630
#> 5.506849 2330.80163 1989.79670 2840.2425 1723.905599 3040.8319
#> 5.509589 1935.66277 1612.22273 2403.1874 1392.092202 2665.1855
#> 5.512329 1708.32390 1382.11453 2237.4151 1127.742252 2498.7669
#> 5.515068 2246.39763 1865.99623 2606.4996 1631.265104 2800.6658
#> 5.517808 2515.52833 2073.78922 2934.3002 1871.371645 3180.4965
#> 5.520548 1651.26785 1231.78487 2176.2572 976.507377 2395.5220
#> 5.523288 659.70841 339.28859 1177.4919 89.093228 1436.0455
#> 5.526027 544.64223 211.46856 998.4419 -9.812119 1203.9084
#> 5.528767 573.65984 232.77433 980.2888 54.522247 1218.0476
#> 5.531507 525.83151 176.13466 962.8125 -60.545558 1169.4122
#> 5.534247 484.83512 172.72676 958.9233 -39.445627 1142.7861
#> 5.536986 684.98632 345.12913 1169.1507 97.051853 1354.5759
#> 5.539726 1382.21730 1016.58857 1779.0753 827.826280 1965.5735
#> 5.542466 1909.30597 1499.14553 2261.5284 1281.655870 2481.2735
#> 5.545205 2124.92557 1703.84557 2578.2176 1489.812891 2825.9102
#> 5.547945 2074.84900 1630.56386 2634.6132 1398.117036 2866.7775
#> 5.550685 1981.99069 1566.60608 2469.0605 1326.922499 2712.3140
#> 5.553425 1898.14558 1506.94148 2310.5883 1331.683264 2480.5272
#> 5.556164 1261.71642 857.90746 1705.7475 630.044980 1930.3174
#> 5.558904 837.78831 452.18674 1289.3410 225.536915 1510.3808
#> 5.561644 910.87903 521.42745 1343.6774 311.159973 1527.1595
#> 5.564384 1584.60805 1185.99171 2046.5042 995.640843 2263.9592
#> 5.567123 1438.75092 1071.97040 1904.9812 842.762719 2105.4542
#> 5.569863 871.47178 520.19382 1336.8806 339.174551 1554.7133
#> 5.572603 1272.84954 883.65967 1661.8736 710.379129 1864.9401
#> 5.575342 2201.75012 1776.34605 2533.0472 1590.306218 2688.5483
#> 5.578082 2632.41010 2234.92851 2977.2152 2025.090135 3183.7188
#> 5.580822 2324.58769 1926.97387 2778.7765 1762.580139 2991.8050
#> 5.583562 2148.30553 1803.78362 2576.4194 1561.856798 2826.9569
#> 5.586301 1913.19509 1565.03540 2370.1910 1338.382615 2612.5358
#> 5.589041 2079.66802 1663.73695 2488.3794 1444.433928 2695.4613
#> 5.591781 2142.30430 1741.87700 2559.0278 1511.688768 2778.8050
#> 5.594521 2241.81073 1901.56226 2675.5393 1621.212881 2867.1111
#> 5.597260 2020.91205 1629.01587 2467.2351 1395.207720 2688.2622
#> 5.600000 1383.93481 1010.22867 1828.9718 758.997932 2087.0936
#> 5.602740 1165.71235 741.99423 1575.7721 554.347045 1775.9169
#> 5.605479 1440.10234 998.46996 1833.0942 817.537664 2035.2885
#> 5.608219 1788.28187 1392.56283 2208.7829 1142.521813 2385.1794
#> 5.610959 1465.97458 1122.60921 2002.0824 898.421063 2247.0742
#> 5.613699 744.76459 485.19027 1318.7136 263.972583 1527.7493
#> 5.616438 319.10128 46.31797 853.6378 -209.955367 1109.9520
#> 5.619178 1704.13502 1366.81144 2189.8909 1200.454649 2422.1267
#> 5.621918 1797.48948 1402.43104 2349.2482 1134.985109 2574.0252
#> 5.624658 2061.28769 1642.22444 2644.3286 1381.067542 2863.1610
#> 5.627397 1957.53285 1558.32691 2517.8074 1343.829075 2725.2653
#> 5.630137 1550.93487 1202.00490 2071.7085 969.406192 2289.2089
#> 5.632877 1283.57591 905.30671 1758.8678 676.703903 1972.1517
#> 5.635616 1172.04305 722.02688 1601.9005 458.056092 1816.6824
#> 5.638356 1722.24048 1314.08621 2100.6821 1105.936433 2300.0844
#> 5.641096 1615.84211 1233.25042 2106.3719 1004.285541 2365.3371
#> 5.643836 1420.50873 1099.44238 1952.6583 902.437622 2199.5089
#> 5.646575 1200.35672 897.10685 1671.2415 714.922562 1877.3350
#> 5.649315 1027.66789 638.55041 1491.8772 411.845849 1725.7276
#> 5.652055 684.07689 328.40074 1163.8807 151.950640 1368.7206
#> 5.654795 228.91172 -59.48776 708.7280 -281.307622 959.4922
#> 5.657534 120.96256 -172.61273 635.1022 -387.704560 871.9500
#> 5.660274 881.32254 577.61223 1419.0887 322.442929 1610.2289
#> 5.663014 1827.06845 1462.87816 2319.3366 1217.998766 2548.8861
#> 5.665753 2618.83323 2211.31835 3092.8912 1974.158358 3350.7403
#> 5.668493 3255.94158 2835.58601 3723.3735 2625.202774 3963.1305
#> 5.671233 3255.39553 2859.24381 3766.7656 2644.380045 3987.2078
#> 5.673973 3398.45663 3060.25890 3948.2402 2788.505088 4184.1882
#> 5.676712 2290.25524 1844.13391 2805.9096 1580.543282 3034.4962
#> 5.679452 2116.06775 1577.10069 2494.0814 1351.255116 2752.7220
#> 5.682192 2185.46322 1654.44168 2590.0160 1430.957895 2862.8098
#> 5.684932 2131.95104 1719.92604 2598.0147 1502.116287 2865.9887
#> 5.687671 1755.91858 1412.05994 2237.1886 1214.274123 2474.3252
#> 5.690411 817.21176 489.15283 1299.7028 312.285248 1491.7253
#> 5.693151 1146.36756 801.97750 1604.4369 583.133536 1810.5372
#> 5.695890 1391.79117 1010.52622 1862.8211 788.344973 2092.4861
#> 5.698630 1719.02453 1315.78195 2237.2121 1120.511190 2507.2542
#> 5.701370 2184.30905 1750.83908 2758.3883 1526.805576 3041.7566
#> 5.704110 2620.22742 2123.33631 3188.1230 1880.350955 3441.3112
#> 5.706849 2819.26071 2254.20344 3344.5404 2003.006507 3625.1703
#> 5.709589 2240.06732 1613.05423 2798.6360 1300.273113 3086.7309
#> 5.712329 2514.91060 1822.76748 3173.4193 1493.462239 3544.4920
#> 5.715068 2756.22257 1986.53217 3365.6362 1585.863115 3748.4887
#> 5.717808 2290.81100 1582.88917 2827.0690 1292.285863 3096.8189
#> 5.720548 2171.63054 1617.45589 2740.3086 1320.758144 3031.0100
#> 5.723288 2215.56483 1842.83086 2781.4023 1574.070857 3077.2453
#> 5.726027 1442.76638 1162.35465 2058.8797 943.852348 2279.2373
#> 5.728767 1982.61713 1693.19327 2615.9650 1497.372149 2838.9854
#> 5.731507 3511.56448 3143.99145 4257.3153 2854.308992 4568.7244
#> 5.734247 4066.98052 3508.28277 4819.3412 3190.256929 5144.2374
#> 5.736986 4673.03279 4124.56276 5382.4701 3727.593048 5685.2741
#> 5.739726 3379.82910 2977.61429 4104.4235 2661.726486 4351.9687
#> 5.742466 2672.52041 2222.80543 3452.5512 1883.398423 3692.1925
#> 5.745205 2739.35627 2214.58062 3520.0294 1819.584406 3799.3732
#> 5.747945 2376.61111 1655.39566 3016.4422 1286.329928 3293.3026
#> 5.750685 1256.71887 704.76201 1688.8935 399.455814 1985.6926
#> 5.753425 3363.27353 2847.86713 3799.0320 2540.998813 4101.7579
#> 5.756164 4059.57934 3595.12204 4552.3083 3330.075501 4806.0914
#> 5.758904 4823.71511 4286.04649 5429.4877 3924.405008 5705.8248
#> 5.761644 4304.93857 3752.95049 4977.6599 3458.583094 5298.4280
#> 5.764384 3218.91340 2662.75149 3669.6876 2343.790640 3928.0335
#> 5.767123 2500.20591 1917.92658 2977.5322 1630.980676 3238.0600
#> 5.769863 3523.20842 2913.74723 4208.5129 2585.815565 4570.5882
#> 5.772603 5320.76335 4584.61540 6130.5732 4284.957062 6554.4786
#> 5.775342 6825.31500 5956.84944 7477.9351 5597.351090 7949.0125
#> 5.778082 6435.50869 5532.67050 6913.7408 5181.592284 7276.4162
#> 5.780822 4659.48838 3896.74789 5003.7900 3569.122561 5244.5479
#> 5.783562 915.80130 508.08246 1301.8880 316.429232 1518.0622
#> 5.786301 141.31020 -133.19656 560.4422 -346.961618 774.7574
#> 5.789041 877.07911 557.47624 1340.5853 367.722393 1582.7429
#> 5.791781 825.66272 529.42890 1215.1530 335.127430 1381.9756
#> 5.794521 961.04393 535.89791 1356.0338 340.066991 1557.8270
#> 5.797260 2261.49970 1880.07823 2655.3619 1675.354601 2905.5619
#> 5.800000 3185.44021 2859.70133 3656.2505 2662.951529 3893.8174
#> 5.802740 4265.03456 3877.56355 4815.9046 3645.421907 5038.6823
#> 5.805479 5336.03929 4889.40013 5796.8106 4618.355122 6054.4976
#> 5.808219 2011.36940 1618.45923 2385.3808 1407.543628 2564.9865
#> 5.810959 2493.83695 2104.46732 2869.3530 1884.055339 3063.0961
#> 5.813699 3673.79389 3163.11004 4087.7375 2852.625626 4311.6182
#> 5.816438 5816.91063 5188.41003 6226.9618 4839.944949 6490.5151
#> 5.819178 7090.36039 6485.28393 7433.2228 6262.536479 7682.0430
#> 5.821918 1893.43945 1489.29400 2202.1998 1298.111759 2363.5995
#> 5.824658 392.13892 68.04909 754.6563 -140.234233 947.3764
#> 5.827397 2437.92979 2094.27619 2844.3757 1886.875088 3020.1831
#> 5.830137 3274.39154 2906.87570 3787.9481 2626.593297 3997.2914
#> 5.832877 3934.02275 3544.83538 4511.9851 3285.818067 4736.8229
#> 5.835616 3941.62867 3571.49903 4431.3631 3370.263222 4658.1462
#> 5.838356 2028.84787 1660.20864 2405.3852 1501.541270 2619.6414
#> 5.841096 408.30850 53.17513 747.2980 -145.809693 973.4113
#> 5.843836 2356.81531 1940.26514 2739.3937 1711.298481 2913.3870
#> 5.846575 5076.19482 4631.75839 5437.5053 4452.481608 5653.9376
#> 5.849315 2777.31111 2405.98286 3149.8597 2184.604145 3314.5557
#> 5.852055 972.02539 619.12322 1293.7625 430.209063 1492.1821
#> 5.854795 2070.05459 1709.09945 2444.5984 1529.549762 2661.9611
#> 5.857534 3655.58141 3223.26976 4045.9010 3012.349439 4290.8253
#> 5.860274 4216.49804 3767.48410 4627.6066 3538.269753 4859.5948
#> 5.863014 4807.32102 4272.88689 5201.4645 4075.083366 5467.8800
#> 5.865753 1700.79523 1275.64230 2020.6780 1123.611075 2186.7178
#> 5.868493 162.70030 -180.50249 520.5441 -386.776695 698.5046
#> 5.871233 336.58997 14.84860 717.8487 -210.817860 953.1292
#> 5.873973 1594.24235 1242.14105 2009.9658 1030.869178 2229.1342
#> 5.876712 4247.60117 3851.95382 4643.5350 3598.423733 4863.8023
#> 5.879452 1726.79571 1383.60998 2114.9341 1202.395059 2262.1936
#> 5.882192 910.80668 584.68506 1310.9916 396.143362 1481.5880
#> 5.884932 2554.78735 2128.10088 2917.2171 1930.612562 3109.7647
#> 5.887671 5047.36390 4607.40155 5353.6829 4424.182956 5554.7841
#> 5.890411 6754.00034 6346.30356 7112.4349 6126.059385 7273.6348
#> 5.893151 4782.96873 4370.39299 5173.4132 4166.275838 5414.2935
#> 5.895890 1580.21565 1238.00559 1988.9165 1024.601170 2154.6764
#> 5.898630 3546.14537 3224.95793 3965.3997 3009.258111 4168.8612
#> 5.901370 3010.00082 2640.18356 3444.1658 2452.803090 3656.8493
#> 5.904110 5256.97800 4808.45005 5658.0714 4593.772589 5892.2359
#> 5.906849 7381.74414 6930.65326 7722.1541 6687.689543 7952.6811
#> 5.909589 2911.28119 2536.75872 3284.9924 2375.657167 3493.7180
#> 5.912329 1864.13901 1507.87535 2228.7325 1326.554314 2467.0749
#> 5.915068 4343.97411 3966.38420 4709.9717 3760.965273 4917.1118
#> 5.917808 1623.22527 1301.41708 1987.7872 1109.570612 2173.7314
#> 5.920548 455.02797 122.96454 798.4788 -54.258839 967.7240
#> 5.923288 1001.82495 686.28231 1382.6481 534.135802 1606.2918
#> 5.926027 371.94579 35.39497 720.5184 -130.755943 904.9348
#> 5.928767 516.18215 172.78335 866.3072 25.170164 1075.4245
#> 5.931507 763.15214 449.62897 1125.4330 255.341241 1280.7408
#> 5.934247 2223.18551 1867.05014 2625.4145 1640.866993 2779.3496
#> 5.936986 1872.09593 1487.13888 2197.0968 1335.503951 2353.8174
#> 5.939726 2029.58132 1674.77814 2402.2378 1507.772584 2585.6720
#> 5.942466 3912.56448 3544.60796 4277.8349 3334.086104 4480.8363
#> 5.945205 2781.81695 2418.24720 3160.0340 2204.561793 3334.1135
#> 5.947945 1827.93104 1502.34505 2197.0985 1293.881219 2434.1875
#> 5.950685 493.32217 161.01243 842.2407 17.902779 1059.1554
#> 5.953425 2021.24789 1622.25403 2392.3539 1433.197368 2604.3571
#> 5.956164 3186.81076 2821.30021 3567.6945 2625.145295 3762.0705
#> 5.958904 1209.45987 845.03222 1535.1383 657.612097 1730.0034
#> 5.961644 751.71858 391.18502 1122.1082 207.258106 1310.3864
#> 5.964384 2155.30229 1743.26351 2569.5664 1532.379176 2797.6358
#> 5.967123 3358.22954 2983.90224 3773.5847 2750.678526 4015.5985
#> 5.969863 1487.08416 1159.54504 1852.4702 988.750269 2052.4877
#> 5.972603 167.62963 -143.51261 498.7079 -338.024009 693.3939
#> 5.975342 441.36026 92.19126 781.6495 -99.142992 978.1282
#> 5.978082 2241.09549 1826.63894 2633.7527 1648.049803 2863.6830
#> 5.980822 2297.95127 1928.62308 2678.4636 1774.801605 2870.1625
#> 5.983562 537.85449 199.62422 923.7762 5.031299 1128.0251
#> 5.986301 883.52077 525.16518 1297.6274 288.187258 1477.4858
#> 5.989041 2607.02528 2228.93105 3055.6847 2018.050881 3272.3306
#> 5.991781 3773.48803 3358.48869 4205.4128 3115.767762 4395.2801
#> 5.994521 3596.19724 3161.48550 3994.0003 2943.010380 4200.5396
#> 5.997260 2917.91696 2529.43442 3348.1583 2307.945234 3569.6427
#> 6.000000 1977.96914 1644.62011 2426.2562 1435.240011 2630.1158
#> 6.002740 347.22479 11.90982 727.6610 -164.578460 915.4272
autoplot(nnetfor) + theme_hc()
Lets take a closer look at the forecast:
nfor <- as.nfor(nnetfor)
autoplot(nfor)
Visually, this is not a bad forecast. We managed to capture a lot of the behavior of the time series, and even neared it on a few of the peaks. Lets check out the scores to see. I imagine the MAPE will be very good in this case.
data.frame(t(scores(nfor)))
Looks like this one did about the same, maybe slightly better, than the common sense ARMA benchmark, in terms of ASE. However in terms of MAPE, this is far and away our best model. This means that in general, it was pretty close to the truth. However, when it missed, it missed by a lot. This is evidenced in our plot as well. We will now move to the more complex, and interesting, LSTM model
Please refer to this chapter of Forecasting: Principles and Practice.
makeModel.nfor <- function(obj) {
score <- scores(obj)
return(list(preds = obj$mean, ul = obj$upper[, 2], ll = obj$upper[, 2],
ase = score[1], mape = score[2], Conf.Score = score[3]))
}
models$nnetar = makeModel(nfor)
LSTM models are a type of recurrent neural network. Before we get into the actual modeling, we need to stop and think about what this means.
The model in this section is the result of hours and hours of hard work (over 700 models were trained by hand), and can be used, in its trained form, in analysis/daily/winner.h5.
To load and test on your own dataset, you simply need to get the data in the right format (I will work on automating this later, for a fully reproducible, deployable model), and run lstm_model <- load_model_hdf5("/path/to/winner.h5"), and the fully trained model will pop up in your global environment.will pop up in your global environment.
To explain this, we need to discuss what a normal neural network, a feedforward neural network is. In a feedforward neural network, inputs go into the input layer, is mapped through an activation function, goes to some number of hidden layers with their own activation functions. The hidden layer finds a numerical way to represent the data that was fed into it, and then this representation is passed into the output layer. Information is fed straight in, and straight out. This means that a simple NN has no way of discerning anything about the order of things, or time. They know nothing about previous iterations.
In contrast, in an RNN, the inputs are not just the data, but the previous outputs of the RNN. This means it has a sort of memory about what it did in the past, and can learn from it. This also means they have a notion of order and time, and will do well for dealing with time series. The information of previous iterations is stored in the RNNs hidden layers, so with each iteration they are aware of the past. They are especially good at finding correlations between events which occured a long way apart.
An RNN is a neural network that shares weights it has created over time
An LSTM is a special class of RNN, where extra (outside of the normal information flow of a RNN) information is stored in a gated cell. The cell is updated and changed by the RNN. What happens is the signal from inside the RNN is passed through a sigmoid funciton, and then based on the strength and weight of that signal, either allow it to be stored in the long term memory cell. This functional method allows for backpropogation of errors (it can remember the error of previous steps, and allow that to help the gradient boosted optimization of the network (we will discuss this)).
There are three gates in the LSTM:
Thats the gist of it. We will get more into it as we build our model.
For an extremely helpful discussion of neural networks and especially RNNs, please refer to this link
Please also note that the code in this section is based off of the code found in the excellent book Deep Learning with R, an excellent investment.
As LSTMs rely on the sigmoid function, which is between zero and one, we must prepare our data accordingly (aka center and scale it). It also has to be in matrix form (for now). We also want the LSTM to see the lagged version of our variables. We are going to lag not only the xreg variables, but also our target data itself. This allows the neural network to do a sort of autocorrelation analysis of the data, as well as analysis of the cross corelation at many lags. For this, we will use the mlag_dfr function from my tswgewrapped package. This takes as an input column names, a vector of lags, and then adds those columns to the data frame. As we have tons of data, and are probably only going to go to about lag 50, we will deal with the NAs created in our typical manner: spline interpolation. Lets check out the mlag_dfr and lag_dfr functions:
tswgewrapped::lag_dfr
#> function (df, vars, lags)
#> {
#> out <- lapply(lags, function(x) stats::lag(df[[vars]], x))
#> out <- do.call(cbind.data.frame, out)
#> cnames <- sapply(lags, function(x) paste(vars, x, sep = "_"))
#> colnames(out) <- cnames
#> out
#> }
#> <bytecode: 0x106c20a8>
#> <environment: namespace:tswgewrapped>
tswgewrapped::mlag_dfr
#> function (df, vars, lags)
#> {
#> out <- lapply(vars, function(x) lag_dfr(df, x, lags))
#> out <- do.call(data.frame, unlist(out, recursive = FALSE))
#> cbind(df, out)
#> }
#> <bytecode: 0x9c6ba58>
#> <environment: namespace:tswgewrapped>
Now lets do the initial preprocessing of our data
library(keras)
# reset the data state of this report
rm(train, test, traints)
dfsplit(bj2)
# unneccessarily remove columns we arent using
train[1:3] <- NULL
train[9:14] <- NULL
test[1:3] <- NULL
test[9:14] <- NULL
traints <- (purrr::keep(train, is.ts))
test <- test %>% dplyr::select(PM_US.Post, HUMI, TEMP, PRES, Iws)
train <- traints %>% dplyr::select(PM_US.Post, HUMI, TEMP, PRES, Iws)
test <- mlag_dfr(test, names(test), 1:50)
train <- mlag_dfr(train, names(train), 1:50)
test <- impute(test) %>% as.data.frame
train <- impute(train) %>% as.data.frame
train <- data.matrix(train)
test <- data.matrix(test)
# make an overall data matrix
dat <- rbind(train, test)
# We have one too many columns because of the nature of the ts objects (with
# their windows, this is fine for forecasting but not for an LSTM)
dat <- dat[-1, ]
# do not worry about this, we are doing this so later on when we do our
# bagging, we have a validation technique it will all make sense i promise
maxt <- floor(nrow(dat) * 4/5)
maxv <- floor(nrow(dat))
val <- dat[(maxv - 365 * 2):(maxv - 365), ]
val <- scale(val, center = apply(val, 2, mean), scale = apply(val, 2, sd))
valScale <- attr(val, "scaled:scale")[1]
valCent <- attr(val, "scaled:center")[1]
val <- array(val, c(nrow(val), 2, 255))
# normalize the data for the lstm
mn <- apply(dat, 2, mean)
std <- apply(dat, 2, sd)
dat <- scale(dat, center = mn, scale = std)
# lets also make an array for our test dataset in the form of our input to
# the lstm, which we will discuss next Note we are doing this in a
# convoluded manner because we want to be able to denormalize our data. Next
# time i will use the recipes package because this is a massive headache
test2 <- scale(test, center = apply(test, 2, mean), scale = apply(test, 2, sd))
test3 <- array(test2, c(nrow(test2), 10, 255))
That was a lot of code, but we did nothing fancy. Next, we are going to write a generator function. This keeps us from having to load our data into the GPU’s memory (if you are using a GPU).
As mentioned above, if you are running GPU based calculations, you often dont want to load an entire dataset in your GPU’s memory. So, it is good practice to write a generator function. What this will do is create another function. It will generate samples, moving forward through time, on the fly. As we want to move through the time series slowly, with overlap, there would be a massive amount of data to store. We need to define three values for this:
Lookback how far back we will go to grab our observations steps How far apart our samples will be delay How far ahead our targets will be from the observations batch_size How many times we do this per batch
lookback <- 60 # observations will go back 60 days
steps <- 6 # We will sample every 6 days
delay <- 1 # we will set our prediction targets to be 1 day ahead
batch_size <- 150 #the number of times we will do this per batch
Next, we write a series of functions to generate these samples for our model
generator <- function(data, lookback, delay, min_index, max_index, shuffle = FALSE,
batch_size = 128, step = 6) {
if (is.null(max_index))
max_index <- nrow(data) - delay - 1
i <- min_index + lookback
function() {
if (shuffle) {
rows <- sample(c((min_index + lookback):max_index), size = batch_size)
} else {
if (i + batch_size >= max_index)
i <<- min_index + lookback
rows <- c(i:min(i + batch_size - 1, max_index))
i <<- i + length(rows)
}
samples <- array(0, dim = c(length(rows), lookback/step, dim(data)[[-1]]))
targets <- array(0, dim = c(length(rows)))
for (j in 1:length(rows)) {
indices <- seq(rows[[j]] - lookback, rows[[j]] - 1, length.out = dim(samples)[[2]])
samples[j, , ] <- data[indices, ]
targets[[j]] <- data[rows[[j]] + delay, 2]
}
list(samples, targets)
}
}
train_gen <- generator(dat, lookback = lookback, delay = delay, min_index = 1,
max_index = maxt, shuffle = TRUE, step = step, batch_size = batch_size)
val_gen = generator(dat, lookback = lookback, delay = delay, min_index = maxt +
1, max_index = maxv, step = step, batch_size = batch_size)
We will discuss now the construction of our model. Our final model consists of 6 layers in this case, 5 LSTM layers and one standard densely connected neural network. We will discuss each unique layer in some detail, as extreme measures were necessary to avoid overfitting with such a small amount of data
Training an LSTM works as follows:
You define the number of batches per epoch. An epoch is a single training run for a neural network. So you define the number of times your generator function samples the data per epoch
In each epoch, the neural net first reads in your training data, and trains on that, using whatever masking/dropout you have (we will discuss this). Then, when it is done with that, it runs the validation (test set in this case) data through the neural net, and compares results. On the validation set, it removes any predefined masking or dropout (which is why the loss is generally lower on the validation set). Things that are confusing now will be discussed later, as it makes more sense to define with an example. Basically, each epoch it runs through the training data N steps, with some tricks to prevent overfitting the training data. Then it runs through once with you validation/test set. With each step of this, weights for each parameter are stored in the nodes themselves, and other useful information such as error is stored in the gates. The weights and the info in the gates gets added to (and removed from in the case of the forget gate). This is done N steps, over M epochs, repeatedly updating the LSTM’s weights and other information until we say so.
The first layer consists of an LSTM with 10 hidden nodes. A small amount of nodes at the beginning was necessary to avoid overfitting (even at 50 nodes we rapidly overfit). The next tool we used to combat overfitting was dropout. There are two types of dropout. First we will discuss normal dropout:
Normal Dropout is used in the input and output layers of a neural network. For a given dropout rate N, it uses matrix multiplication to randomly remove N percentage of the data going in to the node. This is NOT used on the validation set. By showing the neural network less points of the significantly larger training set, it prevents overfitting on the training set
Recurrent Dropout is the exact same thing as normal dropout, except it affects the hidden layer instead of the input and output layers
Our simple LSTM layer just has these parameters
Next we pop the output of our first LSTM layer into two BiLSTM layers. A normal LSTM passes the data in from start to end. A bidirectional LSTM actually represents two LSTMs in parallel, one that reads the data from start to end, and one that reads the data from end to start. This helps it find even more interesting patterns and nuances in our data, increasing its representational power. Essentially, we are going to allow the LSTM to track long term patterns with equal power as short term powers with this.
A dense layer is a standard neural network. Adding this to our model further increases the representational power.
We will now define the model:
model_lstm <- keras_model_sequential() %>% layer_lstm(units = 10, dropout = 0.4,
recurrent_dropout = 0.1, activation = "sigmoid", input_shape = list(NULL,
dim(dat)[[-1]]), return_sequences = TRUE) %>% bidirectional(layer_lstm(units = 40,
dropout = 0.4, recurrent_dropout = 0.4, activation = "sigmoid", return_sequences = TRUE)) %>%
bidirectional(layer_lstm(units = 80, dropout = 0.4, recurrent_dropout = 0.4,
activation = "sigmoid")) %>% layer_dense(units = 1)
Lets see what the model looks like now
model_lstm
#> Model
#> Model: "sequential_16"
#> ___________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ===========================================================================
#> lstm_48 (LSTM) (None, None, 10) 10640
#> ___________________________________________________________________________
#> bidirectional_32 (Bidirectional) (None, None, 80) 16320
#> ___________________________________________________________________________
#> bidirectional_33 (Bidirectional) (None, 160) 103040
#> ___________________________________________________________________________
#> dense_16 (Dense) (None, 1) 161
#> ===========================================================================
#> Total params: 130,161
#> Trainable params: 130,161
#> Non-trainable params: 0
#> ___________________________________________________________________________
Next, we need to compile the model, or define how it trains. We will want to train it to optimize ASE, as we have been using that metric on the rest of the models. For our optimizing method, we will use rmsprop. To learn how RMSprop works, please seethis link. It is typically a good optimizer for RNNs.
The parameter to really tune here is learning rate. We will see in the next section how to tune it. By default, the learning rate is 0.001.
model_lstm %>% compile(optimizer = optimizer_rmsprop(), loss = "mse")
Finally, we get to train our model. We will save the history of our training, so we can diagnose the model
lstmHist <- model_lstm %>% fit_generator(train_gen, steps_per_epoch = 40, epochs = 20,
validation_data = val_gen, validation_steps = val_steps)
Lets look at the loss curve of our training to see how we did:
plot(lstmHist)
There are two things to know about the loss curve:
If the validation loss is greater than training loss, you are severly overfitting.
If the loss is too steep, learning rate is too high. If it is too flat, the learning rate is too low
For an amazing guide about tuning this, please refer to this link
Knowing this, we can now assess our model. First of all, thanks to our dropout, we are absolutely not overfitting. Looking at the shape, the learning rate is likely slightly too high, but this is not worth tuning. (This model performs excellently, and after several hundred tries I know it is not worth it to fiddle further).
As with the ANN, there is no set way of calculting the prediction interval of an LSTM, as it is not a stochastic, parametric tool, it is simply a composition of lots of functions. However, we can use our dropout to approximate a prediction interval. For an in depth explanation, please refer to this paper. Basically, dropout is a random process. So each time we run our model on the training set, with our dropout, and our random generator function, we will get slightly different results (whereas when we make a prediction, we will not have the dropout, so it will be the same result every time). So, what we can do, is we can evaluate how well our model represents the training set, and use that to calculte a prediction interval. First, we will write a function that gets the mean error of our model over the training set:
getMeanError <- function(mod) {
# 28 is simply the integer number of times we need to sample from our
# training set to represent the whole thing. In this case it will run
# through the entire set three times
sqerror <- evaluate(mod, train_gen, 28)
return(sqrt(sqerror))
}
Next, we will write a function which does this n times, so we have a sample of errors. First, a brief note about writing for loops in R
First, we must allocate the space in our computer’s RAM for the object we are constructing. If we do not do this, with each iteration of the for loop, R will actually destroy the object, and then create a new one. This creates horrificly slow for loops. For more info on this, please see the amazing book The R Inferno. To allocate this space, we do as follows:
out <- double(n)
Note that we allocated it as a double. If you want to make the for loops even faster, you must help out your R interpreter. If you do not define the type of the output of your loop, with each iteration the R interpreter will have to guess the type, which prevents it from optimizing. Double is the equivalent of numeric, but makes more sense to the reader.
Next, we must define the for loop. We will us seq_along instead of length, because if somebody defines the object to be something with length zero, seq_along will behave properly, while length will break everything
for (i in seq_along(out)){
do things here
}
Now, lets write our for loop function, to get the errors n timese:
GetErrors <- function(n) {
errors <- double(n)
for (i in seq_along(errors)) {
cat("evaluation number", i, "\n")
errors[i] <- getMeanError(model_lstm)
}
return(errors)
}
Next, lets get our errors, for 1000 runs of the LSTM
trainErr <- GetErrors(1000)
Next, lets look at the errors, and get their mean and std deviation:
hist(trainErr)
errorMean <- mean(trainErr)
errorStd <- sd(trainErr)
Finally, lets write a function that takes our predictions, and gives them an interval for uncertainty:
makeInterval <- function(prediction) {
pm <- errorMean + errorStd
upper <- prediction + pm
lower <- prediction - pm
return(data.frame(fitted = prediction, upper = upper, lower = lower))
}
We define three S3 methods for predictions made by Keras:
as.keras <- function(x) structure(x, class = "keras")
autoplot.keras <- function(obj) {
testdf <- data.frame(type = "actual", t = seq_along(test[, 1]), ppm = as.numeric(test[,
1]))
preddf <- data.frame(type = "predicted", t = seq_along(test[, 1]), ppm = as.numeric(obj$fitted))
confdf <- data.frame(t = seq_along(test[, 1]), upper = obj$upper, lower = obj$lower)
dfl <- list(testdf, preddf)
.testPredPlot(dfl) + geom_line(data = confdf, aes(x = t, y = lower, alpha = 0.2),
linetype = 3003) + geom_line(data = confdf, aes(x = t, y = upper, alpha = 0.2),
linetype = 3003) + guides(alpha = FALSE)
}
scores.keras <- function(obj) {
c(ase = ASE(obj$fitted, test[, 1]), mape = MAPE(obj$fitted, test[, 1]),
Conf.Score = confScore(upper = obj$upper, lower = obj$lower, test[,
1]))
}
Now that we are set up, lets predict off of the test set to see how we did
lstmTest <- predict(model_lstm, test3, n.ahead = nrow(test2))
lstmTest <- makeInterval(lstmTest)
descaleTest <- function(x) {
x * attr(test2, "scaled:scale")[1] + attr(test2, "scaled:center")[1]
}
lstmTest <- lapply(lstmTest, descaleTest)
lstmTest <- as.keras(lstmTest)
That was pretty easy. Now we see why we made these test2 and test3 objects, with the proper dimensions for our neural network. Without this it would be very difficult. The dimensions of the input array for our model are as follows:
array(nrow(input), lookback/step, ncol(input))
autoplot(lstmTest)
Interesting. It matched the shape of the data excellently, or at least the overal pattern, but the peaks and troughs seem to be flattened. This is due to our slightly too high learning rate.To put it in human terms, the LSTM is learning too fast, and “skipping over” some details.
However, again trying to guess these is a bit overkill at this length forecast, and being conservative and “pretty good” is completely appropriate. Lets check out the scores:
data.frame(t(scores(lstmTest)))
Wow. This gave us the very best forecast so far. Let us try now to build an ensemble forecast. To do this, and this will be explained later, lets fit the LSTM to the second to last year of our dataset:
lstmVal <- predict(model_lstm, val, n.ahead = nrow(test2))
lstmVal <- makeInterval(lstmVal)
descaleVal <- function(x) {
x * valScale + valCent
}
lstmVal <- lapply(lstmVal, descaleVal)
lstmVal <- as.keras(lstmVal)
There is no defined
All work performed in this project is intended to be completely reproducible. The following sections describe the necessary steps to reproduce this analysis. Obviously the first step is to clone the parent repository:
git clone https://github.com/josephsdavid/ChinesePM.git
The author is a huge fan of the reproducible package manager nix, which allows the user to share a reproducible file shell.nix. When someone else runs from the command line nix-shell, the contents of the file are installed in isolation from the rest of the users system, and is dropped into a shell which perfectly reproduces the working environment of this project. This includes automatically installing and setting up the tensorflow/keras deep learning stack, installing complex dependencies, and if you comment out the tensorflow line and uncomment the tensorflow with CUDA line, will automatically set up tensorflow to run with Nvidia CUDA.
To install remaining dependencies (namely devtools packages), you can run Rscript libs.R to automatically install them. This technique is highly recommended. Below are the contents of libs.R:
library(remotes)
# packages that dont build on nix shell
options(repos = "http://cran.rstudio.org")
myPackages = installed.packages()
cranPackages <- c("nnfor", "dtwclust")
to.install <- setdiff(cranPackages, myPackages[, 1])
if (length(to.install) > 0) install.packages(to.install)
if (!("tswgewrapped" %in% installed.packages())) {
install_github("josephsdavid/tswgewrapped")
}
The shell.nix file can be seen in the github repository.
Note that using this method will also automatically setup tensorflow, Keras, tensorboard, and any other necesary dependencies for deep learning time series models (with R and python).
To reproduce the R dependencies required for this analysis, please run Rscript noNix.R This reads shell.nix, searches for R packages, compares what is there to what is installed on your computer, and then if they are different, it installs them
library(remotes)
options(repos = "http://cran.rstudio.org")
myPackages = installed.packages()
shell <- readLines("shell.nix")
shell <- shell[grepl(shell, pattern = "rPackages")]
shell <- gsub("rPackages.", "", shell)
qhell <- gsub(" ", "", shell)
otherPackages <- c("nnfor", "dtwclust")
cranPackages <- c(shell, otherPackages)
to.install <- setdiff(cranPackages, myPackages[, 1])
if (length(to.install) > 0) install.packages(to.install)
if (!("tswgewrapped" %in% installed.packages())) {
install_github("josephsdavid/tswgewrapped")
}
Note that this method will require you to manually set up a deep learning stack
To reproduce the models from this project, you have two options: you can either load the saved .Rda files, or run all the models (as shown in this report). Note this will likely take several days, as some of the models take a long time to train, and multiple grid searches over 10,000+ hyperparameters were performed.