Overview

This project covers material in Forecasting: Principles and Practice by Hyndman and Athanasopoulos.

[The project is] a time series forecasting assignment. You [are] given a data set with basic information and basic objectives. Very much like a busy boss that does not have time to spell everything out. You can add your creativity. You can use additional tools beside R (however you should use some R). You will want to come up with a very nice report that is readable in Word. I care about technical accuracy, but presentation matters! I used think flash and flare were false and fake for analysts without substance. While I still think that (to a lesser extent), so what! It does in the real world! There are many things in the world or system that we don’t agree with, but that is reality. You have to work within the system. So, oftentimes your job output will be judged on flash as well as technical accuracy, so be good at both. [This is a] single Effort, no interaction with others outside of meetups on this assignment.

Submission: Word readable document for report (all in one), Excel readable for outputs (all in one, separate sheets).

System Setup

if (!require("xts")) install.packages("xts")
if (!require("xlsx")) install.packages("xlsx")
if (!require("tseries")) install.packages("tseries")
if (!require("forecast")) install.packages("forecast")

To use the xlsx package on Linux, RJava and Rgdal must be installed.

sudo apt-get update
sudo apt-get install default-jre # Install Java
sudo apt-get install default-jdk # Install JDK
sudo R CMD javareconf # Assotiate the JDK installed with R
sudo apt-get install r-cran-rjava # Install RJava
sudo apt-get install libgdal-dev libproj-dev # Install Rgdal

ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose. I am giving you data, please provide your written report on your findings, visuals, discussion and your R code all within a Word readable document, except the forecast which you will put in an Excel readable file. I must be able to cut and paste your R code and run it in R studio or other interpreter.

Import Data

There are several packages for importing Excel files into R. The native method in R version 3.3.1 (2016-06-21) uses the readxl() function which translates Excel cell/column date types into R POSIXct data types. This new and improved date format is not compatible with the current version of other R packages. The read.xls() function from the openxlsx package does not currently support https and requires use of gsub() as a workaround. The xlsx package is versatile and has the added benefit of a function that exports to Excel.

github <- "https://raw.githubusercontent.com/jzuniga123"
file <- "/SPS/master/DATA%20624/ATM624Data.xlsx"
download.file(paste0(github, file), "temp.xlsx", mode="wb")
atm <- xlsx::read.xlsx("temp.xlsx", sheetIndex=1, header=T)
invisible(file.remove("temp.xlsx"))

Explore Data

The Time Series data are first explored through number summaries and visualizations.

Number Summaries

The summary() function provides descriptive statistics on quantitative and qualitative variables. The xts::periodocity() function describes the periods between dates in the time series. The complete.cases() function returns observations in the datasets that have no missing values.

summary(atm)

##       DATE              ATM           Cash        
##  Min.   :2009-05-01   ATM1:365   Min.   :    0.0  
##  1st Qu.:2009-08-01   ATM2:365   1st Qu.:    0.5  
##  Median :2009-11-01   ATM3:365   Median :   73.0  
##  Mean   :2009-10-31   ATM4:365   Mean   :  155.6  
##  3rd Qu.:2010-02-01   NA's: 14   3rd Qu.:  114.0  
##  Max.   :2010-05-14              Max.   :10919.8  
##                                  NA's   :19

xts::periodicity(unique(atm$DATE))

## Daily periodicity from 2009-05-01 to 2010-05-14

atm[!complete.cases(atm), ]

##           DATE  ATM Cash
## 87  2009-06-13 ATM1   NA
## 93  2009-06-16 ATM1   NA
## 98  2009-06-18 ATM2   NA
## 105 2009-06-22 ATM1   NA
## 110 2009-06-24 ATM2   NA
## 731 2010-05-01 <NA>   NA
## 732 2010-05-02 <NA>   NA
## 733 2010-05-03 <NA>   NA
## 734 2010-05-04 <NA>   NA
## 735 2010-05-05 <NA>   NA
## 736 2010-05-06 <NA>   NA
## 737 2010-05-07 <NA>   NA
## 738 2010-05-08 <NA>   NA
## 739 2010-05-09 <NA>   NA
## 740 2010-05-10 <NA>   NA
## 741 2010-05-11 <NA>   NA
## 742 2010-05-12 <NA>   NA
## 743 2010-05-13 <NA>   NA
## 744 2010-05-14 <NA>   NA

summary(factor(atm$ATM)[!is.na(atm$Cash) & atm$Cash %% 1 != 0])

## ATM1 ATM2 ATM3 ATM4 
##    0    0    0  365

This is a time series spanning May 1, 2009 to April 30, 2010. The time series contains cash transaction amounts for four ATMs. The cash transactions have missing values and skewed distribution since the mean is higher than the third quartile. There are also non-integer transactions at ATM 4 implying that these data are likely debit card purchase transactions.

Visualizations

Visualizations consists of Time Plots, Histograms, ACF plots, and PACF plots.

Time Plot and Histogram

The Time Plot is a line graph that plots each observed value against the time of the observation, with a single line connecting each observation across the entire period. The Histogram displays the frequency at which values in a vector occur.

par(mfrow=c(4, 2), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
for(i in 1:length(levels(atm$ATM))) {
  atm_sub <- subset(atm, ATM == paste0("ATM", i))
  atm_ts <- xts::xts(atm_sub$Cash, order.by=atm_sub$DATE)
  n <- nrow(atm_ts); l <- rep(1, n); m <- rep(20, n); h <- rep(100, n)
  print(plot(cbind(atm_ts, l, m,h), main=paste0("ATM", i)))
  hist(atm_ts, col="steelblue", xlab="", main="")
}

The Time Plots and Histograms for ATM1 and ATM2 are unremarkable. The Time Plot and Histogram of ATM3 shows the data consists mostly of zero values with a handful of transactions occurring at the end of the series. ATM3 will not be modeled due to these degenerative properties. The Time Plot and Histogram of ATM4 shows an extreme outlier around the three-quarter mark of the series. The horizontal lines in the Time Plots delineate $1, $20, and $100 in red, green, and blue; respectively.

ACF and PACF

The ACF plot shows the autocorrelations between each observation and its immediate predecessor (lagged observation). The PACF plot shows the autocorrelations between the current observation and each individual lagged observation The xts::xts()function converts data to a time series object which displays better in visualizations than time series objects created using other packages.

par(mfrow=c(4, 2), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
for(i in 1:length(levels(atm$ATM))) {
  atm_sub <- subset(atm, ATM == paste0("ATM", i))
  atm_ts <- xts::xts(atm_sub$Cash, order.by=atm_sub$DATE)
  acf(na.omit(atm_ts), ylab=paste0("ACF ATM", i), main="") 
  pacf(na.omit(atm_ts), ylab=paste0("PACF ATM", i), main="")
}

The ACF and PACF plots for ATM1, ATM2, and ATM3 show autocorrelation between each observation and its immediate predecessor and autocorrelation between the current observation and other individual lagged observations. The ACF and PACF plots for ATM3 however, are not reliable due to the dearth of observations.

Clean Data

Data are cleaned using forecast::tsclean() and then converted to a time series object using the ts() function. The tsclean() function imputes nulls and removes outliers. The ts()function converts data to a time series object which is compatible with the forecast package.

for(i in 1:length(levels(atm$ATM))) {
  atm_num <- paste0("ATM", i)
  atm_sub <- subset(atm, ATM == atm_num, select=-2)
  atm_sub$Cash <- forecast::tsclean(atm_sub$Cash, replace.missing=T)
  assign(atm_num, ts(atm_sub$Cash, frequency = 7, start=start(atm_sub$DATE)))
}

The frequency parameter is selected in accordance with the documentation accompanying the ts() function:

The value of argument frequency is used when the series is sampled an integral number of times in each unit time interval. For example, one could use a value of 7 for frequency when the data are sampled daily, and the natural time period is a week, or 12 when the data are sampled monthly and the natural time period is a year. Values of 4 and 12 are assumed in (e.g.) print methods to imply a quarterly and monthly series respectively.

Examine Trend

A moving average smoother is helpful in examining what kind of trend is involved in a series. Moving average models should not be confused with moving average smoothing. A moving average model is used for forecasting future values while moving average smoothing is used for estimating the trend-cycle component of past values. The ma() function computes a simple moving average smoother of a given time series.

par(mfrow=c(3, 1), mar = c(0, 4, 0, 0), oma = c(0, 0, 0.5, 0.5))
plot(ATM1, col=8, xaxt = "n", ylab="ATM1")
lines(forecast::ma(ATM1, order=7), col=2)
lines(forecast::ma(ATM1, order=30), col=4)
plot(ATM2, col=8, xaxt = "n", ylab="ATM3")
lines(forecast::ma(ATM2, order=7), col=2)
lines(forecast::ma(ATM2, order=30), col=4)
plot(ATM4, col=8, xaxt = "n", ylab="ATM4")
lines(forecast::ma(ATM4, order=7), col=2)
lines(forecast::ma(ATM4, order=30), col=4)

The 7-day (weekly) and 30-day (monthly) moving average smoother line shows that the data for the ATMs have no apparent trend.

Decomposition

The Decomposition Plot decomposes and plots the observed values, the underlying trend, seasonality, and randomness of the time series data.

plot(decompose(ATM1))

plot(decompose(ATM2))

plot(decompose(ATM4))

Plotting the trend-cycle and seasonal indices computed by additive decomposition shows that the data have no apparent trend, seasonal fluctuations, and fairly random residuals.

Dickey-Fuller Test

An augmented Dickey-Fuller unit root test evaluates if the data exhibit a Stationarity process with deterministic trend or a Stationarity process with stochastic trend.

tseries::adf.test(ATM1)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ATM1
## Dickey-Fuller = -4.5329, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

tseries::adf.test(ATM2)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ATM2
## Dickey-Fuller = -6.046, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

tseries::adf.test(ATM4)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ATM4
## Dickey-Fuller = -5.6304, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

The augmented Dickey-Fuller unit root test $p$-values are below $\alpha=0.05$. Therefore, the null hypothesis that the data has unit roots is rejected. The data exhibit stochastic trend which suggests using regression (AR) in lieu of differencing. Autoregressive (AR) modeling acts like partial differencing when $\phi<1$. When $\phi=1$ the AR(1) model is like a first-order difference.

Model Data

Train-Test Split

The train and test sets are created by referencing rows by index.

index_train <- 1:(length(ATM1) - 30)
ATM1_train <- ts(ATM1[index_train], frequency=7)
ATM1_test <- ts(ATM1[-index_train], frequency=7)
index_train <- 1:(length(ATM2) - 30)
ATM2_train <- ts(ATM2[index_train], frequency=7)
ATM2_test <- ts(ATM2[-index_train], frequency=7)
index_train <- 1:(length(ATM3) - 30)
ATM3_train <- ts(ATM3[index_train], frequency=7)
ATM3_test <- ts(ATM3[-index_train], frequency=7)
index_train <- 1:(length(ATM4) - 30)
ATM4_train <- ts(ATM4[index_train], frequency=7)
ATM4_test <- ts(ATM4[-index_train], frequency=7)

The indexed rows for the test set are a window at the end of the times series. The window sized for the test set is that of the desired prediction. The training set window is comprised of the indexes which are the complement of the test set indexes.

Transformation

The Augmented Dickey-Fuller Test results support not differencing. Data can be seasonally adjusted for modeling and then reseasonalized for predictions. The modeling algorithm being used evaluates seasonal components and produces predictions that reflect the seasonality in the underlying data. Therefore, the data need not be seasonally adjusted.Heteroskedasticity refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable. Box-Cox transformations can help to stabilize the variance of a time series. Some typical Box-Cox transformations are: \[-2\Rightarrow \frac { 1 }{ y_t^{ 2 } }, \quad -1\Rightarrow \frac { 1 }{ y_t }, \quad -0.5\Rightarrow \frac { 1 }{ \sqrt { y_t } }, \quad 0\Rightarrow \log { \left( y_t \right) }, \quad 0.5\Rightarrow \sqrt { y_t }, \quad 1\Rightarrow y_t, \quad 2\Rightarrow y_t^{ 2 }\]

(lambda1 <- forecast::BoxCox.lambda(ATM1_train))

## [1] 0.4905058

(lambda2 <- forecast::BoxCox.lambda(ATM2_train))

## [1] 0.6682213

(lambda4 <- forecast::BoxCox.lambda(ATM4_train))

## [1] 0.4986909

The Box-Cox transformation parameters suggested are around $\lambda=0.5$. This rounded (more interpretable) value is suggestive of a ${ 1 }/{ \sqrt { y_t } }$ transformation. These Box-Cox transformations stabilize the variance and make each series relatively homoskedastic with equal variance.

ARIMA Model

The auto.arima() function chooses an ARIMA model automatically. It uses a variation of the Hyndman and Khandakar algorithm which combines unit root tests, minimization of the AICc, and MLE to obtain an ARIMA model. The function takes some short-cuts in order to speed up the computation and will not always yield the best model. Setting stepwise and approximation to FALSE prevents the function from taking short-cuts.

(fit1 <- forecast::auto.arima(ATM1_train, stepwise=F, approximation=F, d=0, lambda=lambda1))

## Series: ATM1_train 
## ARIMA(0,0,1)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4905058 
## 
## Coefficients:
##          ma1    sar1    sar2     mean
##       0.1374  0.4184  0.4328  15.0641
## s.e.  0.0649  0.0555  0.0575   1.1921
## 
## sigma^2 estimated as 9.273:  log likelihood=-647.98
## AIC=1305.97   AICc=1306.21   BIC=1323.67

(fit2 <- forecast::auto.arima(ATM2_train, stepwise=F, approximation=F, d=0, lambda=lambda2))

## Series: ATM2_train 
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.6682213 
## 
## Coefficients:
##         sar1    sar2     mean
##       0.3731  0.4630  21.3885
## s.e.  0.0555  0.0573   2.1841
## 
## sigma^2 estimated as 47.44:  log likelihood=-856.41
## AIC=1720.81   AICc=1720.97   BIC=1734.98

(fit4 <- forecast::auto.arima(ATM4_train, stepwise=F, approximation=F, d=0, lambda=lambda4))

## Series: ATM4_train 
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4986909 
## 
## Coefficients:
##          ar1    sar1    sar2     mean
##       0.0923  0.1974  0.1696  36.0373
## s.e.  0.0635  0.0622  0.0636   1.9024
## 
## sigma^2 estimated as 324.1:  log likelihood=-1097.31
## AIC=2204.63   AICc=2204.87   BIC=2222.33

The auto.arima() function suggests an $\textrm{ARIMA}(0,0,1)(2,0,0)_{7}$ model for ATM1, an $\textrm{ARIMA}(0,0,0)(2,0,0)_{7}$ model for ATM2, and an $\textrm{ARIMA}(1,0,0)(2,0,0)_{7}$ model for ATM4.

Evaluate

ACF and PACF

par(mfrow=c(3, 2), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
acf(residuals(fit1), ylab="ACF ATM1"); pacf(residuals(fit1), ylab="PACF ATM1")
acf(residuals(fit2), ylab="ACF ATM2"); pacf(residuals(fit2), ylab="PACF ATM2")
acf(residuals(fit4), ylab="ACF ATM4"); pacf(residuals(fit4), ylab="PACF ATM4")

The residuals of the models appear to display the characteristics of White Noise in the ACF and PACF plots with only one of the twenty residuals (or 0.05%) being significant. At a 95% confidence interval this is within probabilistic expectations.

Box-Ljung Test

The Box-Ljung test is helpful in assessing if data follow a White Noise pattern. The arma attribute of the fitted model returns a vector containing the ARIMA model parameters $p$, $q$, $P$, $Q$, period, $d$, and $D$; in that order.

Box.test(residuals(fit1), lag=7, fitdf=sum(fit1$arma[1:2]), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(fit1)
## X-squared = 5.1641, df = 6, p-value = 0.5229

Box.test(residuals(fit2), lag=7, fitdf=sum(fit1$arma[1:2]), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(fit2)
## X-squared = 9.7143, df = 6, p-value = 0.1372

Box.test(residuals(fit4), lag=7, fitdf=sum(fit1$arma[1:2]), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(fit4)
## X-squared = 3.6851, df = 6, p-value = 0.7192

The null hypothesis of independence is not rejected. The Box-Ljung shows that the autocorrelations of the residuals from the models are not significantly different from zero at $\alpha=0.05$. The residuals of the models display the characteristics of White Noise. The models pass the required checks and are therefore suitable for forecasting.

Forecast

ATM3 was not modeled due to its degenerative properties. To forecast values for ATM3, the model for an ATM with a similar mean will be used.

c(mean(ATM1), mean(ATM2), mean(ATM3[ATM3!=0]), mean(ATM4))

## [1]  84.15479  62.59178  87.66667 444.75681

The mean of ATM1 is very close to the mean of the few values in ATM3. Therefore, the $\textrm{ARIMA}(0,0,1)(2,0,0)_{7}$ model for ATM1 will be used to make predictions for ATM3.

fit3 <- forecast::Arima(ATM3_train, model=fit1)

In-Sample

Forecasts are done using the forecast::forecast() function. Since the data were not seasonally adjusted, they need not be reseasonalized prior to forecast. Prediction point estimates are represented by a blue line, prediction intervals are represented by blue bands, and actual values are represented by a red line.

fcast1 <- forecast::forecast(fit1, h=30)
fcast2 <- forecast::forecast(fit2, h=30)
fcast3 <- forecast::forecast(fit3, h=30)
fcast4 <- forecast::forecast(fit4, h=30)
par(mfrow=c(4, 1), mar = c(0, 4, 0, 0), oma = c(4, 4, 2, 0.5))
plot(fcast1, ylab="Cash ATM1", main="", xaxt="n"); 
lines(lag(ATM1_test, -length(ATM1_train)), col="red")
plot(fcast2, ylab="Cash ATM2", main="", xaxt="n"); 
lines(lag(ATM2_test, -length(ATM2_train)), col="red")
plot(fcast3, ylab="Cash ATM3", main="", xaxt="n")
lines(lag(ATM3_test, -length(ATM3_train)), col="red")
plot(fcast4, ylab="Cash ATM4", main="", xaxt="n")
lines(lag(ATM4_test, -length(ATM4_train)), col="red")
title("ATM Predictions", outer=TRUE)

The predictions appear to produce a useful forecasts that reflect patterns in the original data.

Model Accuracy

The accuracy() function is helpful for obtaining summary measures of the forecast accuracy: Mean Error (ME), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Percentage Error (MPE), Mean Absolute Percentage Error (MAPE), Mean Absolute Scaled Error (MASE), and Autocorrelation of errors at lag 1 (ACF1).

round(forecast::accuracy(fcast1, length(ATM1_test)), 3)

##                   ME   RMSE    MAE      MPE    MAPE  MASE   ACF1
## Training set   2.735 24.143 15.932  -84.616 101.797 0.435 -0.019
## Test set     -73.955 73.955 73.955 -246.518 246.518 2.018     NA

round(forecast::accuracy(fcast2, length(ATM2_test)), 3)

##                   ME   RMSE    MAE     MPE   MAPE  MASE  ACF1
## Training set   2.067 26.242 18.655    -Inf    Inf 0.434 0.052
## Test set     -58.287 58.287 58.287 -194.29 194.29 1.355    NA

round(forecast::accuracy(fcast4, length(ATM4_test)), 3)

##                    ME    RMSE     MAE       MPE     MAPE  MASE  ACF1
## Training set   81.003 359.089 285.911  -400.987  445.408 0.733 0.015
## Test set     -539.929 539.929 539.929 -1799.765 1799.765 1.384    NA

These accuracy for the predications vary. ATM1 and ATM2 predictions are more accurate than ATM4 predictions. The closer the original data are to being White Noise, the less accurate the predictions.

Out-of-Sample

Forecasts are done using the forecast::forecast() function. Since the data were not seasonally adjusted, they need not be reseasonalized prior to forecast. These out-of-sample predictions and for exporting.

fcast1 <- forecast::forecast(forecast::Arima(ATM1, model=fit1), h=30)
fcast2 <- forecast::forecast(forecast::Arima(ATM2, model=fit2), h=30)
fcast3 <- forecast::forecast(forecast::Arima(ATM3, model=fit1), h=30)
fcast4 <- forecast::forecast(forecast::Arima(ATM4, model=fit4), h=30)

Export Results

Results are exported to an Excel file using the xlsx::write.xlsx() function which includes an option for appending a worksheet to an existing file.

xlsx::write.xlsx(fcast1, file="DATA624_Project1.xlsx", 
  sheetName="ATM1", col.names = T, row.names = T, append = F)
xlsx::write.xlsx(fcast2, file="DATA624_Project1.xlsx", 
  sheetName="ATM2", col.names = T, row.names = T, append = T)
xlsx::write.xlsx(fcast3, file="DATA624_Project1.xlsx", 
  sheetName="ATM3", col.names = T, row.names = T, append = T)
xlsx::write.xlsx(fcast4, file="DATA624_Project1.xlsx", 
  sheetName="ATM4", col.names = T, row.names = T, append = T)

Power Forecast

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable KWH is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Import Data

github <- "https://raw.githubusercontent.com/jzuniga123"
file <- "/SPS/master/DATA%20624/ResidentialCustomerForecastLoad-624.xlsx"
download.file(paste0(github, file), "temp.xlsx", mode="wb")
power <- xlsx::read.xlsx("temp.xlsx", sheetIndex=1, header=T)
power$YYYY.MMM <- as.Date(paste0(power$YYYY.MMM,"-01"), format = "%Y-%b-%d")
invisible(file.remove("temp.xlsx"))

The ambiguous “YYYY-MMM” format dates in this file are interpreted as factors. They must be converted to dates.

Explore Data

The Time Series data are first explored through number summaries and visualizations.

Number Summaries

summary(power)

##   CaseSequence      YYYY.MMM               KWH          
##  Min.   :733.0   Min.   :1998-01-01   Min.   :  770523  
##  1st Qu.:780.8   1st Qu.:2001-12-24   1st Qu.: 5429912  
##  Median :828.5   Median :2005-12-16   Median : 6283324  
##  Mean   :828.5   Mean   :2005-12-15   Mean   : 6502475  
##  3rd Qu.:876.2   3rd Qu.:2009-12-08   3rd Qu.: 7620524  
##  Max.   :924.0   Max.   :2013-12-01   Max.   :10655730  
##                                       NA's   :1

xts::periodicity(unique(power$YYYY.MMM))

## Monthly periodicity from 1998-01-01 to 2013-12-01

power[!complete.cases(power), ]

##     CaseSequence   YYYY.MMM KWH
## 129          861 2008-09-01  NA

This is a time series spanning January 1, 1998 to December 1, 2013. The time series contains kWh usage. There is only one missing value.

Visualizations

Visualizations consists of Time Plots, Histograms, ACF plots, and PACF plots.

Time Plot and Histogram

kWh <- xts::xts(power$KWH, order.by=power$YYYY.MMM)
par(mfrow=c(2, 1), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
plot(kWh, main="kWh")
hist(kWh, col="steelblue", xlab="", main="")

The Time Plot and Histogram of ATM4 shows an outlier around the three-quarter mark of the series.

ACF and PACF

par(mfrow=c(2, 1), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
acf(na.omit(kWh), ylab="kWh", main="") 
pacf(na.omit(kWh), ylab="kWh", main="")

The ACF and PACF plots show autocorrelation between each observation and its immediate predecessor and autocorrelation between the current observation and other individual lagged observations.

Clean Data

kWh <- ts(forecast::tsclean(power$KWH, replace.missing=T), 
          frequency = 12, start=start(power$YYYY.MMM))
kWh[kWh==min(kWh)] <- mean(kWh[kWh!=min(kWh)])

The tsclean() function removed the missing value but did not remove the outlier. The outlier is manually replaced with the mean of the series excluding the outlier. The frequency parameter is selected in accordance with the documentation accompanying the ts() function:

The value of argument frequency is used when the series is sampled an integral number of times in each unit time interval. For example, one could use a value of 7 for frequency when the data are sampled daily, and the natural time period is a week, or 12 when the data are sampled monthly and the natural time period is a year. Values of 4 and 12 are assumed in (e.g.) print methods to imply a quarterly and monthly series respectively.

Examine Trend

plot(kWh, col=8, xaxt = "n", ylab="ATM1")
lines(forecast::ma(kWh, order=6), col=2)
lines(forecast::ma(kWh, order=12), col=4)

The 6-month (biannual) and 12-month (annual) moving average smoother line shows that the data only a slight apparent trend.

Decomposition

The Decomposition Plot decomposes and plots the observed values, the underlying trend, seasonality, and randomness of the time series data.

plot(decompose(kWh))

Plotting the trend-cycle and seasonal indices computed by additive decomposition shows that the data have a slight apparent trend, seasonal fluctuations, and fairly random residuals.

Dickey-Fuller Test

An augmented Dickey-Fuller unit root test evaluates if the data exhibit a Stationarity process with deterministic trend or a Stationarity process with stochastic trend.

tseries::adf.test(kWh)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  kWh
## Dickey-Fuller = -4.5466, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

The augmented Dickey-Fuller unit root test $p$-value is below $\alpha=0.05$. Therefore, the null hypothesis that the data has unit roots is rejected. The data exhibit stochastic trend which suggests using regression (AR) in lieu of differencing. Autoregressive (AR) modeling acts like partial differencing when $\phi<1$. When $\phi=1$ the AR(1) model is like a first-order difference.

Model Data

Train-Test Split

The train and test sets are created by referencing rows by index.

index_train <- 1:(length(kWh) - 12)
kWh_train <- ts(kWh[index_train], frequency=12)
kWh_test <- ts(kWh[-index_train], frequency=12)

Transformation

(lambda <- forecast::BoxCox.lambda(kWh_train))

## [1] -0.1738523

The Box-Cox transformation parameter suggested is about $\lambda=-0.25$. This rounded (slightly more interpretable) value is suggestive of an inverse quartic root. This Box-Cox transformation stabilizes the variance and makes the series relatively homoskedastic with equal variance.

ARIMA Model

(fit <- forecast::auto.arima(kWh_train, stepwise=F, approximation=F, d=0, lambda=lambda))

## Series: kWh_train 
## ARIMA(0,0,3)(2,1,0)[12] with drift         
## Box Cox transformation: lambda= -0.1738523 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  drift
##       0.2803  0.0857  0.2227  -0.7729  -0.4416  1e-04
## s.e.  0.0757  0.0823  0.0687   0.0742   0.0812  1e-04
## 
## sigma^2 estimated as 3.652e-05:  log likelihood=623.26
## AIC=-1232.52   AICc=-1231.82   BIC=-1210.65

The auto.arima() function suggests an $\textrm{ARIMA}(0,0,3)(2,1,0)_{12}$ model.

Evaluate

ACF and PACF

par(mfrow=c(2, 1), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
acf(residuals(fit), ylab="ACF kWh"); pacf(residuals(fit), ylab="PACF kWh")

The residuals of the model appear to display the characteristics of White Noise in both the ACF and PACF plots. None of the residuals are significant. At a 95% confidence interval this is well within probabilistic expectations.

Box-Ljung Test

Box.test(residuals(fit), lag=7, fitdf=sum(fit$arma[1:2]), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(fit)
## X-squared = 7.2788, df = 4, p-value = 0.1219

The null hypothesis of independence is not rejected. The Box-Ljung shows that the autocorrelations of the residuals from the model are not significantly different from zero at $\alpha=0.05$. The residuals of the model displays the characteristics of White Noise. The model passes the required checks and is therefore suitable for forecasting.

Forecast

In-Sample

fcast <- forecast::forecast(fit, h=12)
plot(fcast, ylab="kWh", main="kWh Predictions", xaxt="n")
lines(lag(kWh_test, -length(kWh_train)), col="red")

The prediction appears to produce a useful forecasts that reflect patterns in the original data.

Model Accuracy

round(forecast::accuracy(fcast, length(kWh_test)), 3)

##                       ME      RMSE       MAE         MPE       MAPE  MASE
## Training set    39670.75  581907.7  456894.4  5.8000e-02 7.0750e+00 0.413
## Test set     -9047400.57 9047400.6 9047400.6 -7.5395e+07 7.5395e+07 8.183
##               ACF1
## Training set 0.115
## Test set        NA

These accuracy for the predications is fair. The large metrics are representative of the large values found in the data.

Out-of-Sample

fcast <- forecast::forecast(forecast::Arima(kWh, model=fit), h=12)

Export Results

Results are exported to an Excel file using the xlsx::write.xlsx() function which includes an option for appending a worksheet to an existing file.

xlsx::write.xlsx(fcast, file="DATA624_Project1.xlsx", 
  sheetName="kWh", col.names = T, row.names = T, append = T)

Waterflow Forecast

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results as above in a Word readable file and the forecast in an Excel readable file.

Import Data

github <- "https://raw.githubusercontent.com/jzuniga123"
file <- "/SPS/master/DATA%20624/Waterflow_Pipe1.xlsx"
download.file(paste0(github, file), "temp.xlsx", mode="wb")
pipe1 <- xlsx::read.xlsx("temp.xlsx", sheetIndex=1, header=T)
file <- "/SPS/master/DATA%20624/Waterflow_Pipe2.xlsx"
download.file(paste0(github, file), "temp.xlsx", mode="wb")
pipe2 <- xlsx::read.xlsx("temp.xlsx", sheetIndex=1, header=T)
invisible(file.remove("temp.xlsx"))

Explore Data

The Time Series data are first explored through number summaries and visualizations.

Number Summaries

summary(pipe1)

##    Date.Time                     WaterFlow     
##  Min.   :2015-10-23 00:24:06   Min.   : 1.067  
##  1st Qu.:2015-10-25 11:21:35   1st Qu.:13.683  
##  Median :2015-10-27 20:07:30   Median :19.880  
##  Mean   :2015-10-27 20:49:15   Mean   :19.897  
##  3rd Qu.:2015-10-30 08:24:51   3rd Qu.:26.159  
##  Max.   :2015-11-01 23:35:43   Max.   :38.913

summary(pipe2)

##    Date.Time                     WaterFlow     
##  Min.   :2015-10-23 01:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
##  Median :2015-11-12 20:30:00   Median :39.682  
##  Mean   :2015-11-12 20:30:00   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :78.303

xts::periodicity(unique(pipe1$Date.Time))

## 10.0198883493741 minute periodicity from 2015-10-23 00:24:06 to 2015-11-01 23:35:43

xts::periodicity(unique(pipe2$Date.Time))

## Hourly periodicity from 2015-10-23 01:00:00 to 2015-12-03 16:00:00

These are time series’ spanning October 23, 2015 to November 1, 2015 for Pipe 1 and to December 3, 2015 for Pipe 2. The time series contains water flow measurements for two pipes. The datasets do not contain any missing values or apparent outliers.

Sequence

The xts:: period.apply() function groups an xts time series object into a specified period and applies an aggregate function. It is similar to the apply() function except that it “applies a function to non-overlapping intervals along a vector.” The xts::periodocity() function describes the periods between dates in the time series.

Pipe1 <- xts::xts(pipe1$WaterFlow, order.by=pipe2$Date.Time)
Pipe1h <- xts::period.apply(Pipe1, xts::endpoints(Pipe1, "hours"), mean)
xts::periodicity(Pipe1h)

## Hourly periodicity from 2015-10-23 01:59:59 to 2015-12-03 16:00:00

The data for Pipe 1 which have a period of every ten minutes are grouped and converted to hourly with the mean() function applied to the aggregated hourly periods.

Aggregate

The merge() function can be used to join two time series objects.

Pipe2 <- xts::xts(pipe2$WaterFlow, order.by=pipe2$Date.Time)
Pipes <- merge(merge(Pipe1, Pipe2, join='inner'), Pipe1+Pipe2, join='inner')
head(Pipes)

##                         Pipe1    Pipe2 Pipe1...Pipe2
## 2015-10-23 01:00:00 23.369599 18.81079      42.18039
## 2015-10-23 01:59:59 28.002881 43.08703      71.08991
## 2015-10-23 03:00:00 23.065895 37.98770      61.05360
## 2015-10-23 04:00:00 29.972809 36.12038      66.09319
## 2015-10-23 04:59:59  5.997953 31.85126      37.84921
## 2015-10-23 06:00:00 15.935223 28.23809      44.17331

The merge() function takes two arguments. Therefore, Pipe1 and Pipe2 are first merged using an inner join, then the sum of the Pipes is merged to resultant multi-series using an inner join.

Visualizations

Visualizations consists of Time Plots, Histograms, ACF plots, and PACF plots.

Time Plot and Histogram

par(mfrow=c(3, 2), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
plot(Pipe1, main="Pipe 1")
hist(Pipe1, col="steelblue", xlab="", main="")
plot(Pipe1h, main="Pipe 1 Hourly")
hist(Pipe1h, col="steelblue", xlab="", main="")
plot(Pipe2, main="Pipe 2")
hist(Pipe2, col="steelblue", xlab="", main="")

The Time Plots and Histograms for the Pipes and the hourly aggregate of Pipe 1 are remarkable insofar as the data following what appears to be a Gaussian distribution.

ACF and PACF

par(mfrow=c(3, 2), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
acf(Pipe1, ylab="ACF Pipe 1", main="") 
pacf(Pipe1, ylab="PACF Pipe 1", main="")
acf(Pipe1h, ylab="ACF Pipe 1 Hourly", main="") 
pacf(Pipe1h, ylab="PACF Pipe 1 Hourly", main="")
acf(Pipe2, ylab="ACF Pipe 2", main="") 
pacf(Pipe2, ylab="PACF Pipe 2", main="")

The ACF and PACF plots for Pipes and the hourly aggregate of Pipe 1 show little if any autocorrelation between each observation and its immediate predecessor and little if any autocorrelation between the current observation and other individual lagged observations.

Clean Data

Pipe1 <- ts(Pipe1h, frequency = 12, start=start(Pipe1h))
Pipe2 <- ts(pipe2$WaterFlow, frequency = 12, start=start(pipe2$Date.Time))

The hourly aggregate of Pipe 1 is being used in this analysis. The frequency parameter is selected in accordance with the documentation accompanying the ts() function:

The value of argument frequency is used when the series is sampled an integral number of times in each unit time interval. For example, one could use a value of 7 for frequency when the data are sampled daily, and the natural time period is a week, or 12 when the data are sampled monthly and the natural time period is a year. Values of 4 and 12 are assumed in (e.g.) print methods to imply a quarterly and monthly series respectively.

Examine Trend

par(mfrow=c(2, 1), mar = c(0, 4, 0, 0), oma = c(0, 0, 0.5, 0.5))
plot(Pipe1, col=8, xaxt = "n", ylab="Pipe 1 Hourly")
lines(forecast::ma(Pipe1, order=2), col=2)
lines(forecast::ma(Pipe1, order=14), col=4)
plot(Pipe2, col=8, xaxt = "n", ylab="Pipe 2")
lines(forecast::ma(Pipe2, order=2), col=2)
lines(forecast::ma(Pipe2, order=14), col=4)

The 24-hour (daily) and 168-hour (weekly) moving average smoother line shows that the data for the ATMs have no apparent trend.

Decomposition

The Decomposition Plot decomposes and plots the observed values, the underlying trend, seasonality, and randomness of the time series data.

plot(decompose(Pipe1))

plot(decompose(Pipe2))

Plotting the trend-cycle and seasonal indices computed by additive decomposition shows that the data have no apparent trend, some seasonal fluctuations, and fairly random residuals.

Dickey-Fuller Test

An augmented Dickey-Fuller unit root test evaluates if the data exhibit a Stationarity process with deterministic trend or a Stationarity process with stochastic trend.

tseries::adf.test(Pipe1)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Pipe1
## Dickey-Fuller = -7.4928, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

tseries::adf.test(Pipe2)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Pipe2
## Dickey-Fuller = -9.5123, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

Model Data

Train-Test Split

The train and test sets are created by referencing rows by index.

index_train <- 1:(length(Pipe1) - 24*7)
Pipe1_train <- ts(Pipe1[index_train], frequency=12)
Pipe1_test <- ts(Pipe1[-index_train], frequency=12)
index_train <- 1:(length(Pipe2) - 24*7)
Pipe2_train <- ts(Pipe2[index_train], frequency=12)
Pipe2_test <- ts(Pipe2[-index_train], frequency=12)

Transformation

(lambda1 <- forecast::BoxCox.lambda(Pipe1_train))

## [1] 0.9951213

(lambda2 <- forecast::BoxCox.lambda(Pipe2_train))

## [1] 0.9064793

The Box-Cox transformations suggested are around $\lambda=1$. This rounded (more interpretable) value is suggestive of no transformation. It will still be applied however since Box-Cox transformations stabilize the variance and make each series relatively homoskedastic with equal variance.

ARIMA Model

(fit1 <- forecast::auto.arima(Pipe1_train, stepwise=F, approximation=F, d=0, lambda=lambda1))

## Series: Pipe1_train 
## ARIMA(0,0,0)(2,0,2)[12] with non-zero mean 
## Box Cox transformation: lambda= 0.9951213 
## 
## Coefficients:
##         sar1     sar2     sma1    sma2     mean
##       1.1463  -0.6880  -1.2000  0.6353  18.5951
## s.e.  0.2137   0.1838   0.2241  0.2150   0.2630
## 
## sigma^2 estimated as 53.23:  log likelihood=-1697.88
## AIC=3407.77   AICc=3407.94   BIC=3433.04

(fit2 <- forecast::auto.arima(Pipe2_train, stepwise=F, approximation=F, d=0, lambda=lambda2))

## Series: Pipe2_train 
## ARIMA(0,0,0)(2,0,0)[12] with non-zero mean 
## Box Cox transformation: lambda= 0.9064793 
## 
## Coefficients:
##          sar1    sar2     mean
##       -0.0040  0.0894  29.6302
## s.e.   0.0348  0.0350   0.4281
## 
## sigma^2 estimated as 128.7:  log likelihood=-3199.75
## AIC=6407.51   AICc=6407.55   BIC=6426.4

The auto.arima() function suggests an $\textrm{ARIMA}(0,0,0)(2,0,2)_{12}$ model for Pipe 1, and an $\textrm{ARIMA}(0,0,0)(2,0,0)_{12}$ model for Pipe 2.

Evaluate

ACF and PACF

par(mfrow=c(2, 2), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
acf(residuals(fit1), ylab="ACF Pipe 1"); pacf(residuals(fit1), ylab="PACF Pipe 1")
acf(residuals(fit2), ylab="ACF Pipe 2"); pacf(residuals(fit2), ylab="PACF Pipe 2")

The residuals of the models appear to display the characteristics of White Noise in the ACF and PACF plots with under one of the twenty residuals (or 0.05%) being significant. At a 95% confidence interval this is within probabilistic expectations.

Box-Ljung Test

Box.test(residuals(fit1), lag=7, fitdf=sum(fit1$arma[1:2]), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(fit1)
## X-squared = 7.8646, df = 7, p-value = 0.3447

Box.test(residuals(fit2), lag=7, fitdf=sum(fit1$arma[1:2]), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(fit2)
## X-squared = 4.9378, df = 7, p-value = 0.6676

Forecast

In-Sample

fcast1 <- forecast::forecast(fit1, h=24*7)
fcast2 <- forecast::forecast(fit2, h=24*7)
par(mfrow=c(2, 1), mar = c(0, 4, 0, 0), oma = c(4, 4, 2, 0.5))
plot(fcast1, ylab="Waterflow Pipe 1", main="", xaxt="n")
lines(lag(Pipe1_test, -length(Pipe1_train)), col="red")
plot(fcast2, ylab="Waterflow Pipe 2", main="", xaxt="n")
lines(lag(Pipe2_test, -length(Pipe2_train)), col="red")
title("Waterflow Predictions", outer=TRUE)

The predictions appear to produce forecasts that do not reflect patterns in the original data. The predictions intervals capture the actual data but the point estimates revert to the mean. This is likely due to the original data being Gaussian White Noise.

Model Accuracy

round(forecast::accuracy(fcast1, length(Pipe1_test)), 3)

##                   ME    RMSE     MAE     MPE   MAPE   MASE   ACF1
## Training set  -0.002   7.365   6.005 -22.484 43.434  0.690 -0.029
## Test set     146.326 146.326 146.326  87.099 87.099 16.813     NA

round(forecast::accuracy(fcast2, length(Pipe2_test)), 3)

##                   ME    RMSE     MAE     MPE   MAPE  MASE   ACF1
## Training set   0.331  15.837  12.933 -32.738 55.922 0.701 -0.026
## Test set     128.829 128.829 128.829  76.684 76.684 6.985     NA

The closer data are to being White Noise, the less accurate predictions become because White Noise is not predicable. The accuracy for these predications is without meaning since the original data are likely Gaussian White Noise.

Out-of-Sample

fcast1 <- forecast::forecast(forecast::Arima(Pipe1, model=fit1), h=30)
fcast2 <- forecast::forecast(forecast::Arima(Pipe2, model=fit2), h=30)

Export Results

Results are exported to an Excel file using the xlsx::write.xlsx() function which includes an option for appending a worksheet to an existing file.

xlsx::write.xlsx(fcast1, file="DATA624_Project1.xlsx", 
  sheetName="Pipe1", col.names = T, row.names = T, append = T)
xlsx::write.xlsx(fcast2, file="DATA624_Project1.xlsx", 
  sheetName="Pipe2", col.names = T, row.names = T, append = T)
xlsx::write.xlsx(Pipes, file="DATA624_Project1.xlsx", 
  sheetName="PipeMerge", col.names = T, row.names = T, append = T)