Project 1

The rpubs version of this work can be found here, and source/data can be found on github here.

Description

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

rm(list = ls())

library(knitr)
library(ggplot2)
library(tidyr)
library(dplyr)
library(tseries)
library(forecast)
library(lubridate)
library(tidyverse)

Load Data

Load all the data for all 3 parts from github in the interest of reproducibility.

temp_file <- tempfile(fileext = ".xlsx")

download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/ATM624Data.xlsx?raw=true", 
              destfile = temp_file, 
              mode = "wb", 
              quiet = TRUE)

ATM <- readxl::read_excel(temp_file,skip=0,col_types = c("date","text","numeric"))


download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/ResidentialCustomerForecastLoad-624.xlsx?raw=true", 
              destfile = temp_file, 
              mode = "wb", 
              quiet = TRUE)

power <- readxl::read_excel(temp_file,skip=0,col_types = c("numeric","text","numeric"))




download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/Waterflow_Pipe1.xlsx?raw=true", 
              destfile = temp_file, 
              mode = "wb", 
              quiet = TRUE)

water1 <- readxl::read_excel(temp_file,skip=0,col_types = c("date","numeric"))

download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/Waterflow_Pipe2.xlsx?raw=true", 
              destfile = temp_file, 
              mode = "wb", 
              quiet = TRUE)

water2 <- readxl::read_excel(temp_file,skip=0,col_types = c("date","numeric"))

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

First we’ll inspect the data to see if we can gain any intuition about what we are working with.

Inspect the Data

We find a few missing values * 3 cases where we have no Cash values for ATM1 and 2 cases for ATM2, which we will impute after further investigation * The entire month of may is missing as this is what we are asked to forcast.

ATM[!complete.cases(ATM),]
p <- ggplot(ATM[complete.cases(ATM),],aes(x=DATE,y=Cash,color=ATM))+
  geom_line()

p + facet_grid(rows = vars(ATM),scales="free")

p <- ggplot(ATM[complete.cases(ATM),],aes(x=Cash,color=ATM))+
  geom_histogram()

p + facet_grid(cols = vars(ATM),scales="free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From plotting the data, we can see that ATM1 and ATM2 look roughly similar in terms of withdrawls and cyclicality / seasonality. ATM3 appears as though it has just come “on-line” and thus has almost no history.

For ATM 1 we see a bi-modal distribution with peaks near 20 and 200 whereas for ATM2 we see a more constant/uniform distribution.

ATM4 has several oddities. First, we see a monstrous outlier value of 10919.76. Given that these values are in 100s, the outlier equates to > $1 Million, which is substantially more than ATMs typically hold - quick googling suggests that most ATMs carry about 10K and the max is around 200K - so something is wrong here. Common sense (based on context) dictates that this outlier is likely an error. I also note that the data from ATM4 are not integers which is odd as ATMs only dispense bills. I wonder whether this data is from an ATM that dispenses non-USD, yet is reported in USD or something similar…

Check for Seasonality / Cyclicality

We will eliminate the massive outlier from ATM4, and examine the data for any seasonal / cyclical patterns

#drop the outlier
ATM2 = ATM[ATM$Cash < 9500,]

#add a weekday col
ATM2$weekday <- factor(weekdays(as.Date(ATM2$DATE)))

#reorder levels
ATM2$weekday <- ordered(ATM2$weekday,levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

#drop nas for now
ATM2 <- ATM2[complete.cases(ATM2),]

#rescale AMT 4, for plotting
ATM2$Cash[ATM2$ATM == "ATM4"] <- ATM2$Cash[ATM2$ATM == "ATM4"]/5

ggplot(ATM2[complete.cases(ATM),],aes(x=weekday,y=Cash,color=ATM))+
  geom_boxplot()+
  ggtitle("ATM Cyclicality")

From the above plots we can see that there is a weekly cycle where Thursday appears to have dramatically lower withdrawls than all other days. Note that in this plot, ATM4 has been re-scaled to make the visualization easier on the eyes.

Impute Missing Values

Given that the day-of-week appears to be important, we will use this data when imputing missing values. We will impute using the median values for the week-day of eash missing data-point. As can be seen below, a missing “Thursday” probably warrants a different value than a missing “Saturday”

medians.by.weekday <- ATM2 %>%
    group_by(weekday,ATM) %>%
    summarise_at(vars(-DATE), funs(median(., na.rm=TRUE)))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
ggplot(medians.by.weekday,aes(x=weekday,y=Cash,color=ATM,group=ATM,fill=ATM))+
  geom_bar(stat="identity",position="dodge")+
  ggtitle("Values For Imputation")

Given that there are so few missing values, we’ll perform the imputation manually. If there were substantially more, or expected to be substantially more in the future, we would obviously build a more efficient process to handle this.

ATM$weekday <- factor(weekdays(as.Date(ATM$DATE)))

missing.values <- head(ATM[is.na(ATM$Cash) ,],5)

kable(missing.values,caption = "The Missing Values with Weekdays")
The Missing Values with Weekdays
DATE ATM Cash weekday
2009-06-13 ATM1 NA Saturday
2009-06-16 ATM1 NA Tuesday
2009-06-18 ATM2 NA Thursday
2009-06-22 ATM1 NA Monday
2009-06-24 ATM2 NA Wednesday
#grab index for missing values
missing.idx <- head(which(is.na(ATM$Cash)),5)

#manually replace each one
ATM$Cash[missing.idx[1]]<- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Saturday") & (medians.by.weekday$ATM == "ATM1")]

ATM$Cash[missing.idx[2]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Tuesday") & (medians.by.weekday$ATM == "ATM1")]

ATM$Cash[missing.idx[3]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Thursday") & (medians.by.weekday$ATM == "ATM2")]


ATM$Cash[missing.idx[4]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Monday") & (medians.by.weekday$ATM == "ATM1")]

ATM$Cash[missing.idx[5]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Wednesday") & (medians.by.weekday$ATM == "ATM2")]


ATM$Cash[ATM$Cash > 9000]  <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Tuesday") & (medians.by.weekday$ATM == "ATM4")]

#check updated values
kable( ATM[missing.idx,],caption = "The Imputed Values")
The Imputed Values
DATE ATM Cash weekday
2009-06-13 ATM1 96 Saturday
2009-06-16 ATM1 99 Tuesday
2009-06-18 ATM2 8 Thursday
2009-06-22 ATM1 85 Monday
2009-06-24 ATM2 28 Wednesday

Split the data into 4 Timeseries, Build Models

The characteristics of the ATMs seem distinct enough that they could be modelled as 4 separate processes. Given more time, I would likley try to tackle this as a more generalized “N-in, 1 out” so that i could make use of ALL the ATM data for each prediction, but i have not done that here.

ATM1

ATM1 <- ATM[ATM$ATM == "ATM1",]

ATM1 <- ts(ATM1[c("Cash")])

ggtsdisplay(ATM1, points = FALSE, 
            main = "Activity for ATM1",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

The weekly cycle is quite apparent in all 3 of the above plots and the relevant period is 7-days (i.e. thursdays appear to show a major slowdown in activity). We’ll difference using a lag of 7-periods and rub some stationarity tests

ATM1 <- ATM[ATM$ATM == "ATM1",]

ATM1 <- ATM1[complete.cases(ATM1),]

ATM1 <- ts(ATM1[c("Cash")],frequency = 7)

ggtsdisplay(ATM1, 
            main = "Activity for ATM1",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

Box.test(diff(ATM1,lag=7), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737
kpss.test(diff(ATM1,lag=7))
## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1
ggtsdisplay(diff(ATM1,lag=7), 
            main = "ATM1 - Diff, Lag=7",
            xlab = "Day",
            ylab = "")

Based on kpss output the series appears stationary so we will move onto the modelling part.

The Spike at lag=1 suggests an MA component and the large spike at lag=7 is consistent with the seasonal component we’ve already discussed. The variability looks reasonably stable but we’ll run a BoxCox anyway just to be sure.

We use auto.arima to search a model-space and return an appropriate model:

lambda = BoxCox.lambda(ATM1)
ATM1.arima <- auto.arima(ATM1,approximation = F,lambda = lambda)

checkresiduals(ATM1.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 10.52, df = 11, p-value = 0.4843
## 
## Model df: 3.   Total lags used: 14
ATM1.arima %>% forecast(h=31) %>% autoplot()

kpss.test(resid(ATM1.arima))
## Warning in kpss.test(resid(ATM1.arima)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(ATM1.arima)
## KPSS Level = 0.094343, Truncation lag parameter = 5, p-value = 0.1

The model looks reasonably good with the residuals approximately normally distributed (albiet with some left skew) and the ACF values all in-bounds. The forcast plot also looks reasonable to me.

ATM2

Given the similarity between ATM1 & ATM2, we’ll follow more-or-less the same process.

ATM2 <- ATM[ATM$ATM == "ATM2",]


ATM2 <- ATM2[complete.cases(ATM2),]

ATM2 <- ts(ATM2[c("Cash")],frequency = 7)

ggtsdisplay(ATM1, 
            main = "Activity for ATM2",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

Box.test(diff(ATM1,lag=7), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737
kpss.test(diff(ATM1,lag=7))
## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1
ggtsdisplay(diff(ATM1,lag=7), 
            main = "ATM2 - Diff, Lag=7",
            xlab = "Day",
            ylab = "")

lambda = BoxCox.lambda(ATM2)
ATM2.arima <- auto.arima(ATM2,approximation = F,lambda = lambda)

checkresiduals(ATM2.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.987, df = 9, p-value = 0.3515
## 
## Model df: 5.   Total lags used: 14
ATM2.arima %>% forecast(h=31) %>% autoplot()

kpss.test(resid(ATM2.arima))
## Warning in kpss.test(resid(ATM2.arima)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(ATM2.arima)
## KPSS Level = 0.071598, Truncation lag parameter = 5, p-value = 0.1

The model for ATM2 looks reasonable as well - resituals appear to be random, the ACF is more-or-less entirely in-bounds and the distribution of residuals is normal. The model parameters are slightly different from those of ATM1

ATM3

ATM 3 poses a bit of a tricky problem. We have almost no data and thus nothing to model. However, the limited data that we DO have appears reasonably similar to that of ATM1 and ATM2 so rather than having no forecast or using a fixed value or similar, we will use the mean of ATM1 and ATM2. I feel this is appropriate because we know ATM3 is, in fact, an ATM (which tells us a lot!) and that the limited data we have looks similar to a few other ATMs we know something about.

ATM3 <- (ATM1 + ATM2) /2

ggtsdisplay(ATM1, 
            main = "Activity for ATM3 (Proxy)",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

Box.test(diff(ATM1,lag=7), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737
kpss.test(diff(ATM1,lag=7))
## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1
ggtsdisplay(diff(ATM1,lag=7), 
            main = "ATM3 (Proxy) - Diff, Lag=7",
            xlab = "Day",
            ylab = "")

lambda = BoxCox.lambda(ATM3)
ATM3.arima <- auto.arima(ATM3,approximation = F,lambda = lambda)

checkresiduals(ATM3.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 10.476, df = 9, p-value = 0.3134
## 
## Model df: 5.   Total lags used: 14
ATM3.arima %>% forecast(h=31) %>% autoplot()

kpss.test(resid(ATM3.arima))
## Warning in kpss.test(resid(ATM3.arima)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(ATM3.arima)
## KPSS Level = 0.096755, Truncation lag parameter = 5, p-value = 0.1

ATM4

ATM4 <- ATM[ATM$ATM == "ATM4",]


ATM4 <- ATM4[complete.cases(ATM4),]

ATM4 <- ts(ATM4[c("Cash")],frequency = 7)

ggtsdisplay(ATM1, 
            main = "Activity for ATM4",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

Box.test(diff(ATM1,lag=7), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737
kpss.test(diff(ATM1,lag=7))
## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1
ggtsdisplay(diff(ATM1,lag=7), 
            main = "ATM4 - Diff, Lag=7",
            xlab = "Day",
            ylab = "")

lambda = BoxCox.lambda(ATM4)
ATM4.arima <- auto.arima(ATM4,approximation = F,lambda = lambda)

checkresiduals(ATM4.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 17.788, df = 10, p-value = 0.05864
## 
## Model df: 4.   Total lags used: 14
ATM4.arima %>% forecast(h=31) %>% autoplot()

kpss.test(resid(ATM4.arima))
## Warning in kpss.test(resid(ATM4.arima)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(ATM4.arima)
## KPSS Level = 0.048277, Truncation lag parameter = 5, p-value = 0.1

Output the Predictions

Below is a table containing the predictions for each ATM

atm1.f <- ATM1.arima %>% forecast(h=31)
atm2.f <- ATM2.arima %>% forecast(h=31)
atm3.f <- ATM3.arima %>% forecast(h=31)
atm4.f <- ATM4.arima %>% forecast(h=31)

ATM.f <- cbind(atm1.f$mean,atm2.f$mean,atm3.f$mean,atm4.f$mean)

colnames(ATM.f) <- c("ATM1","ATM2","ATM3","ATM4")
row.names(ATM.f) <- as.Date(seq.Date(as.Date("2010/5/1"),by="day",length.out=31),format = "%Y-%m-%d")

kable(ATM.f)
ATM1 ATM2 ATM3 ATM4
14730 86.70689 65.570389 76.518060 238.3850
14731 100.58583 71.420388 87.010739 346.0242
14732 73.68117 11.665773 43.414625 421.9467
14733 4.23253 2.506298 4.046689 158.9724
14734 100.16166 98.076872 99.373513 385.1930
14735 79.34712 89.044895 85.713901 264.4395
14736 85.75795 65.813527 76.793954 326.8497
14737 87.20172 65.595578 76.579836 207.1759
14738 100.39771 71.435234 87.179046 382.2706
14739 73.68117 11.649449 43.306440 372.5236
14740 4.23253 2.505951 4.013465 199.9500
14741 100.16166 98.106872 99.533873 354.1557
14742 79.34712 89.033761 85.714920 242.4417
14743 85.75795 65.794021 76.665286 367.6915
14744 87.20172 65.612794 76.638523 291.4327
14745 100.39771 71.445892 87.266385 353.1292
14746 73.68117 11.638179 43.230970 366.2645
14747 4.23253 2.505590 4.000651 269.7907
14748 100.16166 98.127762 99.633279 355.3942
14749 79.34712 89.026356 85.698752 305.8082
14750 85.75795 65.780310 76.593109 346.2168
14751 87.20172 65.624554 76.685456 303.0557
14752 100.39771 71.453533 87.308673 354.6069
14753 73.68117 11.630400 43.180981 355.3480
14754 4.23253 2.505255 3.997501 296.6022
14755 100.16166 98.142302 99.692670 349.4277
14756 79.34712 89.021441 85.679321 314.7151
14757 85.75795 65.770676 76.554380 350.2818
14758 87.20172 65.632582 76.719640 325.2836
14759 100.39771 71.459002 87.327043 349.0576
14760 73.68117 11.625033 43.149243 351.8726
write.csv(ATM.f,"ATM_predictions.csv")

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Inspect the data

#quick inspection and check for missing data
kable(tail(power))
CaseSequence YYYY-MMM KWH
919 2013-Jul 8415321
920 2013-Aug 9080226
921 2013-Sep 7968220
922 2013-Oct 5759367
923 2013-Nov 5769083
924 2013-Dec 9606304
kable(power[!complete.cases(power),])
CaseSequence YYYY-MMM KWH
861 2008-Sep NA
power.ts <-ts(power$KWH, start = c(1998, 1), frequency = 12)

autoplot(power.ts )

ggseasonplot(power.ts )

ggsubseriesplot(power.ts )
## Warning: Removed 16 row(s) containing missing values (geom_path).

gglagplot(power.ts )

ggAcf(power.ts )

We see only 1 missing case in the table above (sept 2008) which we can interpolate using the mean sept values based on what we see above. We also see an extreme value in july 2010.

We can see from the time series plot that the data has no major long-term trend and reasonably stable variability, tho there is an indication that the amplitude may be increasing post-2010. There is a pronounced seasonality visible in the timeseries plot and more specifically, the seasonal plot - summer and winter equate to higher power consumption than do fall and spring.

The seasonality is also abundantly clear in the ACF plot.

The seasonality and lag both suggest that we want to use a 12 month lag when evaluating this data. We will also try a BoxCox transformation given that the variability appears to be picking up towards the end of the series

Impute missing value & Winsorize

We’ll fix the missing value and also winsorize the data to see if there is any way we can reduce that extreme value from July 2010.

#find missing idx
idx <- which(!complete.cases(power))

#collect all values from same month
sept.kwh <- power$KWH[seq.int(9, length(power$KWH), 12L)]

#compute mean and inject
power$KWH[idx] <- mean(sept.kwh,na.rm = T)

#power$KWH <- Winsorize(power$KWH )

#rebuild the TS
power.ts <-ts(power$KWH, start = c(1998, 1), frequency = 12)

More Data Work

power.ts.l <- BoxCox(power.ts, BoxCox.lambda(power.ts))

ggtsdisplay(diff(power.ts.l,order=1,lag=12), main = "KWH Consumption - 2nd order Differenced w/ Lag= 12")

kpss.test(diff(power.ts.l,order=1,lag=12))
## Warning in kpss.test(diff(power.ts.l, order = 1, lag = 12)): p-value greater
## than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(power.ts.l, order = 1, lag = 12)
## KPSS Level = 0.06921, Truncation lag parameter = 4, p-value = 0.1

The data are now stationary as per the KPSS test. We see lag=12 as a stand-out in both the ACF and PACF.

lambda = BoxCox.lambda(power.ts)
power.arima <- auto.arima(power.ts,approximation = F,lambda = lambda)

checkresiduals(power.arima )

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)(2,0,0)[12]
## Q* = 20.429, df = 20, p-value = 0.4314
## 
## Model df: 4.   Total lags used: 24
power.arima %>% forecast(h=12) %>% autoplot()

kpss.test(resid(power.arima))
## Warning in kpss.test(resid(power.arima)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(power.arima)
## KPSS Level = 0.18848, Truncation lag parameter = 4, p-value = 0.1

Based on the residual plots above the model seems to be reasonably good! The fly in the ointment seem to be that major outlier month in July 2010. The ACF is mostly in-bounds and the distribution of residuals is mostly normal, however there is significant left-skew as a result of the the outlier in the data.

Output predictions

power.f <-power.arima %>% forecast(h=12)

power.f <- data.frame(power.f$mean)

colnames(power.f) <- "KWH"
row.names(power.f) <- as.Date(seq.Date(as.Date("2014/1/1"),by="month",length.out=12),format = "%Y-%m")

kable(power.f)
KWH
2014-01-01 9054756
2014-02-01 7517891
2014-03-01 6679011
2014-04-01 6289639
2014-05-01 6312917
2014-06-01 7578198
2014-07-01 7998690
2014-08-01 8386987
2014-09-01 7478936
2014-10-01 6186266
2014-11-01 6242122
2014-12-01 7583984
write.csv(power.f,"Power_predictions.csv")

Part C – BONUS, optional (part or all)

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Prepare the data

head(water1)
head(water2)
#join them
water <- rbind(water1,water2)

#sort data
water <- water[order(as.Date(water$`Date Time`, format="%Y-m%-d% h%:M%:s%")),]

#aggragate by hour and take mean...
water <- water %>%
  separate(`Date Time`, into = c("date", "time"), sep = " ") %>%
  separate(time, into = c("hour", "minute", "seccond"), sep = ":") %>%
  group_by(date, hour) %>%
  summarise(mean(WaterFlow,na.rm = T)) %>%
  ungroup() %>%
  arrange() %>%
  collect() %>%
  `colnames<-`(c("date", "time", "WaterFlow"))

#create timeseries
water.ts <- ts(water$WaterFlow,frequency=24)

Inspect the data

autoplot(water.ts)

ggseasonplot(water.ts,polar=T )

ggsubseriesplot(water.ts )

gglagplot(water.ts )

ggAcf(water.ts )

There’s an obvious change in variability at around day 12 which sticks out like a sore thumb. Otherwise, we see that there is no clear seasonal pattern and that the ACF is generally positive out to 48 lags.

On closer inspection, it appears as though the data from the 2 sets are quite different in terms of magnitude. In order to deal with the apparent jump in variance, i’m going to scale “water1” by the ratio of water2/water1

scale.factor <- sd(water2$WaterFlow)/sd(water1$WaterFlow)

water1$WaterFlow <- water1$WaterFlow * scale.factor

#join them
water <- rbind(water1,water2)

#sort data
water <- water[order(as.Date(water$`Date Time`, format="%Y-m%-d% h%:M%:s%")),]

#aggragate by hour and take mean...
water <- water %>%
  separate(`Date Time`, into = c("date", "time"), sep = " ") %>%
  separate(time, into = c("hour", "minute", "seccond"), sep = ":") %>%
  group_by(date, hour) %>%
  summarise(mean(WaterFlow,na.rm = T)) %>%
  ungroup() %>%
  arrange() %>%
  collect() %>%
  `colnames<-`(c("date", "time", "WaterFlow"))

#create timeseries
water.ts <- ts(water$WaterFlow,frequency=24)
autoplot(water.ts)

Transform and inspect

water.ts.l <- BoxCox(water.ts, BoxCox.lambda(water.ts))

ggtsdisplay(diff(water.ts.l,order=1,lag=1), main = "KWH Consumption - 2nd order Differenced w/ Lag= 12")

kpss.test(diff(water.ts.l,order=1,lag=1))
## Warning in kpss.test(diff(water.ts.l, order = 1, lag = 1)): p-value greater than
## printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(water.ts.l, order = 1, lag = 1)
## KPSS Level = 0.011026, Truncation lag parameter = 7, p-value = 0.1

We see that after a BoxCox transform and a 1st order difference with a lag of one, we get something stationary where the ACF is mostly in bounds and the PACF appears to have a well defined exponential-looking decay to zero.

Build a Model

lambda = BoxCox.lambda(water.ts)
water.arima <- auto.arima(water.ts,approximation = F,lambda = lambda)

checkresiduals(water.arima )

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(1,0,0)[24] with non-zero mean
## Q* = 62.932, df = 46, p-value = 0.04911
## 
## Model df: 2.   Total lags used: 48
water.arima %>% forecast(h=168) %>% autoplot()

kpss.test(resid(water.arima))
## Warning in kpss.test(resid(water.arima)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(water.arima)
## KPSS Level = 0.11557, Truncation lag parameter = 7, p-value = 0.1

I had tried this perviously without scaling the data from “water1” and got results that looked sub-par (not shown here). Upon scaling the input data (which admittedly was quite naive - i’m sure there’s a much better way!) i get results that look reasonable.

The residuals look reasonably normal in the ACF is mostly in-bounds.

Output predictions

water.f <-water.arima %>% forecast(h=168)

water.f <- data.frame(water.f$mean)

colnames(water.f) <- "WaterFlow"
row.names(water.f) <- seq(ymd_hm("2015-12-3 17:00"), ymd_hm("2015-12-10 16:00"), by = "hour")

kable(water.f)
WaterFlow
2015-12-03 17:00:00 38.73527
2015-12-03 18:00:00 39.74757
2015-12-03 19:00:00 40.60960
2015-12-03 20:00:00 38.86769
2015-12-03 21:00:00 40.00868
2015-12-03 22:00:00 39.48159
2015-12-03 23:00:00 37.51949
2015-12-04 00:00:00 39.27028
2015-12-04 01:00:00 41.29809
2015-12-04 02:00:00 39.51524
2015-12-04 03:00:00 40.92247
2015-12-04 04:00:00 40.10006
2015-12-04 05:00:00 40.10572
2015-12-04 06:00:00 37.97261
2015-12-04 07:00:00 38.76030
2015-12-04 08:00:00 36.99283
2015-12-04 09:00:00 37.85080
2015-12-04 10:00:00 39.13071
2015-12-04 11:00:00 41.03850
2015-12-04 12:00:00 38.19784
2015-12-04 13:00:00 40.64643
2015-12-04 14:00:00 39.03807
2015-12-04 15:00:00 38.34283
2015-12-04 16:00:00 40.63772
2015-12-04 17:00:00 38.73654
2015-12-04 18:00:00 38.80984
2015-12-04 19:00:00 38.87193
2015-12-04 20:00:00 38.74616
2015-12-04 21:00:00 38.82868
2015-12-04 22:00:00 38.79062
2015-12-04 23:00:00 38.64794
2015-12-05 00:00:00 38.77534
2015-12-05 01:00:00 38.92130
2015-12-05 02:00:00 38.79306
2015-12-05 03:00:00 38.89439
2015-12-05 04:00:00 38.83526
2015-12-05 05:00:00 38.83567
2015-12-05 06:00:00 38.68104
2015-12-05 07:00:00 38.73836
2015-12-05 08:00:00 38.60936
2015-12-05 09:00:00 38.67215
2015-12-05 10:00:00 38.76523
2015-12-05 11:00:00 38.90271
2015-12-05 12:00:00 38.69746
2015-12-05 13:00:00 38.87457
2015-12-05 14:00:00 38.75851
2015-12-05 15:00:00 38.70801
2015-12-05 16:00:00 38.87395
2015-12-05 17:00:00 38.73664
2015-12-05 18:00:00 38.74196
2015-12-05 19:00:00 38.74646
2015-12-05 20:00:00 38.73734
2015-12-05 21:00:00 38.74333
2015-12-05 22:00:00 38.74056
2015-12-05 23:00:00 38.73020
2015-12-06 00:00:00 38.73945
2015-12-06 01:00:00 38.75005
2015-12-06 02:00:00 38.74074
2015-12-06 03:00:00 38.74809
2015-12-06 04:00:00 38.74380
2015-12-06 05:00:00 38.74383
2015-12-06 06:00:00 38.73261
2015-12-06 07:00:00 38.73677
2015-12-06 08:00:00 38.72740
2015-12-06 09:00:00 38.73196
2015-12-06 10:00:00 38.73872
2015-12-06 11:00:00 38.74870
2015-12-06 12:00:00 38.73380
2015-12-06 13:00:00 38.74666
2015-12-06 14:00:00 38.73823
2015-12-06 15:00:00 38.73457
2015-12-06 16:00:00 38.74661
2015-12-06 17:00:00 38.73664
2015-12-06 18:00:00 38.73703
2015-12-06 19:00:00 38.73736
2015-12-06 20:00:00 38.73669
2015-12-06 21:00:00 38.73713
2015-12-06 22:00:00 38.73693
2015-12-06 23:00:00 38.73618
2015-12-07 00:00:00 38.73685
2015-12-07 01:00:00 38.73762
2015-12-07 02:00:00 38.73694
2015-12-07 03:00:00 38.73748
2015-12-07 04:00:00 38.73716
2015-12-07 05:00:00 38.73717
2015-12-07 06:00:00 38.73635
2015-12-07 07:00:00 38.73665
2015-12-07 08:00:00 38.73597
2015-12-07 09:00:00 38.73630
2015-12-07 10:00:00 38.73679
2015-12-07 11:00:00 38.73752
2015-12-07 12:00:00 38.73644
2015-12-07 13:00:00 38.73737
2015-12-07 14:00:00 38.73676
2015-12-07 15:00:00 38.73649
2015-12-07 16:00:00 38.73737
2015-12-07 17:00:00 38.73664
2015-12-07 18:00:00 38.73667
2015-12-07 19:00:00 38.73670
2015-12-07 20:00:00 38.73665
2015-12-07 21:00:00 38.73668
2015-12-07 22:00:00 38.73666
2015-12-07 23:00:00 38.73661
2015-12-08 00:00:00 38.73666
2015-12-08 01:00:00 38.73671
2015-12-08 02:00:00 38.73667
2015-12-08 03:00:00 38.73670
2015-12-08 04:00:00 38.73668
2015-12-08 05:00:00 38.73668
2015-12-08 06:00:00 38.73662
2015-12-08 07:00:00 38.73664
2015-12-08 08:00:00 38.73660
2015-12-08 09:00:00 38.73662
2015-12-08 10:00:00 38.73666
2015-12-08 11:00:00 38.73671
2015-12-08 12:00:00 38.73663
2015-12-08 13:00:00 38.73670
2015-12-08 14:00:00 38.73665
2015-12-08 15:00:00 38.73663
2015-12-08 16:00:00 38.73670
2015-12-08 17:00:00 38.73664
2015-12-08 18:00:00 38.73665
2015-12-08 19:00:00 38.73665
2015-12-08 20:00:00 38.73664
2015-12-08 21:00:00 38.73665
2015-12-08 22:00:00 38.73665
2015-12-08 23:00:00 38.73664
2015-12-09 00:00:00 38.73665
2015-12-09 01:00:00 38.73665
2015-12-09 02:00:00 38.73665
2015-12-09 03:00:00 38.73665
2015-12-09 04:00:00 38.73665
2015-12-09 05:00:00 38.73665
2015-12-09 06:00:00 38.73664
2015-12-09 07:00:00 38.73664
2015-12-09 08:00:00 38.73664
2015-12-09 09:00:00 38.73664
2015-12-09 10:00:00 38.73665
2015-12-09 11:00:00 38.73665
2015-12-09 12:00:00 38.73664
2015-12-09 13:00:00 38.73665
2015-12-09 14:00:00 38.73664
2015-12-09 15:00:00 38.73664
2015-12-09 16:00:00 38.73665
2015-12-09 17:00:00 38.73664
2015-12-09 18:00:00 38.73664
2015-12-09 19:00:00 38.73664
2015-12-09 20:00:00 38.73664
2015-12-09 21:00:00 38.73664
2015-12-09 22:00:00 38.73664
2015-12-09 23:00:00 38.73664
2015-12-10 00:00:00 38.73664
2015-12-10 01:00:00 38.73664
2015-12-10 02:00:00 38.73664
2015-12-10 03:00:00 38.73664
2015-12-10 04:00:00 38.73664
2015-12-10 05:00:00 38.73664
2015-12-10 06:00:00 38.73664
2015-12-10 07:00:00 38.73664
2015-12-10 08:00:00 38.73664
2015-12-10 09:00:00 38.73664
2015-12-10 10:00:00 38.73664
2015-12-10 11:00:00 38.73664
2015-12-10 12:00:00 38.73664
2015-12-10 13:00:00 38.73664
2015-12-10 14:00:00 38.73664
2015-12-10 15:00:00 38.73664
2015-12-10 16:00:00 38.73664
write.csv(water.f,"Water_predictions.csv")