Project 1

The rpubs version of this work can be found here, and source/data can be found on github here.

Description

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

rm(list = ls())

library(knitr)
library(ggplot2)
library(tidyr)
library(dplyr)
library(tseries)
library(forecast)
library(lubridate)
library(tidyverse)

Load Data

Load all the data for all 3 parts from github in the interest of reproducibility.

temp_file <- tempfile(fileext = ".xlsx")

download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/ATM624Data.xlsx?raw=true", 
              destfile = temp_file, 
              mode = "wb", 
              quiet = TRUE)

ATM <- readxl::read_excel(temp_file,skip=0,col_types = c("date","text","numeric"))


download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/ResidentialCustomerForecastLoad-624.xlsx?raw=true", 
              destfile = temp_file, 
              mode = "wb", 
              quiet = TRUE)

power <- readxl::read_excel(temp_file,skip=0,col_types = c("numeric","text","numeric"))




download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/Waterflow_Pipe1.xlsx?raw=true", 
              destfile = temp_file, 
              mode = "wb", 
              quiet = TRUE)

water1 <- readxl::read_excel(temp_file,skip=0,col_types = c("date","numeric"))

download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/Waterflow_Pipe2.xlsx?raw=true", 
              destfile = temp_file, 
              mode = "wb", 
              quiet = TRUE)

water2 <- readxl::read_excel(temp_file,skip=0,col_types = c("date","numeric"))

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

First we’ll inspect the data to see if we can gain any intuition about what we are working with.

Inspect the Data

We find a few missing values * 3 cases where we have no Cash values for ATM1 and 2 cases for ATM2, which we will impute after further investigation * The entire month of may is missing as this is what we are asked to forcast.

ATM[!complete.cases(ATM),]

p <- ggplot(ATM[complete.cases(ATM),],aes(x=DATE,y=Cash,color=ATM))+
  geom_line()

p + facet_grid(rows = vars(ATM),scales="free")

p <- ggplot(ATM[complete.cases(ATM),],aes(x=Cash,color=ATM))+
  geom_histogram()

p + facet_grid(cols = vars(ATM),scales="free")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From plotting the data, we can see that ATM1 and ATM2 look roughly similar in terms of withdrawls and cyclicality / seasonality. ATM3 appears as though it has just come “on-line” and thus has almost no history.

For ATM 1 we see a bi-modal distribution with peaks near 20 and 200 whereas for ATM2 we see a more constant/uniform distribution.

ATM4 has several oddities. First, we see a monstrous outlier value of 10919.76. Given that these values are in 100s, the outlier equates to > $1 Million, which is substantially more than ATMs typically hold - quick googling suggests that most ATMs carry about 10K and the max is around 200K - so something is wrong here. Common sense (based on context) dictates that this outlier is likely an error. I also note that the data from ATM4 are not integers which is odd as ATMs only dispense bills. I wonder whether this data is from an ATM that dispenses non-USD, yet is reported in USD or something similar…

Check for Seasonality / Cyclicality

We will eliminate the massive outlier from ATM4, and examine the data for any seasonal / cyclical patterns

#drop the outlier
ATM2 = ATM[ATM$Cash < 9500,]

#add a weekday col
ATM2$weekday <- factor(weekdays(as.Date(ATM2$DATE)))

#reorder levels
ATM2$weekday <- ordered(ATM2$weekday,levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

#drop nas for now
ATM2 <- ATM2[complete.cases(ATM2),]

#rescale AMT 4, for plotting
ATM2$Cash[ATM2$ATM == "ATM4"] <- ATM2$Cash[ATM2$ATM == "ATM4"]/5

ggplot(ATM2[complete.cases(ATM),],aes(x=weekday,y=Cash,color=ATM))+
  geom_boxplot()+
  ggtitle("ATM Cyclicality")

From the above plots we can see that there is a weekly cycle where Thursday appears to have dramatically lower withdrawls than all other days. Note that in this plot, ATM4 has been re-scaled to make the visualization easier on the eyes.

Impute Missing Values

Given that the day-of-week appears to be important, we will use this data when imputing missing values. We will impute using the median values for the week-day of eash missing data-point. As can be seen below, a missing “Thursday” probably warrants a different value than a missing “Saturday”

medians.by.weekday <- ATM2 %>%
    group_by(weekday,ATM) %>%
    summarise_at(vars(-DATE), funs(median(., na.rm=TRUE)))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

ggplot(medians.by.weekday,aes(x=weekday,y=Cash,color=ATM,group=ATM,fill=ATM))+
  geom_bar(stat="identity",position="dodge")+
  ggtitle("Values For Imputation")

Given that there are so few missing values, we’ll perform the imputation manually. If there were substantially more, or expected to be substantially more in the future, we would obviously build a more efficient process to handle this.

ATM$weekday <- factor(weekdays(as.Date(ATM$DATE)))

missing.values <- head(ATM[is.na(ATM$Cash) ,],5)

kable(missing.values,caption = "The Missing Values with Weekdays")

The Missing Values with Weekdays
DATE	ATM	Cash	weekday
2009-06-13	ATM1	NA	Saturday
2009-06-16	ATM1	NA	Tuesday
2009-06-18	ATM2	NA	Thursday
2009-06-22	ATM1	NA	Monday
2009-06-24	ATM2	NA	Wednesday

#grab index for missing values
missing.idx <- head(which(is.na(ATM$Cash)),5)

#manually replace each one
ATM$Cash[missing.idx[1]]<- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Saturday") & (medians.by.weekday$ATM == "ATM1")]

ATM$Cash[missing.idx[2]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Tuesday") & (medians.by.weekday$ATM == "ATM1")]

ATM$Cash[missing.idx[3]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Thursday") & (medians.by.weekday$ATM == "ATM2")]


ATM$Cash[missing.idx[4]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Monday") & (medians.by.weekday$ATM == "ATM1")]

ATM$Cash[missing.idx[5]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Wednesday") & (medians.by.weekday$ATM == "ATM2")]


ATM$Cash[ATM$Cash > 9000]  <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Tuesday") & (medians.by.weekday$ATM == "ATM4")]

#check updated values
kable( ATM[missing.idx,],caption = "The Imputed Values")

The Imputed Values
DATE	ATM	Cash	weekday
2009-06-13	ATM1	96	Saturday
2009-06-16	ATM1	99	Tuesday
2009-06-18	ATM2	8	Thursday
2009-06-22	ATM1	85	Monday
2009-06-24	ATM2	28	Wednesday

Split the data into 4 Timeseries, Build Models

The characteristics of the ATMs seem distinct enough that they could be modelled as 4 separate processes. Given more time, I would likley try to tackle this as a more generalized “N-in, 1 out” so that i could make use of ALL the ATM data for each prediction, but i have not done that here.

ATM1

ATM1 <- ATM[ATM$ATM == "ATM1",]

ATM1 <- ts(ATM1[c("Cash")])

ggtsdisplay(ATM1, points = FALSE, 
            main = "Activity for ATM1",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

The weekly cycle is quite apparent in all 3 of the above plots and the relevant period is 7-days (i.e. thursdays appear to show a major slowdown in activity). We’ll difference using a lag of 7-periods and rub some stationarity tests

ATM1 <- ATM[ATM$ATM == "ATM1",]

ATM1 <- ATM1[complete.cases(ATM1),]

ATM1 <- ts(ATM1[c("Cash")],frequency = 7)

ggtsdisplay(ATM1, 
            main = "Activity for ATM1",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

Box.test(diff(ATM1,lag=7), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737

kpss.test(diff(ATM1,lag=7))

## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1

ggtsdisplay(diff(ATM1,lag=7), 
            main = "ATM1 - Diff, Lag=7",
            xlab = "Day",
            ylab = "")

Based on kpss output the series appears stationary so we will move onto the modelling part.

The Spike at lag=1 suggests an MA component and the large spike at lag=7 is consistent with the seasonal component we’ve already discussed. The variability looks reasonably stable but we’ll run a BoxCox anyway just to be sure.

We use auto.arima to search a model-space and return an appropriate model:

lambda = BoxCox.lambda(ATM1)
ATM1.arima <- auto.arima(ATM1,approximation = F,lambda = lambda)

checkresiduals(ATM1.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 10.52, df = 11, p-value = 0.4843
## 
## Model df: 3.   Total lags used: 14

ATM1.arima %>% forecast(h=31) %>% autoplot()

kpss.test(resid(ATM1.arima))

## Warning in kpss.test(resid(ATM1.arima)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(ATM1.arima)
## KPSS Level = 0.094343, Truncation lag parameter = 5, p-value = 0.1

The model looks reasonably good with the residuals approximately normally distributed (albiet with some left skew) and the ACF values all in-bounds. The forcast plot also looks reasonable to me.

ATM2

Given the similarity between ATM1 & ATM2, we’ll follow more-or-less the same process.

ATM2 <- ATM[ATM$ATM == "ATM2",]


ATM2 <- ATM2[complete.cases(ATM2),]

ATM2 <- ts(ATM2[c("Cash")],frequency = 7)

ggtsdisplay(ATM1, 
            main = "Activity for ATM2",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

Box.test(diff(ATM1,lag=7), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737

kpss.test(diff(ATM1,lag=7))

## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1

ggtsdisplay(diff(ATM1,lag=7), 
            main = "ATM2 - Diff, Lag=7",
            xlab = "Day",
            ylab = "")

lambda = BoxCox.lambda(ATM2)
ATM2.arima <- auto.arima(ATM2,approximation = F,lambda = lambda)

checkresiduals(ATM2.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.987, df = 9, p-value = 0.3515
## 
## Model df: 5.   Total lags used: 14

ATM2.arima %>% forecast(h=31) %>% autoplot()

kpss.test(resid(ATM2.arima))

## Warning in kpss.test(resid(ATM2.arima)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(ATM2.arima)
## KPSS Level = 0.071598, Truncation lag parameter = 5, p-value = 0.1

The model for ATM2 looks reasonable as well - resituals appear to be random, the ACF is more-or-less entirely in-bounds and the distribution of residuals is normal. The model parameters are slightly different from those of ATM1

ATM3

ATM 3 poses a bit of a tricky problem. We have almost no data and thus nothing to model. However, the limited data that we DO have appears reasonably similar to that of ATM1 and ATM2 so rather than having no forecast or using a fixed value or similar, we will use the mean of ATM1 and ATM2. I feel this is appropriate because we know ATM3 is, in fact, an ATM (which tells us a lot!) and that the limited data we have looks similar to a few other ATMs we know something about.

ATM3 <- (ATM1 + ATM2) /2

ggtsdisplay(ATM1, 
            main = "Activity for ATM3 (Proxy)",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

Box.test(diff(ATM1,lag=7), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737

kpss.test(diff(ATM1,lag=7))

## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1

ggtsdisplay(diff(ATM1,lag=7), 
            main = "ATM3 (Proxy) - Diff, Lag=7",
            xlab = "Day",
            ylab = "")

lambda = BoxCox.lambda(ATM3)
ATM3.arima <- auto.arima(ATM3,approximation = F,lambda = lambda)

checkresiduals(ATM3.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 10.476, df = 9, p-value = 0.3134
## 
## Model df: 5.   Total lags used: 14

ATM3.arima %>% forecast(h=31) %>% autoplot()

kpss.test(resid(ATM3.arima))

## Warning in kpss.test(resid(ATM3.arima)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(ATM3.arima)
## KPSS Level = 0.096755, Truncation lag parameter = 5, p-value = 0.1

ATM4

ATM4 <- ATM[ATM$ATM == "ATM4",]


ATM4 <- ATM4[complete.cases(ATM4),]

ATM4 <- ts(ATM4[c("Cash")],frequency = 7)

ggtsdisplay(ATM1, 
            main = "Activity for ATM4",
            xlab = "Day",
            ylab = "Withdrawls (000's)")

Box.test(diff(ATM1,lag=7), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737

kpss.test(diff(ATM1,lag=7))

## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1

ggtsdisplay(diff(ATM1,lag=7), 
            main = "ATM4 - Diff, Lag=7",
            xlab = "Day",
            ylab = "")

lambda = BoxCox.lambda(ATM4)
ATM4.arima <- auto.arima(ATM4,approximation = F,lambda = lambda)

checkresiduals(ATM4.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 17.788, df = 10, p-value = 0.05864
## 
## Model df: 4.   Total lags used: 14

ATM4.arima %>% forecast(h=31) %>% autoplot()

kpss.test(resid(ATM4.arima))

## Warning in kpss.test(resid(ATM4.arima)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(ATM4.arima)
## KPSS Level = 0.048277, Truncation lag parameter = 5, p-value = 0.1

Output the Predictions

Below is a table containing the predictions for each ATM

atm1.f <- ATM1.arima %>% forecast(h=31)
atm2.f <- ATM2.arima %>% forecast(h=31)
atm3.f <- ATM3.arima %>% forecast(h=31)
atm4.f <- ATM4.arima %>% forecast(h=31)

ATM.f <- cbind(atm1.f$mean,atm2.f$mean,atm3.f$mean,atm4.f$mean)

colnames(ATM.f) <- c("ATM1","ATM2","ATM3","ATM4")
row.names(ATM.f) <- as.Date(seq.Date(as.Date("2010/5/1"),by="day",length.out=31),format = "%Y-%m-%d")

kable(ATM.f)

	ATM1	ATM2	ATM3	ATM4
14730	86.70689	65.570389	76.518060	238.3850
14731	100.58583	71.420388	87.010739	346.0242
14732	73.68117	11.665773	43.414625	421.9467
14733	4.23253	2.506298	4.046689	158.9724
14734	100.16166	98.076872	99.373513	385.1930
14735	79.34712	89.044895	85.713901	264.4395
14736	85.75795	65.813527	76.793954	326.8497
14737	87.20172	65.595578	76.579836	207.1759
14738	100.39771	71.435234	87.179046	382.2706
14739	73.68117	11.649449	43.306440	372.5236
14740	4.23253	2.505951	4.013465	199.9500
14741	100.16166	98.106872	99.533873	354.1557
14742	79.34712	89.033761	85.714920	242.4417
14743	85.75795	65.794021	76.665286	367.6915
14744	87.20172	65.612794	76.638523	291.4327
14745	100.39771	71.445892	87.266385	353.1292
14746	73.68117	11.638179	43.230970	366.2645
14747	4.23253	2.505590	4.000651	269.7907
14748	100.16166	98.127762	99.633279	355.3942
14749	79.34712	89.026356	85.698752	305.8082
14750	85.75795	65.780310	76.593109	346.2168
14751	87.20172	65.624554	76.685456	303.0557
14752	100.39771	71.453533	87.308673	354.6069
14753	73.68117	11.630400	43.180981	355.3480
14754	4.23253	2.505255	3.997501	296.6022
14755	100.16166	98.142302	99.692670	349.4277
14756	79.34712	89.021441	85.679321	314.7151
14757	85.75795	65.770676	76.554380	350.2818
14758	87.20172	65.632582	76.719640	325.2836
14759	100.39771	71.459002	87.327043	349.0576
14760	73.68117	11.625033	43.149243	351.8726

write.csv(ATM.f,"ATM_predictions.csv")

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Inspect the data

#quick inspection and check for missing data
kable(tail(power))

CaseSequence	YYYY-MMM	KWH
919	2013-Jul	8415321
920	2013-Aug	9080226
921	2013-Sep	7968220
922	2013-Oct	5759367
923	2013-Nov	5769083
924	2013-Dec	9606304

kable(power[!complete.cases(power),])

CaseSequence	YYYY-MMM	KWH
861	2008-Sep	NA

power.ts <-ts(power$KWH, start = c(1998, 1), frequency = 12)

autoplot(power.ts )

ggseasonplot(power.ts )

ggsubseriesplot(power.ts )

## Warning: Removed 16 row(s) containing missing values (geom_path).

gglagplot(power.ts )

ggAcf(power.ts )

We see only 1 missing case in the table above (sept 2008) which we can interpolate using the mean sept values based on what we see above. We also see an extreme value in july 2010.

We can see from the time series plot that the data has no major long-term trend and reasonably stable variability, tho there is an indication that the amplitude may be increasing post-2010. There is a pronounced seasonality visible in the timeseries plot and more specifically, the seasonal plot - summer and winter equate to higher power consumption than do fall and spring.

The seasonality is also abundantly clear in the ACF plot.

The seasonality and lag both suggest that we want to use a 12 month lag when evaluating this data. We will also try a BoxCox transformation given that the variability appears to be picking up towards the end of the series

Impute missing value & Winsorize

We’ll fix the missing value and also winsorize the data to see if there is any way we can reduce that extreme value from July 2010.

#find missing idx
idx <- which(!complete.cases(power))

#collect all values from same month
sept.kwh <- power$KWH[seq.int(9, length(power$KWH), 12L)]

#compute mean and inject
power$KWH[idx] <- mean(sept.kwh,na.rm = T)

#power$KWH <- Winsorize(power$KWH )

#rebuild the TS
power.ts <-ts(power$KWH, start = c(1998, 1), frequency = 12)

More Data Work

power.ts.l <- BoxCox(power.ts, BoxCox.lambda(power.ts))

ggtsdisplay(diff(power.ts.l,order=1,lag=12), main = "KWH Consumption - 2nd order Differenced w/ Lag= 12")

kpss.test(diff(power.ts.l,order=1,lag=12))

## Warning in kpss.test(diff(power.ts.l, order = 1, lag = 12)): p-value greater
## than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(power.ts.l, order = 1, lag = 12)
## KPSS Level = 0.06921, Truncation lag parameter = 4, p-value = 0.1

The data are now stationary as per the KPSS test. We see lag=12 as a stand-out in both the ACF and PACF.

lambda = BoxCox.lambda(power.ts)
power.arima <- auto.arima(power.ts,approximation = F,lambda = lambda)

checkresiduals(power.arima )

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)(2,0,0)[12]
## Q* = 20.429, df = 20, p-value = 0.4314
## 
## Model df: 4.   Total lags used: 24

power.arima %>% forecast(h=12) %>% autoplot()

kpss.test(resid(power.arima))

## Warning in kpss.test(resid(power.arima)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(power.arima)
## KPSS Level = 0.18848, Truncation lag parameter = 4, p-value = 0.1

Based on the residual plots above the model seems to be reasonably good! The fly in the ointment seem to be that major outlier month in July 2010. The ACF is mostly in-bounds and the distribution of residuals is mostly normal, however there is significant left-skew as a result of the the outlier in the data.

Output predictions

power.f <-power.arima %>% forecast(h=12)

power.f <- data.frame(power.f$mean)

colnames(power.f) <- "KWH"
row.names(power.f) <- as.Date(seq.Date(as.Date("2014/1/1"),by="month",length.out=12),format = "%Y-%m")

kable(power.f)

	KWH
2014-01-01	9054756
2014-02-01	7517891
2014-03-01	6679011
2014-04-01	6289639
2014-05-01	6312917
2014-06-01	7578198
2014-07-01	7998690
2014-08-01	8386987
2014-09-01	7478936
2014-10-01	6186266
2014-11-01	6242122
2014-12-01	7583984

write.csv(power.f,"Power_predictions.csv")

Part C – BONUS, optional (part or all)

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Prepare the data

head(water1)

head(water2)

#join them
water <- rbind(water1,water2)

#sort data
water <- water[order(as.Date(water$`Date Time`, format="%Y-m%-d% h%:M%:s%")),]

#aggragate by hour and take mean...
water <- water %>%
  separate(`Date Time`, into = c("date", "time"), sep = " ") %>%
  separate(time, into = c("hour", "minute", "seccond"), sep = ":") %>%
  group_by(date, hour) %>%
  summarise(mean(WaterFlow,na.rm = T)) %>%
  ungroup() %>%
  arrange() %>%
  collect() %>%
  `colnames<-`(c("date", "time", "WaterFlow"))

#create timeseries
water.ts <- ts(water$WaterFlow,frequency=24)

Inspect the data

autoplot(water.ts)

ggseasonplot(water.ts,polar=T )

ggsubseriesplot(water.ts )

gglagplot(water.ts )

ggAcf(water.ts )

There’s an obvious change in variability at around day 12 which sticks out like a sore thumb. Otherwise, we see that there is no clear seasonal pattern and that the ACF is generally positive out to 48 lags.

On closer inspection, it appears as though the data from the 2 sets are quite different in terms of magnitude. In order to deal with the apparent jump in variance, i’m going to scale “water1” by the ratio of water2/water1

scale.factor <- sd(water2$WaterFlow)/sd(water1$WaterFlow)

water1$WaterFlow <- water1$WaterFlow * scale.factor

#join them
water <- rbind(water1,water2)

#sort data
water <- water[order(as.Date(water$`Date Time`, format="%Y-m%-d% h%:M%:s%")),]

#aggragate by hour and take mean...
water <- water %>%
  separate(`Date Time`, into = c("date", "time"), sep = " ") %>%
  separate(time, into = c("hour", "minute", "seccond"), sep = ":") %>%
  group_by(date, hour) %>%
  summarise(mean(WaterFlow,na.rm = T)) %>%
  ungroup() %>%
  arrange() %>%
  collect() %>%
  `colnames<-`(c("date", "time", "WaterFlow"))

#create timeseries
water.ts <- ts(water$WaterFlow,frequency=24)
autoplot(water.ts)

Transform and inspect

water.ts.l <- BoxCox(water.ts, BoxCox.lambda(water.ts))

ggtsdisplay(diff(water.ts.l,order=1,lag=1), main = "KWH Consumption - 2nd order Differenced w/ Lag= 12")

kpss.test(diff(water.ts.l,order=1,lag=1))

## Warning in kpss.test(diff(water.ts.l, order = 1, lag = 1)): p-value greater than
## printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(water.ts.l, order = 1, lag = 1)
## KPSS Level = 0.011026, Truncation lag parameter = 7, p-value = 0.1

We see that after a BoxCox transform and a 1st order difference with a lag of one, we get something stationary where the ACF is mostly in bounds and the PACF appears to have a well defined exponential-looking decay to zero.

Build a Model

lambda = BoxCox.lambda(water.ts)
water.arima <- auto.arima(water.ts,approximation = F,lambda = lambda)

checkresiduals(water.arima )

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(1,0,0)[24] with non-zero mean
## Q* = 62.932, df = 46, p-value = 0.04911
## 
## Model df: 2.   Total lags used: 48

water.arima %>% forecast(h=168) %>% autoplot()

kpss.test(resid(water.arima))

## Warning in kpss.test(resid(water.arima)): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  resid(water.arima)
## KPSS Level = 0.11557, Truncation lag parameter = 7, p-value = 0.1

I had tried this perviously without scaling the data from “water1” and got results that looked sub-par (not shown here). Upon scaling the input data (which admittedly was quite naive - i’m sure there’s a much better way!) i get results that look reasonable.

The residuals look reasonably normal in the ACF is mostly in-bounds.

Output predictions

water.f <-water.arima %>% forecast(h=168)

water.f <- data.frame(water.f$mean)

colnames(water.f) <- "WaterFlow"
row.names(water.f) <- seq(ymd_hm("2015-12-3 17:00"), ymd_hm("2015-12-10 16:00"), by = "hour")

kable(water.f)

	WaterFlow
2015-12-03 17:00:00	38.73527
2015-12-03 18:00:00	39.74757
2015-12-03 19:00:00	40.60960
2015-12-03 20:00:00	38.86769
2015-12-03 21:00:00	40.00868
2015-12-03 22:00:00	39.48159
2015-12-03 23:00:00	37.51949
2015-12-04 00:00:00	39.27028
2015-12-04 01:00:00	41.29809
2015-12-04 02:00:00	39.51524
2015-12-04 03:00:00	40.92247
2015-12-04 04:00:00	40.10006
2015-12-04 05:00:00	40.10572
2015-12-04 06:00:00	37.97261
2015-12-04 07:00:00	38.76030
2015-12-04 08:00:00	36.99283
2015-12-04 09:00:00	37.85080
2015-12-04 10:00:00	39.13071
2015-12-04 11:00:00	41.03850
2015-12-04 12:00:00	38.19784
2015-12-04 13:00:00	40.64643
2015-12-04 14:00:00	39.03807
2015-12-04 15:00:00	38.34283
2015-12-04 16:00:00	40.63772
2015-12-04 17:00:00	38.73654
2015-12-04 18:00:00	38.80984
2015-12-04 19:00:00	38.87193
2015-12-04 20:00:00	38.74616
2015-12-04 21:00:00	38.82868
2015-12-04 22:00:00	38.79062
2015-12-04 23:00:00	38.64794
2015-12-05 00:00:00	38.77534
2015-12-05 01:00:00	38.92130
2015-12-05 02:00:00	38.79306
2015-12-05 03:00:00	38.89439
2015-12-05 04:00:00	38.83526
2015-12-05 05:00:00	38.83567
2015-12-05 06:00:00	38.68104
2015-12-05 07:00:00	38.73836
2015-12-05 08:00:00	38.60936
2015-12-05 09:00:00	38.67215
2015-12-05 10:00:00	38.76523
2015-12-05 11:00:00	38.90271
2015-12-05 12:00:00	38.69746
2015-12-05 13:00:00	38.87457
2015-12-05 14:00:00	38.75851
2015-12-05 15:00:00	38.70801
2015-12-05 16:00:00	38.87395
2015-12-05 17:00:00	38.73664
2015-12-05 18:00:00	38.74196
2015-12-05 19:00:00	38.74646
2015-12-05 20:00:00	38.73734
2015-12-05 21:00:00	38.74333
2015-12-05 22:00:00	38.74056
2015-12-05 23:00:00	38.73020
2015-12-06 00:00:00	38.73945
2015-12-06 01:00:00	38.75005
2015-12-06 02:00:00	38.74074
2015-12-06 03:00:00	38.74809
2015-12-06 04:00:00	38.74380
2015-12-06 05:00:00	38.74383
2015-12-06 06:00:00	38.73261
2015-12-06 07:00:00	38.73677
2015-12-06 08:00:00	38.72740
2015-12-06 09:00:00	38.73196
2015-12-06 10:00:00	38.73872
2015-12-06 11:00:00	38.74870
2015-12-06 12:00:00	38.73380
2015-12-06 13:00:00	38.74666
2015-12-06 14:00:00	38.73823
2015-12-06 15:00:00	38.73457
2015-12-06 16:00:00	38.74661
2015-12-06 17:00:00	38.73664
2015-12-06 18:00:00	38.73703
2015-12-06 19:00:00	38.73736
2015-12-06 20:00:00	38.73669
2015-12-06 21:00:00	38.73713
2015-12-06 22:00:00	38.73693
2015-12-06 23:00:00	38.73618
2015-12-07 00:00:00	38.73685
2015-12-07 01:00:00	38.73762
2015-12-07 02:00:00	38.73694
2015-12-07 03:00:00	38.73748
2015-12-07 04:00:00	38.73716
2015-12-07 05:00:00	38.73717
2015-12-07 06:00:00	38.73635
2015-12-07 07:00:00	38.73665
2015-12-07 08:00:00	38.73597
2015-12-07 09:00:00	38.73630
2015-12-07 10:00:00	38.73679
2015-12-07 11:00:00	38.73752
2015-12-07 12:00:00	38.73644
2015-12-07 13:00:00	38.73737
2015-12-07 14:00:00	38.73676
2015-12-07 15:00:00	38.73649
2015-12-07 16:00:00	38.73737
2015-12-07 17:00:00	38.73664
2015-12-07 18:00:00	38.73667
2015-12-07 19:00:00	38.73670
2015-12-07 20:00:00	38.73665
2015-12-07 21:00:00	38.73668
2015-12-07 22:00:00	38.73666
2015-12-07 23:00:00	38.73661
2015-12-08 00:00:00	38.73666
2015-12-08 01:00:00	38.73671
2015-12-08 02:00:00	38.73667
2015-12-08 03:00:00	38.73670
2015-12-08 04:00:00	38.73668
2015-12-08 05:00:00	38.73668
2015-12-08 06:00:00	38.73662
2015-12-08 07:00:00	38.73664
2015-12-08 08:00:00	38.73660
2015-12-08 09:00:00	38.73662
2015-12-08 10:00:00	38.73666
2015-12-08 11:00:00	38.73671
2015-12-08 12:00:00	38.73663
2015-12-08 13:00:00	38.73670
2015-12-08 14:00:00	38.73665
2015-12-08 15:00:00	38.73663
2015-12-08 16:00:00	38.73670
2015-12-08 17:00:00	38.73664
2015-12-08 18:00:00	38.73665
2015-12-08 19:00:00	38.73665
2015-12-08 20:00:00	38.73664
2015-12-08 21:00:00	38.73665
2015-12-08 22:00:00	38.73665
2015-12-08 23:00:00	38.73664
2015-12-09 00:00:00	38.73665
2015-12-09 01:00:00	38.73665
2015-12-09 02:00:00	38.73665
2015-12-09 03:00:00	38.73665
2015-12-09 04:00:00	38.73665
2015-12-09 05:00:00	38.73665
2015-12-09 06:00:00	38.73664
2015-12-09 07:00:00	38.73664
2015-12-09 08:00:00	38.73664
2015-12-09 09:00:00	38.73664
2015-12-09 10:00:00	38.73665
2015-12-09 11:00:00	38.73665
2015-12-09 12:00:00	38.73664
2015-12-09 13:00:00	38.73665
2015-12-09 14:00:00	38.73664
2015-12-09 15:00:00	38.73664
2015-12-09 16:00:00	38.73665
2015-12-09 17:00:00	38.73664
2015-12-09 18:00:00	38.73664
2015-12-09 19:00:00	38.73664
2015-12-09 20:00:00	38.73664
2015-12-09 21:00:00	38.73664
2015-12-09 22:00:00	38.73664
2015-12-09 23:00:00	38.73664
2015-12-10 00:00:00	38.73664
2015-12-10 01:00:00	38.73664
2015-12-10 02:00:00	38.73664
2015-12-10 03:00:00	38.73664
2015-12-10 04:00:00	38.73664
2015-12-10 05:00:00	38.73664
2015-12-10 06:00:00	38.73664
2015-12-10 07:00:00	38.73664
2015-12-10 08:00:00	38.73664
2015-12-10 09:00:00	38.73664
2015-12-10 10:00:00	38.73664
2015-12-10 11:00:00	38.73664
2015-12-10 12:00:00	38.73664
2015-12-10 13:00:00	38.73664
2015-12-10 14:00:00	38.73664
2015-12-10 15:00:00	38.73664
2015-12-10 16:00:00	38.73664

write.csv(water.f,"Water_predictions.csv")

Data 624 - Project 1

Paul Britton

2020-03-29

Project 1

Description

Load Data

Part A – ATM Forecast

Inspect the Data

Check for Seasonality / Cyclicality

Impute Missing Values

Split the data into 4 Timeseries, Build Models

ATM1

ATM2

ATM3

ATM4

Output the Predictions

Part B – Forecasting Power

Inspect the data

Impute missing value & Winsorize

More Data Work

Output predictions

Part C – BONUS, optional (part or all)

Prepare the data

Inspect the data

Transform and inspect

Build a Model

Output predictions