The rpubs version of this work can be found here, and source/data can be found on github here.
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.
rm(list = ls())
library(knitr)
library(ggplot2)
library(tidyr)
library(dplyr)
library(tseries)
library(forecast)
library(lubridate)
library(tidyverse)
Load all the data for all 3 parts from github in the interest of reproducibility.
temp_file <- tempfile(fileext = ".xlsx")
download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/ATM624Data.xlsx?raw=true",
destfile = temp_file,
mode = "wb",
quiet = TRUE)
ATM <- readxl::read_excel(temp_file,skip=0,col_types = c("date","text","numeric"))
download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/ResidentialCustomerForecastLoad-624.xlsx?raw=true",
destfile = temp_file,
mode = "wb",
quiet = TRUE)
power <- readxl::read_excel(temp_file,skip=0,col_types = c("numeric","text","numeric"))
download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/Waterflow_Pipe1.xlsx?raw=true",
destfile = temp_file,
mode = "wb",
quiet = TRUE)
water1 <- readxl::read_excel(temp_file,skip=0,col_types = c("date","numeric"))
download.file(url = "https://github.com/plb2018/DATA624/raw/master/project1/Waterflow_Pipe2.xlsx?raw=true",
destfile = temp_file,
mode = "wb",
quiet = TRUE)
water2 <- readxl::read_excel(temp_file,skip=0,col_types = c("date","numeric"))
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
First we’ll inspect the data to see if we can gain any intuition about what we are working with.
We find a few missing values * 3 cases where we have no Cash values for ATM1 and 2 cases for ATM2, which we will impute after further investigation * The entire month of may is missing as this is what we are asked to forcast.
ATM[!complete.cases(ATM),]
p <- ggplot(ATM[complete.cases(ATM),],aes(x=DATE,y=Cash,color=ATM))+
geom_line()
p + facet_grid(rows = vars(ATM),scales="free")
p <- ggplot(ATM[complete.cases(ATM),],aes(x=Cash,color=ATM))+
geom_histogram()
p + facet_grid(cols = vars(ATM),scales="free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From plotting the data, we can see that ATM1 and ATM2 look roughly similar in terms of withdrawls and cyclicality / seasonality. ATM3 appears as though it has just come “on-line” and thus has almost no history.
For ATM 1 we see a bi-modal distribution with peaks near 20 and 200 whereas for ATM2 we see a more constant/uniform distribution.
ATM4 has several oddities. First, we see a monstrous outlier value of 10919.76. Given that these values are in 100s, the outlier equates to > $1 Million, which is substantially more than ATMs typically hold - quick googling suggests that most ATMs carry about 10K and the max is around 200K - so something is wrong here. Common sense (based on context) dictates that this outlier is likely an error. I also note that the data from ATM4 are not integers which is odd as ATMs only dispense bills. I wonder whether this data is from an ATM that dispenses non-USD, yet is reported in USD or something similar…
We will eliminate the massive outlier from ATM4, and examine the data for any seasonal / cyclical patterns
#drop the outlier
ATM2 = ATM[ATM$Cash < 9500,]
#add a weekday col
ATM2$weekday <- factor(weekdays(as.Date(ATM2$DATE)))
#reorder levels
ATM2$weekday <- ordered(ATM2$weekday,levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
#drop nas for now
ATM2 <- ATM2[complete.cases(ATM2),]
#rescale AMT 4, for plotting
ATM2$Cash[ATM2$ATM == "ATM4"] <- ATM2$Cash[ATM2$ATM == "ATM4"]/5
ggplot(ATM2[complete.cases(ATM),],aes(x=weekday,y=Cash,color=ATM))+
geom_boxplot()+
ggtitle("ATM Cyclicality")
From the above plots we can see that there is a weekly cycle where Thursday appears to have dramatically lower withdrawls than all other days. Note that in this plot, ATM4 has been re-scaled to make the visualization easier on the eyes.
Given that the day-of-week appears to be important, we will use this data when imputing missing values. We will impute using the median values for the week-day of eash missing data-point. As can be seen below, a missing “Thursday” probably warrants a different value than a missing “Saturday”
medians.by.weekday <- ATM2 %>%
group_by(weekday,ATM) %>%
summarise_at(vars(-DATE), funs(median(., na.rm=TRUE)))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
ggplot(medians.by.weekday,aes(x=weekday,y=Cash,color=ATM,group=ATM,fill=ATM))+
geom_bar(stat="identity",position="dodge")+
ggtitle("Values For Imputation")
Given that there are so few missing values, we’ll perform the imputation manually. If there were substantially more, or expected to be substantially more in the future, we would obviously build a more efficient process to handle this.
ATM$weekday <- factor(weekdays(as.Date(ATM$DATE)))
missing.values <- head(ATM[is.na(ATM$Cash) ,],5)
kable(missing.values,caption = "The Missing Values with Weekdays")
| DATE | ATM | Cash | weekday |
|---|---|---|---|
| 2009-06-13 | ATM1 | NA | Saturday |
| 2009-06-16 | ATM1 | NA | Tuesday |
| 2009-06-18 | ATM2 | NA | Thursday |
| 2009-06-22 | ATM1 | NA | Monday |
| 2009-06-24 | ATM2 | NA | Wednesday |
#grab index for missing values
missing.idx <- head(which(is.na(ATM$Cash)),5)
#manually replace each one
ATM$Cash[missing.idx[1]]<- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Saturday") & (medians.by.weekday$ATM == "ATM1")]
ATM$Cash[missing.idx[2]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Tuesday") & (medians.by.weekday$ATM == "ATM1")]
ATM$Cash[missing.idx[3]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Thursday") & (medians.by.weekday$ATM == "ATM2")]
ATM$Cash[missing.idx[4]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Monday") & (medians.by.weekday$ATM == "ATM1")]
ATM$Cash[missing.idx[5]] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Wednesday") & (medians.by.weekday$ATM == "ATM2")]
ATM$Cash[ATM$Cash > 9000] <- medians.by.weekday$Cash[(medians.by.weekday$weekday=="Tuesday") & (medians.by.weekday$ATM == "ATM4")]
#check updated values
kable( ATM[missing.idx,],caption = "The Imputed Values")
| DATE | ATM | Cash | weekday |
|---|---|---|---|
| 2009-06-13 | ATM1 | 96 | Saturday |
| 2009-06-16 | ATM1 | 99 | Tuesday |
| 2009-06-18 | ATM2 | 8 | Thursday |
| 2009-06-22 | ATM1 | 85 | Monday |
| 2009-06-24 | ATM2 | 28 | Wednesday |
The characteristics of the ATMs seem distinct enough that they could be modelled as 4 separate processes. Given more time, I would likley try to tackle this as a more generalized “N-in, 1 out” so that i could make use of ALL the ATM data for each prediction, but i have not done that here.
ATM1 <- ATM[ATM$ATM == "ATM1",]
ATM1 <- ts(ATM1[c("Cash")])
ggtsdisplay(ATM1, points = FALSE,
main = "Activity for ATM1",
xlab = "Day",
ylab = "Withdrawls (000's)")
The weekly cycle is quite apparent in all 3 of the above plots and the relevant period is 7-days (i.e. thursdays appear to show a major slowdown in activity). We’ll difference using a lag of 7-periods and rub some stationarity tests
ATM1 <- ATM[ATM$ATM == "ATM1",]
ATM1 <- ATM1[complete.cases(ATM1),]
ATM1 <- ts(ATM1[c("Cash")],frequency = 7)
ggtsdisplay(ATM1,
main = "Activity for ATM1",
xlab = "Day",
ylab = "Withdrawls (000's)")
Box.test(diff(ATM1,lag=7), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737
kpss.test(diff(ATM1,lag=7))
## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1
ggtsdisplay(diff(ATM1,lag=7),
main = "ATM1 - Diff, Lag=7",
xlab = "Day",
ylab = "")
Based on kpss output the series appears stationary so we will move onto the modelling part.
The Spike at lag=1 suggests an MA component and the large spike at lag=7 is consistent with the seasonal component we’ve already discussed. The variability looks reasonably stable but we’ll run a BoxCox anyway just to be sure.
We use auto.arima to search a model-space and return an appropriate model:
lambda = BoxCox.lambda(ATM1)
ATM1.arima <- auto.arima(ATM1,approximation = F,lambda = lambda)
checkresiduals(ATM1.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 10.52, df = 11, p-value = 0.4843
##
## Model df: 3. Total lags used: 14
ATM1.arima %>% forecast(h=31) %>% autoplot()
kpss.test(resid(ATM1.arima))
## Warning in kpss.test(resid(ATM1.arima)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: resid(ATM1.arima)
## KPSS Level = 0.094343, Truncation lag parameter = 5, p-value = 0.1
The model looks reasonably good with the residuals approximately normally distributed (albiet with some left skew) and the ACF values all in-bounds. The forcast plot also looks reasonable to me.
Given the similarity between ATM1 & ATM2, we’ll follow more-or-less the same process.
ATM2 <- ATM[ATM$ATM == "ATM2",]
ATM2 <- ATM2[complete.cases(ATM2),]
ATM2 <- ts(ATM2[c("Cash")],frequency = 7)
ggtsdisplay(ATM1,
main = "Activity for ATM2",
xlab = "Day",
ylab = "Withdrawls (000's)")
Box.test(diff(ATM1,lag=7), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737
kpss.test(diff(ATM1,lag=7))
## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1
ggtsdisplay(diff(ATM1,lag=7),
main = "ATM2 - Diff, Lag=7",
xlab = "Day",
ylab = "")
lambda = BoxCox.lambda(ATM2)
ATM2.arima <- auto.arima(ATM2,approximation = F,lambda = lambda)
checkresiduals(ATM2.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.987, df = 9, p-value = 0.3515
##
## Model df: 5. Total lags used: 14
ATM2.arima %>% forecast(h=31) %>% autoplot()
kpss.test(resid(ATM2.arima))
## Warning in kpss.test(resid(ATM2.arima)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: resid(ATM2.arima)
## KPSS Level = 0.071598, Truncation lag parameter = 5, p-value = 0.1
The model for ATM2 looks reasonable as well - resituals appear to be random, the ACF is more-or-less entirely in-bounds and the distribution of residuals is normal. The model parameters are slightly different from those of ATM1
ATM 3 poses a bit of a tricky problem. We have almost no data and thus nothing to model. However, the limited data that we DO have appears reasonably similar to that of ATM1 and ATM2 so rather than having no forecast or using a fixed value or similar, we will use the mean of ATM1 and ATM2. I feel this is appropriate because we know ATM3 is, in fact, an ATM (which tells us a lot!) and that the limited data we have looks similar to a few other ATMs we know something about.
ATM3 <- (ATM1 + ATM2) /2
ggtsdisplay(ATM1,
main = "Activity for ATM3 (Proxy)",
xlab = "Day",
ylab = "Withdrawls (000's)")
Box.test(diff(ATM1,lag=7), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737
kpss.test(diff(ATM1,lag=7))
## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1
ggtsdisplay(diff(ATM1,lag=7),
main = "ATM3 (Proxy) - Diff, Lag=7",
xlab = "Day",
ylab = "")
lambda = BoxCox.lambda(ATM3)
ATM3.arima <- auto.arima(ATM3,approximation = F,lambda = lambda)
checkresiduals(ATM3.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 10.476, df = 9, p-value = 0.3134
##
## Model df: 5. Total lags used: 14
ATM3.arima %>% forecast(h=31) %>% autoplot()
kpss.test(resid(ATM3.arima))
## Warning in kpss.test(resid(ATM3.arima)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: resid(ATM3.arima)
## KPSS Level = 0.096755, Truncation lag parameter = 5, p-value = 0.1
ATM4 <- ATM[ATM$ATM == "ATM4",]
ATM4 <- ATM4[complete.cases(ATM4),]
ATM4 <- ts(ATM4[c("Cash")],frequency = 7)
ggtsdisplay(ATM1,
main = "Activity for ATM4",
xlab = "Day",
ylab = "Withdrawls (000's)")
Box.test(diff(ATM1,lag=7), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: diff(ATM1, lag = 7)
## X-squared = 9.8085, df = 1, p-value = 0.001737
kpss.test(diff(ATM1,lag=7))
## Warning in kpss.test(diff(ATM1, lag = 7)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: diff(ATM1, lag = 7)
## KPSS Level = 0.021721, Truncation lag parameter = 5, p-value = 0.1
ggtsdisplay(diff(ATM1,lag=7),
main = "ATM4 - Diff, Lag=7",
xlab = "Day",
ylab = "")
lambda = BoxCox.lambda(ATM4)
ATM4.arima <- auto.arima(ATM4,approximation = F,lambda = lambda)
checkresiduals(ATM4.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 17.788, df = 10, p-value = 0.05864
##
## Model df: 4. Total lags used: 14
ATM4.arima %>% forecast(h=31) %>% autoplot()
kpss.test(resid(ATM4.arima))
## Warning in kpss.test(resid(ATM4.arima)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: resid(ATM4.arima)
## KPSS Level = 0.048277, Truncation lag parameter = 5, p-value = 0.1
Below is a table containing the predictions for each ATM
atm1.f <- ATM1.arima %>% forecast(h=31)
atm2.f <- ATM2.arima %>% forecast(h=31)
atm3.f <- ATM3.arima %>% forecast(h=31)
atm4.f <- ATM4.arima %>% forecast(h=31)
ATM.f <- cbind(atm1.f$mean,atm2.f$mean,atm3.f$mean,atm4.f$mean)
colnames(ATM.f) <- c("ATM1","ATM2","ATM3","ATM4")
row.names(ATM.f) <- as.Date(seq.Date(as.Date("2010/5/1"),by="day",length.out=31),format = "%Y-%m-%d")
kable(ATM.f)
| ATM1 | ATM2 | ATM3 | ATM4 | |
|---|---|---|---|---|
| 14730 | 86.70689 | 65.570389 | 76.518060 | 238.3850 |
| 14731 | 100.58583 | 71.420388 | 87.010739 | 346.0242 |
| 14732 | 73.68117 | 11.665773 | 43.414625 | 421.9467 |
| 14733 | 4.23253 | 2.506298 | 4.046689 | 158.9724 |
| 14734 | 100.16166 | 98.076872 | 99.373513 | 385.1930 |
| 14735 | 79.34712 | 89.044895 | 85.713901 | 264.4395 |
| 14736 | 85.75795 | 65.813527 | 76.793954 | 326.8497 |
| 14737 | 87.20172 | 65.595578 | 76.579836 | 207.1759 |
| 14738 | 100.39771 | 71.435234 | 87.179046 | 382.2706 |
| 14739 | 73.68117 | 11.649449 | 43.306440 | 372.5236 |
| 14740 | 4.23253 | 2.505951 | 4.013465 | 199.9500 |
| 14741 | 100.16166 | 98.106872 | 99.533873 | 354.1557 |
| 14742 | 79.34712 | 89.033761 | 85.714920 | 242.4417 |
| 14743 | 85.75795 | 65.794021 | 76.665286 | 367.6915 |
| 14744 | 87.20172 | 65.612794 | 76.638523 | 291.4327 |
| 14745 | 100.39771 | 71.445892 | 87.266385 | 353.1292 |
| 14746 | 73.68117 | 11.638179 | 43.230970 | 366.2645 |
| 14747 | 4.23253 | 2.505590 | 4.000651 | 269.7907 |
| 14748 | 100.16166 | 98.127762 | 99.633279 | 355.3942 |
| 14749 | 79.34712 | 89.026356 | 85.698752 | 305.8082 |
| 14750 | 85.75795 | 65.780310 | 76.593109 | 346.2168 |
| 14751 | 87.20172 | 65.624554 | 76.685456 | 303.0557 |
| 14752 | 100.39771 | 71.453533 | 87.308673 | 354.6069 |
| 14753 | 73.68117 | 11.630400 | 43.180981 | 355.3480 |
| 14754 | 4.23253 | 2.505255 | 3.997501 | 296.6022 |
| 14755 | 100.16166 | 98.142302 | 99.692670 | 349.4277 |
| 14756 | 79.34712 | 89.021441 | 85.679321 | 314.7151 |
| 14757 | 85.75795 | 65.770676 | 76.554380 | 350.2818 |
| 14758 | 87.20172 | 65.632582 | 76.719640 | 325.2836 |
| 14759 | 100.39771 | 71.459002 | 87.327043 | 349.0576 |
| 14760 | 73.68117 | 11.625033 | 43.149243 | 351.8726 |
write.csv(ATM.f,"ATM_predictions.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
#quick inspection and check for missing data
kable(tail(power))
| CaseSequence | YYYY-MMM | KWH |
|---|---|---|
| 919 | 2013-Jul | 8415321 |
| 920 | 2013-Aug | 9080226 |
| 921 | 2013-Sep | 7968220 |
| 922 | 2013-Oct | 5759367 |
| 923 | 2013-Nov | 5769083 |
| 924 | 2013-Dec | 9606304 |
kable(power[!complete.cases(power),])
| CaseSequence | YYYY-MMM | KWH |
|---|---|---|
| 861 | 2008-Sep | NA |
power.ts <-ts(power$KWH, start = c(1998, 1), frequency = 12)
autoplot(power.ts )
ggseasonplot(power.ts )
ggsubseriesplot(power.ts )
## Warning: Removed 16 row(s) containing missing values (geom_path).
gglagplot(power.ts )
ggAcf(power.ts )
We see only 1 missing case in the table above (sept 2008) which we can interpolate using the mean sept values based on what we see above. We also see an extreme value in july 2010.
We can see from the time series plot that the data has no major long-term trend and reasonably stable variability, tho there is an indication that the amplitude may be increasing post-2010. There is a pronounced seasonality visible in the timeseries plot and more specifically, the seasonal plot - summer and winter equate to higher power consumption than do fall and spring.
The seasonality is also abundantly clear in the ACF plot.
The seasonality and lag both suggest that we want to use a 12 month lag when evaluating this data. We will also try a BoxCox transformation given that the variability appears to be picking up towards the end of the series
We’ll fix the missing value and also winsorize the data to see if there is any way we can reduce that extreme value from July 2010.
#find missing idx
idx <- which(!complete.cases(power))
#collect all values from same month
sept.kwh <- power$KWH[seq.int(9, length(power$KWH), 12L)]
#compute mean and inject
power$KWH[idx] <- mean(sept.kwh,na.rm = T)
#power$KWH <- Winsorize(power$KWH )
#rebuild the TS
power.ts <-ts(power$KWH, start = c(1998, 1), frequency = 12)
power.ts.l <- BoxCox(power.ts, BoxCox.lambda(power.ts))
ggtsdisplay(diff(power.ts.l,order=1,lag=12), main = "KWH Consumption - 2nd order Differenced w/ Lag= 12")
kpss.test(diff(power.ts.l,order=1,lag=12))
## Warning in kpss.test(diff(power.ts.l, order = 1, lag = 12)): p-value greater
## than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: diff(power.ts.l, order = 1, lag = 12)
## KPSS Level = 0.06921, Truncation lag parameter = 4, p-value = 0.1
The data are now stationary as per the KPSS test. We see lag=12 as a stand-out in both the ACF and PACF.
lambda = BoxCox.lambda(power.ts)
power.arima <- auto.arima(power.ts,approximation = F,lambda = lambda)
checkresiduals(power.arima )
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)(2,0,0)[12]
## Q* = 20.429, df = 20, p-value = 0.4314
##
## Model df: 4. Total lags used: 24
power.arima %>% forecast(h=12) %>% autoplot()
kpss.test(resid(power.arima))
## Warning in kpss.test(resid(power.arima)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: resid(power.arima)
## KPSS Level = 0.18848, Truncation lag parameter = 4, p-value = 0.1
Based on the residual plots above the model seems to be reasonably good! The fly in the ointment seem to be that major outlier month in July 2010. The ACF is mostly in-bounds and the distribution of residuals is mostly normal, however there is significant left-skew as a result of the the outlier in the data.
power.f <-power.arima %>% forecast(h=12)
power.f <- data.frame(power.f$mean)
colnames(power.f) <- "KWH"
row.names(power.f) <- as.Date(seq.Date(as.Date("2014/1/1"),by="month",length.out=12),format = "%Y-%m")
kable(power.f)
| KWH | |
|---|---|
| 2014-01-01 | 9054756 |
| 2014-02-01 | 7517891 |
| 2014-03-01 | 6679011 |
| 2014-04-01 | 6289639 |
| 2014-05-01 | 6312917 |
| 2014-06-01 | 7578198 |
| 2014-07-01 | 7998690 |
| 2014-08-01 | 8386987 |
| 2014-09-01 | 7478936 |
| 2014-10-01 | 6186266 |
| 2014-11-01 | 6242122 |
| 2014-12-01 | 7583984 |
write.csv(power.f,"Power_predictions.csv")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
head(water1)
head(water2)
#join them
water <- rbind(water1,water2)
#sort data
water <- water[order(as.Date(water$`Date Time`, format="%Y-m%-d% h%:M%:s%")),]
#aggragate by hour and take mean...
water <- water %>%
separate(`Date Time`, into = c("date", "time"), sep = " ") %>%
separate(time, into = c("hour", "minute", "seccond"), sep = ":") %>%
group_by(date, hour) %>%
summarise(mean(WaterFlow,na.rm = T)) %>%
ungroup() %>%
arrange() %>%
collect() %>%
`colnames<-`(c("date", "time", "WaterFlow"))
#create timeseries
water.ts <- ts(water$WaterFlow,frequency=24)
autoplot(water.ts)
ggseasonplot(water.ts,polar=T )
ggsubseriesplot(water.ts )
gglagplot(water.ts )
ggAcf(water.ts )
There’s an obvious change in variability at around day 12 which sticks out like a sore thumb. Otherwise, we see that there is no clear seasonal pattern and that the ACF is generally positive out to 48 lags.
On closer inspection, it appears as though the data from the 2 sets are quite different in terms of magnitude. In order to deal with the apparent jump in variance, i’m going to scale “water1” by the ratio of water2/water1
scale.factor <- sd(water2$WaterFlow)/sd(water1$WaterFlow)
water1$WaterFlow <- water1$WaterFlow * scale.factor
#join them
water <- rbind(water1,water2)
#sort data
water <- water[order(as.Date(water$`Date Time`, format="%Y-m%-d% h%:M%:s%")),]
#aggragate by hour and take mean...
water <- water %>%
separate(`Date Time`, into = c("date", "time"), sep = " ") %>%
separate(time, into = c("hour", "minute", "seccond"), sep = ":") %>%
group_by(date, hour) %>%
summarise(mean(WaterFlow,na.rm = T)) %>%
ungroup() %>%
arrange() %>%
collect() %>%
`colnames<-`(c("date", "time", "WaterFlow"))
#create timeseries
water.ts <- ts(water$WaterFlow,frequency=24)
autoplot(water.ts)
water.ts.l <- BoxCox(water.ts, BoxCox.lambda(water.ts))
ggtsdisplay(diff(water.ts.l,order=1,lag=1), main = "KWH Consumption - 2nd order Differenced w/ Lag= 12")
kpss.test(diff(water.ts.l,order=1,lag=1))
## Warning in kpss.test(diff(water.ts.l, order = 1, lag = 1)): p-value greater than
## printed p-value
##
## KPSS Test for Level Stationarity
##
## data: diff(water.ts.l, order = 1, lag = 1)
## KPSS Level = 0.011026, Truncation lag parameter = 7, p-value = 0.1
We see that after a BoxCox transform and a 1st order difference with a lag of one, we get something stationary where the ACF is mostly in bounds and the PACF appears to have a well defined exponential-looking decay to zero.
lambda = BoxCox.lambda(water.ts)
water.arima <- auto.arima(water.ts,approximation = F,lambda = lambda)
checkresiduals(water.arima )
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(1,0,0)[24] with non-zero mean
## Q* = 62.932, df = 46, p-value = 0.04911
##
## Model df: 2. Total lags used: 48
water.arima %>% forecast(h=168) %>% autoplot()
kpss.test(resid(water.arima))
## Warning in kpss.test(resid(water.arima)): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: resid(water.arima)
## KPSS Level = 0.11557, Truncation lag parameter = 7, p-value = 0.1
I had tried this perviously without scaling the data from “water1” and got results that looked sub-par (not shown here). Upon scaling the input data (which admittedly was quite naive - i’m sure there’s a much better way!) i get results that look reasonable.
The residuals look reasonably normal in the ACF is mostly in-bounds.
water.f <-water.arima %>% forecast(h=168)
water.f <- data.frame(water.f$mean)
colnames(water.f) <- "WaterFlow"
row.names(water.f) <- seq(ymd_hm("2015-12-3 17:00"), ymd_hm("2015-12-10 16:00"), by = "hour")
kable(water.f)
| WaterFlow | |
|---|---|
| 2015-12-03 17:00:00 | 38.73527 |
| 2015-12-03 18:00:00 | 39.74757 |
| 2015-12-03 19:00:00 | 40.60960 |
| 2015-12-03 20:00:00 | 38.86769 |
| 2015-12-03 21:00:00 | 40.00868 |
| 2015-12-03 22:00:00 | 39.48159 |
| 2015-12-03 23:00:00 | 37.51949 |
| 2015-12-04 00:00:00 | 39.27028 |
| 2015-12-04 01:00:00 | 41.29809 |
| 2015-12-04 02:00:00 | 39.51524 |
| 2015-12-04 03:00:00 | 40.92247 |
| 2015-12-04 04:00:00 | 40.10006 |
| 2015-12-04 05:00:00 | 40.10572 |
| 2015-12-04 06:00:00 | 37.97261 |
| 2015-12-04 07:00:00 | 38.76030 |
| 2015-12-04 08:00:00 | 36.99283 |
| 2015-12-04 09:00:00 | 37.85080 |
| 2015-12-04 10:00:00 | 39.13071 |
| 2015-12-04 11:00:00 | 41.03850 |
| 2015-12-04 12:00:00 | 38.19784 |
| 2015-12-04 13:00:00 | 40.64643 |
| 2015-12-04 14:00:00 | 39.03807 |
| 2015-12-04 15:00:00 | 38.34283 |
| 2015-12-04 16:00:00 | 40.63772 |
| 2015-12-04 17:00:00 | 38.73654 |
| 2015-12-04 18:00:00 | 38.80984 |
| 2015-12-04 19:00:00 | 38.87193 |
| 2015-12-04 20:00:00 | 38.74616 |
| 2015-12-04 21:00:00 | 38.82868 |
| 2015-12-04 22:00:00 | 38.79062 |
| 2015-12-04 23:00:00 | 38.64794 |
| 2015-12-05 00:00:00 | 38.77534 |
| 2015-12-05 01:00:00 | 38.92130 |
| 2015-12-05 02:00:00 | 38.79306 |
| 2015-12-05 03:00:00 | 38.89439 |
| 2015-12-05 04:00:00 | 38.83526 |
| 2015-12-05 05:00:00 | 38.83567 |
| 2015-12-05 06:00:00 | 38.68104 |
| 2015-12-05 07:00:00 | 38.73836 |
| 2015-12-05 08:00:00 | 38.60936 |
| 2015-12-05 09:00:00 | 38.67215 |
| 2015-12-05 10:00:00 | 38.76523 |
| 2015-12-05 11:00:00 | 38.90271 |
| 2015-12-05 12:00:00 | 38.69746 |
| 2015-12-05 13:00:00 | 38.87457 |
| 2015-12-05 14:00:00 | 38.75851 |
| 2015-12-05 15:00:00 | 38.70801 |
| 2015-12-05 16:00:00 | 38.87395 |
| 2015-12-05 17:00:00 | 38.73664 |
| 2015-12-05 18:00:00 | 38.74196 |
| 2015-12-05 19:00:00 | 38.74646 |
| 2015-12-05 20:00:00 | 38.73734 |
| 2015-12-05 21:00:00 | 38.74333 |
| 2015-12-05 22:00:00 | 38.74056 |
| 2015-12-05 23:00:00 | 38.73020 |
| 2015-12-06 00:00:00 | 38.73945 |
| 2015-12-06 01:00:00 | 38.75005 |
| 2015-12-06 02:00:00 | 38.74074 |
| 2015-12-06 03:00:00 | 38.74809 |
| 2015-12-06 04:00:00 | 38.74380 |
| 2015-12-06 05:00:00 | 38.74383 |
| 2015-12-06 06:00:00 | 38.73261 |
| 2015-12-06 07:00:00 | 38.73677 |
| 2015-12-06 08:00:00 | 38.72740 |
| 2015-12-06 09:00:00 | 38.73196 |
| 2015-12-06 10:00:00 | 38.73872 |
| 2015-12-06 11:00:00 | 38.74870 |
| 2015-12-06 12:00:00 | 38.73380 |
| 2015-12-06 13:00:00 | 38.74666 |
| 2015-12-06 14:00:00 | 38.73823 |
| 2015-12-06 15:00:00 | 38.73457 |
| 2015-12-06 16:00:00 | 38.74661 |
| 2015-12-06 17:00:00 | 38.73664 |
| 2015-12-06 18:00:00 | 38.73703 |
| 2015-12-06 19:00:00 | 38.73736 |
| 2015-12-06 20:00:00 | 38.73669 |
| 2015-12-06 21:00:00 | 38.73713 |
| 2015-12-06 22:00:00 | 38.73693 |
| 2015-12-06 23:00:00 | 38.73618 |
| 2015-12-07 00:00:00 | 38.73685 |
| 2015-12-07 01:00:00 | 38.73762 |
| 2015-12-07 02:00:00 | 38.73694 |
| 2015-12-07 03:00:00 | 38.73748 |
| 2015-12-07 04:00:00 | 38.73716 |
| 2015-12-07 05:00:00 | 38.73717 |
| 2015-12-07 06:00:00 | 38.73635 |
| 2015-12-07 07:00:00 | 38.73665 |
| 2015-12-07 08:00:00 | 38.73597 |
| 2015-12-07 09:00:00 | 38.73630 |
| 2015-12-07 10:00:00 | 38.73679 |
| 2015-12-07 11:00:00 | 38.73752 |
| 2015-12-07 12:00:00 | 38.73644 |
| 2015-12-07 13:00:00 | 38.73737 |
| 2015-12-07 14:00:00 | 38.73676 |
| 2015-12-07 15:00:00 | 38.73649 |
| 2015-12-07 16:00:00 | 38.73737 |
| 2015-12-07 17:00:00 | 38.73664 |
| 2015-12-07 18:00:00 | 38.73667 |
| 2015-12-07 19:00:00 | 38.73670 |
| 2015-12-07 20:00:00 | 38.73665 |
| 2015-12-07 21:00:00 | 38.73668 |
| 2015-12-07 22:00:00 | 38.73666 |
| 2015-12-07 23:00:00 | 38.73661 |
| 2015-12-08 00:00:00 | 38.73666 |
| 2015-12-08 01:00:00 | 38.73671 |
| 2015-12-08 02:00:00 | 38.73667 |
| 2015-12-08 03:00:00 | 38.73670 |
| 2015-12-08 04:00:00 | 38.73668 |
| 2015-12-08 05:00:00 | 38.73668 |
| 2015-12-08 06:00:00 | 38.73662 |
| 2015-12-08 07:00:00 | 38.73664 |
| 2015-12-08 08:00:00 | 38.73660 |
| 2015-12-08 09:00:00 | 38.73662 |
| 2015-12-08 10:00:00 | 38.73666 |
| 2015-12-08 11:00:00 | 38.73671 |
| 2015-12-08 12:00:00 | 38.73663 |
| 2015-12-08 13:00:00 | 38.73670 |
| 2015-12-08 14:00:00 | 38.73665 |
| 2015-12-08 15:00:00 | 38.73663 |
| 2015-12-08 16:00:00 | 38.73670 |
| 2015-12-08 17:00:00 | 38.73664 |
| 2015-12-08 18:00:00 | 38.73665 |
| 2015-12-08 19:00:00 | 38.73665 |
| 2015-12-08 20:00:00 | 38.73664 |
| 2015-12-08 21:00:00 | 38.73665 |
| 2015-12-08 22:00:00 | 38.73665 |
| 2015-12-08 23:00:00 | 38.73664 |
| 2015-12-09 00:00:00 | 38.73665 |
| 2015-12-09 01:00:00 | 38.73665 |
| 2015-12-09 02:00:00 | 38.73665 |
| 2015-12-09 03:00:00 | 38.73665 |
| 2015-12-09 04:00:00 | 38.73665 |
| 2015-12-09 05:00:00 | 38.73665 |
| 2015-12-09 06:00:00 | 38.73664 |
| 2015-12-09 07:00:00 | 38.73664 |
| 2015-12-09 08:00:00 | 38.73664 |
| 2015-12-09 09:00:00 | 38.73664 |
| 2015-12-09 10:00:00 | 38.73665 |
| 2015-12-09 11:00:00 | 38.73665 |
| 2015-12-09 12:00:00 | 38.73664 |
| 2015-12-09 13:00:00 | 38.73665 |
| 2015-12-09 14:00:00 | 38.73664 |
| 2015-12-09 15:00:00 | 38.73664 |
| 2015-12-09 16:00:00 | 38.73665 |
| 2015-12-09 17:00:00 | 38.73664 |
| 2015-12-09 18:00:00 | 38.73664 |
| 2015-12-09 19:00:00 | 38.73664 |
| 2015-12-09 20:00:00 | 38.73664 |
| 2015-12-09 21:00:00 | 38.73664 |
| 2015-12-09 22:00:00 | 38.73664 |
| 2015-12-09 23:00:00 | 38.73664 |
| 2015-12-10 00:00:00 | 38.73664 |
| 2015-12-10 01:00:00 | 38.73664 |
| 2015-12-10 02:00:00 | 38.73664 |
| 2015-12-10 03:00:00 | 38.73664 |
| 2015-12-10 04:00:00 | 38.73664 |
| 2015-12-10 05:00:00 | 38.73664 |
| 2015-12-10 06:00:00 | 38.73664 |
| 2015-12-10 07:00:00 | 38.73664 |
| 2015-12-10 08:00:00 | 38.73664 |
| 2015-12-10 09:00:00 | 38.73664 |
| 2015-12-10 10:00:00 | 38.73664 |
| 2015-12-10 11:00:00 | 38.73664 |
| 2015-12-10 12:00:00 | 38.73664 |
| 2015-12-10 13:00:00 | 38.73664 |
| 2015-12-10 14:00:00 | 38.73664 |
| 2015-12-10 15:00:00 | 38.73664 |
| 2015-12-10 16:00:00 | 38.73664 |
write.csv(water.f,"Water_predictions.csv")