This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

library(tidyverse)
library(scales)
library(lubridate)
library(knitr)
library(fpp2)
library(readxl)
library(forecast)

Part A – ATM Forecast

ATM1 Time Series

Set the frequency to 7 and plot the ACF/PACF

atm1<-atm.ts[, "ATM1"]
atm1.ts<-ts(atm1, frequency = 7)

ggtsdisplay(atm1.ts, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

From the above plot, we notice:

  • weekly seasonality is clearly presented in The ACF and spectrum plots

  • Large spikes in the ACF lags 7, 14, and 21

  • Large spikes in the spectrum plot at frequencies 1, 2, and 3

  • Seasonal ARIMA model is suggested by the plot

let's display the differencing with lag 7 for autocorrelation

ggtsdisplay(diff(atm1.ts,7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

Clearly, there is no non-seasonal differencing.

Let’s create model with Box-Cox, specify ARIMA parameters and extract AIC

atm1.lambda<-BoxCox.lambda(atm1.ts)
atm1.lambda
## [1] 0.2081476

\(\lambda\) = 0.2081476

function to create models & return AIC values for time series

atm.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm1.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1.lambda))
}

Different combinations of p, q, P, Q:

expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm.aic)) %>% 
  slice(which.min(aic))
##   p q P Q      aic
## 1 1 1 0 1 1096.583

The minimum AIC value is for non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1).

The model used is ARIMA(1,0,1)(0,1,1)

atm1.fit<-Arima(atm1.ts, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1.lambda)
atm1.fit
## Series: atm1.ts 
## ARIMA(1,0,1)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2081476 
## 
## Coefficients:
##           ar1     ma1     sma1
##       -0.5040  0.6201  -0.6438
## s.e.   0.2417  0.2183   0.0427
## 
## sigma^2 estimated as 1.252:  log likelihood=-544.29
## AIC=1096.58   AICc=1096.7   BIC=1112.11

Let’s check on residuals using the Ljung-Box test

Box.test(resid(atm1.fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm1.fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909
ggtsdisplay(resid(atm1.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")

p-value (=0.09) is greater than 0.05, therefore the residuals may have white noise.

The residuals are normally distrubuted around a mean of 0 (zero)

Therefore we accept this model and can be used in forecasting.

ATM2 Time Series

Let’s do the same oparations for ATM2

atm2<-atm.ts[, "ATM2"]
atm2.ts<-ts(atm2, frequency = 7)
atm2.ts[which(is.na(atm2.ts))]<-median(atm2.ts, na.rm = TRUE)

ggtsdisplay(atm2.ts, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

We also notice presence of seasonality in ATM2 which is differenced with lag = 7

ggtsdisplay(diff(atm2.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

  • The spikes in ACF & PACF in the non-differenced series at k=2 & k=5

  • Large spike at k=7 suggests D=1, while the stationary nature of the timeseries suggests d=0

Will do the same thing as before; that is create model with Box-Cox, specify ARIMA parameters and extract AIC

atm2.lambda<-BoxCox.lambda(atm2.ts)
atm2.lambda
## [1] 0.7278107
atm2.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm2.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2.lambda))
}

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm2.aic)) %>% 
  slice(which.min(aic))
##   p q P Q      aic
## 1 5 5 0 1 2544.791
  • The minimum AIC value is for non-seasonal AR(5) & MA(5) and seasonal AR(0) & MA(1).

  • ARIMA(5,0,5)(0,1,1) is the model used

atm2.fit<-Arima(atm2.ts, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2.lambda)
Box.test(resid(atm2.fit), type = "L", fitdf = 11, lag = 14)
## 
##  Box-Ljung test
## 
## data:  resid(atm2.fit)
## X-squared = 4.696, df = 3, p-value = 0.1955
ggtsdisplay(resid(atm2.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM 2 withdrawals", xlab = "Week", ylab = "Residual")

p-value is greater than 0.05, therefore the residuals may be white noise as suggested by Ljung-Box test.

The residuals are normally distrubuted around a mean of 0 (zero)

Therefore we accept this model and can be used in forecasting.

ATM3 Time Series

again, let’s do the same operations as before.

atm3<-atm.ts[(nrow(atm.ts) - 2):nrow(atm.ts), "ATM3"]
atm3.ts<-ts(atm3, start = 363)
atm3.ts<-atm.ts[, "ATM3"]
atm3.ts[which(atm3.ts == 0)]<-NA
atm3.ts
## Time Series:
## Start = 1 
## End = 365 
## Frequency = 1 
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [276] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [301] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [326] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [351] NA NA NA NA NA NA NA NA NA NA NA NA 96 82 85
#let's summarize to see the mean
summary(atm3.ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   82.00   83.50   85.00   87.67   90.50   96.00     362

Most of the values are NA. ATM 3 has only 3 transactions and we’ll use simple mean function to forecast for ATM 3

ATM4 Time Series

again, let’s do the same operations.

atm4<-atm.ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)
atm4.ts<-ts(atm4, frequency = 7)

summary(atm4.ts)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.563  124.334  403.839  445.233  704.192 1712.075
ggtsdisplay(atm4.ts, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

Notice the weekly seasonality and and differenced with lag=7

ggtsdisplay(diff(atm4.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

we notice stationary time series with a large spike at k=7 suggesting D=1 and d=0

Will do the same thing as before; that is create model with Box-Cox, specify ARIMA parameters and extract AIC

atm4.lambda<-BoxCox.lambda(atm4.ts)
atm4.lambda
## [1] 0.4525697
atm4.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm4.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4.lambda))
}

#creating possible combinations of p, q, P, Q except all zeros
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm4.aic)) %>% 
  slice(which.min(aic))
##   p q P Q      aic
## 1 5 2 0 1 2863.061
  • The minimum AIC value is for non-seasonal AR(0) & MA(2) and seasonal AR(0) & MA(1).

  • ARIMA(0,0,2)(0,1,1) will be the model used here

atm4.fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4.lambda)

Box.test(resid(atm4.fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm4.fit)
## X-squared = 27.674, df = 4, p-value = 1.452e-05
ggtsdisplay(resid(atm4.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")

p-value is greater than 0.05, therefore the residuals may be white noise as suggested by Ljung-Box test.

The residuals are normally distrubuted around a mean of 0 (zero)

Therefore we accept this model and can be used in forecasting.

ATM Forcasting

atm1_forecast<-forecast(atm1.fit, 31, level = 95)
atm2_forecast<-forecast(atm2.fit, 31, level = 95)
atm3_forecast<-meanf(atm3.ts, 31, level = 95)
atm4_forecast<-forecast(atm4.fit, 31, level = 95)
gridExtra::grid.arrange(
  autoplot(atm1_forecast) + 
    labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm2_forecast) + 
    labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm3_forecast) + 
    labs(title = "ATM3: mean", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm4_forecast) + 
    labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for May 2010\n")
)

  • The forecast shows ATM 3 has a single mean value forecast

  • The forecast shows ATMs 1, 2, and 4 have seasonality

The forcasted values are saved in a sv file like so:

data_frame(DATE = rep(max(tidy.atm.df$DATE) + 1:31, 4), ATM = rep(names(tidy.atm.df)[-1], each = 31), Cash = c(atm1_forecast$mean, atm2_forecast$mean, atm3_forecast$mean, atm4_forecast$mean)) %>%
  write_csv("Forecast_Project1_Abdelmalek_ATM624Data_PartA_Output.csv")
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

power.df<-read_excel("ResidentialCustomerForecastLoad-624.xlsx")

kable(head(power.df))
CaseSequence YYYY-MMM KWH
733 1998-Jan 6862583
734 1998-Feb 5838198
735 1998-Mar 5420658
736 1998-Apr 5010364
737 1998-May 4665377
738 1998-Jun 6467147
summary(power.df)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

according to the summary above, there is only 1 missing value. So I’ll just remove that row.

power.df <- power.df %>% filter(!is.na(KWH))
dim(power.df)
## [1] 191   3

At this point, the data is good and we can start creating the time series

power.ts<-ts(power.df[, "KWH"], start = c(1998, 1), frequency = 12)
head(power.ts)
##          Jan     Feb     Mar     Apr     May     Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147
summary(power.ts)
##       KWH          
##  Min.   :  770523  
##  1st Qu.: 5429912  
##  Median : 6283324  
##  Mean   : 6502475  
##  3rd Qu.: 7620524  
##  Max.   :10655730
autoplot(power.ts, main = "Monthly KWH Readings") + xlab("Month") + ylab("Kilowatts used per hour (KWH)") 

We notice:

  • Presence of seasonality in the time series data

  • July 2010 shows an unusual very low reading; probably due to power outage during summer time.I will assume this is a correct reading and I will not chnage the value for that month.

  • A slight upward trend after July 2010

Let’s use ggseasonal to explore any hidden details

ggseasonplot(power.ts) + ggtitle("Killowatt hour usage by year/month") + xlab("Month") + ylab("Kilowatts used per hour (KWH)")

It looks like we almost have a sin (sinusoidal) function with a period of 1 year, revealing the pattern of the following years.

ndiffs(power.ts)
## [1] 1
ggtsdisplay(power.ts, points = FALSE, plot.type = "histogram")

Clearly seasonality and histogram are not normal. Therefore we may need to use Box-Cox for transformation.

power.ts.lambda <- BoxCox.lambda(power.ts)
power.ts.lambda
## [1] 0.06361748

First, let’s see using Box-Cox to normalize the the histogram, what will happen

power.ts.box <- power.ts %>% BoxCox(lambda = 0.1)  
ggtsdisplay(power.ts.box, points = FALSE, plot.type = "histogram")

Yes the histogram is slightly normalized, but it is not quite significant. Therefore, Box-Cox transformation may not be appropriate here.

power.ts.fit <- auto.arima(power.ts)
checkresiduals(power.ts.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,2)[12] with drift
## Q* = 17.494, df = 20, p-value = 0.6207
## 
## Model df: 4.   Total lags used: 24

Well Well Well!! It looks like the Residual histogram of the non-Box-Coxed series is more normal. So we will not consider Box-Box. We just discard it.

Instead, we will use auti.arima for our prediction.

power.ts.forecast <- forecast(power.ts.fit, 12, level = 95)
autoplot(power.ts.forecast) + labs(title = "KWH:  ARIMA(0, 0, 1)(0, 1, 2)", x = "Week", y = NULL) + theme(legend.position = "none")

Let’s write the monthly forecasts of 2014 to excel

mSeq <- max(power.df$CaseSequence)
power.dates <- format(seq(ymd("2014-01-01"), ymd("2014-12-31"), by = "1 month"), "%Y-%b")
power.ts.forecast.df <- data.frame("CaseSequence" = c(mSeq + 1, mSeq + 2, mSeq + 3, mSeq + 4, mSeq + 5, mSeq + 6, mSeq + 7, mSeq + 8, mSeq + 9, mSeq + 10, mSeq + 11, mSeq + 12),`YYYY-MMM` = power.dates, "KWH" = c(round(power.ts.forecast$mean)))
write_csv(power.ts.forecast.df, "Forecast_Project1_Abdelmalek_ResidentialCustomerData_PartB_Output.csv")

Part C – Waterflow

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

waterflow1 <- read_excel('Waterflow_Pipe1.xlsx', col_types =c("date", "numeric"))
waterflow2 <- read_excel('Waterflow_Pipe2.xlsx', col_types =c("date", "numeric"))

kable(head(waterflow1))
Date Time WaterFlow
2015-10-23 00:24:06 23.369599
2015-10-23 00:40:02 28.002881
2015-10-23 00:53:51 23.065895
2015-10-23 00:55:40 29.972809
2015-10-23 01:19:17 5.997953
2015-10-23 01:23:58 15.935223
kable(head(waterflow2))
Date Time WaterFlow
2015-10-23 01:00:00 18.81079
2015-10-23 02:00:00 43.08703
2015-10-23 03:00:00 37.98770
2015-10-23 04:00:00 36.12038
2015-10-23 05:00:00 31.85126
2015-10-23 06:00:00 28.23809

Let’s fix the column names for consistency

colnames(waterflow1)= c("DateTime","WaterFlow")
colnames(waterflow2)= c("DateTime","WaterFlow")

Looking at the 2 readings, we notice that the first pipe is reading more than once within the hour, while the second pipe is reading once every hour. We need to normalize both pipes by converting the first pipe into Hourly.

wpc1 <-waterflow1 %>% 
  mutate(Date = date(DateTime),
         Hour = hour(DateTime) + 1) %>% 
  group_by(Date, Hour) %>% 
  summarize(WaterFlow = mean(WaterFlow)) %>% 
  ungroup() %>%
  mutate(DateTime = ymd_h(paste(Date, Hour))) %>% 
  select(DateTime, WaterFlow)

kable(head(wpc1))
DateTime WaterFlow
2015-10-23 01:00:00 26.10280
2015-10-23 02:00:00 18.85202
2015-10-23 03:00:00 15.15857
2015-10-23 04:00:00 23.07886
2015-10-23 05:00:00 15.48219
2015-10-23 06:00:00 22.72539

We now use full join by adding waterflows through both of the pipes, so each data in each pipe is an hour of waterflow.

water.data<-full_join(wpc1, waterflow2, by = "DateTime", suffix = c("_1", "_2")) %>% 
  mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>% 
  mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>% 
  select(DateTime, WaterFlow)

kable(head(water.data, 10))
DateTime WaterFlow
2015-10-23 01:00:00 44.91359
2015-10-23 02:00:00 61.93904
2015-10-23 03:00:00 53.14628
2015-10-23 04:00:00 59.19923
2015-10-23 05:00:00 47.33345
2015-10-23 06:00:00 50.96348
2015-10-23 07:00:00 30.45346
2015-10-23 08:00:00 45.04664
2015-10-23 09:00:00 76.56548
2015-10-23 10:00:00 75.37564

Let’s create our time series

waterflow.ts<-ts(water.data$WaterFlow, frequency = 24)
summary(waterflow.ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.885  31.253  43.558  44.250  56.950 106.499
dim(water.data)
## [1] 1000    2
autoplot(waterflow.ts) +
  labs(title = "Two Pipelines Hourly Water Flow", x = "Day", y = "Total Waterflow")

We see variability over the period of time. So because of Variance, we will be using BoxCox.

waterflow.lambda<-BoxCox.lambda(waterflow.ts)
waterflow.trans<-BoxCox(waterflow.ts, waterflow.lambda)
ggtsdisplay(diff(waterflow.trans), points = FALSE, main = "Transformed WaterFlow Differenced Box-Cox")

Our time series looks stationary, but we do see spikes in the ACF and PACF at k=1, which means non-seasonal AR(1) and MA(1) components.

waterflow.fit<-Arima(waterflow.ts, order = c(1, 1, 1), lambda = waterflow.lambda)

Box.test(resid(waterflow.fit), type = "L")
## 
##  Box-Ljung test
## 
## data:  resid(waterflow.fit)
## X-squared = 2.0096e-05, df = 1, p-value = 0.9964
  • The Ljung-Box test returns a p value >0.05
  • The residuals appear to be normally distributed.
  • The model is acceptable and will be used for forecasting.
ggtsdisplay(resid(waterflow.fit), points = FALSE, plot.type = "histogram", main = "WaterFlow Residuals for ARIMA(1,1,1)")

Forecasting using ARIMA(1,1,1):

waterflow.forecast<-forecast(waterflow.fit, 168, level = 95)

autoplot(waterflow.forecast) + 
    labs(title = "WaterFlow Forecasted", subtitle = "Using ARIMA(1,1,1)", x = "Day", y = "Total WaterFlow") +
    theme(legend.position = "none")

Seems the modal forecast lacks seasonality.

Let’s write the forecast to a file

data_frame(DateTime = max(water.data$DateTime) + 
  hours(1:168), WaterFlow = waterflow.forecast$mean) %>% 
  write_csv("Forcast_Project1_Abdelmalek_WaterFlow_PartC_Output.csv")