This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

library(tidyverse)
library(scales)
library(lubridate)
library(knitr)
library(fpp2)
library(readxl)
library(forecast)

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

atm.df<-read_excel("ATM624Data.xlsx")
kable(head(atm.df))

DATE	ATM	Cash
39934	ATM1	96
39934	ATM2	107
39935	ATM1	82
39935	ATM2	89
39936	ATM1	85
39936	ATM2	90

kable(summary(atm.df))

DATE	ATM	Cash
Min. :39934	Length:1474	Min. : 0.0
1st Qu.:40026	Class :character	1st Qu.: 0.5
Median :40118	Mode :character	Median : 73.0
Mean :40118	NA	Mean : 155.6
3rd Qu.:40210	NA	3rd Qu.: 114.0
Max. :40312	NA	Max. :10919.8
NA	NA	NA’s :19

#check Date type
class(atm.df$DATE)

## [1] "numeric"

Clearly, NAs do exist, and Date is numeric. So let’s drop the NAs and do the clean up while we are at it, and not forgetting to convert Date column to Date type that time series can understand.

tidy.atm.df<-atm.df %>%
  drop_na() %>%
  spread(ATM, Cash) %>% 
  mutate(DATE = as.Date(DATE, origin = "1899-12-30")) 

kable(head(tidy.atm.df, 10))

DATE	ATM1	ATM2	ATM4
2009-05-01	96	107	776.99342
2009-05-02	82	89	524.41796
2009-05-03	85	90	792.81136
2009-05-04	90	55	908.23846
2009-05-05	99	79	52.83210
2009-05-06	88	19	52.20845
2009-05-07	8	2	55.47361
2009-05-08	104	103	558.50325
2009-05-09	87	107	904.34136
2009-05-10	93	118	879.49359

#Check Date type now:
class(tidy.atm.df$DATE)

## [1] "Date"

We see all data for all ATMs. Now, let’s create our Time Series from this data. While at it, let’s produce the summary as well.

atm.ts<-ts(tidy.atm.df %>% 
             select(-DATE))

head(atm.ts)

## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   ATM1 ATM2 ATM3      ATM4
## 1   96  107    0 776.99342
## 2   82   89    0 524.41796
## 3   85   90    0 792.81136
## 4   90   55    0 908.23846
## 5   99   79    0  52.83210
## 6   88   19    0  52.20845

Plotting our Time Series for each ATM:

autoplot(atm.ts) +
  labs(title = "Cash Withdrawn From the 4 ATMS", x = "Day") +
  scale_y_continuous("Cash withdrawn in hundreds", labels = dollar) +
  scale_color_discrete(NULL) +
  theme(legend.position = c(0.1, 0.8))

Let’s bring back the Date and tidy up the data more:

tidy.atm.df %>% gather(atm, Cash, -DATE) %>% 
  ggplot(aes(x = DATE, y = Cash, col = atm)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ atm, ncol = 1, scales = "free_y") +
  labs(title = "Cash Withdrawn From the 4 ATMs", x = "Date") +
  scale_y_continuous("Cash withdrawn in hundreds", labels = dollar)

Clearly, While ATM1, ATM2 and ATM3 ha ve a far amount of transactions, ATM3 has the least number of withdrawals.

autoplot(ts(atm.ts[1:61, ])) +
  labs(title = "Cash Withdrawn From 4 ATMs", x = "Day") +
  scale_y_continuous("Cash Withdrawn in hundreds", labels = dollar)

From above, clearly weekly seasonality is present in data.
Therefore we set our frequency to 7.

ATM1 Time Series

Set the frequency to 7 and plot the ACF/PACF

atm1<-atm.ts[, "ATM1"]
atm1.ts<-ts(atm1, frequency = 7)

ggtsdisplay(atm1.ts, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

From the above plot, we notice:

weekly seasonality is clearly presented in The ACF and spectrum plots
Large spikes in the ACF lags 7, 14, and 21
Large spikes in the spectrum plot at frequencies 1, 2, and 3
Seasonal ARIMA model is suggested by the plot

let's display the differencing with lag 7 for autocorrelation

ggtsdisplay(diff(atm1.ts,7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

Clearly, there is no non-seasonal differencing.

Let’s create model with Box-Cox, specify ARIMA parameters and extract AIC

atm1.lambda<-BoxCox.lambda(atm1.ts)
atm1.lambda

## [1] 0.2081476

\(\lambda\) = 0.2081476

function to create models & return AIC values for time series

atm.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm1.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1.lambda))
}

Different combinations of p, q, P, Q:

expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm.aic)) %>% 
  slice(which.min(aic))

##   p q P Q      aic
## 1 1 1 0 1 1096.583

The minimum AIC value is for non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1).

The model used is ARIMA(1,0,1)(0,1,1)

atm1.fit<-Arima(atm1.ts, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1.lambda)
atm1.fit

## Series: atm1.ts 
## ARIMA(1,0,1)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2081476 
## 
## Coefficients:
##           ar1     ma1     sma1
##       -0.5040  0.6201  -0.6438
## s.e.   0.2417  0.2183   0.0427
## 
## sigma^2 estimated as 1.252:  log likelihood=-544.29
## AIC=1096.58   AICc=1096.7   BIC=1112.11

Let’s check on residuals using the Ljung-Box test

Box.test(resid(atm1.fit), type = "L", fitdf = 3, lag = 7)

## 
##  Box-Ljung test
## 
## data:  resid(atm1.fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909

ggtsdisplay(resid(atm1.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")

p-value (=0.09) is greater than 0.05, therefore the residuals may have white noise.

The residuals are normally distrubuted around a mean of 0 (zero)

Therefore we accept this model and can be used in forecasting.

ATM2 Time Series

Let’s do the same oparations for ATM2

atm2<-atm.ts[, "ATM2"]
atm2.ts<-ts(atm2, frequency = 7)
atm2.ts[which(is.na(atm2.ts))]<-median(atm2.ts, na.rm = TRUE)

ggtsdisplay(atm2.ts, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

We also notice presence of seasonality in ATM2 which is differenced with lag = 7

ggtsdisplay(diff(atm2.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

The spikes in ACF & PACF in the non-differenced series at k=2 & k=5
Large spike at k=7 suggests D=1, while the stationary nature of the timeseries suggests d=0

Will do the same thing as before; that is create model with Box-Cox, specify ARIMA parameters and extract AIC

atm2.lambda<-BoxCox.lambda(atm2.ts)
atm2.lambda

## [1] 0.7278107

atm2.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm2.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2.lambda))
}

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm2.aic)) %>% 
  slice(which.min(aic))

##   p q P Q      aic
## 1 5 5 0 1 2544.791

The minimum AIC value is for non-seasonal AR(5) & MA(5) and seasonal AR(0) & MA(1).
ARIMA(5,0,5)(0,1,1) is the model used

atm2.fit<-Arima(atm2.ts, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2.lambda)
Box.test(resid(atm2.fit), type = "L", fitdf = 11, lag = 14)

## 
##  Box-Ljung test
## 
## data:  resid(atm2.fit)
## X-squared = 4.696, df = 3, p-value = 0.1955

ggtsdisplay(resid(atm2.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM 2 withdrawals", xlab = "Week", ylab = "Residual")

p-value is greater than 0.05, therefore the residuals may be white noise as suggested by Ljung-Box test.

The residuals are normally distrubuted around a mean of 0 (zero)

Therefore we accept this model and can be used in forecasting.

ATM3 Time Series

again, let’s do the same operations as before.

atm3<-atm.ts[(nrow(atm.ts) - 2):nrow(atm.ts), "ATM3"]
atm3.ts<-ts(atm3, start = 363)
atm3.ts<-atm.ts[, "ATM3"]
atm3.ts[which(atm3.ts == 0)]<-NA
atm3.ts

## Time Series:
## Start = 1 
## End = 365 
## Frequency = 1 
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [276] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [301] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [326] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [351] NA NA NA NA NA NA NA NA NA NA NA NA 96 82 85

#let's summarize to see the mean
summary(atm3.ts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   82.00   83.50   85.00   87.67   90.50   96.00     362

Most of the values are NA. ATM 3 has only 3 transactions and we’ll use simple mean function to forecast for ATM 3

ATM4 Time Series

again, let’s do the same operations.

atm4<-atm.ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)
atm4.ts<-ts(atm4, frequency = 7)

summary(atm4.ts)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.563  124.334  403.839  445.233  704.192 1712.075

ggtsdisplay(atm4.ts, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

Notice the weekly seasonality and and differenced with lag=7

ggtsdisplay(diff(atm4.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

we notice stationary time series with a large spike at k=7 suggesting D=1 and d=0

Will do the same thing as before; that is create model with Box-Cox, specify ARIMA parameters and extract AIC

atm4.lambda<-BoxCox.lambda(atm4.ts)
atm4.lambda

## [1] 0.4525697

atm4.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm4.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4.lambda))
}

#creating possible combinations of p, q, P, Q except all zeros
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm4.aic)) %>% 
  slice(which.min(aic))

##   p q P Q      aic
## 1 5 2 0 1 2863.061

The minimum AIC value is for non-seasonal AR(0) & MA(2) and seasonal AR(0) & MA(1).
ARIMA(0,0,2)(0,1,1) will be the model used here

atm4.fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4.lambda)

Box.test(resid(atm4.fit), type = "L", fitdf = 3, lag = 7)

## 
##  Box-Ljung test
## 
## data:  resid(atm4.fit)
## X-squared = 27.674, df = 4, p-value = 1.452e-05

ggtsdisplay(resid(atm4.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")

p-value is greater than 0.05, therefore the residuals may be white noise as suggested by Ljung-Box test.

The residuals are normally distrubuted around a mean of 0 (zero)

Therefore we accept this model and can be used in forecasting.

ATM Forcasting

atm1_forecast<-forecast(atm1.fit, 31, level = 95)
atm2_forecast<-forecast(atm2.fit, 31, level = 95)
atm3_forecast<-meanf(atm3.ts, 31, level = 95)
atm4_forecast<-forecast(atm4.fit, 31, level = 95)

gridExtra::grid.arrange(
  autoplot(atm1_forecast) + 
    labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm2_forecast) + 
    labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm3_forecast) + 
    labs(title = "ATM3: mean", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm4_forecast) + 
    labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for May 2010\n")
)

The forecast shows ATM 3 has a single mean value forecast
The forecast shows ATMs 1, 2, and 4 have seasonality

The forcasted values are saved in a sv file like so:

data_frame(DATE = rep(max(tidy.atm.df$DATE) + 1:31, 4), ATM = rep(names(tidy.atm.df)[-1], each = 31), Cash = c(atm1_forecast$mean, atm2_forecast$mean, atm3_forecast$mean, atm4_forecast$mean)) %>%
  write_csv("Forecast_Project1_Abdelmalek_ATM624Data_PartA_Output.csv")

## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

power.df<-read_excel("ResidentialCustomerForecastLoad-624.xlsx")

kable(head(power.df))

CaseSequence	YYYY-MMM	KWH
733	1998-Jan	6862583
734	1998-Feb	5838198
735	1998-Mar	5420658
736	1998-Apr	5010364
737	1998-May	4665377
738	1998-Jun	6467147

summary(power.df)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

according to the summary above, there is only 1 missing value. So I’ll just remove that row.

power.df <- power.df %>% filter(!is.na(KWH))
dim(power.df)

## [1] 191   3

At this point, the data is good and we can start creating the time series

power.ts<-ts(power.df[, "KWH"], start = c(1998, 1), frequency = 12)
head(power.ts)

##          Jan     Feb     Mar     Apr     May     Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147

summary(power.ts)

##       KWH          
##  Min.   :  770523  
##  1st Qu.: 5429912  
##  Median : 6283324  
##  Mean   : 6502475  
##  3rd Qu.: 7620524  
##  Max.   :10655730

autoplot(power.ts, main = "Monthly KWH Readings") + xlab("Month") + ylab("Kilowatts used per hour (KWH)")

We notice:

Presence of seasonality in the time series data
July 2010 shows an unusual very low reading; probably due to power outage during summer time.I will assume this is a correct reading and I will not chnage the value for that month.
A slight upward trend after July 2010

Let’s use ggseasonal to explore any hidden details

ggseasonplot(power.ts) + ggtitle("Killowatt hour usage by year/month") + xlab("Month") + ylab("Kilowatts used per hour (KWH)")

It looks like we almost have a sin (sinusoidal) function with a period of 1 year, revealing the pattern of the following years.

ndiffs(power.ts)

## [1] 1

ggtsdisplay(power.ts, points = FALSE, plot.type = "histogram")

Clearly seasonality and histogram are not normal. Therefore we may need to use Box-Cox for transformation.

power.ts.lambda <- BoxCox.lambda(power.ts)
power.ts.lambda

## [1] 0.06361748

First, let’s see using Box-Cox to normalize the the histogram, what will happen

power.ts.box <- power.ts %>% BoxCox(lambda = 0.1)  
ggtsdisplay(power.ts.box, points = FALSE, plot.type = "histogram")

Yes the histogram is slightly normalized, but it is not quite significant. Therefore, Box-Cox transformation may not be appropriate here.

power.ts.fit <- auto.arima(power.ts)
checkresiduals(power.ts.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,2)[12] with drift
## Q* = 17.494, df = 20, p-value = 0.6207
## 
## Model df: 4.   Total lags used: 24

Well Well Well!! It looks like the Residual histogram of the non-Box-Coxed series is more normal. So we will not consider Box-Box. We just discard it.

Instead, we will use auti.arima for our prediction.

power.ts.forecast <- forecast(power.ts.fit, 12, level = 95)
autoplot(power.ts.forecast) + labs(title = "KWH:  ARIMA(0, 0, 1)(0, 1, 2)", x = "Week", y = NULL) + theme(legend.position = "none")

Let’s write the monthly forecasts of 2014 to excel

mSeq <- max(power.df$CaseSequence)
power.dates <- format(seq(ymd("2014-01-01"), ymd("2014-12-31"), by = "1 month"), "%Y-%b")
power.ts.forecast.df <- data.frame("CaseSequence" = c(mSeq + 1, mSeq + 2, mSeq + 3, mSeq + 4, mSeq + 5, mSeq + 6, mSeq + 7, mSeq + 8, mSeq + 9, mSeq + 10, mSeq + 11, mSeq + 12),`YYYY-MMM` = power.dates, "KWH" = c(round(power.ts.forecast$mean)))
write_csv(power.ts.forecast.df, "Forecast_Project1_Abdelmalek_ResidentialCustomerData_PartB_Output.csv")

Part C – Waterflow

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

waterflow1 <- read_excel('Waterflow_Pipe1.xlsx', col_types =c("date", "numeric"))
waterflow2 <- read_excel('Waterflow_Pipe2.xlsx', col_types =c("date", "numeric"))

kable(head(waterflow1))

Date Time	WaterFlow
2015-10-23 00:24:06	23.369599
2015-10-23 00:40:02	28.002881
2015-10-23 00:53:51	23.065895
2015-10-23 00:55:40	29.972809
2015-10-23 01:19:17	5.997953
2015-10-23 01:23:58	15.935223

kable(head(waterflow2))

Date Time	WaterFlow
2015-10-23 01:00:00	18.81079
2015-10-23 02:00:00	43.08703
2015-10-23 03:00:00	37.98770
2015-10-23 04:00:00	36.12038
2015-10-23 05:00:00	31.85126
2015-10-23 06:00:00	28.23809

Let’s fix the column names for consistency

colnames(waterflow1)= c("DateTime","WaterFlow")
colnames(waterflow2)= c("DateTime","WaterFlow")

Looking at the 2 readings, we notice that the first pipe is reading more than once within the hour, while the second pipe is reading once every hour. We need to normalize both pipes by converting the first pipe into Hourly.

wpc1 <-waterflow1 %>% 
  mutate(Date = date(DateTime),
         Hour = hour(DateTime) + 1) %>% 
  group_by(Date, Hour) %>% 
  summarize(WaterFlow = mean(WaterFlow)) %>% 
  ungroup() %>%
  mutate(DateTime = ymd_h(paste(Date, Hour))) %>% 
  select(DateTime, WaterFlow)

kable(head(wpc1))

DateTime	WaterFlow
2015-10-23 01:00:00	26.10280
2015-10-23 02:00:00	18.85202
2015-10-23 03:00:00	15.15857
2015-10-23 04:00:00	23.07886
2015-10-23 05:00:00	15.48219
2015-10-23 06:00:00	22.72539

We now use full join by adding waterflows through both of the pipes, so each data in each pipe is an hour of waterflow.

water.data<-full_join(wpc1, waterflow2, by = "DateTime", suffix = c("_1", "_2")) %>% 
  mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>% 
  mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>% 
  select(DateTime, WaterFlow)

kable(head(water.data, 10))

DateTime	WaterFlow
2015-10-23 01:00:00	44.91359
2015-10-23 02:00:00	61.93904
2015-10-23 03:00:00	53.14628
2015-10-23 04:00:00	59.19923
2015-10-23 05:00:00	47.33345
2015-10-23 06:00:00	50.96348
2015-10-23 07:00:00	30.45346
2015-10-23 08:00:00	45.04664
2015-10-23 09:00:00	76.56548
2015-10-23 10:00:00	75.37564

Let’s create our time series

waterflow.ts<-ts(water.data$WaterFlow, frequency = 24)
summary(waterflow.ts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.885  31.253  43.558  44.250  56.950 106.499

dim(water.data)

## [1] 1000    2

autoplot(waterflow.ts) +
  labs(title = "Two Pipelines Hourly Water Flow", x = "Day", y = "Total Waterflow")

We see variability over the period of time. So because of Variance, we will be using BoxCox.

waterflow.lambda<-BoxCox.lambda(waterflow.ts)
waterflow.trans<-BoxCox(waterflow.ts, waterflow.lambda)
ggtsdisplay(diff(waterflow.trans), points = FALSE, main = "Transformed WaterFlow Differenced Box-Cox")

Our time series looks stationary, but we do see spikes in the ACF and PACF at k=1, which means non-seasonal AR(1) and MA(1) components.

waterflow.fit<-Arima(waterflow.ts, order = c(1, 1, 1), lambda = waterflow.lambda)

Box.test(resid(waterflow.fit), type = "L")

## 
##  Box-Ljung test
## 
## data:  resid(waterflow.fit)
## X-squared = 2.0096e-05, df = 1, p-value = 0.9964

The Ljung-Box test returns a p value >0.05
The residuals appear to be normally distributed.
The model is acceptable and will be used for forecasting.

ggtsdisplay(resid(waterflow.fit), points = FALSE, plot.type = "histogram", main = "WaterFlow Residuals for ARIMA(1,1,1)")

Forecasting using ARIMA(1,1,1):

waterflow.forecast<-forecast(waterflow.fit, 168, level = 95)

autoplot(waterflow.forecast) + 
    labs(title = "WaterFlow Forecasted", subtitle = "Using ARIMA(1,1,1)", x = "Day", y = "Total WaterFlow") +
    theme(legend.position = "none")

Seems the modal forecast lacks seasonality.

Let’s write the forecast to a file

data_frame(DateTime = max(water.data$DateTime) + 
  hours(1:168), WaterFlow = waterflow.forecast$mean) %>% 
  write_csv("Forcast_Project1_Abdelmalek_WaterFlow_PartC_Output.csv")

Data624 - Project1

Abdelmalek Hajjam

4/10/2021

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A – ATM Forecast

ATM1 Time Series

ATM2 Time Series

ATM3 Time Series

ATM4 Time Series

ATM Forcasting

Part B – Forecasting Power

Part C – Waterflow