Data624 Project 1

suppressMessages(suppressWarnings(library(fpp2)))
suppressMessages(suppressWarnings(library(readxl)))
suppressMessages(suppressWarnings(library(seasonal)))
suppressMessages(suppressWarnings(library(rdatamarket)))
suppressMessages(suppressWarnings(library(tseries)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(scales)))
suppressMessages(suppressWarnings(library(forecast)))
suppressMessages(suppressWarnings(library(lubridate)))

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A - ATM Forecast, ATM624Data.xlsx

Lets plot this timeseries.

df_atm %>% gather(ATM, Cash, -DATE) %>% 
ggplot(aes(x = DATE, y = Cash, col = ATM)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ ATM, ncol = 1, scales = "free_y") +
labs(title = "Cash withdrawn from ATMs", x = "Date") +
scale_y_continuous("Cash withdrawn (hundreds)", labels = dollar)

From the plot we can see that ATM1 & ATM2 has variations between $0-$15,000, with few exceptions. ATM3 shows zero withdrawals for most of the year until the final 3 days, with observations in the area of $10,000. ATM4 shows a similar pattern as ATM1 & ATM2, with the exception of one day showing withdrawals over $100,000.

ATM1 & ATM2 each exhibit similar patterns through the time window and will use the entire timeseries. ATM3 was mostly inactive. ATM4 has just one spike, so cant consider in forcasting.

Creating seperate time series objects.

# ATM1 & ATM2

atm1 <- atm_ts[, "ATM1"]
atm2 <- atm_ts[, "ATM2"]

#last 3 observations of ATM3 & convert to ts

atm3 <- atm_ts[(nrow(atm_ts) - 2):nrow(atm_ts), "ATM3"]
atm3 <- ts(atm3, start = 363)
atm3 <- atm_ts[, "ATM3"]
atm3[which(atm3 == 0)] <- NA

# ATM4 & impute spike with median

atm4 <- atm_ts[, "ATM4"]
atm4[which.max(atm4)] <- median(atm4, na.rm = TRUE)

Fitting

Lets see few data points to figure out the seasonality for ATM1, ATM2 and ATM4.

autoplot(ts(atm_ts[1:61, ])) +
labs(title = "Cash withdrawn from ATMs", x = "Day") +
scale_y_continuous("Cash withdrawn (hundreds)", labels = dollar) +
scale_color_discrete(NULL)

From the plot we can see there is little seasonality. Daily timeseries can not be decomposed to capture the seasonality identified above. So lets capture weekly seasonal behavior i.e. set frequency = 7.

atm1 <- ts(atm1, frequency = 7)
atm2 <- ts(atm2, frequency = 7)
# impute NA value 
atm1[which(is.na(atm1))] <- median(atm1, na.rm = TRUE)
atm2[which(is.na(atm2))] <- median(atm2, na.rm = TRUE)

atm4 <- ts(atm4, frequency = 7)

ATM 1

The ATM 1 timeseries is displayed below with its ACF & spectrum plots:

ggtsdisplay(atm1, points = FALSE, plot.type = "spectrum",
main = "Withdrawals from atm1", xlab = "Week", ylab = "Cash (hundreds)")

The ACF and the spectrum plots show weekly seasonality, there are large spikes in the ACF lags 7, 14, and 21 and in the spectrum plot at frequencies 1, 2, and 3. This seasonal ARIMA model. For autocorrelation, the time series is differenced with a lag of 7:

ggtsdisplay(diff(atm1, 7), points = FALSE,
main = "Dif lag7 withdrawals from atm1",
xlab = "Week", ylab = "Cash (hundreds)")

This timeseries is now stationary. The spikes in the ACF and PACF at \(k = 1\) suggest non-seasonal AR(1) and/or MA(1) components of the model. The spikes in the ACF and PACF at \(k = 7\) and the decreasing spikes at \(k = 14\) and \(k = 21\) suggest seasonal AR(1) and/or seasonal MA(1) components. This suggests fifteen possible models: ARIMA(p, 0, q)(P, 1, Q) for \(p, q, P, Q \in [0, 1]\) excluding the case where \(p, q, P, Q = 0\)

The models are calculated and their AIC values returned:

# lambda for Box-cox transformation
atm1_lambda <- BoxCox.lambda(atm1)
# define function to create models & return AIC values for timeseries
atm_aic <- function(p, d, q, P, D, Q) {
# create model with Box-Cox and specified ARIMA parameters; extract AIC
  AIC(Arima(atm1, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1_lambda))
}
# create possible combinations of p, q, P, Q except all zero
expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
# calc AIC for models
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm_aic)) %>% 
# return best AIC
slice(which.min(aic))
##   p q P Q      aic
## 1 1 1 0 1 1188.862

The minimum AIC value is for non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1). The model used is ARIMA(1,0,1)(0,1,1):

atm1_fit <- Arima(atm1, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1_lambda)

Ljung-Box test and diagnostic plotting:

Box.test(resid(atm1_fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm1_fit)
## X-squared = 8.2634, df = 4, p-value = 0.08239
ggtsdisplay(resid(atm1_fit), points = FALSE, plot.type = "histogram",
main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of atm1 withdrawals",
xlab = "Week", ylab = "Residual")

From the Ljung-Box test, p-value > 0.05, so it has white noise. The residuals are normally distributed with a mean around zero. They do not appear to be autocorrelated. This model will be used for forecasting.

ATM 2

Lets perform the same for ATM 2

ggtsdisplay(atm2, points = FALSE,
main = "Withdrawals from atm2", xlab = "Week", ylab = "Cash (hundreds)")

We can wee weekly seasonality same as atm1 and it is also differenced with lag = 7:

ggtsdisplay(diff(atm2, 7), points = FALSE,
main = "Dif lag7 withdrawals from atm2",
xlab = "Week", ylab = "Cash (hundreds)")

We can see spikes at \(k=7\) so \(D = 1\), and the stationary nature of the timeseries suggests \(d = 0\). The spikes in ACF & PACF in the non-differenced series at \(k = 2\) & \(k = 5\) suggest \(p, q \in [0, 2, 5]\).

#lambda for Box-cox transformation
atm2_lambda <- BoxCox.lambda(atm2)
# repurpose above function for atm2
atm_aic <- function(p, d, q, P, D, Q) {
# create model with Box-Cox and specified ARIMA parameters; extract AIC
AIC(Arima(atm2, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2_lambda))
}
# create possible combinations of p, q, P, Q except all zero
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
# calc AIC for models
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm_aic)) %>% 
# return best AIC
slice(which.min(aic))
##   p q P Q      aic
## 1 5 5 0 1 2544.791

The minimum AIC value is for non-seasonal AR(5) & MA(5) and seasonal AR(0) & MA(1). The model used is ARIMA(5,0,5)(0,1,1):

atm2_fit <- Arima(atm2, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2_lambda)

Ljung-Box test and diagnostic plotting:

Box.test(resid(atm2_fit), type = "L", fitdf = 11, lag = 14)
## 
##  Box-Ljung test
## 
## data:  resid(atm2_fit)
## X-squared = 4.696, df = 3, p-value = 0.1955
ggtsdisplay(resid(atm2_fit), points = FALSE, plot.type = "histogram",
main = "Residuals for ARIMA(5,0,5)(0,1,1) of atm2 withdrawals",
xlab = "Week", ylab = "Residual")

From Ljung-Box test, p-value > 0.05, so it has white noise. The residuals appear to be approximately normally distributed with a mean around zero. This model will be used for forecasting.

ATM 3

for ATM 3 we will use simple mean forecast as there are only 3 non NA values.

ATM 4

We will use the same procedure for ATM 4 as ATM 1 and ATM 2

ggtsdisplay(atm4, points = FALSE,
main = "Withdrawals from atm4", xlab = "Week", ylab = "Cash (hundreds)")

The same weekly seasonality is seen as for atm1 & atm2 and it is also differenced with lag = 7:

ggtsdisplay(diff(atm4, 7), points = FALSE,
main = "Dif lag7 withdrawals from atm4",
xlab = "Week", ylab = "Cash (hundreds)")

The stationary time series with a spike at \(k=7\) so \(D = 1\) and \(d = 0\). spikes in the ACF & PACF of both the original and differenced timeseries as atm2 suggest \(p, q \in [0, 2, 5]\) and \(P, Q \in [0,1]\).

# lambda for Box-cox transformation
atm4_lambda <- BoxCox.lambda(atm4)
# repurpose above function for atm4
atm_aic <- function(p, d, q, P, D, Q) {
# create model with Box-Cox and specified ARIMA parameters; extract AIC
AIC(Arima(atm4, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4_lambda))
}
# create possible combinations of p, q, P, Q except all zero
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
# calc AIC for models
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm_aic)) %>% 
# return best AIC
slice(which.min(aic))
##   p q P Q      aic
## 1 5 2 0 1 2863.061

The minimum AIC value is for non-seasonal AR(0) & MA(2) and seasonal AR(0) & MA(1). The model used is ARIMA(0,0,2)(0,1,1):

atm4_fit <- Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4_lambda)

Ljung-Box test and diagnostic plotting:

Box.test(resid(atm4_fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm4_fit)
## X-squared = 8.6819, df = 4, p-value = 0.06956
ggtsdisplay(resid(atm4_fit), points = FALSE, plot.type = "histogram",
main = "Residuals for ARIMA(0,0,2)(0,1,1) of atm4 withdrawals",
xlab = "Week", ylab = "Residual")

From Ljung-Box test, p-value > 0.05, with residuals approximately normally distributed with a mean around zero. This model will be used for forecasting.

Forecasting

atm1_forecast <- forecast(atm1_fit, 31, level = 95)
atm2_forecast <- forecast(atm2_fit, 31, level = 95)
atm3_forecast <- meanf(atm3, 31, level = 95)
atm4_forecast <- forecast(atm4_fit, 31, level = 95)

Forecasts are plotted below:

gridExtra::grid.arrange(
autoplot(atm1_forecast) + 
labs(title = "atm1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),

autoplot(atm2_forecast) + 
labs(title = "atm2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),

autoplot(atm3_forecast) + 
labs(title = "atm3: mean", x = "Day", y = NULL) +
theme(legend.position = "none"),

autoplot(atm4_forecast) + 
labs(title = "atm4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),

a = grid::textGrob("Forecasted atm withdrawals")
)
## Warning: Removed 362 rows containing missing values (geom_path).

There are seasonality for atms 1, 2, and 4 and a single value forecast for atm3.

The forecast are written to csv:

data_frame(DATE = rep(max(df_atm$DATE) + 1:31, 4),
atm = rep(names(df_atm)[-1], each = 31),
Cash = c(atm1_forecast$mean, atm2_forecast$mean,
atm3_forecast$mean, atm4_forecast$mean)) %>% 
write_csv("C:/Users/rites/Documents/GitHub/Data624-Project1/project1_atm.csv")

Part B - Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

# read the data
pu_data <- readxl::read_excel("C:/Users/rites/Documents/GitHub/Data624-Project1/ResidentialCustomerForecastLoad-624.xlsx")

#timeseries
pu_data =ts(pu_data[,"KWH"],start = c(1998,1),frequency = 12)

Data exploration

autoplot(pu_data) +
labs(title = "Monthly residential power usage")

We can see annual seasonality in the data.

# Box-cox paramter
pu_lambda <- BoxCox.lambda(pu_data)
pu_trans <- BoxCox(pu_data, pu_lambda)

Fitting

ggtsdisplay(diff(pu_trans, 12), points = FALSE,
main = "Dif lag12 Box-Cox transformed residential power usage")

The series is stationary, so no non-seasonal differencing is needed. The decaying seasonal spikes in the PACF suggests a seasonal AR(1) component, while the very quickly-decaying seasonal spikes in the ACF suggest the possibility of a seasonal MA(1) component. Spikes in the PACF and ACF at \(k = 1\) and \(k = 4\) suggest non-seasonal AR(1) or AR(4) components, and non-seasonal MA(1) or MA(4) components. The function used to select the model with lowest AIC in Part A is redefined for use on the kWh timeseries with \(D = 1\) and \(d = 0\):

# redefine function
pu_aic <- function(p, q, P, Q) {
# create model with Box-Cox and specified ARIMA parameters; extract AIC
AIC(Arima(pu_data, order = c(p, 0, q), seasonal = c(P, 1, Q), lambda = pu_lambda))
}
# create possible combinations except all zero & p = q = 4; P = Q = 1 (returns error)
expand.grid(p = c(0, 1, 4), q = c(0, 1, 4), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0, p < 4 | q < 4 | P < 1 | Q < 1) %>%
# calc AIC for models
mutate(aic = pmap_dbl(list(p, q, P, Q), pu_aic)) %>% 
# return best AIC
slice(which.min(aic))
##   p q P Q      aic
## 1 1 0 0 1 575.2659

The minimum AIC value returned is for the ARIMA(1,0,0)(0,1,1) model; this is used:

pu_fit <- Arima(pu_data, order = c(1, 0, 0), seasonal = c(0, 1, 1), lambda = pu_lambda)

Ljung-Box test and diagnostic plotting:

Box.test(resid(pu_fit), type = "L", fitdf = 3, lag = 12)
## 
##  Box-Ljung test
## 
## data:  resid(pu_fit)
## X-squared = 6.4543, df = 9, p-value = 0.6937
ggtsdisplay(resid(pu_fit), points = FALSE,
main = "Residuals for ARIMA(1,0,0)(0,1,1) of residential power usage")

From Ljung-Box test, p-value > 0.05, but the spikes in ACF & PACF at \(k = 3\) and \(k = 4\) suggest the possibility of AR(3) or MA(3) components (since the spike at \(k = 4\) was addressed above). Investigation of these does not yield any AIC values lower than that of the above-identified model:

expand.grid(p = c(1, 3), q = c(1, 3)) %>%
mutate(aic = pmap_dbl(list(p, q, 0, 1), pu_aic))
##   p q      aic
## 1 1 1 575.7090
## 2 3 1 579.7385
## 3 1 3 579.9391
## 4 3 3 581.4511

The residuals appears to be normally distributed around zero without any significant autocorrelation:

ggtsdisplay(resid(pu_fit), points = FALSE, plot.type = "histogram",
main = "Residuals for ARIMA(1,0,0)(0,1,1) of residential power usage")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Forecasting

Using the ARIMA(1,0,0)(0,1,1) model, the next year (12 months) is forecast, and this forecast is plotted:

pu_forecast <- forecast(pu_fit, 12, level = 95)
autoplot(pu_forecast) + 
labs(title = "Forecasted residential enery use",
subtitle = "Using ARIMA(1,0,0)(0,1,1) model", x = "Month", y = "kWh") +
theme(legend.position = "none")

The forecast shows annual seasonality.

data_frame(`YYYY-MMM` = paste0(2014, "-", month.abb),
PU = pu_forecast$mean) %>% 
write_csv("C:/Users/rites/Documents/GitHub/Data624-Project1/project1_kWh.csv")

Part C - BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

# read the data
water1 <- readxl::read_excel("C:/Users/rites/Documents/GitHub/Data624-Project1/Waterflow_Pipe1.xlsx")
water2 <- readxl::read_excel("C:/Users/rites/Documents/GitHub/Data624-Project1/Waterflow_Pipe2.xlsx")

In order to use the two series together, the readings for pipeline 1 must be converted to hourly:

water1 <- water1 %>% 
# separate date & hour components of readings
mutate(Date = date(DateTime),
# convert hour to hour-ending to match pipeline 2
Hour = hour(DateTime) + 1) %>% 
# get average reading for each date & hour
group_by(Date, Hour) %>% 
summarize(WaterFlow = mean(WaterFlow)) %>% 
# convert back to DateTime and drop separate date/hour columns
ungroup() %>%
mutate(DateTime = ymd_h(paste(Date, Hour))) %>% 
select(DateTime, WaterFlow)

The two datasets are joined and a total wateflow is created, then converted to a timeseries:

# create df with both observations for each hour
water_df <- full_join(water1, water2, by = "DateTime", suffix = c("_1", "_2")) %>% 
# convert missing pipeline 1 readings to zero
mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>% 
# get total waterflow by hour
mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>% 
# drop individual numbers
select(DateTime, WaterFlow)
# create hourly timeseries object
water_ts <- ts(water_df$WaterFlow, frequency = 24)

The timeseries is plotted to inspect its features:

autoplot(water_ts) +
labs(title = "Hourly water flow through two pipelines", x = "Day", y = "Total waterflow")

There is variability across the whole range, with an initial downward trend before day 10 followed by a roughly flat period through the end of the time window.

Fitting

The variance is constant but we will perform .Due to the apparent non-stationarity, a lag-1 difference is taken:

# Box-cox paramter & transform
water_lambda <- BoxCox.lambda(water_ts)
water_trans <- BoxCox(water_ts, water_lambda)
# plot differenced transformed series
ggtsdisplay(diff(water_trans), points = FALSE,
main = "Differenced Box-Cox transformed water flow")

The timeseries is stationary, but shows significant spikes in the ACF and PACF at \(k = 1\), strongly suggesting non-seasonal AR(1) and MA(1) components. There is no seasonal behavior. Thus, an ARIMA(1,1,1) model is used:

water_fit <- Arima(water_ts, order = c(1, 1, 1), lambda = water_lambda)
Box.test(resid(water_fit), type = "L")
## 
##  Box-Ljung test
## 
## data:  resid(water_fit)
## X-squared = 2.0096e-05, df = 1, p-value = 0.9964
ggtsdisplay(resid(water_fit), points = FALSE, plot.type = "histogram",
main = "Residuals for ARIMA(1,1,1) of water flow")

Ljung-Box test returns a value of almost 1; the residuals appear to be roughly normally distributed around 0 without significant autocorrelation. The model is acceptable and will be used for forecasting.

Forecasting

Using the ARIMA(1,1,1) model, one week (168 hours) is forecast, and the forecast plotted:

water_forecast <- forecast(water_fit, 168, level = 95)
autoplot(water_forecast) + 
labs(title = "Forecasted water flow",
x = "Day", y = "Total flow") +
theme(legend.position = "none")

There is lack of seasonality so a single value is forecasted. The forecast values are output to a .csv file:

data_frame(DateTime = max(water_df$DateTime) + hours(1:168),
WaterFlow = water_forecast$mean) %>% 
write_csv("C:/Users/rites/Documents/GitHub/Data624-Project1/project1_water.csv")