library(tidyverse)
library(fpp2)
library(urca)
library(rio)
library(gridExtra)
library(forecast)
library(lubridate)

1 Project#1

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade.

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

2 Libraries

library(tidyverse)
library(ggplot2)
library(kableExtra)
library(gridExtra)
library(fpp2)
library(rio)

3 Part A

Part A – ATM Forecast, ATM624Data.xlsx

3.1 Load Data

First, load the excel data, clean it by dropping the NA values, and rearrange it to a better format.

atm <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project1/ATM624Data.xlsx", col_types = c("date", "text", "numeric"))
atm #original data

atm_daily <- atm %>% drop_na() %>% spread(ATM, Cash) 
atm_daily #daily total after dropping NA values

From the table above, we can see that the cash withdrawal from ATM4 are generally larger than the other 3 ATMs, and that of ATM3 are mostly zero. Let’s look closely with their summary statistics and boxplots.

There are 3 NA values in ATM1 and 2 in ATM2.
There are only 3 datapoints in ATM3.
There are many outliers in ATM1 and one extremely high outlier in ATM4.

par(mfrow=c(4,1))
for (i in 2:5) {
  print(summary(atm_daily[i]))
  boxplot(atm_daily[i], horizontal = TRUE)
}

##       ATM1       
##  Min.   :  1.00  
##  1st Qu.: 73.00  
##  Median : 91.00  
##  Mean   : 83.89  
##  3rd Qu.:108.00  
##  Max.   :180.00  
##  NA's   :3

##       ATM2       
##  Min.   :  0.00  
##  1st Qu.: 25.50  
##  Median : 67.00  
##  Mean   : 62.58  
##  3rd Qu.: 93.00  
##  Max.   :147.00  
##  NA's   :2

##       ATM3        
##  Min.   : 0.0000  
##  1st Qu.: 0.0000  
##  Median : 0.0000  
##  Mean   : 0.7206  
##  3rd Qu.: 0.0000  
##  Max.   :96.0000

##       ATM4          
##  Min.   :    1.563  
##  1st Qu.:  124.334  
##  Median :  403.839  
##  Mean   :  474.043  
##  3rd Qu.:  704.507  
##  Max.   :10919.762

3.2 Timeseries

Next, convert the data into a timeseries and plot it.

As the spike from ATM4 in the first plot makes it hard to see the details in ATM1, ATM2, and ATM3, individual plots will be shown below.

atm_ts <- ts(atm_daily %>% select(-DATE))
autoplot(atm_ts) +
  ggtitle("Daily Cash Withdrawal from 4 ATM Machines") +
  xlab("Day") +
  ylab("Hundreds of Dollars ($100)")

To handle the data better, the extremely high outlier in ATM4 will be replaced by median for better forecasting.

atm[1394,3] <- 403.839
atm %>% drop_na() %>% ggplot(aes(x=DATE, y=Cash, col=ATM)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ATM, ncol=1, scales="free_y") +
  labs(title="Daily Cash Withdrawal from 4 ATM Machines (May 2019 to April 2020)") +
  xlab("Date") + ylab("Hundreds of Dollars ($100)")

From the above plots:

Seasonality is seen in ATM1 and ATM2, will apply Box-Cox transformation.
Only 3 datapoints in ATM3 and all before them are 0, we may not have enough information for prediction. This may happen if the ATM3 launched only 3 days before end of April 2020.
Replacing the outlier with median makes the ATM4 better to forecast.

3.3 ARIMA Models

3.3.1 ATM1

From the timeseries plot, ATM1 is clearly with seasonality, which is weekly seasonality.

The ACF and PACF plots have significant lag7, lag14, and lag21.

Thus, this is a non-stationary timeseries.

atm1 <- atm_daily[2]
atm1 <- ts(atm1, frequency=7)
atm2 <- atm_daily[3]
atm2 <- ts(atm2, frequency=7)
atm3 <- atm_daily[4]
atm3 <- ts(atm3, start=363)
atm3[which(atm3==0)] <- NA
atm4 <- atm_daily[5]
atm4[285,] <- 403.839
atm4 <- ts(atm4, frequency=7)

ggtsdisplay(atm1, main="Daily Cash Withdrawal in ATM1 (May 2019 - April 2020)")

ggseasonplot(atm1)

ggsubseriesplot(atm1)

atm1_bc <- BoxCox(atm1, lambda = BoxCox.lambda(atm1))
ggtsdisplay(atm1_bc, main="ATM1 with BoxCox Transformation")

After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.

ggtsdisplay(diff(atm1_bc, 7), points=FALSE)

atm1_bc %>% diff(.,7) %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.016 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary.

To recall, an ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations.

The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.

The seasonal spike at lag7 suggest seasonal AR(1) and/or MA(1) components. As ACF decays gradually, this suggests seasonal AR(0) and MA(1), i.e. P=0, Q=1.

Thus, the possible model here I suggest to be ARIMA(1,0,1)(0,1,1).

atm1_arima <- Arima(atm1, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm1))
summary(atm1_arima)

## Series: atm1 
## ARIMA(1,0,1)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2081476 
## 
## Coefficients:
##           ar1     ma1     sma1
##       -0.5040  0.6201  -0.6438
## s.e.   0.2417  0.2183   0.0427
## 
## sigma^2 estimated as 1.252:  log likelihood=-544.29
## AIC=1096.58   AICc=1096.7   BIC=1112.11
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 2.514774 25.09103 15.83582 -89.65246 108.6821 0.8958869
##                      ACF1
## Training set -0.007347018

checkresiduals(atm1_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 9.4438, df = 11, p-value = 0.581
## 
## Model df: 3.   Total lags used: 14

atm1_auto <- auto.arima(atm1, approximation = FALSE, lambda=BoxCox.lambda(atm1))
summary(atm1_auto)

## Series: atm1 
## ARIMA(0,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2081476 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.0989  -0.1107  -0.6482
## s.e.  0.0532   0.0527   0.0425
## 
## sigma^2 estimated as 1.247:  log likelihood=-543.68
## AIC=1095.37   AICc=1095.48   BIC=1110.89
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 2.493867 25.19797 16.12913 -88.12706 107.3975 0.9124807
##                    ACF1
## Training set 0.01895515

checkresiduals(atm1_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 8.5811, df = 11, p-value = 0.6605
## 
## Model df: 3.   Total lags used: 14

The ARIMA model found by auto.arima is ARIMA(0,0,2)(0,1,1)[7].

The ARIMA model I suggested is ARIMA(1,0,1)(0,1,1)[7].

According to the error measures and the residual plots, both models represents the data well with similar AIC values, similar error measures, and similar p-values.

Both models worked great but the ARIMA(0,0,2)(0,1,1)[7] has smaller AICc which is better than the model I suggested.

atm1_model <- Arima(atm1, order=c(0,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm1))

3.3.2 ATM2

From the timeseries plot, ATM2 is clearly with seasonality, which is weekly seasonality.

The ACF and PACF plots have significant lag7, lag14, and lag21.

Thus, this is a non-stationary timeseries.

ggtsdisplay(atm2, main="Daily Cash Withdrawal in ATM2 (May 2019 - April 2020)")

ggseasonplot(atm2)

ggsubseriesplot(atm2)

atm2_bc <- BoxCox(atm2, lambda = BoxCox.lambda(atm2))
ggtsdisplay(atm2_bc, main="ATM2 with BoxCox Transformation")

After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.

ggtsdisplay(diff(atm2_bc, 7), points=FALSE)

atm2_bc %>% diff(.,7) %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0161 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary.

One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0.

The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, suggest AR(0) and MA(1), so P=0, Q=1.

The non-differenced ACF and PACF plots have spikes at lag2 and lag5, suggest p=2 and q=2.

Thus, the possible model here I suggest to be ARIMA(2,0,2)(0,1,1)[7].

atm2_arima <- Arima(atm2, order=c(2,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm2))
summary(atm2_arima)

## Series: atm2 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7164431 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4188  -0.9019  0.4587  0.7886  -0.7187
## s.e.   0.0587   0.0471  0.0884  0.0619   0.0414
## 
## sigma^2 estimated as 62.19:  log likelihood=-1240.08
## AIC=2492.16   AICc=2492.4   BIC=2515.45
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf  Inf 0.8250038 -0.01464052

checkresiduals(atm2_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
## 
## Model df: 5.   Total lags used: 14

atm2_auto <- auto.arima(atm2, approximation = FALSE, lambda=BoxCox.lambda(atm2))
summary(atm2_auto)

## Series: atm2 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7164431 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4188  -0.9019  0.4587  0.7886  -0.7187
## s.e.   0.0587   0.0471  0.0884  0.0619   0.0414
## 
## sigma^2 estimated as 62.19:  log likelihood=-1240.08
## AIC=2492.16   AICc=2492.4   BIC=2515.45
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf  Inf 0.8250038 -0.01464052

checkresiduals(atm2_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
## 
## Model df: 5.   Total lags used: 14

The ARIMA model found by auto.arima is ARIMA(2,0,2)(0,1,1)[7].

The ARIMA model I suggested is ARIMA(2,0,2)(0,1,1)[7].

The two models are the same.

According to the error measures and the residual plots, the model represents the data well with small p-value.

atm2_model <- Arima(atm2, order=c(2,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm2))

3.3.3 ATM3

Only 3 datapoints in ATM3 and all before them are 0. This may happen if the ATM3 launched only 3 days before end of April 2020.

Given only 3 datapoints, there is not enough information to forecast on the timeseries.

Therefore, I will use a simple mean forecast.

ggtsdisplay(atm3, main="Daily Cash Withdrawal in ATM3 (May 2019 - April 2020)")

3.3.4 ATM4

Let’s repeat the steps from ATM1&2 on ATM4. ATM4 also has weekly seasonality.

The ACF and PACF plots have significant lag7, lag14, and lag21.

Thus, this is a non-stationary timeseries.

ggtsdisplay(atm4, main="Daily Cash Withdrawal in ATM4 (May 2019 - April 2020)")

ggseasonplot(atm4)

ggsubseriesplot(atm4)

atm4_bc <- BoxCox(atm4, lambda = BoxCox.lambda(atm4))
ggtsdisplay(atm4_bc, main="ATM4 with BoxCox Transformation")

After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.

ggtsdisplay(diff(atm4_bc, 7), points=FALSE)

atm4_bc %>% diff() %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0124 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary.

One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0.

The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, suggest AR(0) and MA(1), so P=0, Q=1.

There one non-seasonal spike at lag3 in ACF and PACF plots suggest p=q=1.

Thus, I will try ARIMA(1,0,1)(0,1,1)[7].

atm4_arima <- Arima(atm4, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm4))
summary(atm4_arima)

## Series: atm4 
## ARIMA(1,0,1)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.4525697 
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.6851  -0.6059  -0.8677
## s.e.  0.3040   0.3315   0.0315
## 
## sigma^2 estimated as 171.4:  log likelihood=-1432.08
## AIC=2872.16   AICc=2872.27   BIC=2887.68
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 61.63608 337.5467 254.8181 -360.8151 404.2179 0.7355835
##                   ACF1
## Training set 0.0185852

checkresiduals(atm4_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 16.519, df = 11, p-value = 0.1229
## 
## Model df: 3.   Total lags used: 14

atm4_auto <- auto.arima(atm4, approximation = FALSE, lambda=BoxCox.lambda(atm4))
summary(atm4_auto)

## Series: atm4 
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4525697 
## 
## Coefficients:
##          ar1    sar1    sar2     mean
##       0.0771  0.2090  0.1996  29.0082
## s.e.  0.0526  0.0516  0.0525   1.2627
## 
## sigma^2 estimated as 182.2:  log likelihood=-1466.34
## AIC=2942.69   AICc=2942.85   BIC=2962.19
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 84.38613 351.7695 274.2406 -370.6751 415.1343 0.7916504
##                    ACF1
## Training set 0.01960666

checkresiduals(atm4_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.92, df = 10, p-value = 0.1019
## 
## Model df: 4.   Total lags used: 14

The ARIMA model found by auto.arima is ARIMA(1,0,0)(2,0,0)[7].

The ARIMA model I suggested is ARIMA(1,0,1)(0,1,1)[7].

The model I suggested has smaller AICc and RMSE.

According to the error measures and the residual plots, I will stick to the model I suggested.

atm4_model <- Arima(atm4, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm4))

3.4 Forecast

atm1_f <- forecast(atm1_model, 31, level=95)
atm2_f <- forecast(atm2_model, 31, level=95)
atm3_f <- meanf(atm3, 31, level=95)
atm4_f <- forecast(atm4_model, 31, level=95)

gridExtra::grid.arrange(
  autoplot(atm1_f) +
    labs(title="ATM1: ARIMA(0,0,2)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
  autoplot(atm2_f) +
    labs(title="ATM2: ARIMA(2,0,2)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
  autoplot(atm3_f) +
    labs(title="ATM3: meanf", x="Day", y="Hundreds of Dollars($100)"),
  autoplot(atm4_f) +
    labs(title="ATM4: ARIMA(1,0,1)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
  top = grid::textGrob("Forecast on Cash Withdrawal for May 2020")
)

export <- rbind(atm1_f$mean, atm2_f$mean, atm3_f$mean, atm4_f$mean)
write.csv(export, "ATM624Forecast.csv")

data.frame(export) %>% cbind(ATM = c('ATM1', 'ATM2', 'ATM3', 'ATM4')) %>% 
                    select(ATM, everything())

4 Part B

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

4.1 Load Data

First, load the excel data into R and clean it by using function tsclean() which can handle outliers and NA value.

kwh <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project1/ResidentialCustomerForecastLoad-624.xlsx")
kwh #original data

kwh_ts <- ts(kwh[,"KWH"], start=c(1998,1), frequency=12) %>% 
  tsclean() #handle the outliers and NA value
kwh_ts #cleaned data

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260
## 2010  9397357  8390677  7347915  5776131  4919289  6696292  7696306
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321
##           Aug      Sep      Oct      Nov      Dec
## 1998  8607428  6989888  6345620  4640410  4693479
## 1999  7564391  7899368  5358314  4436269  4419229
## 2000  7517830  8912169  5844352  5041769  6220334
## 2001  8450717  7112069  5242535  4461979  5240995
## 2002  8058748  8245227  5865014  4908979  5779958
## 2003  8476499  7791791  5344613  4913707  5756193
## 2004  7309774  6690366  5444948  4824940  5791208
## 2005  7786659  7057213  6694523  4313019  6181548
## 2006  8241110  7296355  5104799  4458429  6226214
## 2007  7447146  7666970  5785964  4907057  6047292
## 2008  8037137  7381789  5101803  4555602  6442746
## 2009  8350517  7583146  5566075  5339890  7089880
## 2010  7922701  7819472  5875917  4800733  6152583
## 2011 10308076  8943599  5603920  6154138  8273142
## 2012  9612423  7559148  5576852  5731899  6609694
## 2013  9080226  7968220  5759367  5769083  7028762

summary(kwh[,"KWH"]) #before

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

summary(kwh_ts) #after: clear out the outlier "770523" and handled the NA value in 2008 Sep.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  4313019  5443502  6351262  6529701  7608792 10655730

4.2 Timeseries

Let’s study the cleaned timeseries below.

Seasonality is found in this timeseries and appears to have a peak every 6 months. The seasonality may be annual due to the high usage during winter and summer.

autoplot(kwh_ts) +
  ggtitle("Monthly Residential Power Usage for Jan 1998 to Dec 2013 (KWH)") +
  xlab("Month") +
  ylab("Kilowatt hours (KWH)")

ggseasonplot(kwh_ts)

ggsubseriesplot(kwh_ts)

4.3 ARIMA Model

From the plots below, we see annual seasonality.

ggtsdisplay(kwh_ts, main="Monthly Residential Power Usage for Jan 1998 to Dec 2013 (KWH)")

Tried Box-Cox transformation on the timeseries but no huge differences on the before and after. Will work on differencing instead.

kwh_bc <- BoxCox(kwh_ts, lambda = BoxCox.lambda(kwh_ts))
ggtsdisplay(kwh_bc, main="kwh_ts with BoxCox Transformation")

After differencing once, the timeseries now appears to be stationary.

ggtsdisplay(diff(kwh_ts,12), points=FALSE)

kwh_ts %>% diff(.,12) %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.1168 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.

The seasonal spike at lag7 in ACF and lag7 & lag14 in PACF suggest seasonal AR(1) and/or MA(2) components. As ACF decays gradually, this suggests seasonal AR(0) and MA(2), i.e. P=0, Q=2.

Thus, the possible model here I suggest to be ARIMA(1,0,1)(0,1,2).

kwh_arima <- Arima(kwh_ts, order=c(1,0,1), seasonal=c(0,1,2), lambda = BoxCox.lambda(kwh_ts))
summary(kwh_arima)

## Series: kwh_ts 
## ARIMA(1,0,1)(0,1,2)[12] 
## Box Cox transformation: lambda= -0.1442665 
## 
## Coefficients:
##          ar1      ma1     sma1    sma2
##       0.9480  -0.8020  -0.8027  0.1754
## s.e.  0.0822   0.1772   0.0928  0.0910
## 
## sigma^2 estimated as 9.097e-05:  log likelihood=582.73
## AIC=-1155.47   AICc=-1155.12   BIC=-1139.5
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 85650.36 597401.9 463229.3 0.8034675 6.999746 0.7766959
##                   ACF1
## Training set 0.1519546

checkresiduals(kwh_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,2)[12]
## Q* = 24.449, df = 20, p-value = 0.2233
## 
## Model df: 4.   Total lags used: 24

kwh_auto <- auto.arima(kwh_ts, approximation = FALSE, lambda=BoxCox.lambda(kwh_ts))
checkresiduals(kwh_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,1)[12] with drift
## Q* = 25.496, df = 21, p-value = 0.2263
## 
## Model df: 3.   Total lags used: 24

summary(kwh_auto)

## Series: kwh_ts 
## ARIMA(1,0,0)(0,1,1)[12] with drift 
## Box Cox transformation: lambda= -0.1442665 
## 
## Coefficients:
##          ar1     sma1  drift
##       0.2903  -0.7349  1e-04
## s.e.  0.0724   0.0698  1e-04
## 
## sigma^2 estimated as 8.731e-05:  log likelihood=585.27
## AIC=-1162.55   AICc=-1162.32   BIC=-1149.78
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 20692.11 580573.3 445696.4 -0.2292905 6.807673 0.7472985
##                    ACF1
## Training set 0.05192474

The ARIMA model found by auto.arima is ARIMA(1,0,0)(0,1,1)[12].

The ARIMA model I suggested is ARIMA(1,0,1)(0,1,2)[12].

According to the error measures and the residual plots, the auto model represents the data better with smaller AICc value and smaller error measures.

Therefore ARIMA(1,0,0)(0,1,1)[12] is better than the model I suggested.

4.4 Forecast

kwh_model <- Arima(kwh_ts, order=c(1,0,0), seasonal=c(0,1,1), lambda = BoxCox.lambda(kwh_ts))
kwh_f <- forecast(kwh_model, h=12, level=95)
kwh_f

##          Point Forecast   Lo 95    Hi 95
## Jan 2014        9417011 7779842 11460551
## Feb 2014        7893327 6480120  9670462
## Mar 2014        6436805 5307451  7849709
## Apr 2014        5756614 4760318  6998820
## May 2014        5570785 4610642  6766694
## Jun 2014        7479484 6140427  9163354
## Jul 2014        8522770 6971107 10482526
## Aug 2014        9080593 7413869 11190200
## Sep 2014        7909745 6483433  9706676
## Oct 2014        5661778 4683894  6880415
## Nov 2014        5568654 4608914  6764051
## Dec 2014        6919608 5693132  8457963

autoplot(kwh_f)

export <- kwh_f$mean
write.csv(export, "kwh624Forecast.csv")

5 Part C

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

5.1 Load Data

wfp1 <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project1/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
wfp2 <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project1/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
colnames(wfp1) <- c("DateTime", "WaterFlow") 
colnames(wfp2) <- c("DateTime", "WaterFlow")

wfp1 <- wfp1 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)+1) %>% #match the hour with wfp2
                  group_by(Date, Time) %>%
                  summarise(Water=mean(WaterFlow)) %>%
                  ungroup() %>%
                  mutate(DateTime=ymd_h(paste(Date,Time))) %>%
                  select(DateTime,Water)
wfp1

wfp2 <- wfp2 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)) %>%
                  group_by(Date, Time) %>%
                  summarise(Water=mean(WaterFlow)) %>%
                  ungroup() %>%
                  mutate(DateTime=ymd_h(paste(Date,Time))) %>%
                  select(DateTime, Water)
wfp2

5.2 Timeseries

Combining the two waterflows into one:

water <- full_join(wfp1, wfp2, by="DateTime", suffix=c("_1", "_2")) %>%
  mutate(Water_1=ifelse(is.na(Water_1), 0, Water_1)) %>%
  mutate(Water_2=ifelse(is.na(Water_2), 0, Water_2)) %>%
  mutate(Water = Water_1 + Water_2) %>%
  select(DateTime, Water)
water

water_ts <- ts(water$Water, frequency=24)
ggseasonplot(water_ts) + theme(legend.title = element_blank())

ggsubseriesplot(water_ts)

ggtsdisplay(water_ts, main="Daily Waterflow")

5.3 ARIMA Model

We cannot see significant seasonality involved in water_ts however there is a slightly decreasing trend.

It is a non-stationary timeseries.

water_bc <- BoxCox(water_ts, lambda = BoxCox.lambda(water_ts))
ggtsdisplay(water_bc, main="water_ts with BoxCox Transformation")

Trend differencing is needed.

ndiffs(water_ts)

## [1] 1

nsdiffs(water_ts)

## [1] 0

ggtsdisplay(diff(water_bc), points=FALSE, main="Differenced water_ts with BoxCox Transformation")

water_bc %>% diff() %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0091 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary.

One seasonal differencing was applied so D=0, while the non-seasonal part suggests d=1.

There is one seasonal lags in ACF, suggest Q=1.

There one non-seasonal spike at lag1 in ACF suggest q=1.

Thus, I will try ARIMA(0,1,1)(0,0,1)[24].

water_arima <- Arima(water_ts, order=c(0,1,1), seasonal=c(0,0,1), lambda = BoxCox.lambda(water_ts))
summary(water_arima)

## Series: water_ts 
## ARIMA(0,1,1)(0,0,1)[24] 
## Box Cox transformation: lambda= 0.8713037 
## 
## Coefficients:
##           ma1    sma1
##       -0.9578  0.0712
## s.e.   0.0101  0.0319
## 
## sigma^2 estimated as 103.3:  log likelihood=-3734.4
## AIC=7474.8   AICc=7474.83   BIC=7489.52
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.1591549 16.33716 13.34589 -28.20783 50.26558 0.7507364
##                     ACF1
## Training set -0.04444759

checkresiduals(water_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,0,1)[24]
## Q* = 57.881, df = 46, p-value = 0.1124
## 
## Model df: 2.   Total lags used: 48

water_auto <- auto.arima(water_ts, approximation = FALSE, lambda=BoxCox.lambda(water_ts))
summary(water_auto)

## Series: water_ts 
## ARIMA(0,1,1)(1,0,0)[24] 
## Box Cox transformation: lambda= 0.8713037 
## 
## Coefficients:
##           ma1    sar1
##       -0.9578  0.0714
## s.e.   0.0101  0.0322
## 
## sigma^2 estimated as 103.3:  log likelihood=-3734.4
## AIC=7474.8   AICc=7474.82   BIC=7489.52
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 0.1613506 16.33729 13.34564 -28.19083 50.25213 0.750722
##                     ACF1
## Training set -0.04431537

checkresiduals(water_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(1,0,0)[24]
## Q* = 57.858, df = 46, p-value = 0.1128
## 
## Model df: 2.   Total lags used: 48

The ARIMA model found by auto.arima is ARIMA(0,1,1)(1,0,0)[24].

The ARIMA model I suggested is ARIMA(0,1,1)(0,0,1)[24].

The two models are very close to each other on the statistics with only 0.01 difference on AICc.

I will choose the auto.arima model for this timeseries.

water_model <- Arima(water_ts, order=c(0,1,1), seasonal=c(1,0,0), lambda = BoxCox.lambda(water_ts))

5.4 Forecast

Forecast a week on the water usage, which is \(7*24\) hours.

The timeseries is lack of seasonality which caused zero variation in the long term forecast.

water_f <- forecast(water_model, h=7*24, level=95)

autoplot(water_f) +
  labs(title="Water Usage Forecast: ARIMA(0,1,1)(1,0,0)[24]", x="Day")

export <- water_f$mean
write.csv(export, "Waterflow624Forecast.csv")

df <- data.frame(water_f) %>% select(Point.Forecast)
rownames(df) <- NULL
df

Data 624 Project 1

Data 624 Project 1

1 Project#1

2 Libraries

3 Part A

3.1 Load Data

3.2 Timeseries

3.3 ARIMA Models

3.3.1 ATM1

3.3.2 ATM2

3.3.3 ATM3

3.3.4 ATM4

3.4 Forecast

4 Part B

4.1 Load Data

4.2 Timeseries

4.3 ARIMA Model

4.4 Forecast

5 Part C

5.1 Load Data

5.2 Timeseries

5.3 ARIMA Model

5.4 Forecast