Bikash-DATA-624-Project-1.knit

Column

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling.
Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Import and Clean Data First i need to import the data from excel. I am opening the file in excel and formatting it as a date(5/1/2009),removing time format and saving the file, I am able to import the dates correctly.

atm <- read_excel("c:/Users/Bikash_Bhowmik/downloads/ATM624Data.xlsx")

we have 4 ATMs we need to move each ATM to a separate column. I see a few issue in data that needed to be dealt with:

there are NA’s in the ATM field that needed to be removed ATM3 has almost no values and it is mostly zeros there is an outlier in ATM4’s data that is far above the normal there are 3 NA’s in the Cash field for ATM1 and 2 for ATM2

# Format DATE column 
atm$DATE <- lubridate::ymd(atm$DATE)

# Remove empty rows with no data
atm <- atm[complete.cases(atm), ]

# Plot data
ggplot(atm, aes(DATE, Cash)) + geom_line() + facet_grid(ATM~., scales="free") +
  labs(title="ATM Withdrawals", y="Hundreds of dollars", x="") +
  theme(panel.background = element_blank())

# Convert in a separate column for each ATM machine
atm <- atm %>% pivot_wider(names_from = ATM, values_from = Cash)
atm %>% select(-DATE) %>% summary()

      ATM1             ATM2             ATM3              ATM4          
 Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000   Min.   :    1.563  
 1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000   1st Qu.:  124.334  
 Median : 91.00   Median : 67.00   Median : 0.0000   Median :  403.839  
 Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206   Mean   :  474.043  
 3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000   3rd Qu.:  704.507  
 Max.   :180.00   Max.   :147.00   Max.   :96.0000   Max.   :10919.762  
 NA's   :3        NA's   :2

# Replace NA's and Outlier
atm$ATM1 <- atm$ATM1 %>% replace(is.na(.), median(atm$ATM1, na.rm = TRUE))
atm$ATM2 <- atm$ATM2 %>% replace(is.na(.), median(atm$ATM2, na.rm = TRUE))
atm$ATM4[atm$ATM4==max(atm$ATM4)] <- median(atm$ATM4, na.rm = TRUE)

The NA’s in the ATM field turned out to be the dates that will be forecast with no data in the other columns, so they were removed.

The NA’s in ATM1 and ATM2’s data were imputed using the median since the number of missing values was so small.

ATM4’s outlier was also replaced with the median since it was so far above the norm, and the only abnormal data point in the entire ATM4 series, so it seemed likely to be an error.

Cleaned up data

# Plot and summarize again after cleaning up 
atm2 <- atm %>% pivot_longer(cols = starts_with("ATM"), 
                             names_to = "ATM", values_to = "Cash") 
ggplot(atm2, aes(DATE, Cash)) + geom_line() + facet_grid(ATM~., scales="free") +
  labs(title="ATM Withdrawals", y="Hundreds of dollars", x="") +
  theme(panel.background = element_blank())

atm %>% select(-DATE) %>% summary()

      ATM1             ATM2            ATM3              ATM4         
 Min.   :  1.00   Min.   :  0.0   Min.   : 0.0000   Min.   :   1.563  
 1st Qu.: 73.00   1st Qu.: 26.0   1st Qu.: 0.0000   1st Qu.: 124.334  
 Median : 91.00   Median : 67.0   Median : 0.0000   Median : 403.839  
 Mean   : 83.95   Mean   : 62.6   Mean   : 0.7206   Mean   : 445.233  
 3rd Qu.:108.00   3rd Qu.: 93.0   3rd Qu.: 0.0000   3rd Qu.: 704.192  
 Max.   :180.00   Max.   :147.0   Max.   :96.0000   Max.   :1712.075

ATM’s 1 and 2 appear to have some weekly seasonality. ATM3 with only 3 data points does not have enough data to do a proper forecast so a simple mean will be used. ATM4 Appears to be white noise, but there may still be some seasonality that we will investigate further o determine. There is no apparent trend to any of the series.

Convert to Time Series

Time series objects were created for each ATM since there seemed to be some weekly and possibly monthly seasonality in the plots of the data.

atm1.ts <- atm2 %>% filter(ATM=="ATM1") %>% select(Cash) %>%
  ts(start = 1, frequency = 7)
atm2.ts <- atm2 %>% filter(ATM=="ATM2") %>% select(Cash) %>%
  ts(start = 1, frequency = 7)
atm3.ts <- atm2 %>% filter(ATM=="ATM3") %>% select(Cash) %>%
  ts(start = 1, frequency = 2)
atm4.ts <- atm2 %>% filter(ATM=="ATM4") %>% select(Cash) %>%
  ts(start = 1, frequency = 7)

ATM1

For each of ATM1, ATM2 and ATM4, I want to try running at least a Holt-Winter’s additive model with damped trend, since the seasonal variations are roughly constant through the series, and ETS and ARIMA models to see if they result in better performance than the Holt-Winter’s model.

Before running any models I will check the ACF and PACF plots, and the ndiffs, nsdiffs, and BoxCox.lambda functions to see what they recommend for differencing and what type of model they suggest might be most appropriate.

ggtsdisplay(atm1.ts)

ndiffs(atm1.ts)

[1] 0

nsdiffs(atm1.ts)

[1] 1

atm1.lambda <- BoxCox.lambda(atm1.ts)
atm1.lambda

[1] 0.2446101

For ATM1 no first order differencing is recommended, only a first order seasonal difference and a box-cox transformation with λ = 0.24461. Let’s plot the data again after these transformations are performed to see what impact they have.

 atm1.ts %>% BoxCox(atm1.lambda) %>% diff(lag=7) %>% ggtsdisplay()

The plot above shows that most of the seasonality has been eliminated and are left with an almost stationary time series although there are still spikes in the ACF plot at lag 7 and in the PACF plot at lags 7, 14, and 21. By adding a first order differencing we can make.

Holt-Winters

atm1.fit1 <- atm1.ts %>% hw(h=31, seasonal="additive", 
                           damped=TRUE, lambda = atm1.lambda)
autoplot(atm1.fit1) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm1.fit1)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	2.100449	25.21299	16.10613	-92.92647	111.5487	0.9096064	0.1149874

checkresiduals(atm1.fit1)


    Ljung-Box test

data:  Residuals from Damped Holt-Winters' additive method
Q* = 19.083, df = 14, p-value = 0.1618

Model df: 0.   Total lags used: 14

The residuals plot looks not too bad, but our test has an extremely small p-value indicating that there is still some autocorrelation in our data as we saw in the plot of the transformed data. Our forecast plot looks not too bad either although those confidence intervals extend way past what we have seen historically in the data.

ETS

atm1.fit2 <- atm1.ts %>% ets(model="ZZZ", lambda = atm1.lambda)

autoplot(atm1.fit2) + theme(panel.background = element_blank())

autoplot(forecast(atm1.fit2, h=31)) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm1.fit2)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	2.337381	25.04745	16.1103	-91.70147	110.6442	0.9098418	0.1213482

checkresiduals(atm1.fit2)


    Ljung-Box test

data:  Residuals from ETS(A,N,A)
Q* = 19.696, df = 14, p-value = 0.14

Model df: 0.   Total lags used: 14

The ETS model gave us almost exactly the same results with only slightly better RMSE and Ljung-Box results.

ARIMA

atm1.fit3 <- auto.arima(atm1.ts)
# as.character(atm1.fit3)
autoplot(forecast(atm1.fit3, h=31)) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm1.fit3)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	-0.0736394	23.3332	14.60281	-102.7359	117.6027	0.8247052	-0.0081341

checkresiduals(atm1.fit3)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,1)(0,1,2)[7]
Q* = 16.789, df = 11, p-value = 0.1143

Model df: 3.   Total lags used: 14

The ARIMA model resulted in the best fit with the best RMSE and a Ljung-Box p-value that means we cannot reject the null hypothesis that the series is consistent with white noise. The plot of the forecast also looks like a more reasonable estimate of what we can expect based on the historical data.

results <- data.frame(rbind(accuracy(atm1.fit1), accuracy(atm1.fit2), accuracy(atm1.fit3)))
rownames(results) <- c("Holt-Winter's", "ETS", "ARIMA(0,0,1)(0,1,2)[7]")

kable(results)

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Holt-Winter’s	2.1004493	25.21299	16.10613	-92.92647	111.5487	0.9096064	0.1149874
ETS	2.3373807	25.04745	16.11030	-91.70147	110.6442	0.9098418	0.1213482
ARIMA(0,0,1)(0,1,2)[7]	-0.0736394	23.33320	14.60281	-102.73591	117.6027	0.8247052	-0.0081341

atm1.fc <- data.frame(forecast(atm1.fit3, h=31)) %>% remove_rownames()
write_csv(atm1.fc, "ATM1_Forecast.csv")

ATM2

ggtsdisplay(atm2.ts)

ndiffs(atm2.ts)

[1] 1

nsdiffs(atm2.ts)

[1] 1

atm2.lambda <- BoxCox.lambda(atm2.ts)
atm2.lambda

[1] 0.7278107

For ATM2 both first order differencing and seasonal differencing are recommended, and a box-cox transformation with λ = 0.7278107. Let’s plot the data again after these transformations are performed to see what impact they have.

I can see clear seasonality in the ACF and PACF plots before transformation in the plot above, and after in the plot below. Note the first order differencing was not applied because it resulted in a PACF with many spikes outside the critical values.

atm2.ts %>% BoxCox(atm2.lambda) %>% diff(lag=7) %>% ggtsdisplay()

Holt-Winters

atm2.fit1 <- atm2.ts %>% hw(h=31, seasonal="additive", damped=TRUE, 
                            lambda = atm2.lambda)
autoplot(atm2.fit1) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm2.fit1)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	0.6284115	25.40858	17.92711	-Inf	Inf	0.8614639	0.0199519

checkresiduals(atm2.fit1)


    Ljung-Box test

data:  Residuals from Damped Holt-Winters' additive method
Q* = 33.37, df = 14, p-value = 0.002547

Model df: 0.   Total lags used: 14

ETS

atm2.fit2 <- atm2.ts %>% ets(model="ZZZ", lambda =atm2.lambda)

autoplot(atm2.fit2) + theme(panel.background = element_blank())

autoplot(forecast(atm2.fit2, h=31)) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm2.fit2)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	0.5166014	25.36411	17.84184	-Inf	Inf	0.8573662	0.019386

checkresiduals(atm2.fit2)


    Ljung-Box test

data:  Residuals from ETS(A,N,A)
Q* = 33.355, df = 14, p-value = 0.00256

Model df: 0.   Total lags used: 14

ARIMA

atm2.fit3 <- auto.arima(atm2.ts, lambda = atm2.lambda)
autoplot(forecast(atm2.fit3, h=31)) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm2.fit3)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	1.44981	24.25163	17.12271	-Inf	Inf	0.8228094	0.0058678

checkresiduals(atm2.fit3)


    Ljung-Box test

data:  Residuals from ARIMA(3,0,3)(0,1,1)[7] with drift
Q* = 8.859, df = 7, p-value = 0.2629

Model df: 7.   Total lags used: 14

ARIMA with first order differencing

Since the ndiffs function recommended first order differencing but the auto.arima function did not use differencing in the model, let’s try manually adding it to see if we can improve the model.

atm2.fit4 <- Arima(diff(atm2.ts), order=c(3,1,3),seasonal=c(0,1,1), lambda = atm2.lambda)
autoplot(forecast(atm2.fit4, h=31)) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm2.fit4)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	2.332289	28.9457	20.52568	NaN	Inf	0.7141976	-0.0447085

checkresiduals(atm2.fit4)


    Ljung-Box test

data:  Residuals from ARIMA(3,1,3)(0,1,1)[7]
Q* = 24.274, df = 7, p-value = 0.001019

Model df: 7.   Total lags used: 14

results <- data.frame(rbind(accuracy(atm2.fit1), accuracy(atm2.fit2),
                            accuracy(atm2.fit3),accuracy(atm2.fit4)))
rownames(results) <- c("Holt-Winter's", "ETS", "ARIMA(3,0,3)(0,1,1)[7]",
                       "ARIMA(3,1,3)(0,1,1)[7]")
kable(results)

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Holt-Winter’s	0.6284115	25.40858	17.92711	-Inf	Inf	0.8614639	0.0199519
ETS	0.5166014	25.36411	17.84184	-Inf	Inf	0.8573662	0.0193860
ARIMA(3,0,3)(0,1,1)[7]	1.4498098	24.25163	17.12271	-Inf	Inf	0.8228094	0.0058678
ARIMA(3,1,3)(0,1,1)[7]	2.3322890	28.94570	20.52568	NaN	Inf	0.7141976	-0.0447085

The auto.arima function gave the best results so that model will be used for predictions.

atm2.fc <- data.frame(forecast(atm2.fit3, h=31)) %>% remove_rownames()
write_csv(atm2.fc, "ATM2_Forecast.csv")

ATM3

As mentioned earlier there just is not enough data to make a good forecast for ATM3, a simple mean will be used to forecast until more data can be collected.

atm3.ts <- window(atm3.ts, start=182)
autoplot(atm3.ts)

atm3.fit = meanf(atm3.ts, h=31)
autoplot(atm3.fit)

atm3.fc <- data.frame(forecast(atm3.fit, h=31)) %>% remove_rownames()
write_csv(atm3.fc, "ATM3_Forecast.csv")

ATM4

The same procedure for ATM4 as done earlier for ATM1 or ATM2.

ggtsdisplay(atm4.ts)

ndiffs(atm4.ts)

[1] 0

nsdiffs(atm4.ts)

[1] 0

atm4.lambda <- BoxCox.lambda(atm4.ts)
atm4.lambda

[1] 0.4525697

atm4.ts %>% BoxCox(atm4.lambda) %>% ggtsdisplay()

Here no differencing or seasonal differencing was recommended but a box-cox transformation with λ = 0.4525697 was. Looking at the plots however, it’s not clear that the box-cox transformation improved the stationarity of the data. Seasonal spikes are still apparent.

Holt-Winters

atm4.fit1 <- atm4.ts %>% hw(h=31, seasonal="additive", damped=TRUE, 
                           lambda = atm4.lambda)
autoplot(atm4.fit1) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm4.fit1)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	72.46451	340.8815	261.8729	-369.0055	413.0543	0.7559486	0.1006656

checkresiduals(atm4.fit1)


    Ljung-Box test

data:  Residuals from Damped Holt-Winters' additive method
Q* = 18.947, df = 14, p-value = 0.167

Model df: 0.   Total lags used: 14

ETS

atm4.fit2 <- atm4.ts %>% ets(model="ZZZ", lambda = atm4.lambda)
autoplot(atm4.fit2) + theme(panel.background = element_blank())

autoplot(forecast(atm4.fit2, h=31)) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm4.fit2)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	67.23932	338.2241	259.7873	-376.9491	420.7907	0.7499281	0.0973636

checkresiduals(atm4.fit2)


    Ljung-Box test

data:  Residuals from ETS(A,N,A)
Q* = 19.488, df = 14, p-value = 0.1471

Model df: 0.   Total lags used: 14

ARIMA

atm4.fit3 <- auto.arima(atm4.ts, seasonal = TRUE, lambda = atm4.lambda)
autoplot(forecast(atm4.fit3, h=31)) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm4.fit3)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	84.4142	351.8346	274.3411	-370.6976	415.1683	0.7919408	0.019962

checkresiduals(atm4.fit3)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
Q* = 15.872, df = 11, p-value = 0.1459

Model df: 3.   Total lags used: 14

atm4.fit4 <- Arima(atm4.ts, order=c(0,0,1),seasonal=c(14,1,0), 
                   lambda = atm4.lambda)
autoplot(forecast(atm4.fit4, h=31)) + theme(panel.background = element_blank())

kable(data.frame(accuracy(atm4.fit4)))

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	58.30159	334.008	251.7076	-339.9261	382.9995	0.7266044	0.0143578

checkresiduals(atm4.fit4)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,1)(14,1,0)[7]
Q* = 16.453, df = 3, p-value = 0.0009157

Model df: 15.   Total lags used: 18

results <- data.frame(rbind(accuracy(atm4.fit1), accuracy(atm4.fit2),
                            accuracy(atm4.fit3),accuracy(atm4.fit4)))
rownames(results) <- c("Holt-Winter's", "ETS", "ARIMA(0,0,1)(2,0,0)[7]",
                       "ARIMA(0,0,1)(14,1,0)[7]")
kable(results)

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Holt-Winter’s	72.46451	340.8815	261.8729	-369.0055	413.0543	0.7559486	0.1006656
ETS	67.23932	338.2241	259.7873	-376.9491	420.7907	0.7499281	0.0973636
ARIMA(0,0,1)(2,0,0)[7]	84.41420	351.8346	274.3411	-370.6976	415.1683	0.7919408	0.0199620
ARIMA(0,0,1)(14,1,0)[7]	58.30159	334.0080	251.7076	-339.9261	382.9995	0.7266044	0.0143578

Although similar performance was eventually reached with the ARIMA model the much less complex ETS model is preferred since the improvement in performance is very minimal.

atm4.fc <- data.frame(forecast(atm4.fit2, h=31)) %>% remove_rownames()
write_csv(atm4.fc, "ATM4_Forecast.csv")

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and create a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

Download and Load Excel files

Loading the excel file into R and cleaning it by using function ‘tsclean()’ which can handle outliers and NA value

kwh <- import("c:/Users/Bikash_Bhowmik/downloads/ResidentialCustomerForecastLoad-624.xlsx")

head(kwh)

  CaseSequence YYYY-MMM     KWH
1          733 1998-Jan 6862583
2          734 1998-Feb 5838198
3          735 1998-Mar 5420658
4          736 1998-Apr 5010364
5          737 1998-May 4665377
6          738 1998-Jun 6467147

kwh_ts <- ts(kwh[,"KWH"], start=c(1998,1), frequency=12) %>% 
  tsclean()
kwh_ts

          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755  8607428
1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416  7564391
2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662  7517830
2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129  8450717
2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702  8058748
2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676  8476499
2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931  7309774
2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998  7786659
2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564  8241110
2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220  7447146
2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987  8037137
2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260  8350517
2010  9397357  8390677  7347915  5776131  4919289  6696292  7686742  7922701
2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343 10308076
2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851  9612423
2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321  9080226
          Sep      Oct      Nov      Dec
1998  6989888  6345620  4640410  4693479
1999  7899368  5358314  4436269  4419229
2000  8912169  5844352  5041769  6220334
2001  7112069  5242535  4461979  5240995
2002  8245227  5865014  4908979  5779958
2003  7791791  5344613  4913707  5756193
2004  6690366  5444948  4824940  5791208
2005  7057213  6694523  4313019  6181548
2006  7296355  5104799  4458429  6226214
2007  7666970  5785964  4907057  6047292
2008  7436335  5101803  4555602  6442746
2009  7583146  5566075  5339890  7089880
2010  7819472  5875917  4800733  6152583
2011  8943599  5603920  6154138  8273142
2012  7559148  5576852  5731899  6609694
2013  7968220  5759367  5769083  6987874

summary(kwh[,"KWH"]) #before

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
  770523  5429912  6283324  6502475  7620524 10655730        1

summary(kwh_ts)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 4313019  5443502  6351262  6529723  7608792 10655730

Timeseries

converting the data into a time series.

autoplot(kwh_ts) +
  ggtitle("Monthly Power Usage", subtitle = "From Jan 1998 to Dec 2013 (KWH)") +
  xlab("Month") +
  ylab("Kilowatt hours (KWH)")

ggseasonplot(kwh_ts)

ggsubseriesplot(kwh_ts)

Seasonality is found in this timeseries and appears to have a peak every 6 months. The seasonality may be annual due to the high power usage during winter and summer.

ARIMA MODEL

Annual seasonality.

ggtsdisplay(kwh_ts, points = FALSE, main="Monthly Residential Power Usage ( From Jan 1998 to Dec 2013) - (KWH)", xlab = "Years", ylab = "Power Usage (kwh)")

Apply the BoxCox transformation:

kwh_bc <- BoxCox(kwh_ts, lambda = BoxCox.lambda(kwh_ts))
ggtsdisplay(kwh_bc, main="kwh_ts with BoxCox Transformation")

Box-Cox transformation on the timeseries but no huge differences.

ggtsdisplay(diff(kwh_ts,12), points=FALSE)

kwh_ts %>% diff(.,12) %>% ur.kpss() %>% summary()


####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 4 lags. 

Value of test-statistic is: 0.1159 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary. ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations. The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.

The seasonal spike at lag7 in ACF and lag7 & lag14 in PACF suggest seasonal AR(1) and MA(2) components. As ACF decays gradually, this suggests seasonal AR(1) and MA(2), P=0, Q=2. I will try ARIMA(1,0,1)(0,1,2).

kwh_arima <- Arima(kwh_ts, order=c(1,0,1), seasonal=c(0,1,2), lambda = BoxCox.lambda(kwh_ts))
summary(kwh_arima)

Series: kwh_ts 
ARIMA(1,0,1)(0,1,2)[12] 
Box Cox transformation: lambda= -0.1454175 

Coefficients:
         ar1      ma1     sma1    sma2
      0.9475  -0.8014  -0.8016  0.1738
s.e.  0.0838   0.1802   0.0933  0.0912

sigma^2 = 8.786e-05:  log likelihood = 585.93
AIC=-1161.86   AICc=-1161.51   BIC=-1145.89

Training set error measures:
                   ME     RMSE      MAE       MPE    MAPE      MASE      ACF1
Training set 86012.58 597771.6 463632.2 0.8088745 7.00649 0.7783202 0.1518374

checkresiduals(kwh_arima)


    Ljung-Box test

data:  Residuals from ARIMA(1,0,1)(0,1,2)[12]
Q* = 24.322, df = 20, p-value = 0.2286

Model df: 4.   Total lags used: 24

let’s use auto.arima to find out which is the best Arima model:

kwh_auto <- auto.arima(kwh_ts, approximation = FALSE, lambda=BoxCox.lambda(kwh_ts))
checkresiduals(kwh_auto)


    Ljung-Box test

data:  Residuals from ARIMA(1,0,0)(0,1,1)[12] with drift
Q* = 25.402, df = 22, p-value = 0.2782

Model df: 2.   Total lags used: 24

summary(kwh_auto)

Series: kwh_ts 
ARIMA(1,0,0)(0,1,1)[12] with drift 
Box Cox transformation: lambda= -0.1454175 

Coefficients:
         ar1     sma1  drift
      0.2895  -0.7353  1e-04
s.e.  0.0724   0.0700  1e-04

sigma^2 = 8.429e-05:  log likelihood = 588.5
AIC=-1169   AICc=-1168.77   BIC=-1156.23

Training set error measures:
                   ME     RMSE      MAE        MPE     MAPE      MASE
Training set 20961.31 580774.8 445952.2 -0.2249688 6.812064 0.7486401
                   ACF1
Training set 0.05258332

The ARIMA model found by ‘auto.arima’ is ARIMA(1,0,0)(0,1,1)[12]. The ARIMA model I suggested is ARIMA(1,0,1)(0,1,2)[12]. According to the error measures and the residual plots, both models represents the data well with similar AIC values, similar error measures, and similar p-values. Both models can be applied, but the best Arima model is ARIMA(1,0,0)(0,1,1)[12], because it has a smaller AICc.

Forecast

kwh_model <- Arima(kwh_ts, order=c(1,0,0), seasonal=c(0,1,1), lambda = BoxCox.lambda(kwh_ts))
kwh_f <- forecast(kwh_model, h=12, level=95)
kwh_f

         Point Forecast   Lo 95    Hi 95
Jan 2014        9396883 7762540 11437614
Feb 2014        7887475 6475039  9664189
Mar 2014        6435175 5306104  7848082
Apr 2014        5756155 4760036  6998419
May 2014        5570691 4610683  6766688
Jun 2014        7479517 6140217  9164205
Jul 2014        8521791 6969852 10482565
Aug 2014        9080655 7413299 11191835
Sep 2014        7911892 6484811  9710387
Oct 2014        5661786 4684004  6880566
Nov 2014        5568723 4609087  6764249
Dec 2014        6904666 5681062  8439739

ggtsdisplay(resid(kwh_model), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(2,1,1)(0,0,2) of power consumption", xlab = "Year", ylab = "Residual")

autoplot(kwh_f) + 
    labs(title = "Energy Forecast for 2014", x = "Month", y = "kWh") +
    theme(legend.position = "none")

export <- kwh_f$mean
write.csv(export,"kwh_Forecast.csv")

The forecast data in the file: ‘kwh_Forecast.csv’

I think the forecast generated is accurate. Our forecast captures the seasonality of what appears to be increased demand in the summer and winter while falling between the peaks and troughs of more recent years.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Load the excel file:

wfp1 <- import("c:/Users/Bikash_Bhowmik/downloads/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
wfp2 <- import("c:/Users/Bikash_Bhowmik/downloads/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
colnames(wfp1) <- c("DateTime", "WaterFlow") 
colnames(wfp2) <- c("DateTime", "WaterFlow") 

wfp1 <- wfp1 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)+1) %>% 
                  group_by(Date, Time) %>%
                  summarise(Water=mean(WaterFlow)) %>%
                  ungroup() %>%
                  mutate(DateTime=ymd_h(paste(Date,Time))) %>%
                  select(DateTime,Water)
wfp1

# A tibble: 236 × 2
   DateTime            Water
   <dttm>              <dbl>
 1 2015-10-23 01:00:00  26.1
 2 2015-10-23 02:00:00  18.9
 3 2015-10-23 03:00:00  15.2
 4 2015-10-23 04:00:00  23.1
 5 2015-10-23 05:00:00  15.5
 6 2015-10-23 06:00:00  22.7
 7 2015-10-23 07:00:00  20.6
 8 2015-10-23 08:00:00  18.4
 9 2015-10-23 09:00:00  20.8
10 2015-10-23 10:00:00  21.2
# ℹ 226 more rows

wfp2 <- wfp2 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)) %>%
                  group_by(Date, Time) %>%
                  summarise(Water=mean(WaterFlow)) %>%
                  ungroup() %>%
                  mutate(DateTime=ymd_h(paste(Date,Time))) %>%
                  select(DateTime, Water)
wfp2

# A tibble: 1,000 × 2
   DateTime            Water
   <dttm>              <dbl>
 1 2015-10-23 01:00:00 18.8 
 2 2015-10-23 02:00:00 43.1 
 3 2015-10-23 03:00:00 38.0 
 4 2015-10-23 04:00:00 36.1 
 5 2015-10-23 05:00:00 31.9 
 6 2015-10-23 06:00:00 28.2 
 7 2015-10-23 07:00:00  9.86
 8 2015-10-23 08:00:00 26.7 
 9 2015-10-23 09:00:00 55.8 
10 2015-10-23 10:00:00 54.2 
# ℹ 990 more rows

Timeseries Combining the two waterflows into one:

water <- full_join(wfp1, wfp2, by="DateTime", suffix=c("_1", "_2")) %>%
  mutate(Water_1=ifelse(is.na(Water_1), 0, Water_1)) %>%
  mutate(Water_2=ifelse(is.na(Water_2), 0, Water_2)) %>%
  mutate(Water = Water_1 + Water_2) %>%
  select(DateTime, Water)
water

# A tibble: 1,000 × 2
   DateTime            Water
   <dttm>              <dbl>
 1 2015-10-23 01:00:00  44.9
 2 2015-10-23 02:00:00  61.9
 3 2015-10-23 03:00:00  53.1
 4 2015-10-23 04:00:00  59.2
 5 2015-10-23 05:00:00  47.3
 6 2015-10-23 06:00:00  51.0
 7 2015-10-23 07:00:00  30.5
 8 2015-10-23 08:00:00  45.0
 9 2015-10-23 09:00:00  76.6
10 2015-10-23 10:00:00  75.4
# ℹ 990 more rows

water_ts <- ts(water$Water, frequency=24)
ggseasonplot(water_ts) + theme(legend.title = element_blank())

ggsubseriesplot(water_ts)

We cannot see significant seasonality involved in ‘water_ts’ however there is a slightly decreasing trend. It is a non-stationary timeseries.

water_ts <- ts(water$Water, frequency=24)
ggtsdisplay(water_ts, main="Daily Waterflow")

ARIMA Model

BocCox transformation:

water_bc <- BoxCox(water_ts, lambda = BoxCox.lambda(water_ts))
ggtsdisplay(water_bc, main="Water_ts with BoxCox Transformation")

Trend differencing is needed.

ndiffs(water_ts)

[1] 1

nsdiffs(water_ts)

[1] 0

ggtsdisplay(diff(water_bc), points=FALSE, main="Difference water_ts with BoxCox Transformation")

water_bc %>% diff() %>% ur.kpss() %>% summary()


####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 7 lags. 

Value of test-statistic is: 0.0091 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary. One seasonal differencing was applied so D=0, while the non-seasonal part suggests d=1. There is one seasonal lags in ACF, suggest Q=1. There one non-seasonal spike at lag1 in ACF suggest q=1.I will try ARIMA(0,1,1)(0,0,1)[24].

water_arima <- Arima(water_ts, order=c(0,1,1), seasonal=c(0,0,1), lambda = BoxCox.lambda(water_ts))
summary(water_arima)

Series: water_ts 
ARIMA(0,1,1)(0,0,1)[24] 
Box Cox transformation: lambda= 0.8713037 

Coefficients:
          ma1    sma1
      -0.9578  0.0712
s.e.   0.0101  0.0319

sigma^2 = 103.3:  log likelihood = -3734.4
AIC=7474.8   AICc=7474.83   BIC=7489.52

Training set error measures:
                    ME     RMSE      MAE       MPE     MAPE      MASE
Training set 0.1591549 16.33716 13.34589 -28.20783 50.26558 0.7507364
                    ACF1
Training set -0.04444759

checkresiduals(water_arima)


    Ljung-Box test

data:  Residuals from ARIMA(0,1,1)(0,0,1)[24]
Q* = 57.881, df = 46, p-value = 0.1124

Model df: 2.   Total lags used: 48

Now, let’s use auto.arima to find out which is the best Arima model:

water_auto <- auto.arima(water_ts, approximation = FALSE, lambda=BoxCox.lambda(water_ts))
summary(water_auto)

Series: water_ts 
ARIMA(0,1,1)(1,0,0)[24] 
Box Cox transformation: lambda= 0.8713037 

Coefficients:
          ma1    sar1
      -0.9578  0.0714
s.e.   0.0101  0.0322

sigma^2 = 103.3:  log likelihood = -3734.4
AIC=7474.8   AICc=7474.82   BIC=7489.52

Training set error measures:
                    ME     RMSE      MAE       MPE     MAPE     MASE
Training set 0.1613506 16.33729 13.34564 -28.19083 50.25213 0.750722
                    ACF1
Training set -0.04431537

checkresiduals(water_auto)


    Ljung-Box test

data:  Residuals from ARIMA(0,1,1)(1,0,0)[24]
Q* = 57.858, df = 46, p-value = 0.1128

Model df: 2.   Total lags used: 48

water_model <- Arima(water_ts, order=c(0,1,1), seasonal=c(1,0,0), lambda = BoxCox.lambda(water_ts))

Forecast

I performed the forecast on a week on water use, 7 days, for 24 hours.

water_f <- forecast(water_model, h=7*24, level=95)
autoplot(water_f) +
  labs(title="Water Usage Forecast", subtitle = "ARIMA(0,1,1)(1,0,0)[24]", x="Day")

export <- water_f$mean
write.csv(export, "Waterflow_Forecast.csv")
df <- data.frame(water_f) %>% select(Point.Forecast)
rownames(df) <- NULL
df

    Point.Forecast
1         44.13862
2         45.16903
3         46.07068
4         44.27144
5         45.44000
6         44.89509
7         42.95309
8         44.67901
9         46.80433
10        44.92962
11        46.40268
12        45.53528
13        45.54120
14        43.38716
15        44.16368
16        42.46275
17        43.26947
18        44.53708
19        46.52640
20        43.60654
21        46.10964
22        44.44325
23        43.74894
24        46.10043
25        44.52937
26        44.60297
27        44.66721
28        44.53887
29        44.62229
30        44.58342
31        44.44443
32        44.56799
33        44.71937
34        44.58589
35        44.69083
36        44.62908
37        44.62950
38        44.47556
39        44.53116
40        44.40922
41        44.46713
42        44.55785
43        44.69962
44        44.49128
45        44.66998
46        44.55115
47        44.50148
48        44.66933
49        44.55730
50        44.56256
51        44.56715
52        44.55798
53        44.56394
54        44.56117
55        44.55123
56        44.56006
57        44.57088
58        44.56134
59        44.56884
60        44.56443
61        44.56446
62        44.55346
63        44.55743
64        44.54872
65        44.55286
66        44.55934
67        44.56947
68        44.55458
69        44.56735
70        44.55886
71        44.55531
72        44.56730
73        44.55930
74        44.55967
75        44.56000
76        44.55935
77        44.55977
78        44.55958
79        44.55887
80        44.55950
81        44.56027
82        44.55959
83        44.56012
84        44.55981
85        44.55981
86        44.55902
87        44.55931
88        44.55869
89        44.55898
90        44.55944
91        44.56017
92        44.55910
93        44.56002
94        44.55941
95        44.55916
96        44.56001
97        44.55944
98        44.55947
99        44.55949
100       44.55945
101       44.55948
102       44.55946
103       44.55941
104       44.55946
105       44.55951
106       44.55946
107       44.55950
108       44.55948
109       44.55948
110       44.55942
111       44.55944
112       44.55940
113       44.55942
114       44.55945
115       44.55950
116       44.55943
117       44.55949
118       44.55945
119       44.55943
120       44.55949
121       44.55945
122       44.55945
123       44.55946
124       44.55945
125       44.55945
126       44.55945
127       44.55945
128       44.55945
129       44.55946
130       44.55945
131       44.55946
132       44.55945
133       44.55945
134       44.55945
135       44.55945
136       44.55945
137       44.55945
138       44.55945
139       44.55946
140       44.55945
141       44.55946
142       44.55945
143       44.55945
144       44.55946
145       44.55945
146       44.55945
147       44.55945
148       44.55945
149       44.55945
150       44.55945
151       44.55945
152       44.55945
153       44.55945
154       44.55945
155       44.55945
156       44.55945
157       44.55945
158       44.55945
159       44.55945
160       44.55945
161       44.55945
162       44.55945
163       44.55945
164       44.55945
165       44.55945
166       44.55945
167       44.55945
168       44.55945

The forecast data in the file ‘Waterflow_Forecast.csv’.

DATA 624, Project 1

Bikash Bhowmik

March 30, 2025

Column

Column