In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
As noted in Part B, after looking at the data for both, decided to do Part B first, as the data seemed more straightforward. That let me work out a ‘process’. Now circling back to Part A, will use the same flow of work, with extra needed to accommodate the data challenges here.
ATM624Data <- read_excel("ATM624Data.xlsx")
View(ATM624Data)
At first glance, see that there is an anomaly in ATM4, and also some NAs in the ATM field. Hard to see the rest of the data as ATM4 dominates.
atm <- ATM624Data
atm$DATE <- as.Date(atm$DATE, origin = "1899-12-30")
atm <- as_tsibble(atm, key=ATM)
suppressMessages(autoplot(atm, Cash))
Next, plotting each separately to better see, and using the same colors as above for consistency. ATM1 and 2 look OK, and ATM4 has the anomaly noted above. Now shows that ATM3 has 0 for most days.
Note for this plot, the axis titles and X axis text was removed as it is already known they are the same, as shown in the first co-mingled plot, and they distracted from seeing the data.
a1 <-atm |>
filter(ATM=='ATM1') |>
autoplot(Cash)+
geom_line(color = "#F8766D") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM1")
a2 <-atm |>
filter(ATM=='ATM2') |>
autoplot(Cash)+
geom_line(color = "#7CAE00") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM2")
a3 <-atm |>
filter(ATM=='ATM3') |>
autoplot(Cash)+
geom_line(color = "#00BFC4") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM3")
a4 <-atm |>
filter(ATM=='ATM4') |>
autoplot(Cash) +
geom_line(color = "#CA82FF") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM4")
ggarrange(a1,a2,a3,a4)
Wondered if there was a relationship between the ATM=NA rows and the ATM=ATM3 where Cash=0 rows, but since there are only 14 ATM=NA rows, that is unlikely. The dates on those 14 rows are part of the month we are meant to forecast, and they all have NA for the cash.
Since we have 365 rows for each of the other ATMs, therefore not missing anything, safe to just remove the ATM=NA rows.
rows<- as.data.frame(atm)
rows |>
group_by(ATM) |>
summarise(Count= n(),.groups = "drop")
## # A tibble: 5 × 2
## ATM Count
## <chr> <int>
## 1 ATM1 365
## 2 ATM2 365
## 3 ATM3 365
## 4 ATM4 365
## 5 <NA> 14
atm <- atm |>
filter(!is.na(ATM))
rows<- as.data.frame(atm)
rows |>
group_by(ATM) |>
summarise(Count= n(),.groups = "drop")
## # A tibble: 4 × 2
## ATM Count
## <chr> <int>
## 1 ATM1 365
## 2 ATM2 365
## 3 ATM3 365
## 4 ATM4 365
Using same method as in Part B, will set to NA, then use mice to impute.
Conveniently, the mice() to impute step will also address the Cash=NAs (total of 5 rows where ATM = ATM1 or ATM2).
However, in the first attempt at doing this, an anomaly was introduced into ATM1. This is due to mice() not separating out the 4 ATMs and the scale of ATM4 being significantly different.
Noting that ATM4 Cash ranges from 0 to almost 1750, where others go
from 0 to almost 175. This seems like a scale error, especially since
Cash is meant to be in 100s of dollars. That would make the upper
boundary of ATM4 about $17,500, and that seems way to high. Also noting
that ATM4 is the only one with decimal values.
In real life, one might go back to the source and validate any
assumptions of data errors, but here the assumption will be made that
ATM4 values should be divided by 10 and truncated to be whole
numbers.
This will be done first, then we will use mice to replace the NAs.
atm <- atm |>
mutate(Cash=ifelse(Cash<3000, Cash,NA))
tmpdf2 <- as.data.frame(atm)
impute2 <- mice(tmpdf2, m = 5, method = 'pmm', seed = 888)
##
## iter imp variable
## 1 1 Cash
## 1 2 Cash
## 1 3 Cash
## 1 4 Cash
## 1 5 Cash
## 2 1 Cash
## 2 2 Cash
## 2 3 Cash
## 2 4 Cash
## 2 5 Cash
## 3 1 Cash
## 3 2 Cash
## 3 3 Cash
## 3 4 Cash
## 3 5 Cash
## 4 1 Cash
## 4 2 Cash
## 4 3 Cash
## 4 4 Cash
## 4 5 Cash
## 5 1 Cash
## 5 2 Cash
## 5 3 Cash
## 5 4 Cash
## 5 5 Cash
cmplt2 <- complete(impute2)
atm2 <- as_tsibble(cmplt2, index = DATE, key=ATM)
a5 <-atm2 |>
filter(ATM=='ATM1') |>
autoplot(Cash)+
geom_line(color = "#F8766D") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM1")
a6 <-atm2 |>
filter(ATM=='ATM2') |>
autoplot(Cash)+
geom_line(color = "#7CAE00") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM2")
a7 <-atm2 |>
filter(ATM=='ATM3') |>
autoplot(Cash)+
geom_line(color = "#00BFC4") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM3")
a8 <-atm2 |>
filter(ATM=='ATM4') |>
autoplot(Cash) +
geom_line(color = "#CA82FF") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM4")
ggarrange(a5,a6,a7,a8)
Noting that ATM4 Cash ranges from 0 to almost 1750, where others go from 0 to almost 175. This seems like a scale error. Also noting that ATM4 is the only one with decimal values. In real life, one might go back to the source and validate any assumptions of data errors, but here the assumption will be made that ATM4 values should be divided by 10 and truncated to be whole numbers. This will be done first, then we will use mice to replace the NAs.
First, divide all ATM4 Cash values by 10 and truncate it to 0 decimals (eg make int). Then use mice to impute, and plot results.
ATM1, ATM2 and ATM4 now look good.
#Noting that we already mutated the anomaly in ATM4 with this commented out code, copied from v1 above
#atm <- atm |>
# mutate(Cash=ifelse(Cash<3000, Cash,NA))
atm <- atm |>
mutate(Cash=
ifelse(ATM =='ATM4',
as.integer(Cash/10), Cash)
)
tmpdf3 <- as.data.frame(atm)
impute3 <- mice(tmpdf3, m = 5, method = 'pmm', seed = 888)
##
## iter imp variable
## 1 1 Cash
## 1 2 Cash
## 1 3 Cash
## 1 4 Cash
## 1 5 Cash
## 2 1 Cash
## 2 2 Cash
## 2 3 Cash
## 2 4 Cash
## 2 5 Cash
## 3 1 Cash
## 3 2 Cash
## 3 3 Cash
## 3 4 Cash
## 3 5 Cash
## 4 1 Cash
## 4 2 Cash
## 4 3 Cash
## 4 4 Cash
## 4 5 Cash
## 5 1 Cash
## 5 2 Cash
## 5 3 Cash
## 5 4 Cash
## 5 5 Cash
cmplt3 <- complete(impute3)
atm3 <- as_tsibble(cmplt3, index = DATE, key=ATM)
a9 <-atm3 |>
filter(ATM=='ATM1') |>
autoplot(Cash)+
geom_line(color = "#F8766D") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM1")
a10 <-atm3 |>
filter(ATM=='ATM2') |>
autoplot(Cash)+
geom_line(color = "#7CAE00") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM2")
a11 <-atm3 |>
filter(ATM=='ATM3') |>
autoplot(Cash)+
geom_line(color = "#00BFC4") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM3")
a12 <-atm3 |>
filter(ATM=='ATM4') |>
autoplot(Cash) +
geom_line(color = "#CA82FF") +
theme(axis.text.x=element_blank(),
axis.title = element_blank())+
ggtitle("ATM4")
ggarrange(a9,a10,a11,a12)
ATM3 has only 3 non-zero values (the last 3 days of the year provided). It may have been broken most of the year, or it was not reloaded with cash or something happened. But there is no value in forecasting a month’s worth of data from 3 days. ATM3 will be removed from the data.
atm3 <- atm3 |>
filter(ATM != 'ATM3')
Note that the legend was removed so the plot could be seen more clearly. ATM1=red; ATM2=green; ATM4=blue.
The trend is pretty flat, with the exception of ATM2 (green) looking more erratic for the last two months. We also see the remainder looking different during that same period for both ATM2 (green) and ATM1 (red).
The weekly seasonality is very consistent for all 3 ATMs.
Note that since the seasonality was extremely regular, and the trend is flat (except for ATM2), we will skip the X-11 Decomposition.
a<-atm3 |>
model(
STL(Cash ~ trend(window=7)+ #7 for weekly
season(window="periodic"),
robust=TRUE)
) |>
components()
autoplot(a) +
theme(legend.position="none")
Now compare seasonal ETS and ARIMA models. This is seasonal data. Only ATM2 has a trend. There is no significant variance.
For all 3 ATMs and ETS (A,N,A) is chosen.
fit_ets_a <- atm3 |>
model(ETS(Cash))
fit_ets_a |>
filter (ATM =='ATM1')|>
report()
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.01106574
## gamma = 0.297658
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 78.86062 -65.44887 1.723305 10.89606 -0.7946538 21.58535 15.52393 16.51488
##
## sigma^2: 647.9359
##
## AIC AICc BIC
## 4527.284 4527.905 4566.283
fit_ets_a |>
filter (ATM =='ATM2')|>
report()
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000487
## gamma = 0.3552542
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 74.18287 -35.7908 -21.83354 20.22 -7.333659 2.716864 17.17553 24.84561
##
## sigma^2: 656.3285
##
## AIC AICc BIC
## 4531.981 4532.603 4570.980
fit_ets_a |>
filter (ATM =='ATM4')|>
report()
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000888
## gamma = 0.0001000098
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 44.36583 -26.77297 -3.552716 2.258881 4.470791 8.900272 3.900033 10.79571
##
## sigma^2: 1109.365
##
## AIC AICc BIC
## 4723.563 4724.184 4762.562
ETS ATM1:
ETS ATM2:
ETS ATM4:
fit_ets_a |>
filter (ATM =='ATM1')|>
gg_tsresiduals() +
labs(title = "ETS: ATM1")
fit_ets_a |>
filter (ATM =='ATM2')|>
gg_tsresiduals() +
labs(title = "ETS: ATM2")
fit_ets_a |>
filter (ATM =='ATM4')|>
gg_tsresiduals() +
labs(title = "ETS: ATM4")
The p-value varies by ATM:
augment(fit_ets_a) |>
features(.innov, ljung_box, lag = 24) |>
select( ATM, lb_pvalue)
## # A tibble: 3 × 2
## ATM lb_pvalue
## <chr> <dbl>
## 1 ATM1 0.163
## 2 ATM2 0.000512
## 3 ATM4 0.521
Different models were chosen for each of the ATMs:
ATM1: ARIMA(0,0,1)(0,1,2)[7]
ATM2: ARIMA(2,0,2)(0,1,1)[7]
ATM4: ARIMA(3,0,2)(1,0,0)[7] w/ mean
fit_arima_a <- atm3 |>
model(ARIMA(Cash))
fit_arima_a |>
filter (ATM =='ATM1')|>
report()
## Series: Cash
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1188 -0.6332 -0.0723
## s.e. 0.0550 0.0516 0.0509
##
## sigma^2 estimated as 636.5: log likelihood=-1664.34
## AIC=3336.68 AICc=3336.79 BIC=3352.2
fit_arima_a |>
filter (ATM =='ATM2')|>
report()
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4315 -0.9183 0.4711 0.8153 -0.7552
## s.e. 0.0547 0.0395 0.0857 0.0543 0.0385
##
## sigma^2 estimated as 613.4: log likelihood=-1656.88
## AIC=3325.75 AICc=3325.99 BIC=3349.04
fit_arima_a |>
filter (ATM =='ATM4')|>
report()
## Series: Cash
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 constant
## -0.9107 -0.6704 0.1577 0.9891 0.7808 0.2462 80.4593
## s.e. 0.1146 0.1299 0.0564 0.1068 0.1229 0.0553 4.9023
##
## sigma^2 estimated as 1181: log likelihood=-1805.66
## AIC=3627.32 AICc=3627.72 BIC=3658.52
ARIMA ATM1:
ARIMA ATM2:
ARIMA ATM4:
fit_arima_a |>
filter (ATM =='ATM1')|>
gg_tsresiduals() +
labs(title = "ARIMA: ATM1")
fit_arima_a |>
filter (ATM =='ATM2')|>
gg_tsresiduals() +
labs(title = "ARIMA: ATM2")
fit_arima_a |>
filter (ATM =='ATM4')|>
gg_tsresiduals() +
labs(title = "ARIMA: ATM4")
The p-value varies by ATM:
ATM1: At 0.3829807, the p-value is > 0.05.
ATM2: P-value = 0.4750484 the p-value is > 0.05.
ATM4: P-value of 0.9416482, so > 0.05 and meeting the standard for a
good model.
augment(fit_arima_a) |>
features(.innov, ljung_box, lag = 24) |>
select( ATM, lb_pvalue)
## # A tibble: 3 × 2
## ATM lb_pvalue
## <chr> <dbl>
## 1 ATM1 0.383
## 2 ATM2 0.475
## 3 ATM4 0.942
ATM1 & ATM2 look good here, but ATM4 does not seem to catpture any of the seasonality.
fit_arima_a |>
forecast(h=31) |>
autoplot(atm)
The ETS model for ATM4 looks mush better.
p1<-fit_ets_a |>
filter (ATM == 'ATM4') |>
forecast(h=31) |>
autoplot(atm)+
labs(title = "ETS: ATM4")+
theme(axis.text.x=element_blank(),
axis.title = element_blank(),
legend.position="none")
p1
ATM1 & ATM2 will use the ARIMA and ATM4 will use the ETS.
p2<- fit_arima_a |>
filter (ATM == 'ATM1') |>
forecast(h=31) |>
autoplot(atm)+
labs(title = "ARIMA: ATM1")+
theme(axis.text.x=element_blank(),
axis.title = element_blank(),
legend.position="none")
p3<- fit_arima_a |>
filter (ATM == 'ATM2') |>
forecast(h=31) |>
autoplot(atm)+
labs(title = "ARIMA: ATM2")+
theme(axis.text.x=element_blank(),
axis.title = element_blank(),
legend.position="none")
ggarrange(p2,p3,p1)
Exporting the forecast data to excel.
outputA1 <- fit_arima_a |> forecast(h=31) |>
filter(ATM != 'ATM4')
outputA1 <- as_tibble(outputA1) |>
select(ATM, DATE, .mean) |>
rename(Cash = .mean)
outputA2 <- fit_arima_a |> forecast(h=31) |>
filter(ATM == 'ATM4')
outputA2 <- as_tibble(outputA2) |>
select(ATM, DATE, .mean) |>
rename(Cash = .mean)
outputA <- bind_rows(outputA1, outputA2)
write_xlsx(outputA, path = "outputA.xlsx")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
After looking at the data, I decided to work on Part B first, as it seemed more straightforward. This would allow me to work out my ‘process’ and then apply and adjust it to the more complex data in Part A.
power <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
and do a quick autoplot for a first look at the data
pwr <- power |>
mutate(Month = yearmonth(`YYYY-MMM`),
KWH =KWH/1000000)|> # scaling for graphs
as_tsibble(index=Month) |>
select (-'YYYY-MMM') # this is now Month
#without the KWH specified it plots CaseSequence
autoplot(pwr, KWH)
There is one weird low number that will be treated as erroneous. First we will set it to NA and take another quick look.
# one weird low number, below 3M (CaseSequence 883)
pwr <- pwr |>
mutate(KWH=ifelse(KWH>3,KWH,NA)) # 3M is 3 bcs scaling
autoplot(pwr, KWH)
That looks better. But to use STL Decomposition, the NA we just introduced as well as the one that was ‘native’ to the data must be addressed.
First, make it a dataframe, then use mice to impute values for the NAs, and go back to a tsibble. One more autoplot to see that all looks clean.
tmpdf <- as.data.frame(pwr)
impute <- mice(tmpdf, m = 5, method = 'pmm', seed = 888)
##
## iter imp variable
## 1 1 KWH
## 1 2 KWH
## 1 3 KWH
## 1 4 KWH
## 1 5 KWH
## 2 1 KWH
## 2 2 KWH
## 2 3 KWH
## 2 4 KWH
## 2 5 KWH
## 3 1 KWH
## 3 2 KWH
## 3 3 KWH
## 3 4 KWH
## 3 5 KWH
## 4 1 KWH
## 4 2 KWH
## 4 3 KWH
## 4 4 KWH
## 4 5 KWH
## 5 1 KWH
## 5 2 KWH
## 5 3 KWH
## 5 4 KWH
## 5 5 KWH
cmplt <- complete(impute)
pwr2 <- as_tsibble(cmplt, index = Month)
autoplot(pwr2, KWH)
The decomposition shows a small but unmistakable upward trend. The seasonality is there - if you look closely you see a rounded peak, followed by a more pointed peak, followed by another rounded peak and so on. The remainder looks mostly like noise.
pwr2 |>
model(
STL(KWH ~ trend(window = 21) + #21 for monthly
season(window = "periodic"),
robust = TRUE)) |>
components() |>
autoplot()
With X-11 decomposition, the seasonality is allowed to vary slowly. Here we can see that the pattern described above (rounded peak->pointed peak->rounded peak…) is the same up to about 2005, where is changes slightly so that by the end it is all pointed peaks. Also notice that the curves in the trend are more sharply defined.
Note that the default here is multiplicative vs the additive in the STL.
p <- pwr2 |>
model(x11 = X_13ARIMA_SEATS(KWH ~ x11())) |>
components()
autoplot(p)
In the plot below, the original data, trend and seasonally adjusted data are plotted together. It is evident that there is a variation to the data, with the distance from the highs and lows increasing over time.
p |>
ggplot(aes(x = Month)) +
geom_line(aes(y = KWH, color = "Data")) +
geom_line(aes(y = season_adjust,
color = "Seasonality")) +
geom_line(aes(y = trend, color = "Trend")) +
labs(y = "KWH (in M)",
title = "Power Usage") +
scale_color_manual(
values = c("gray", "#0072B2", "#D55E00"),
breaks = c("Data", "Seasonality", "Trend")
)+
theme(legend.text = element_text(size=4), legend.title = element_text(size=4),
legend.key.height= unit(.25, 'cm'),
legend.key.width= unit(.25, 'cm'))
Just for kicks, and curiosity, let’s start with a simple SNAIVE() model. It is simple and does not look too bad.
#try SNAIVE()
pwr_m1 <- pwr2 |>
model(SNAIVE(KWH))
pwr_fc1 <- pwr_m1 |> forecast(h=12)
autoplot(pwr_fc1) +
autolayer(pwr2,KWH, color='black')
pwr_m1 |>
gg_tsresiduals()
pwr_m1 |>
accuracy() |>
select('.model', 'RMSE')
## # A tibble: 1 × 2
## .model RMSE
## <chr> <dbl>
## 1 SNAIVE(KWH) 0.812
Now compare seasonal ETS and ARIMA models. This is seasonal data, with a trend and there is some variance.
Start by addressing the varriance: Find the lambda for a Box-Cox.
p_lmbda <- pwr2|>
features(KWH, features= guerrero) |>
pull(lambda_guerrero)
p_lmbda
## [1] -0.2244283
..on the Box-Cox-ed KWH. ETS(A,N,A) model is selected.
fit_ets_p <- pwr2 |>
model(ETS(box_cox(KWH, p_lmbda)))
report(fit_ets_p)
## Series: KWH
## Model: ETS(A,N,A)
## Transformation: box_cox(KWH, p_lmbda)
## Smoothing parameters:
## alpha = 0.1838587
## gamma = 0.000100099
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 1.472661 -0.02962359 -0.1763272 -0.08124159 0.1150382 0.1643275 0.1237383
## s[-6] s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.01279731 -0.1622506 -0.1249864 -0.0479503 0.05503094 0.1514475
##
## sigma^2: 0.0038
##
## AIC AICc BIC
## -45.739597 -43.012324 3.122834
fit_ets_p |> gg_tsresiduals()
The p-value is < 0.05, so that does not meet the standard for a good model.
augment(fit_ets_p) |>
features(.innov, ljung_box, lag = 22) |>
select( .model, lb_pvalue)
## # A tibble: 1 × 2
## .model lb_pvalue
## <chr> <dbl>
## 1 ETS(box_cox(KWH, p_lmbda)) 0.0171
..on the Box-Cox-ed KWH, and we see a ARIMA(1,0,0)(2,1,0)[12] w/ drift model is selected.
fit_arima_p <- pwr2 |>
model(ARIMA(box_cox(KWH, p_lmbda)))
report(fit_arima_p)
## Series: KWH
## Model: ARIMA(1,0,0)(2,1,0)[12] w/ drift
## Transformation: box_cox(KWH, p_lmbda)
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.2473 -0.7069 -0.3484 0.0158
## s.e. 0.0759 0.0749 0.0776 0.0047
##
## sigma^2 estimated as 0.003721: log likelihood=246.55
## AIC=-483.09 AICc=-482.75 BIC=-467.13
fit_arima_p |> gg_tsresiduals()
The p-value is > 0.05, so that does meet the standard for a good model.
augment(fit_arima_p) |>
features(.innov, ljung_box, lag = 22) |>
select( .model, lb_pvalue)
## # A tibble: 1 × 2
## .model lb_pvalue
## <chr> <dbl>
## 1 ARIMA(box_cox(KWH, p_lmbda)) 0.255
The ARIMA best meets the evaluation criterea, so that is choosen and plotted. It looks reasonable, capturing both the trend and the seasonality.
plt7<-fit_arima_p |>
forecast(h=12) |>
autoplot(pwr2)
plt7
Exporting the forecast data to excel, remembering to ‘un-scale’ the KWH data.
outputB <- fit_arima_p |> forecast(h=12) |>
mutate (date =as.character(Month))
outputB <- as_tibble(outputB) |>
select(date, .mean) |>
rename(KWH = .mean) |>
mutate(KWH=KWH *1000000)
write_xlsx(outputB, path = "outputB.xlsx")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Waterflow_Pipe1 <- read_excel("Waterflow_Pipe1.xlsx")
Waterflow_Pipe2 <- read_excel("Waterflow_Pipe2.xlsx")
wf <- bind_rows(Waterflow_Pipe1, Waterflow_Pipe2)
#as_datetime(wf$`Date Time`)
wf$DateTime <-as.POSIXct((wf$`Date Time` - 25569) * 86400, origin = "1970-01-01", tz = "UTC")
wf <- wf |>
select (-`Date Time`)
wf2 <- wf %>%
mutate(date = as.Date(DateTime),hour = hour(DateTime)) %>%
group_by(date, hour) %>%
summarise(mean_wf = mean(WaterFlow), .groups = "drop") |>
mutate(DateTime = make_datetime(year(date), month(date), day(date), hour))|>
select (DateTime, mean_wf)
wf2 <- as_tsibble(wf2)
## Using `DateTime` as index variable.
wf2|> autoplot(mean_wf)
#fill gaps with NA
wf3 <- wf2 |>
fill_gaps(mean_wf = NA)
#calc mean
mean_value <- mean(wf3$mean_wf, na.rm = TRUE)
#replace NAs with mean
wf3 <- wf3 |>
mutate(mean_wf = ifelse(is.na(mean_wf), mean_value, mean_wf))
#STL
w<-wf3 |>
model(
STL(mean_wf ~ trend(window=7)+
season(period=24),
robust=TRUE)
) |>
components()
autoplot(w)
wf3 |>
features(mean_wf, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 4.88 0.01
wf3 |>
gg_tsdisplay(difference(mean_wf), plot_type='partial')
wf3 |>
features(difference(mean_wf), unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0115 0.1
wf_fit <- wf3 |>
model(arima310 = ARIMA(mean_wf ~ pdq(3,1,0)),
arima011 = ARIMA(mean_wf ~ pdq(0,1,1)),
stepwise = ARIMA(mean_wf),
search = ARIMA(mean_wf, stepwise=FALSE))
glance(wf_fit) |> arrange(AICc) |> select(.model:BIC)
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stepwise 112. -3777. 7564. 7564. 7588.
## 2 search 112. -3777. 7564. 7564. 7588.
## 3 arima011 112. -3779. 7566. 7566. 7585.
## 4 arima310 133. -3861. 7735. 7735. 7764.
Stepwise and Search have the same AICc, and they are both ARIMA(0,1,2)(0,0,2)[24]
wf_fit |>
select(search) |>
gg_tsresiduals()
wf_fit |>
forecast(h=168) |> #1 week of hours 7*24
filter(.model=='search') |>
autoplot(wf3)
Well, the forecast looks not so great. The confidence intervals are kinda in the right range, but that’s it. I also looked that the one’s I selected, but they were not any better. However, I have run out of time, so will stop here and put those not so great values into excel.
outputC <- wf_fit |> forecast(h=168)
outputC <- as_tibble(outputC) |>
select(DateTime, .mean) |>
rename(WaterFlow = .mean)
write_xlsx(outputC, path = "outputC.xlsx")