Project 1

Part A - ATM Forcast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

As noted in Part B, after looking at the data for both, decided to do Part B first, as the data seemed more straightforward. That let me work out a ‘process’. Now circling back to Part A, will use the same flow of work, with extra needed to accommodate the data challenges here.

Load and Explore Data

Read from Excel

ATM624Data <- read_excel("ATM624Data.xlsx")
View(ATM624Data)

At first glance, see that there is an anomaly in ATM4, and also some NAs in the ATM field. Hard to see the rest of the data as ATM4 dominates.

atm <- ATM624Data

atm$DATE <- as.Date(atm$DATE, origin = "1899-12-30")

atm <- as_tsibble(atm, key=ATM)

suppressMessages(autoplot(atm, Cash))

Next, plotting each separately to better see, and using the same colors as above for consistency. ATM1 and 2 look OK, and ATM4 has the anomaly noted above. Now shows that ATM3 has 0 for most days.

Note for this plot, the axis titles and X axis text was removed as it is already known they are the same, as shown in the first co-mingled plot, and they distracted from seeing the data.

a1 <-atm |> 
  filter(ATM=='ATM1') |>
  autoplot(Cash)+
  geom_line(color = "#F8766D") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM1")
a2 <-atm |> 
  filter(ATM=='ATM2') |>
  autoplot(Cash)+
  geom_line(color = "#7CAE00") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM2")
a3 <-atm |> 
  filter(ATM=='ATM3') |>
  autoplot(Cash)+
  geom_line(color = "#00BFC4") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM3")
a4 <-atm |> 
  filter(ATM=='ATM4') |>
  autoplot(Cash) +
  geom_line(color = "#CA82FF") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM4")

ggarrange(a1,a2,a3,a4)

Address ATM=NA rows

Wondered if there was a relationship between the ATM=NA rows and the ATM=ATM3 where Cash=0 rows, but since there are only 14 ATM=NA rows, that is unlikely. The dates on those 14 rows are part of the month we are meant to forecast, and they all have NA for the cash.

Since we have 365 rows for each of the other ATMs, therefore not missing anything, safe to just remove the ATM=NA rows.

rows<- as.data.frame(atm) 

rows |> 
  group_by(ATM) |>
  summarise(Count= n(),.groups = "drop")

## # A tibble: 5 × 2
##   ATM   Count
##   <chr> <int>
## 1 ATM1    365
## 2 ATM2    365
## 3 ATM3    365
## 4 ATM4    365
## 5 <NA>     14

atm <- atm |>
  filter(!is.na(ATM))

rows<- as.data.frame(atm) 

rows |> 
  group_by(ATM) |>
  summarise(Count= n(),.groups = "drop")

## # A tibble: 4 × 2
##   ATM   Count
##   <chr> <int>
## 1 ATM1    365
## 2 ATM2    365
## 3 ATM3    365
## 4 ATM4    365

Address Annomoly in ATM4 + Cash = NA (v1)

Using same method as in Part B, will set to NA, then use mice to impute.

Conveniently, the mice() to impute step will also address the Cash=NAs (total of 5 rows where ATM = ATM1 or ATM2).

However, in the first attempt at doing this, an anomaly was introduced into ATM1. This is due to mice() not separating out the 4 ATMs and the scale of ATM4 being significantly different.

Scale of ATM4

Noting that ATM4 Cash ranges from 0 to almost 1750, where others go from 0 to almost 175. This seems like a scale error, especially since Cash is meant to be in 100s of dollars. That would make the upper boundary of ATM4 about $17,500, and that seems way to high. Also noting that ATM4 is the only one with decimal values.
In real life, one might go back to the source and validate any assumptions of data errors, but here the assumption will be made that ATM4 values should be divided by 10 and truncated to be whole numbers.

This will be done first, then we will use mice to replace the NAs.

atm <- atm |>
  mutate(Cash=ifelse(Cash<3000, Cash,NA))

tmpdf2 <- as.data.frame(atm)

impute2 <- mice(tmpdf2, m = 5, method = 'pmm', seed = 888)

## 
##  iter imp variable
##   1   1  Cash
##   1   2  Cash
##   1   3  Cash
##   1   4  Cash
##   1   5  Cash
##   2   1  Cash
##   2   2  Cash
##   2   3  Cash
##   2   4  Cash
##   2   5  Cash
##   3   1  Cash
##   3   2  Cash
##   3   3  Cash
##   3   4  Cash
##   3   5  Cash
##   4   1  Cash
##   4   2  Cash
##   4   3  Cash
##   4   4  Cash
##   4   5  Cash
##   5   1  Cash
##   5   2  Cash
##   5   3  Cash
##   5   4  Cash
##   5   5  Cash

cmplt2 <- complete(impute2)

atm2 <- as_tsibble(cmplt2, index = DATE, key=ATM)

a5 <-atm2 |> 
  filter(ATM=='ATM1') |>
  autoplot(Cash)+
  geom_line(color = "#F8766D") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM1")
a6 <-atm2 |> 
  filter(ATM=='ATM2') |>
  autoplot(Cash)+
  geom_line(color = "#7CAE00") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM2")
a7 <-atm2 |> 
  filter(ATM=='ATM3') |>
  autoplot(Cash)+
  geom_line(color = "#00BFC4") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM3")
a8 <-atm2 |> 
  filter(ATM=='ATM4') |>
  autoplot(Cash) +
  geom_line(color = "#CA82FF") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM4")

ggarrange(a5,a6,a7,a8)

Scale ATM4, then Annomoly ATM4 + Cash = NA (v2)

Noting that ATM4 Cash ranges from 0 to almost 1750, where others go from 0 to almost 175. This seems like a scale error. Also noting that ATM4 is the only one with decimal values. In real life, one might go back to the source and validate any assumptions of data errors, but here the assumption will be made that ATM4 values should be divided by 10 and truncated to be whole numbers. This will be done first, then we will use mice to replace the NAs.

First, divide all ATM4 Cash values by 10 and truncate it to 0 decimals (eg make int). Then use mice to impute, and plot results.

ATM1, ATM2 and ATM4 now look good.

#Noting that we already mutated the anomaly in ATM4 with this commented out code, copied from v1 above
#atm <- atm |>
#  mutate(Cash=ifelse(Cash<3000, Cash,NA))
 
atm <- atm |>
  mutate(Cash=
           ifelse(ATM =='ATM4',
                  as.integer(Cash/10), Cash)
         )

tmpdf3 <- as.data.frame(atm)

impute3 <- mice(tmpdf3, m = 5, method = 'pmm', seed = 888)

## 
##  iter imp variable
##   1   1  Cash
##   1   2  Cash
##   1   3  Cash
##   1   4  Cash
##   1   5  Cash
##   2   1  Cash
##   2   2  Cash
##   2   3  Cash
##   2   4  Cash
##   2   5  Cash
##   3   1  Cash
##   3   2  Cash
##   3   3  Cash
##   3   4  Cash
##   3   5  Cash
##   4   1  Cash
##   4   2  Cash
##   4   3  Cash
##   4   4  Cash
##   4   5  Cash
##   5   1  Cash
##   5   2  Cash
##   5   3  Cash
##   5   4  Cash
##   5   5  Cash

cmplt3 <- complete(impute3)

atm3 <- as_tsibble(cmplt3, index = DATE, key=ATM)

a9 <-atm3 |> 
  filter(ATM=='ATM1') |>
  autoplot(Cash)+
  geom_line(color = "#F8766D") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM1")
a10 <-atm3 |> 
  filter(ATM=='ATM2') |>
  autoplot(Cash)+
  geom_line(color = "#7CAE00") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM2")
a11 <-atm3 |> 
  filter(ATM=='ATM3') |>
  autoplot(Cash)+
  geom_line(color = "#00BFC4") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM3")
a12 <-atm3 |> 
  filter(ATM=='ATM4') |>
  autoplot(Cash) +
  geom_line(color = "#CA82FF") + 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank())+ 
  ggtitle("ATM4")

ggarrange(a9,a10,a11,a12)

What to do with ATM3?

ATM3 has only 3 non-zero values (the last 3 days of the year provided). It may have been broken most of the year, or it was not reloaded with cash or something happened. But there is no value in forecasting a month’s worth of data from 3 days. ATM3 will be removed from the data.

atm3 <- atm3 |>
  filter(ATM != 'ATM3')

STL Decomposition

Note that the legend was removed so the plot could be seen more clearly. ATM1=red; ATM2=green; ATM4=blue.

The trend is pretty flat, with the exception of ATM2 (green) looking more erratic for the last two months. We also see the remainder looking different during that same period for both ATM2 (green) and ATM1 (red).

The weekly seasonality is very consistent for all 3 ATMs.

Note that since the seasonality was extremely regular, and the trend is flat (except for ATM2), we will skip the X-11 Decomposition.

a<-atm3 |>
  model(
    STL(Cash ~ trend(window=7)+ #7 for weekly
          season(window="periodic"),
        robust=TRUE)
  ) |>
  components() 

autoplot(a) +
  theme(legend.position="none")

ETS and ARIMA

Now compare seasonal ETS and ARIMA models. This is seasonal data. Only ATM2 has a trend. There is no significant variance.

Model ETS

For all 3 ATMs and ETS (A,N,A) is chosen.

fit_ets_a <- atm3 |> 
  model(ETS(Cash))

fit_ets_a |>
  filter (ATM =='ATM1')|>
  report()

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.01106574 
##     gamma = 0.297658 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]      s[-3]    s[-4]    s[-5]    s[-6]
##  78.86062 -65.44887 1.723305 10.89606 -0.7946538 21.58535 15.52393 16.51488
## 
##   sigma^2:  647.9359
## 
##      AIC     AICc      BIC 
## 4527.284 4527.905 4566.283

fit_ets_a |>
  filter (ATM =='ATM2')|>
  report()

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000487 
##     gamma = 0.3552542 
## 
##   Initial states:
##      l[0]     s[0]     s[-1] s[-2]     s[-3]    s[-4]    s[-5]    s[-6]
##  74.18287 -35.7908 -21.83354 20.22 -7.333659 2.716864 17.17553 24.84561
## 
##   sigma^2:  656.3285
## 
##      AIC     AICc      BIC 
## 4531.981 4532.603 4570.980

fit_ets_a |>
  filter (ATM =='ATM4')|>
  report()

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000888 
##     gamma = 0.0001000098 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  44.36583 -26.77297 -3.552716 2.258881 4.470791 8.900272 3.900033 10.79571
## 
##   sigma^2:  1109.365
## 
##      AIC     AICc      BIC 
## 4723.563 4724.184 4762.562

tsresiduals() Results

ETS ATM1:

The time plot of the residuals shows that the variation of the residuals has quite a few exceptions.
The acf shows 1 lag outside of the range, and 2 just on the edge.
The consistency of variation can also be seen on the histogram of the residuals, looking pretty normal with outliers on either side.

ETS ATM2:

The time plot of the residuals shows that the variation of the residuals has quite a few exceptions, even more than ATM1.
The acf shows 3 lags outside of the range, and 2 just on the edge.
The consistency of variation can also be seen on the histogram of the residuals, looking pretty normal with outliers on either side.

ETS ATM4:

The time plot of the residuals shows that the variation of the residuals looks more consistant than either of the others, with just a few exceptions.
The acf shows no lags outside of the range.
The consistency of variation can also be seen on the histogram of the residuals, which looks right skewed, but somewhat normal.

fit_ets_a |> 
  filter (ATM =='ATM1')|>
  gg_tsresiduals() +
  labs(title = "ETS: ATM1")

fit_ets_a |> 
  filter (ATM =='ATM2')|>
  gg_tsresiduals() +
  labs(title = "ETS: ATM2")

fit_ets_a |> 
  filter (ATM =='ATM4')|>
  gg_tsresiduals() +
  labs(title = "ETS: ATM4")

P-Value

The p-value varies by ATM:

ATM1: At 0.0552660778, the p-value is just barley > 0.05 , right on the edge of acceptable.
ATM2: P-value = 0.0007752965. This is the one with the trend at the end of the period. The p-value is < 0.05, so that does not meet the standard for a good model.
ATM4: This is the best, with a p-value of 0.5258470933, so > 0.05 and meeting the standard for a good model.

augment(fit_ets_a) |>
  features(.innov, ljung_box, lag = 24) |>
  select( ATM, lb_pvalue)

## # A tibble: 3 × 2
##   ATM   lb_pvalue
##   <chr>     <dbl>
## 1 ATM1   0.163   
## 2 ATM2   0.000512
## 3 ATM4   0.521

Model ARIMA

Different models were chosen for each of the ATMs:

ATM1: ARIMA(0,0,1)(0,1,2)[7]
ATM2: ARIMA(2,0,2)(0,1,1)[7]
ATM4: ARIMA(3,0,2)(1,0,0)[7] w/ mean

fit_arima_a <- atm3 |> 
  model(ARIMA(Cash))

fit_arima_a |>
  filter (ATM =='ATM1')|>
  report()

## Series: Cash 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1188  -0.6332  -0.0723
## s.e.  0.0550   0.0516   0.0509
## 
## sigma^2 estimated as 636.5:  log likelihood=-1664.34
## AIC=3336.68   AICc=3336.79   BIC=3352.2

fit_arima_a |>
  filter (ATM =='ATM2')|>
  report()

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4315  -0.9183  0.4711  0.8153  -0.7552
## s.e.   0.0547   0.0395  0.0857  0.0543   0.0385
## 
## sigma^2 estimated as 613.4:  log likelihood=-1656.88
## AIC=3325.75   AICc=3325.99   BIC=3349.04

fit_arima_a |>
  filter (ATM =='ATM4')|>
  report()

## Series: Cash 
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2    sar1  constant
##       -0.9107  -0.6704  0.1577  0.9891  0.7808  0.2462   80.4593
## s.e.   0.1146   0.1299  0.0564  0.1068  0.1229  0.0553    4.9023
## 
## sigma^2 estimated as 1181:  log likelihood=-1805.66
## AIC=3627.32   AICc=3627.72   BIC=3658.52

tsresiduals() Results

ARIMA ATM1:

The time plot of the residuals shows that the variation of the residuals has quite a few exceptions.
The acf shows 2 lags outside of the range.
The consistency of variation can also be seen on the histogram of the residuals, looking pretty normal with outliers on either side.

ARIMA ATM2:

The time plot of the residuals shows that the variation of the residuals has quite a few exceptions, even more than ATM1.
The acf shows 1 lag just outside of the range, and 1 just on the edge.
The consistency of variation can also be seen on the histogram of the residuals, looking pretty normal with outliers on either side.

ARIMA ATM4:

The time plot of the residuals shows that the variation of the residuals looks more consistant than either of the others, with just a few exceptions.
The acf shows no lags outside of the range.
The consistency of variation can also be seen on the histogram of the residuals, which looks right skewed, but somewhat normal.

fit_arima_a |> 
  filter (ATM =='ATM1')|>
  gg_tsresiduals() +
  labs(title = "ARIMA: ATM1")

fit_arima_a |> 
  filter (ATM =='ATM2')|>
  gg_tsresiduals() +
  labs(title = "ARIMA: ATM2")

fit_arima_a |> 
  filter (ATM =='ATM4')|>
  gg_tsresiduals() +
  labs(title = "ARIMA: ATM4")

P-Value

The p-value varies by ATM:

ATM1: At 0.3829807, the p-value is > 0.05.
ATM2: P-value = 0.4750484 the p-value is > 0.05.
ATM4: P-value of 0.9416482, so > 0.05 and meeting the standard for a good model.

augment(fit_arima_a) |>
  features(.innov, ljung_box, lag = 24) |>
  select( ATM, lb_pvalue)

## # A tibble: 3 × 2
##   ATM   lb_pvalue
##   <chr>     <dbl>
## 1 ATM1      0.383
## 2 ATM2      0.475
## 3 ATM4      0.942

Plot Model (v1)

ATM1 & ATM2 look good here, but ATM4 does not seem to catpture any of the seasonality.

fit_arima_a |>
  forecast(h=31) |>
  autoplot(atm)

The ETS model for ATM4 looks mush better.

p1<-fit_ets_a |>
  filter (ATM == 'ATM4') |>
  forecast(h=31) |>
  autoplot(atm)+
  labs(title = "ETS: ATM4")+ 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank(),
        legend.position="none") 
p1

Plot Model (v2)

ATM1 & ATM2 will use the ARIMA and ATM4 will use the ETS.

p2<- fit_arima_a |>
  filter (ATM == 'ATM1') |>
  forecast(h=31) |>
  autoplot(atm)+
  labs(title = "ARIMA: ATM1")+ 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank(),
        legend.position="none") 

p3<- fit_arima_a |>
  filter (ATM == 'ATM2') |>
  forecast(h=31) |>
  autoplot(atm)+
  labs(title = "ARIMA: ATM2")+ 
  theme(axis.text.x=element_blank(), 
        axis.title = element_blank(),
        legend.position="none")

ggarrange(p2,p3,p1)

Export to Excel

Exporting the forecast data to excel.

outputA1 <- fit_arima_a |> forecast(h=31) |>
  filter(ATM != 'ATM4')

outputA1 <- as_tibble(outputA1) |>
  select(ATM, DATE, .mean) |>
  rename(Cash = .mean) 
  
outputA2 <- fit_arima_a |> forecast(h=31) |>
  filter(ATM == 'ATM4')

outputA2 <- as_tibble(outputA2) |>
  select(ATM, DATE, .mean) |>
  rename(Cash = .mean) 

outputA <- bind_rows(outputA1, outputA2)

write_xlsx(outputA, path = "outputA.xlsx")

Part B - Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

After looking at the data, I decided to work on Part B first, as it seemed more straightforward. This would allow me to work out my ‘process’ and then apply and adjust it to the more complex data in Part A.

Load and Explore Data

Read from Excel

power <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")

Make a tsiblbe

and do a quick autoplot for a first look at the data

pwr <- power |>
  mutate(Month = yearmonth(`YYYY-MMM`), 
         KWH =KWH/1000000)|> # scaling for graphs
  as_tsibble(index=Month) |>
  select (-'YYYY-MMM') # this is now Month

#without the KWH specified it plots CaseSequence 
autoplot(pwr, KWH)

Address anomoly

There is one weird low number that will be treated as erroneous. First we will set it to NA and take another quick look.

# one weird low number, below 3M (CaseSequence 883)

pwr <- pwr  |>
  mutate(KWH=ifelse(KWH>3,KWH,NA)) # 3M is 3 bcs scaling

autoplot(pwr, KWH)

That looks better. But to use STL Decomposition, the NA we just introduced as well as the one that was ‘native’ to the data must be addressed.

Address NAs

First, make it a dataframe, then use mice to impute values for the NAs, and go back to a tsibble. One more autoplot to see that all looks clean.

tmpdf <- as.data.frame(pwr)

impute <- mice(tmpdf, m = 5, method = 'pmm', seed = 888)

## 
##  iter imp variable
##   1   1  KWH
##   1   2  KWH
##   1   3  KWH
##   1   4  KWH
##   1   5  KWH
##   2   1  KWH
##   2   2  KWH
##   2   3  KWH
##   2   4  KWH
##   2   5  KWH
##   3   1  KWH
##   3   2  KWH
##   3   3  KWH
##   3   4  KWH
##   3   5  KWH
##   4   1  KWH
##   4   2  KWH
##   4   3  KWH
##   4   4  KWH
##   4   5  KWH
##   5   1  KWH
##   5   2  KWH
##   5   3  KWH
##   5   4  KWH
##   5   5  KWH

cmplt <- complete(impute)

pwr2 <- as_tsibble(cmplt, index = Month)
  
autoplot(pwr2, KWH)

STL Decomposition

The decomposition shows a small but unmistakable upward trend. The seasonality is there - if you look closely you see a rounded peak, followed by a more pointed peak, followed by another rounded peak and so on. The remainder looks mostly like noise.

pwr2 |> 
  model(
    STL(KWH ~ trend(window = 21) +  #21 for monthly
                   season(window = "periodic"),
    robust = TRUE)) |>
  components() |>
  autoplot()

X-11 Decomposition

With X-11 decomposition, the seasonality is allowed to vary slowly. Here we can see that the pattern described above (rounded peak->pointed peak->rounded peak…) is the same up to about 2005, where is changes slightly so that by the end it is all pointed peaks. Also notice that the curves in the trend are more sharply defined.

Note that the default here is multiplicative vs the additive in the STL.

p <- pwr2 |>
  model(x11 = X_13ARIMA_SEATS(KWH ~ x11())) |>
  components()

autoplot(p)

In the plot below, the original data, trend and seasonally adjusted data are plotted together. It is evident that there is a variation to the data, with the distance from the highs and lows increasing over time.

p |>
  ggplot(aes(x = Month)) +
  geom_line(aes(y = KWH, color = "Data")) +
  geom_line(aes(y = season_adjust,
                color = "Seasonality")) +
  geom_line(aes(y = trend, color = "Trend")) +
  labs(y = "KWH (in M)",
       title = "Power Usage") +
  scale_color_manual(
    values = c("gray", "#0072B2", "#D55E00"),
    breaks = c("Data", "Seasonality", "Trend")
  )+
  theme(legend.text = element_text(size=4), legend.title = element_text(size=4),
        legend.key.height= unit(.25, 'cm'),
        legend.key.width= unit(.25, 'cm'))

Models

SNAIVE

Just for kicks, and curiosity, let’s start with a simple SNAIVE() model. It is simple and does not look too bad.

The time plot of the residuals shows that the variation of the residuals is much the same across the historical data with a few exceptional points.
The acf shows 2 lags (at 1 and 6) clearly out of range. There are also 2 at 11 and 13 that just barely tip cross out of the range and 2 more (3 and 17) that are right on the border.
The consistency of variation can also be seen on the histogram of the residuals, which looks imperfect but pretty close to normal.

#try SNAIVE()

pwr_m1 <- pwr2 |>
  model(SNAIVE(KWH))

pwr_fc1 <- pwr_m1 |> forecast(h=12)

autoplot(pwr_fc1) +
  autolayer(pwr2,KWH, color='black')

pwr_m1 |> 
  gg_tsresiduals()

pwr_m1 |> 
  accuracy() |>
  select('.model', 'RMSE')

## # A tibble: 1 × 2
##   .model       RMSE
##   <chr>       <dbl>
## 1 SNAIVE(KWH) 0.812

ETS and ARIMA

Now compare seasonal ETS and ARIMA models. This is seasonal data, with a trend and there is some variance.

Start by addressing the varriance: Find the lambda for a Box-Cox.

p_lmbda <- pwr2|>
  features(KWH, features= guerrero) |>
  pull(lambda_guerrero)
p_lmbda

## [1] -0.2244283

Model ETS

..on the Box-Cox-ed KWH. ETS(A,N,A) model is selected.

fit_ets_p <- pwr2 |> 
  model(ETS(box_cox(KWH, p_lmbda)))

report(fit_ets_p)

## Series: KWH 
## Model: ETS(A,N,A) 
## Transformation: box_cox(KWH, p_lmbda) 
##   Smoothing parameters:
##     alpha = 0.1838587 
##     gamma = 0.000100099 
## 
##   Initial states:
##      l[0]        s[0]      s[-1]       s[-2]     s[-3]     s[-4]     s[-5]
##  1.472661 -0.02962359 -0.1763272 -0.08124159 0.1150382 0.1643275 0.1237383
##       s[-6]      s[-7]      s[-8]      s[-9]     s[-10]    s[-11]
##  0.01279731 -0.1622506 -0.1249864 -0.0479503 0.05503094 0.1514475
## 
##   sigma^2:  0.0038
## 
##        AIC       AICc        BIC 
## -45.739597 -43.012324   3.122834

tsresiduals() Results

The time plot of the residuals shows that the variation of the residuals is much the same across the historical data, a few exceptions: 3 strong down plots and one up.
The acf shows all but 2 lags within range.
The consistency of variation can also be seen on the histogram of the residuals, looking pretty normal with some outliers on either side.

fit_ets_p |> gg_tsresiduals()

P-Value

The p-value is < 0.05, so that does not meet the standard for a good model.

augment(fit_ets_p) |>
  features(.innov, ljung_box, lag = 22) |>
  select( .model, lb_pvalue)

## # A tibble: 1 × 2
##   .model                     lb_pvalue
##   <chr>                          <dbl>
## 1 ETS(box_cox(KWH, p_lmbda))    0.0171

Model ARIMA

..on the Box-Cox-ed KWH, and we see a ARIMA(1,0,0)(2,1,0)[12] w/ drift model is selected.

fit_arima_p <- pwr2 |> 
  model(ARIMA(box_cox(KWH, p_lmbda)))

report(fit_arima_p)

## Series: KWH 
## Model: ARIMA(1,0,0)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, p_lmbda) 
## 
## Coefficients:
##          ar1     sar1     sar2  constant
##       0.2473  -0.7069  -0.3484    0.0158
## s.e.  0.0759   0.0749   0.0776    0.0047
## 
## sigma^2 estimated as 0.003721:  log likelihood=246.55
## AIC=-483.09   AICc=-482.75   BIC=-467.13

tsresiduals() Results

The time plot of the residuals shows that the variation of the residuals is much the same across the historical data. There are still a few exceptions, but this is much tighter than the ETS.
The acf shows all but 1 lag within range.
The consistency of variation can also be seen on the histogram of the residuals, looking pretty normal with some outliers on either side.

fit_arima_p |> gg_tsresiduals()

P-Value

The p-value is > 0.05, so that does meet the standard for a good model.

augment(fit_arima_p) |>
  features(.innov, ljung_box, lag = 22) |>
  select( .model, lb_pvalue)

## # A tibble: 1 × 2
##   .model                       lb_pvalue
##   <chr>                            <dbl>
## 1 ARIMA(box_cox(KWH, p_lmbda))     0.255

Plot Model

The ARIMA best meets the evaluation criterea, so that is choosen and plotted. It looks reasonable, capturing both the trend and the seasonality.

plt7<-fit_arima_p |>
  forecast(h=12) |>
  autoplot(pwr2)

plt7

Export to Excel

Exporting the forecast data to excel, remembering to ‘un-scale’ the KWH data.

outputB <- fit_arima_p |> forecast(h=12) |>
  mutate (date =as.character(Month))

outputB <- as_tibble(outputB) |>
  select(date, .mean) |>
  rename(KWH = .mean) |>
  mutate(KWH=KWH *1000000)
  

write_xlsx(outputB, path = "outputB.xlsx")

Part C - BONUS:Waterflow

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Waterflow_Pipe1 <- read_excel("Waterflow_Pipe1.xlsx")
Waterflow_Pipe2 <- read_excel("Waterflow_Pipe2.xlsx")

wf <- bind_rows(Waterflow_Pipe1, Waterflow_Pipe2)
#as_datetime(wf$`Date Time`)

wf$DateTime <-as.POSIXct((wf$`Date Time` - 25569) * 86400, origin = "1970-01-01", tz = "UTC")

wf <- wf |>
  select (-`Date Time`)

wf2 <- wf %>%
  mutate(date = as.Date(DateTime),hour = hour(DateTime)) %>%
  group_by(date, hour) %>%
  summarise(mean_wf = mean(WaterFlow), .groups = "drop") |>
  mutate(DateTime = make_datetime(year(date), month(date), day(date), hour))|>
  select (DateTime, mean_wf)

wf2 <- as_tsibble(wf2)

## Using `DateTime` as index variable.

wf2|> autoplot(mean_wf)

#fill gaps with NA
wf3 <- wf2 |>
  fill_gaps(mean_wf = NA)

#calc mean
mean_value <- mean(wf3$mean_wf, na.rm = TRUE)

#replace NAs with mean
wf3 <- wf3 |> 
  mutate(mean_wf = ifelse(is.na(mean_wf), mean_value, mean_wf))

#STL
w<-wf3 |>
  model(
    STL(mean_wf ~ trend(window=7)+ 
          season(period=24),
        robust=TRUE)
  ) |>
  components() 

autoplot(w)

wf3 |>
  features(mean_wf, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      4.88        0.01

wf3 |>
  gg_tsdisplay(difference(mean_wf), plot_type='partial')

wf3 |>
  features(difference(mean_wf), unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0115         0.1

wf_fit <- wf3 |>
  model(arima310 = ARIMA(mean_wf ~ pdq(3,1,0)),
        arima011 = ARIMA(mean_wf ~ pdq(0,1,1)),
        stepwise = ARIMA(mean_wf),
        search = ARIMA(mean_wf, stepwise=FALSE))

glance(wf_fit) |> arrange(AICc) |> select(.model:BIC)

## # A tibble: 4 × 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 stepwise   112.  -3777. 7564. 7564. 7588.
## 2 search     112.  -3777. 7564. 7564. 7588.
## 3 arima011   112.  -3779. 7566. 7566. 7585.
## 4 arima310   133.  -3861. 7735. 7735. 7764.

Stepwise and Search have the same AICc, and they are both ARIMA(0,1,2)(0,0,2)[24]

wf_fit |>
  select(search) |>
  gg_tsresiduals()

wf_fit |>
  forecast(h=168) |> #1 week of hours 7*24
  filter(.model=='search') |>
  autoplot(wf3)

Well, the forecast looks not so great. The confidence intervals are kinda in the right range, but that’s it. I also looked that the one’s I selected, but they were not any better. However, I have run out of time, so will stop here and put those not so great values into excel.

outputC <- wf_fit |> forecast(h=168) 

outputC <- as_tibble(outputC) |>
  select(DateTime, .mean) |>
  rename(WaterFlow = .mean) 
  

write_xlsx(outputC, path = "outputC.xlsx")

Project 1

D Plunkett

2024-10-27

Part A - ATM Forcast

Load and Explore Data

Read from Excel

Address ATM=NA rows

Address Annomoly in ATM4 + Cash = NA (v1)

Scale of ATM4

Scale ATM4, then Annomoly ATM4 + Cash = NA (v2)

What to do with ATM3?

STL Decomposition

ETS and ARIMA

Model ETS

tsresiduals() Results

P-Value

Model ARIMA

tsresiduals() Results

P-Value

Plot Model (v1)

Plot Model (v2)

Export to Excel

Part B - Forecasting Power

Load and Explore Data

Read from Excel

Make a tsiblbe

Address anomoly

Address NAs

STL Decomposition

X-11 Decomposition

Models

SNAIVE

ETS and ARIMA

Model ETS

tsresiduals() Results

P-Value

Model ARIMA

tsresiduals() Results

P-Value

Plot Model

Export to Excel

Part C - BONUS:Waterflow