DATA624 Project 1

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling.
Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part A

#Load library
library(fpp3)
library(ggplot2)
library(feasts)
library(readxl)
library(openxlsx)
library(imputeTS)

#Read data
ATM<-read_xlsx('ATM624Data.xlsx', col_types=c('date', 'text', 'guess'))|>
  mutate(DATE=as_date(DATE))|>
  filter(!is.na(ATM))|>
  as_tsibble(index=DATE, key=ATM)

#Creat a summary 
summary(ATM)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5

#Plot data by ATM
#Histogram
ATM|>ggplot(aes(x=Cash))+
  geom_histogram()+
  facet_wrap(ATM ~ ., scales = 'free')

#Time series
ATM|>autoplot(Cash)+
  facet_wrap(ATM~., scales='free', nrow=4)+
  labs(title='Cash in hundred taken out of 4 different ATM machines')

Findings on data 1. There were NAs in ATM column and was filtered. 2. None of the data is normally distributed. 3. Since ATM3 has minimum number of value, it will be removed from the analysis and forecast. 4. The count of withdrawal and cash amount in ATM 4 is much higher than the others. 5. ATM4 has an outliner at a value of 10919.76

#Filled out the NAs in Cash column
ATM<-ATM|>na_interpolation('Cash', option='linear')

#Find the order of differencing
ATM|>features(Cash, unitroot_ndiffs)

ATM|>features(Cash, unitroot_nsdiffs)

ATM1<-ATM|>filter(ATM=='ATM1')
ATM2<-ATM|>filter(ATM=='ATM2')
ATM3<-ATM|>filter(ATM=='ATM3')
ATM4<-ATM|>filter(ATM=='ATM4')

ATM1

#Ensemble of time series displays
gg_tsdisplay(ATM1, Cash)

#Find the model
fit_ATM1<-ATM1|> model(
        stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE),
        SN = SNAIVE(Cash),
        ETS_A = ETS(Cash ~ error("A") + trend("N") + season("A")),
        ETS_M = ETS(Cash ~ error("M") + trend("N") + season("M"))
        )

fit_ATM1 |> pivot_longer(!ATM, names_to = "Model name",
                         values_to = "Orders")

accuracy(fit_ATM1)

report(fit_ATM1)

Model evaluation: From the accuracy function, Seasonal Naive model has the highest RMSE whilst both ARIMA models have the lowest one In the report of fit, both ARIMA models have a lower AICc than ETS additive and multiplicative models. As such, the ARIMA (ARIMA(0,0,1)(0,1,2)[7]) has the best model among all.

#forecast and save for output
ATM1_model<-ATM1|>
  model(ARIMA(Cash))|>
  forecast(h=30)
  
#Plot forecast  
ATM1_model|>
  autoplot(ATM)+
  labs(title='ATM1 Cash Withdrawal')

ATM2

#Ensemble of time series displays
gg_tsdisplay(ATM2, Cash)

#Find the model
fit_ATM2<-ATM2|> model(
        stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE),
        SN = SNAIVE(Cash),
        ETS_A = ETS(Cash ~ error("A") + trend("N") + season("A")),
        ETS_M = ETS(Cash ~ error("M") + trend("N") + season("M"))
        )

fit_ATM2 |> pivot_longer(!ATM, names_to = "Model name",
                         values_to = "Orders")

accuracy(fit_ATM2)

report(fit_ATM2)

augment(fit_ATM2)|>
  filter(.model=='search'|.model=='stepwise')|>
  features(.innov, ljung_box)

Model evaluation: From the accuracy function, Seasonal Naive model has the highest RMSE whilst both ARIMA models have the lowest ones. The RMSE of ETS (additive) is closer to ARIMA models. In the report of fit, both ARIMA models have a lower AICc than ETS additive and multiplicative models. ARIMA stepwise model is slightly better than the search model. It is interesting to find out that ARIMA(2,0,2)(0,1,1)[7] is better than ARIMA(5,0,0)(0,1,1)[7]. Both p-value confirmed that they are white noise. The difference is on the non-seasonal components and I select the model ARIMA(2,0,2)(0,1,1)[7] instead. There is also an order constraint of p + q + P + Q <= 6 in ARIMA function. I need to find out the reason for that too. As such, the ARIMA (ARIMA(2,0,2)(0,1,1)[7]) has the best model among all.

#forecast and save for output
ATM2_model<-ATM2|>
  model(ARIMA(Cash~pdq(2,0,2)+PDQ(0,1,1)))|>
  forecast(h=30)
  
#Plot forecast  
ATM2_model|>
  autoplot(ATM)+
  labs(title='ATM2 Cash Withdrawal')

Something is quite strange as the cash withdrawal should not be less than zero.

ATM3

#Ensemble of time series displays
gg_tsdisplay(ATM3, Cash)

#Find the model
fit_ATM3<-ATM3|> model(
        N = NAIVE(Cash),
        Drift = RW(Cash~drift()),
        ETS_A = ETS(Cash ~ error("A") + trend("N") + season("A"))
        )

fit_ATM3 |> pivot_longer(!ATM, names_to = "Model name",
                         values_to = "Orders")

accuracy(fit_ATM3)

report(fit_ATM3)

Since there are only 3 data points in ATM3, there is no reason to have an ARIMA model. The drift model has the lowest RMSE, which considers the average change seen in the historical data.

#forecast and save for output
ATM3_model<-ATM3|>
  model(RW(Cash~drift()))|>
  forecast(h=30)
  
#Plot forecast  
ATM3_model|>
  autoplot(ATM)+
  labs(title='ATM3 Cash Withdrawal')

Second thought, the ATM3 can be a newly installed ATM which has only been in operation since 2010-04-28. I think a NAIVE or even Mean model may work too.

ATM4

#There is one outliner at 10919.762 and will be replaced by an estimate mean of 470
ATM4["Cash"][ATM4["Cash"]>=10000]<-470.0

#Ensemble of time series displays
gg_tsdisplay(ATM4, Cash)

#Find the model
fit_ATM4<-ATM4|> model(
        stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE),
        SN = SNAIVE(Cash),
        ETS_A = ETS(Cash ~ error("A") + trend("N") + season("A")),
        ETS_M = ETS(Cash ~ error("M") + trend("N") + season("M"))
        )

fit_ATM4 |> pivot_longer(!ATM, names_to = "Model name",
                         values_to = "Orders")

accuracy(fit_ATM4)

report(fit_ATM4)

#forecast and save for output
ATM4_model<-ATM4|>
  model(ARIMA(Cash~pdq(3,0,2)+PDQ(1,0,0)))|>
  forecast(h=30)
  
#Plot forecast  
ATM4_model|>
  autoplot(ATM4)+
  labs(title='ATM4 Cash Withdrawal')

Again, the cash withdrawal can’t be less than zero as the model doesn’t predict cash deposit.

#Combine all dataframe
ATM_model<-bind_rows(ATM1_model, ATM2_model, ATM3_model, ATM4_model)

#Creat a new Excel workbook
wb <- createWorkbook()

# add the data frames to separate worksheets in the workbook
addWorksheet(wb, "ATM_model")
writeData(wb, sheet = "ATM_model", x = ATM_model, startRow = 1, startCol = 1)

Part B

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

#Read data to create a tsibble object
RCFL<-read_xlsx('ResidentialCustomerForecastLoad-624.xlsx', col_types=c('numeric', 'text', 'numeric'))|>
  rename("YearMonth"="YYYY-MMM")|>
  mutate(YearMonth=yearmonth(YearMonth), .keep="unused")|>
  #filter(!is.na(KWH))|>
  as_tsibble(index=YearMonth)

#Take a look at the Tsibble
head(RCFL)

summary(RCFL$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

#Interpolate the NA in KWH
RCFL<-RCFL|>na_interpolation('KWH', option='linear')
summary(RCFL$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   770523  5434539  6314472  6502824  7608792 10655730

There is one N/A in KWH column which is fixed.

#Time series
RCFL|>autoplot(KWH)+
  labs(title='Residential power usage for January 1998 until December 2013')

#Time series by replacing the outliner with the mean
RCFL["KWH"][RCFL["KWH"]<=3000000]<-6502824

RCFL|>autoplot(KWH)+
  labs(title='Residential power usage for January 1998 until December 2013 witout outliner')

#Ensemble of time series displays
gg_tsdisplay(RCFL, KWH)

#Find the model
fit_RCFL<-RCFL|> model(
        stepwise = ARIMA(KWH),
        search = ARIMA(KWH, stepwise=FALSE),
        SN = SNAIVE(KWH),
        ETS_A = ETS(KWH ~ error("A") + trend("N") + season("A")),
        ETS_M = ETS(KWH ~ error("M") + trend("N") + season("M"))
        )

fit_RCFL|>pivot_longer(everything(), names_to = "Model name", values_to = "Orders")

accuracy(fit_RCFL)

report(fit_RCFL)

Model evaluation: From the accuracy function, Seasonal Naive model has the highest RMSE whilst both ARIMA models have the lowest ones. ARIMA search model has the lowest RSME and AICc. As such, the ARIMA (ARIMA(0,0,4)(2,1,0)[12]) has the best model among all.

fit_RCFL|>select(search)|>gg_tsresiduals()

#forecast and save for output
RCFL_output<-RCFL|>
  model(ARIMA(KWH~pdq(0,0,4)+PDQ(2,1,0)))|>
  forecast(h=24)

#Plot forecast
RCFL_output|>autoplot(RCFL)+
  labs(title='Residential power usage with Forecast')

addWorksheet(wb, "RCFL")
writeData(wb, sheet = "RCFL", x = RCFL_output, startRow = 1, startCol = 1)


# save the workbook to a file
saveWorkbook(wb, "Data624_project1.xlsx")

The residual looks fine and looks almost like a white noise. The ACF plot shows all values are within the limit and the residual has a normal distribution.

Part C

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

#Read data to create a tsibble object
waterflow_Pipe1<-read_xlsx('Waterflow_Pipe1.xlsx', col_types=c('date', 'numeric'))|>
  rename(DateTime="Date Time")|>
  #add one hour to align with waterflow_Pipe2
  mutate(Date = as.Date(DateTime), Time = hour(DateTime)+1)|> 
  group_by(Date, Time)|>
  summarise(Flow1=mean(WaterFlow))|>
  ungroup()|>
  mutate(DateTime=ymd_h(paste0(Date,Time)))|>
  select(DateTime, Flow1)|>
  as_tsibble(index=DateTime)

waterflow_Pipe2<-read_xlsx('Waterflow_Pipe2.xlsx', col_types=c('date', 'numeric'))|>
  rename(DateTime="Date Time", Flow2=WaterFlow)|>
  mutate(Date = as.Date(DateTime), Time = hour(DateTime))|>
  select(DateTime, Flow2)|>
  as_tsibble(index=DateTime)

summary(waterflow_Pipe1)

##     DateTime                         Flow1       
##  Min.   :2015-10-23 01:00:00.0   Min.   : 8.923  
##  1st Qu.:2015-10-25 11:45:00.0   1st Qu.:17.033  
##  Median :2015-10-27 23:30:00.0   Median :19.784  
##  Mean   :2015-10-27 23:38:53.9   Mean   :19.893  
##  3rd Qu.:2015-10-30 11:15:00.0   3rd Qu.:22.789  
##  Max.   :2015-11-02 00:00:00.0   Max.   :31.730

summary(waterflow_Pipe2)

##     DateTime                       Flow2       
##  Min.   :2015-10-23 01:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
##  Median :2015-11-12 20:30:00   Median :39.682  
##  Mean   :2015-11-12 20:30:00   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :78.303

Both series don’t have any NAs.

#Aggregate based on hour by combining two time series
waterflow<-merge(waterflow_Pipe1, waterflow_Pipe2, by="DateTime", all=TRUE)

waterflow<-waterflow|>
  mutate_at("Flow1", ~replace_na(.,0))|>
  mutate(Flow=Flow1+Flow2)|>
  select(DateTime, Flow)|>
  as_tsibble(index=DateTime)

summary(waterflow)

##     DateTime                        Flow        
##  Min.   :2015-10-23 01:00:00   Min.   :  1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.: 31.253  
##  Median :2015-11-12 20:30:00   Median : 43.558  
##  Mean   :2015-11-12 20:30:00   Mean   : 44.250  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.: 56.950  
##  Max.   :2015-12-03 16:00:00   Max.   :106.499

waterflow|>gg_tsdisplay(Flow)

waterflow|>
  gg_lag(Flow)

waterflow|>gg_tsdisplay(difference(Flow), plot_type = 'partial')

## Warning: Removed 1 row containing missing values (`geom_line()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

The time plot is almost in white noise pattern. The variance looks stable. The ACF shows that autocorrelation exists.

A first differnce of the data is then made to address non-stationarity. It is stationary and shows white noise pattern.

The ACF of difference shows large spike in multiple lag including lag1, 15, 24, 27.

Let’s try the ARIMA algorithm to find out the best model and then forecast.

#Find the model
fit_flow<-waterflow|> model(
        stepwise = ARIMA(Flow),
        search = ARIMA(Flow, stepwise=FALSE),
        SN = SNAIVE(Flow),
        ETS_A = ETS(Flow ~ error("A") + trend("N") + season("A")),
        ETS_M = ETS(Flow ~ error("M") + trend("N") + season("M"))
        )

fit_flow|>pivot_longer(everything(), names_to = "Model name", values_to = "Orders")

accuracy(fit_flow)

report(fit_flow)

The best model is ARIMA(0,1,2)(1,0,1)[24]

#forecast and save for output
fit_flow_o<-waterflow|>
  model(ARIMA(Flow~pdq(0,1,2)+PDQ(1,0,1)))|>
  forecast(h=168)
  
#Plot forecast  
fit_flow_o|>
  autoplot(waterflow)+
  labs(title='Waterflow from two pipe')

#Out to an Excel file
write.xlsx(fit_flow_o, file = "waterflow.xlsx")