DATA624 Project1

library(fpp3)
library(tidyverse)
library(fable)
library(latex2exp)

Part A - ATM Forecast

Pre-processing the data

Firstly, after importing the data, it must be pre-processed before running it through any model so the data was transformed into a tsibble format for easy viewing. The tsibble was assigned DATE (transformed into a date format) as the index and the atm assignment ATM as the key with the only other column Cash left as the data for the tsibble atm. After graphing the data and viewing it there was a bit of data exploration to do, namely, there were some rows missing the atm assignment in the ATM column. These rows were also missing data for the Cash columns and the dates seemed to be at the end of the series but after the end date for the data of each atm so these rows were removed. Then there were a handful of rows missing Cash data for ATM1 and ATM2 so these were given the average Cash value for their respective machines. The average was chosen since it will not likely cause adverse effects on the prediction models while also not trying to guess the actual value. There was one outlier visible for ATM4, which is for the date 2010-02-09 with a Cash value of 10920. Since the data for these machines is in hundreds of dollars that means, according to the data, this machine dispensed $1,092,000. After a preliminary google search most ATM’s have a limited capacity of $800,000 this means that the ATM would have to be refilled at some point throughout the day. While this is not impossible it seems unlikely since February 9th is not a major holiday in the USA and it was a Tuesday. Given these and the fact that the data point ends in a zero it is most likely a data error and was corrected to be 10920 which falls in line with the rest of the data. Now that the data has been pre-processed it was ready for testing of forecasting models.

#reading in ATM data
atm <- read.csv('ATM624Data.csv')
colnames(atm)[1] <- "DATE"

#transforming raw data into a tsibble
atm <- atm |>
  mutate(DATE = as.Date(as_datetime(DATE, format = "%m/%d/%Y %H:%M"))) |>
  as_tsibble(index = DATE, key = ATM)

#viewing the data
atm |> autoplot() +
  facet_wrap(~ATM, scale = "free_y") +
  ggtitle("ATM Data")

#checking to see number of entries with missing atm value in ATM column
atm |>
  filter(ATM == "")

## # A tsibble: 14 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <int>
##  1 2010-05-01 ""       NA
##  2 2010-05-02 ""       NA
##  3 2010-05-03 ""       NA
##  4 2010-05-04 ""       NA
##  5 2010-05-05 ""       NA
##  6 2010-05-06 ""       NA
##  7 2010-05-07 ""       NA
##  8 2010-05-08 ""       NA
##  9 2010-05-09 ""       NA
## 10 2010-05-10 ""       NA
## 11 2010-05-11 ""       NA
## 12 2010-05-12 ""       NA
## 13 2010-05-13 ""       NA
## 14 2010-05-14 ""       NA

#removing rows without ATM value
atm <- atm |>
  filter(ATM != "")

#viewing rows without cash value
atm |>
  filter(is.na(Cash)) |>
  print()

## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [2]
##   DATE       ATM    Cash
##   <date>     <chr> <int>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA
## 4 2009-06-18 ATM2     NA
## 5 2009-06-24 ATM2     NA

#viewing the top 10 highest cash days over all the atms
atm |>
  arrange(desc(Cash)) |>
  head(10) |>
  print()

## # A tsibble: 10 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <int>
##  1 2010-02-09 ATM4  10920
##  2 2009-09-22 ATM4   1712
##  3 2010-01-29 ATM4   1575
##  4 2009-09-04 ATM4   1495
##  5 2009-06-09 ATM4   1484
##  6 2009-09-05 ATM4   1301
##  7 2010-02-28 ATM4   1276
##  8 2009-08-23 ATM4   1246
##  9 2009-06-14 ATM4   1221
## 10 2009-09-29 ATM4   1195

#finding the mean value for each atm
means <- atm |>
  as_tibble() |>
  select(ATM, Cash) |>
  group_by(ATM) |>
  dplyr::summarize(mean = mean(Cash, na.rm = TRUE)) |>
  print()

## # A tibble: 4 × 2
##   ATM      mean
##   <chr>   <dbl>
## 1 ATM1   83.9  
## 2 ATM2   62.6  
## 3 ATM3    0.721
## 4 ATM4  474.

#data clean-up for missing values using mean for the atm and outliers
#with data errors being corrected
atm <- atm |>
  mutate(Cash = case_when(
    is.na(Cash) & ATM == "ATM1" ~ means$mean[means$ATM == "ATM1"], 
    is.na(Cash) & ATM == "ATM2" ~ means$mean[means$ATM == "ATM2"], 
    Cash == 10920 ~ 1092,
    .default = Cash)
  )

#viewing all the data now adjusted
atm |> autoplot() +
  facet_wrap(~ATM, scale = "free_y") +
  ggtitle("ATM Adjusted Data")

Testing for ARIMA

To see if this data was viable for ARIMA it was tested to see if it could be transformed to be stationary. The first check, the function unitroot_kpss, returned that only ATM4 was stationary. After a box-cox lambda (using the guerrero method) produced values not appreciably close to 1 it seemed necessary to attempt a box-cox transform for each atm. Then tested again with unitroot_kpss, while the values were closer to the pass value of 0.1 differencing was still needed. After differencing all the values once the unitroot_kpss values were all 0.1 or close enough to justify the use of ARIMA.

#Testing initial data
atm |>
  features(Cash, unitroot_kpss) |>
  print()

## # A tibble: 4 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM1      0.452      0.0547
## 2 ATM2      2.15       0.01  
## 3 ATM3      0.389      0.0818
## 4 ATM4      0.105      0.1

#calculating lambda for all the atm data
lambda <- atm |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero) |>
  print()

## [1] 0.2365721 0.7240522 1.3654211 0.4638517

#applying the Box-Cox transformation to all the atm data
bc_atm <- atm |>
  mutate(Cash = case_when(
    ATM == "ATM1" ~ box_cox(atm$Cash, lambda[1]),
    ATM == "ATM2" ~ box_cox(atm$Cash, lambda[2]),
    ATM == "ATM3" ~ box_cox(atm$Cash, lambda[3]),
    ATM == "ATM4" ~ box_cox(atm$Cash, lambda[4]))
  ) 

#Testing Box-Cox transformed data
bc_atm |>
  features(Cash, unitroot_kpss) |>
  print()

## # A tibble: 4 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM1     0.409       0.0733
## 2 ATM2     2.14        0.01  
## 3 ATM3     0.389       0.0817
## 4 ATM4     0.0737      0.1

#Finding suggested difference
bc_atm |>
  features(Cash, unitroot_ndiffs) |>
  print()

## # A tibble: 4 × 2
##   ATM   ndiffs
##   <chr>  <int>
## 1 ATM1       0
## 2 ATM2       1
## 3 ATM3       0
## 4 ATM4       0

#applying difference of one found above
diff_bc_atm <- bc_atm |>
  mutate(Cash = difference(Cash))

#Testing Box-Cox Differenced data
diff_bc_atm |>
  features(Cash, unitroot_kpss) |>
  print()

## # A tibble: 4 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM1    0.00876      0.1   
## 2 ATM2    0.0129       0.1   
## 3 ATM3    0.348        0.0997
## 4 ATM4    0.260        0.1

Testing models on training data

Now to test the models the training data is limited to the first 75% of the data. STL is applied to the data to see the components, seasonality, trend, and remainder. Then the ARIMA and ETS functions run wide open give a model for each set of ATMs. Then these models suggested for each ATM is applied to the whole set (this is under the assumption that the same model has to be used for all the ATMs) to ensure it’s the same for all the data. In addition SNAIVE is also used as a basic fail-safe.

#selecting about 75% of the data for training
train <- atm |>
  filter_index("2009-05-01" ~ "2010-01-31")

#STL decomp to see each of the components
train |>
  model(stl = STL(Cash)) |>
  components() |>
  autoplot() +
  ggtitle("ATM Training Data STL Decomp")

#testing ARIMA and ETS with the training data
train |>
  model(
    ARIMA = ARIMA(Cash),
    ETS = ETS(Cash)
  ) |>
  print()

## # A mable: 4 x 3
## # Key:     ATM [4]
##   ATM                               ARIMA          ETS
##   <chr>                           <model>      <model>
## 1 ATM1           <ARIMA(0,0,3)(0,1,1)[7]> <ETS(M,N,M)>
## 2 ATM2  <ARIMA(0,0,0)(1,1,1)[7] w/ drift> <ETS(A,N,A)>
## 3 ATM3           <ARIMA(0,0,0)(0,0,1)[7]> <ETS(A,N,N)>
## 4 ATM4   <ARIMA(0,0,0)(1,0,0)[7] w/ mean> <ETS(M,N,M)>

#testing a bunch of models with the test data
train_atm <- train |>
  model(
    SNAIVE = SNAIVE(Cash, type = "additive"),
    `ETS(A,N,A)` = ETS(Cash ~ error("A") + trend("N") + season("A")),
    `ETS(A,A,A)` = ETS(Cash ~ error("A") + trend("A") + season("A")),
    `ETS(M,N,M)` = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ARIMA1 = ARIMA(Cash ~ pdq(0,0,3) + PDQ(0,1,1)),
    ARIMA2 = ARIMA(Cash ~ 1 + pdq(0,0,0) + PDQ(1,1,1)),
    ARIMA3 = ARIMA(Cash ~ pdq(0,0,0) + PDQ(1,0,0))
  ) |>
  print() |>
  forecast(h = 90)

## # A mable: 4 x 8
## # Key:     ATM [4]
##   ATM     SNAIVE `ETS(A,N,A)` `ETS(A,A,A)` `ETS(M,N,M)`
##   <chr>  <model>      <model>      <model>      <model>
## 1 ATM1  <SNAIVE> <ETS(A,N,A)> <ETS(A,A,A)> <ETS(M,N,M)>
## 2 ATM2  <SNAIVE> <ETS(A,N,A)> <ETS(A,A,A)> <ETS(M,N,M)>
## 3 ATM3  <SNAIVE> <ETS(A,N,A)> <ETS(A,A,A)> <NULL model>
## 4 ATM4  <SNAIVE> <ETS(A,N,A)> <ETS(A,A,A)> <ETS(M,N,M)>
## # ℹ 3 more variables: ARIMA1 <model>, ARIMA2 <model>, ARIMA3 <model>

After running all the selected models on the training data some returned as NULL models, mostly for ATM3 since most of that data is Cash=0. After filtering out these models the accuracy function is used to view all the error calculations to see how well the forecasts match up to the actual rest of the data. While each does best for a different ATM ETS(A,N,A) seems to do the best overall when taking all the ATMs into account across all error measurements.

# averaging RMSE per model
train_atm |>
  accuracy(atm) |>
  select(.model, ATM, RMSE) |>
  filter(.model != "ETS(M,N,M)" & .model != "ARIMA1" & .model != "ARIMA2" 
         & .model != "ARIMA3") |>
  arrange(ATM) |>
  arrange(RMSE) |>
  print()

## # A tibble: 12 × 3
##    .model     ATM    RMSE
##    <chr>      <chr> <dbl>
##  1 ETS(A,A,A) ATM3   16.1
##  2 ETS(A,N,A) ATM3   16.1
##  3 SNAIVE     ATM3   16.1
##  4 SNAIVE     ATM1   49.1
##  5 ETS(A,N,A) ATM1   49.9
##  6 ETS(A,A,A) ATM1   50.2
##  7 SNAIVE     ATM2   58.6
##  8 ETS(A,A,A) ATM2   59.5
##  9 ETS(A,N,A) ATM2   59.6
## 10 ETS(A,N,A) ATM4  377. 
## 11 ETS(A,A,A) ATM4  377. 
## 12 SNAIVE     ATM4  605.

# averaging the error per model
train_atm |>
  accuracy(atm) |>
  select(!ATM & !.type) |>
  filter(.model != "ETS(M,N,M)" & .model != "ARIMA1" & .model != "ARIMA2" 
         & .model != "ARIMA3") |>
  group_by(.model) |>
  summarize(across(everything(), mean)) |>
  arrange(RMSE) |>
  print()

## # A tibble: 3 × 9
##   .model         ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ETS(A,N,A)  -6.12  126.  100.  -Inf   Inf   Inf   Inf 0.188 
## 2 ETS(A,A,A)  -9.78  126.  101.  -Inf   Inf   Inf   Inf 0.193 
## 3 SNAIVE     -48.6   182.  149.  -Inf   Inf   Inf   Inf 0.0572

While the forecast doesn’t match exactly for most of the ATMs it approximates pretty closely.

#plotting the training model ETS(A,N,A) vs the actual ATM data
train_atm |>
  filter(.model == "ETS(A,N,A)") |>
  autoplot(atm) +
  ggtitle("ATM actual vs ETS(A,N,A)")

Forecasting Data

Now that the model has been selected it is just a matter of applying it to the whole data set and forecasting out 31 days to get the month of May in 2010. Then writing the forecast to a csv.

#produces forecast for May 2010 using ETS(A,N,A)
atm_forecast <- atm |>
  model(ETS = ETS(Cash ~ error("A") + trend("N") + season("A"))) |>
  forecast(h = 31)

#graphs forecast above
atm_forecast |>
  autoplot(atm) +
  ggtitle("ATM ETS(A,N,A) Forecast May 2010")

#saves forecast to separate file
atm_forecast |>
  as_tibble() |>
  select(DATE, ATM, .mean) |>
  rename(Cash = .mean) |>
  print() |>
  write.csv("atmforecast.csv")

## # A tibble: 124 × 3
##    DATE       ATM     Cash
##    <date>     <chr>  <dbl>
##  1 2010-05-01 ATM1   87.1 
##  2 2010-05-02 ATM1  101.  
##  3 2010-05-03 ATM1   73.1 
##  4 2010-05-04 ATM1    5.84
##  5 2010-05-05 ATM1  100.  
##  6 2010-05-06 ATM1   79.5 
##  7 2010-05-07 ATM1   85.7 
##  8 2010-05-08 ATM1   87.1 
##  9 2010-05-09 ATM1  101.  
## 10 2010-05-10 ATM1   73.1 
## # ℹ 114 more rows

Part B - Forecasting Power

Pre-processing the data

Same as before the data must be pre-processed before being fed into the forecasting model. So it was first put into a tsibble and autoplot() and there is very clearly a missing data point and an outlier in this data. After filtering to the missing data point there is only one so it was replaced by the mean for the data. The outlier is view and it looks to be very similar to the atm data in that the outlier appeared to have an extra digit, this data seemed to be missing a digit. Especially, since it is power data, which is extremely periodic and most likely wouldn’t have such a major dip unless there was a long-term outage. Since the data is residential and in the millions of KWH it has to be a number of residences so unless everyone went on vacation at the same time and didn’t use as much electricity or there was a natural disaster it is unlikely to be an accurate data point. After the data was mutated the outlier fit much better into the seasonality and the data seemed to be ready for testing.

#reading in data
power <- read.csv('ResidentialCustomerForecastLoad-624.csv')

#transforming data into a tsibble
power <- power |>
  mutate(DATE = yearmonth(YYYY.MMM)) |>
  select(DATE, KWH) |>
  as_tsibble(index = DATE)

#plotting data for first look
power |> autoplot()

#viewing all missing data
power |>
  filter(is.na(KWH)) |>
  print()

## # A tsibble: 1 x 2 [1M]
##       DATE   KWH
##      <mth> <int>
## 1 2008 Sep    NA

#looking at the outliers for the data
power |>
  ggplot() +
  geom_boxplot(aes(KWH))

#mutating so missing data is replaced with average and outlier is corrected
power <- power |>
  mutate(KWH = case_when(
    is.na(KWH) ~ mean(power$KWH, na.rm = TRUE),
    KWH == 770523 ~ 7705230,
    .default = KWH)
  )

#plotting data after final transformations
power |> autoplot()

Testing for ARIMA

The same process used for the ATM data was used here with the same results. After testing the initial data with unitroot_kpss it needed transformations to ensure ARIMA could be used. The lambda found was very far from 1 making it a good candidate for Box-Cox transformation. After the Box-Cox was applied the function unitroot_ndiffs indicated that one difference function application would ensure that the data would be a good candidate for ARIMA. After applying one difference function to KWH the data passed unitroot_kpss, returning a value of 0.1.

#Testing initial data
power |>
  features(KWH, unitroot_kpss) |>
  print()

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      1.65        0.01

#calculating lambda for all the power data
lambda <- power |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero) |>
  print()

## [1] -0.2094801

#applying the Box-Cox transformation to all the power data
bc_power <- power |>
  mutate(KWH = box_cox(KWH, lambda)) 

#Testing Box-Cox transformed data
bc_power |>
  features(KWH, unitroot_kpss) |>
  print()

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      1.64        0.01

#Finding suggested difference
bc_power |>
  features(KWH, unitroot_ndiffs) |>
  print()

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

#applying difference of one found above
diff_bc_power<- bc_power |>
  mutate(KWH = difference(KWH))

#Testing Box-Cox Differenced data
diff_bc_power |>
  features(KWH, unitroot_kpss) |>
  print()

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0412         0.1

Testing models on training data

A subset of the data was selected to test some models with some training data to see the viability of each model. The models ARIMA (with a box-cox transformation found above applied), ETS and SNAIVE were chosen and run open so they could test and select the best parameters. Then a forecast for the rest of the actual data was created.

#selecting about 75% of the data for training
train <- power |>
  filter_index("1998-Jan" ~ "2009-Dec")

#STL decomp to see each of the components
train |>
  model(stl = STL(KWH)) |>
  components() |>
  autoplot() +
  ggtitle("power Training Data STL Decomp")

#testing a bunch of models with the test data and forecasting them
train_power <- train |>
  model(
    SNAIVE = SNAIVE(KWH, type = "additive"),
    ETS = ETS(KWH),
    ARIMA = ARIMA(box_cox(KWH, lambda))
  ) |>
  print() |>
  forecast(h = 48)

## # A mable: 1 x 3
##     SNAIVE          ETS                              ARIMA
##    <model>      <model>                            <model>
## 1 <SNAIVE> <ETS(M,N,A)> <ARIMA(0,0,1)(0,1,1)[12] w/ drift>

The training data forecast was then tested against the actual data using the accuracy function to see which model had the lowest error. After looking ARIMA did the best by far, having the lowest error for every type of error across the board.

#arranging data by RMSE to see which is the best model according to training
train_power |>
  accuracy(power) |>
  arrange(RMSE) |>
  print()

## # A tibble: 3 × 10
##   .model .type      ME     RMSE      MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>   <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA  Test  738027. 1065875.  792335.  9.00  9.90  1.41  1.49 0.263
## 2 SNAIVE Test  772956. 1144846.  888024.  9.25 11.2   1.58  1.60 0.254
## 3 ETS    Test  972863. 1304574. 1000636. 12.0  12.5   1.78  1.83 0.297

Just to confirm that ARIMA was the right fit the residuals for the model were plotted to ensure they were normally distributed and that they looked to be like white noise which they did. The forecast matches extremely well to the actual data and produces a lot of confidence for the model.

#plotting residuals for arima function to see if it is valid for this dataset
power |>
  model(ARIMA(box_cox(KWH, lambda) ~ pdq(0,0,3) + PDQ(0,1,1))) |>
  gg_tsresiduals()

#plotting the training model ARIMA vs the actual power data
train_power |>
  filter(.model == "ARIMA") |>
  autoplot(power) +
  ggtitle("power actual vs ARIMA")

Forecasting Data

After ARIMA was selected it was used to forecast the actual data with the parameters produced by the function running on the training data. It forecasted the monthly data for 2014 and then exported into a csv.

#produces forecast for 2014 using ARIMA
power_forecast <- power |>
  model(ARIMA(box_cox(KWH, lambda) ~ pdq(0,0,3) + PDQ(0,1,1))) |>
  print() |>
  forecast(h = 12)

## # A mable: 1 x 1
##   `ARIMA(box_cox(KWH, lambda) ~ pdq(0, 0, 3) + PDQ(0, 1, 1))`
##                                                       <model>
## 1                          <ARIMA(0,0,3)(0,1,1)[12] w/ drift>

#graphs forecast above
power_forecast |>
  autoplot(power) +
  ggtitle("power ARIMA Forecast May 2010")

#saves forecast to separate file
power_forecast |>
  as_tibble() |>
  select(DATE, .mean) |>
  rename(KWH = .mean) |>
  print() |>
  write.csv("powerforecast.csv")

## # A tibble: 12 × 2
##        DATE       KWH
##       <mth>     <dbl>
##  1 2014 Jan 10655013.
##  2 2014 Feb  8549197.
##  3 2014 Mar  7136705.
##  4 2014 Apr  5996602.
##  5 2014 May  5756337.
##  6 2014 Jun  7696568.
##  7 2014 Jul  8922429.
##  8 2014 Aug  9497618.
##  9 2014 Sep  8297594.
## 10 2014 Oct  5974682.
## 11 2014 Nov  5753939.
## 12 2014 Dec  7891127.

Part C - BONUS Water

water1 <- read.csv('Waterflow_Pipe1.csv')
water2 <- read.csv('Waterflow_Pipe2.csv')

Transforming the data

tswater1 <- water1 |>
  mutate(Date.Time =
           floor_date(
             as_datetime(Date.Time, format = "%m/%d/%y %H:%M %Op"),
             unit = "hour")
         ) |>
  rbind(c(as_datetime("10/27/15 5:00 PM", format = "%m/%d/%y %H:%M %Op"), 0)) |>
  rbind(c(as_datetime("10/28/15 12:00 AM", format = "%m/%d/%y %H:%M %Op"), 0)) |>
  rbind(c(as_datetime("11/01/15 5:00 AM", format = "%m/%d/%y %H:%M %Op"), 0)) |>
  rbind(c(as_datetime("11/01/15 9:00 AM", format = "%m/%d/%y %H:%M %Op"), 0)) |>
  group_by(Date.Time) |>
  summarize(WaterFlow = mean(WaterFlow, rm.na = TRUE)) |>
  as_tsibble(index = Date.Time)

tswater1 |>
  autoplot()

tswater2 <- water2 |>
  mutate(Date.Time =
           floor_date(
             as_datetime(Date.Time, format = "%m/%d/%y %H:%M %Op"),
             unit = "hour")
         ) |>
  as_tsibble(index = Date.Time)

tswater2 |>
  autoplot()

Stationary?

There is an easy method to test whether a data set is stationary and that is to use the unitroot_kpss function to get the kpss_pvalue. Given that the value here is 0.1 for both data sets it can be assumed that they are both stationary.

#Testing if data is stationary
tswater1 |>
  features(WaterFlow, unitroot_kpss) |>
  print()

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0915         0.1

#Testing if data is stationary
tswater2 |>
  features(WaterFlow, unitroot_kpss) |>
  print()

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.105         0.1

Forecasting with ARIMA

Since these data sets are stationary the best bet is to use ARIMA to determine the forecasts. Unfortunately here ARIMA doesn’t do well and I think it’s because the data here is too much like white noise. It is hard to forecast white noise so the best ARIMA model it is able to produce is pdq = 0,0,0.

arimawater1 <- tswater1 |>
  model(ARIMA(WaterFlow ~ PDQ(0,0,0))) |>
  print()

## # A mable: 1 x 1
##   `ARIMA(WaterFlow ~ PDQ(0, 0, 0))`
##                             <model>
## 1            <ARIMA(0,0,0) w/ mean>

arimawater2 <- tswater2 |>
  model(ARIMA(WaterFlow ~ PDQ(0,0,0))) |>
  print()

## # A mable: 1 x 1
##   `ARIMA(WaterFlow ~ PDQ(0, 0, 0))`
##                             <model>
## 1            <ARIMA(0,0,0) w/ mean>

arimawater1 |>
  gg_tsresiduals()

arimawater2 |>
  gg_tsresiduals()

waterforecast1 <- arimawater1 |>
  forecast(h = 24*7)

waterforecast2 <- arimawater2 |>
  forecast(h = 24*7)

waterforecast1 |>
  autoplot(tswater1)

waterforecast2 |>
  autoplot(tswater2)

waterforecast1 |>
  as_tibble() |>
  select(Date.Time, .mean) |>
  rename(WaterFlow = .mean) |>
  write.csv("water1.csv")

waterforecast2 |>
  as_tibble() |>
  select(Date.Time, .mean) |>
  rename(WaterFlow = .mean) |>
  write.csv("water2.csv")

P.S. I don’t know if the forecast produced by the ARIMA function for the water flow is what you were looking for but I do remember very early on in the class you saying that there are some data sets out there where the MEAN function is the best forecast out there so I’m sticking with what I got even if what I got doesn’t look like much of a forecast.

DATA624 Project1

Cameron Gray

2025-03-26

Part A - ATM Forecast

Pre-processing the data

Testing for ARIMA

Testing models on training data

Forecasting Data

Part B - Forecasting Power

Pre-processing the data

Testing for ARIMA

Testing models on training data

Forecasting Data

Part C - BONUS Water

Transforming the data

Stationary?

Forecasting with ARIMA