Data 624 Project 1

PART A

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

We first pull that data in and convert the excel date to month format

atm <- read_excel("ATM624Data.xlsx")
atm <- atm %>%
  mutate(
    DATE = as.Date(DATE, origin = "1899-12-30"), 
  ) 

head(atm)

## # A tibble: 6 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-01 ATM2    107
## 3 2009-05-02 ATM1     82
## 4 2009-05-02 ATM2     89
## 5 2009-05-03 ATM1     85
## 6 2009-05-03 ATM2     90

summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

As we can see that we have NA values within the data set. First we remove the rows with NA fields for Date and ATM as the data points with either f them missing are useless for forecasting.

atm <- atm %>% filter(!is.na(DATE), !is.na(ATM))
summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5

We can see that with just this cleanup we went from 19 NA values for cash to just 5. We can see where these 5 NAs occur

atm_na <- atm %>%
  filter(is.na(Cash)) %>%
  arrange(ATM, DATE)

head(atm_na)

## # A tibble: 5 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA
## 4 2009-06-18 ATM2     NA
## 5 2009-06-24 ATM2     NA

Since the missing value are mostly isolated we can use linear interpolation to fill these up with nearby values to help with the continuity

atm <- atm %>%
  group_by(ATM) %>%
  arrange(DATE) %>%
  mutate(Cash = zoo::na.approx(Cash, na.rm = FALSE)) %>%
  ungroup()

And now we prepare the data an convert it into a time series to work with

atm_ts <- atm %>%
  as_tsibble(index = DATE, key = ATM)


ggplot(atm_ts, aes(x = DATE, y = Cash)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ATM, scales = "free_y", ncol = 1)

ATM 1

For ATM 1 we see a clear seasonality by no specific trend We can try an ARIMA model to see how the forecast performs

atm1_ts <- atm_ts %>% filter(ATM == "ATM1")

atm1_fit_ets <- atm1_ts %>% model(ETS(Cash))
atm1_fit_arima <- atm1_ts %>% model(ARIMA(Cash))

atm1_forecast_ets <- atm1_fit_ets %>%
  forecast(h = "31 days") 
atm1_forecast_arima <- atm1_fit_arima %>%
  forecast(h = "31 days") 

accuracy(atm1_fit_ets)

## # A tibble: 1 × 11
##   ATM   .model    .type         ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr> <chr>     <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1  ETS(Cash) Training -0.0482  23.8  15.1 -107.  122. 0.852 0.857 0.144

accuracy(atm1_fit_arima)

## # A tibble: 1 × 11
##   ATM   .model      .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>       <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM1  ARIMA(Cash) Traini… -0.0742  23.3  14.5 -102.  117. 0.822 0.838 -0.00872

We see the ARIMA model doing slight better than the ETS so we can use that to forecast for ATM1

autoplot(atm1_forecast_arima, atm1_ts)

atm1_fit_arima %>% 
  gg_tsresiduals()

## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We see the residuals normally distributed and the ACF plot without any significant correlation at any lag meaning the residuals are white noise and this is a reasonably good model for the forecast.

ATM 2

Here again we come across seasonal data but with a slightly decreasing trend

atm2_ts <- atm_ts %>% filter(ATM == "ATM2")


atm2_ets <- atm2_ts %>%
  model(ETS(Cash))
atm2_arima <- atm2_ts %>%
  model(ARIMA(Cash))


accuracy(atm2_ets)

## # A tibble: 1 × 11
##   ATM   .model    .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>     <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM2  ETS(Cash) Training -0.689  25.1  17.9  -Inf   Inf 0.859 0.833 0.0199

accuracy(atm2_arima)

## # A tibble: 1 × 11
##   ATM   .model      .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>       <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM2  ARIMA(Cash) Training -0.891  24.1  17.0  -Inf   Inf 0.820 0.799 -0.00428

Here again we see the ARIMA model doing better

atm2_forecast <- atm2_arima %>%
  forecast(h = 31) 
autoplot(atm2_forecast, atm2_ts)

atm2_arima %>% gg_tsresiduals()

ATM 3

Here we see we only have very few reported values so any time series forecasting is irrelevant. Here a simple point forecast being a naive or a mean would suffice as a forecast assuming this is a new ATM that opened operation towards the end of the data set time period

atm3_values <- atm_ts %>%
  filter(ATM == "ATM3", Cash > 0) %>%
  pull(Cash)

atm3_forecast <- mean(atm3_values)
atm3_forecast

## [1] 87.66667

ATM 4

Here we need to deal with the outlier first

atm4_ts <- atm_ts %>% filter(ATM == "ATM4")
summary(atm4_ts$Cash)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.563   124.334   403.839   474.043   704.507 10919.762

mean_cash <- mean(atm4_ts$Cash, na.rm = TRUE)
sd_cash <- sd(atm4_ts$Cash, na.rm = TRUE)
threshold <- mean_cash + 2 * sd_cash
atm4_outliers <- atm4_ts %>% filter(Cash > threshold)

atm4_outliers

## # A tsibble: 1 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM     Cash
##   <date>     <chr>  <dbl>
## 1 2010-02-09 ATM4  10920.

nrow(atm4_outliers)

## [1] 1

It’s one outlier and it looks extremely unnatural. Which makes it most likely to be a data entry error. We could replace this with the median to smooth it out

median_cash <- median(atm4_ts$Cash, na.rm = TRUE)
atm4_ts <- atm4_ts %>%
  mutate(Cash = ifelse(Cash > 10000, median_cash, Cash))

Now we can start to work with the time series

atm4_ets <- atm4_ts %>%
  model(ETS(Cash))
atm4_arima <- atm4_ts %>%
  model(ARIMA(Cash))

accuracy(atm4_ets)

## # A tibble: 1 × 11
##   ATM   .model    .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr> <chr>     <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4  ETS(Cash) Training -8.29  329.  265. -523.  552. 0.765 0.726 0.101

accuracy(atm2_arima)

## # A tibble: 1 × 11
##   ATM   .model      .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>       <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM2  ARIMA(Cash) Training -0.891  24.1  17.0  -Inf   Inf 0.820 0.799 -0.00428

atm4_ets_forecast <- atm4_ets %>%
  forecast(h = "31 days")

autoplot(atm4_ets_forecast, atm4_ts)

atm4_ets %>% gg_tsresiduals()

atm4_forecast_arima <- atm4_arima %>%
  forecast(h = 31)

autoplot(atm4_forecast_arima, atm4_ts)

atm4_arima %>% gg_tsresiduals()

## COnverting to xlsx

# Extract DATE and forecast for each and save as data frame

atm1_df <- atm1_forecast_arima %>%
  as_tibble() %>%
  select(DATE, ATM1_Forecast = .mean)

atm2_df <- atm2_forecast %>%
  as_tibble() %>%
  select(DATE, ATM2_Forecast = .mean)

atm4_df <- atm4_forecast_arima %>%
  as_tibble() %>%
  select(DATE, ATM4_Forecast = .mean)

# Full Join by the date
forecast_wide <- atm1_df %>%
  full_join(atm2_df, by = "DATE") %>%
  full_join(atm4_df, by = "DATE") %>%
  arrange(DATE)

# Add ATM3s point forecast
forecast_wide <- forecast_wide %>%
  mutate(ATM3_Forecast = atm3_forecast)
forecast_wide <- forecast_wide %>%
  mutate(across(starts_with("ATM"), round, 2))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(starts_with("ATM"), round, 2)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

write_csv(forecast_wide, "DATA624_ATMForecasts_May2010.csv")
# Preview first few rows
head(forecast_wide)

## # A tibble: 6 × 5
##   DATE       ATM1_Forecast ATM2_Forecast ATM4_Forecast ATM3_Forecast
##   <date>             <dbl>         <dbl>         <dbl>         <dbl>
## 1 2010-05-01         87.2           66.4          381.          87.7
## 2 2010-05-02        104.            72.7          382.          87.7
## 3 2010-05-03         73.2           13.7          498.          87.7
## 4 2010-05-04          8.03           5.3          347.          87.7
## 5 2010-05-05        100.            97.6          354.          87.7
## 6 2010-05-06         80.9           88.3          407.          87.7

PART B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above

residential  <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(residential)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

summary(residential)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

For this data set we first Date column and fix the formatting, interpolate the missing values and then plot the series

residential <- residential %>%
  mutate(
    Date = yearmonth(`YYYY-MMM`)
  )
residential <- residential %>%
  mutate(KWH = ifelse(is.na(KWH), mean(KWH, na.rm = TRUE), KWH))

residential_ts <- residential %>%
  as_tsibble(index = Date)

autoplot(residential_ts, KWH)

We then deal with the outliers

outlier_row <- residential %>%
  filter(KWH > quantile(KWH, 0.75, na.rm = TRUE) + 
                 1.5 * IQR(KWH, na.rm = TRUE) |
         KWH < quantile(KWH, 0.25, na.rm = TRUE) - 
                 1.5 * IQR(KWH, na.rm = TRUE))

print(outlier_row)

## # A tibble: 1 × 4
##   CaseSequence `YYYY-MMM`    KWH     Date
##          <dbl> <chr>       <dbl>    <mth>
## 1          883 2010-Jul   770523 2010 Jul

residential <- residential %>%
  mutate(KWH = ifelse(Date %in% outlier_row$Date,
                      mean(KWH[Date %in% (outlier_row$Date + c(-1, 1))], na.rm = TRUE),
                      KWH))

residential_ts <- residential %>%
  as_tsibble(index = Date)

autoplot(residential_ts, KWH)

Now we can start testing our models

residential_ets <- residential_ts %>% model(ETS = ETS(KWH))
residential_arima <- residential_ts %>% model(ARIMA = ARIMA(KWH))

residential_accuracy_ets <- accuracy(residential_ets) %>%
  select(.model, RMSE, MAE, MAPE)
residential_accuracy_arima <- accuracy(residential_arima) %>%
  select(.model, RMSE, MAE, MAPE)
accuracy_comparison <- bind_rows(residential_accuracy_ets, residential_accuracy_arima)

accuracy_comparison

## # A tibble: 2 × 4
##   .model    RMSE     MAE  MAPE
##   <chr>    <dbl>   <dbl> <dbl>
## 1 ETS    631598. 482729.  7.28
## 2 ARIMA  591059. 430960.  6.54

ARMIA fits better here

report(residential_arima)

## Series: KWH 
## Model: ARIMA(0,0,4)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3      ma4     sar1     sar2   constant
##       0.3459  0.0679  0.2105  -0.0772  -0.7379  -0.4343  233942.35
## s.e.  0.0772  0.0838  0.0737   0.0821   0.0762   0.0783   74009.17
## 
## sigma^2 estimated as 3.877e+11:  log likelihood=-2657.92
## AIC=5331.84   AICc=5332.68   BIC=5357.38

residential_forecast_arima <- residential_arima %>% forecast(h = "12 months")

autoplot(residential_forecast_arima, residential_ts)

residential_arima %>% gg_tsresiduals()

residential_forecast_df <- residential_forecast_arima %>%
  as_tibble()

write_csv(residential_forecast_df, "DATA624_ResidentialForecast_2014.csv")

PART c |

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

water1  <- read_excel("Waterflow_Pipe1.xlsx")
water2  <- read_excel("Waterflow_Pipe2.xlsx")

head(water1)

## # A tibble: 6 × 2
##   `Date Time` WaterFlow
##         <dbl>     <dbl>
## 1      42300.     23.4 
## 2      42300.     28.0 
## 3      42300.     23.1 
## 4      42300.     30.0 
## 5      42300.      6.00
## 6      42300.     15.9

head(water2)

## # A tibble: 6 × 2
##   `Date Time` WaterFlow
##         <dbl>     <dbl>
## 1      42300.      18.8
## 2      42300.      43.1
## 3      42300.      38.0
## 4      42300.      36.1
## 5      42300.      31.9
## 6      42300.      28.2

summary(water1)

##    Date Time       WaterFlow     
##  Min.   :42300   Min.   : 1.067  
##  1st Qu.:42302   1st Qu.:13.683  
##  Median :42305   Median :19.880  
##  Mean   :42305   Mean   :19.897  
##  3rd Qu.:42307   3rd Qu.:26.159  
##  Max.   :42310   Max.   :38.913

summary(water2)

##    Date Time       WaterFlow     
##  Min.   :42300   Min.   : 1.885  
##  1st Qu.:42310   1st Qu.:28.140  
##  Median :42321   Median :39.682  
##  Mean   :42321   Mean   :39.556  
##  3rd Qu.:42331   3rd Qu.:50.782  
##  Max.   :42342   Max.   :78.303

water1 <- water1 %>%
  mutate(Timestamp = as.POSIXct(`Date Time` * 86400, origin = "1899-12-30", tz = "UTC"))
water2 <- water2 %>%
  mutate(Timestamp = as.POSIXct(`Date Time` * 86400, origin = "1899-12-30", tz = "UTC"))

water1_hourly <- water1 %>%
  mutate(Hour = floor_date(Timestamp, "hour")) %>%
  group_by(Hour) %>%
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
  ungroup()

water2_hourly <- water2 %>%
  mutate(Hour = floor_date(Timestamp, "hour")) %>%
  group_by(Hour) %>%
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
  ungroup()

Convert to hourly time series

water1_ts <- ts(water1_hourly$WaterFlow, frequency = 24)
water2_ts <- ts(water2_hourly$WaterFlow, frequency = 24)

ggtsdisplay(water1_ts, main = "Water1")

ggtsdisplay(water2_ts, main = "Water2")

For the first pipe we get almost no significant auto correlations. The series is roughly stationary around its seasonal pattern.Forecasting is feasible. Water pipe 2 has some significant autocorrelation compared to the first pipe but data still looks stationary for the most part and could use the help of some differencing before forecasting.

fit_ets_water1 <- ets(water1_ts)

forecast_ets_water1 <- forecast(fit_ets_water1, h = 168)

autoplot(forecast_ets_water1)

fit_arima_water1 <- auto.arima(water1_ts, seasonal = TRUE)
forecast_arima_water1 <- forecast(fit_arima_water1, h = 168)

autoplot(forecast_arima_water1)

Not enough data to accurately forecast seasonality or trends SO we just get a point forecast of the mean instead. I don’t think this is particularly useful without additional data points.

Water 2

fit_arima_water2 <- auto.arima(water2_ts, seasonal = TRUE)
forecast_arima_water2 <- forecast(fit_arima_water2, h = 168)

autoplot(forecast_arima_water2)

Here we get a similar situation as Pipe 1 but with a bit more variability in the forecasts but still returning to the mean pretty quickly

combined_forecast <- data.frame(
  Water1_Forecast = as.numeric(forecast_arima_water1$mean),
  Water2_Forecast = as.numeric(forecast_arima_water2$mean)
)

# Save to CSV
write_csv(combined_forecast, "DATA624_Water1Water2_1week.csv")