In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
We first pull that data in and convert the excel date to month format
atm <- read_excel("ATM624Data.xlsx")
atm <- atm %>%
mutate(
DATE = as.Date(DATE, origin = "1899-12-30"),
)
head(atm)
## # A tibble: 6 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
As we can see that we have NA values within the data set. First we remove the rows with NA fields for Date and ATM as the data points with either f them missing are useless for forecasting.
atm <- atm %>% filter(!is.na(DATE), !is.na(ATM))
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0.0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.5
## Median :2009-10-30 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
## NA's :5
We can see that with just this cleanup we went from 19 NA values for cash to just 5. We can see where these 5 NAs occur
atm_na <- atm %>%
filter(is.na(Cash)) %>%
arrange(ATM, DATE)
head(atm_na)
## # A tibble: 5 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
Since the missing value are mostly isolated we can use linear interpolation to fill these up with nearby values to help with the continuity
atm <- atm %>%
group_by(ATM) %>%
arrange(DATE) %>%
mutate(Cash = zoo::na.approx(Cash, na.rm = FALSE)) %>%
ungroup()
And now we prepare the data an convert it into a time series to work with
atm_ts <- atm %>%
as_tsibble(index = DATE, key = ATM)
ggplot(atm_ts, aes(x = DATE, y = Cash)) +
geom_line() +
geom_point() +
facet_wrap(~ATM, scales = "free_y", ncol = 1)
For ATM 1 we see a clear seasonality by no specific trend We can try an ARIMA model to see how the forecast performs
atm1_ts <- atm_ts %>% filter(ATM == "ATM1")
atm1_fit_ets <- atm1_ts %>% model(ETS(Cash))
atm1_fit_arima <- atm1_ts %>% model(ARIMA(Cash))
atm1_forecast_ets <- atm1_fit_ets %>%
forecast(h = "31 days")
atm1_forecast_arima <- atm1_fit_arima %>%
forecast(h = "31 days")
accuracy(atm1_fit_ets)
## # A tibble: 1 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 ETS(Cash) Training -0.0482 23.8 15.1 -107. 122. 0.852 0.857 0.144
accuracy(atm1_fit_arima)
## # A tibble: 1 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 ARIMA(Cash) Traini… -0.0742 23.3 14.5 -102. 117. 0.822 0.838 -0.00872
We see the ARIMA model doing slight better than the ETS so we can use that to forecast for ATM1
autoplot(atm1_forecast_arima, atm1_ts)
atm1_fit_arima %>%
gg_tsresiduals()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We see the residuals normally distributed and the ACF plot without any
significant correlation at any lag meaning the residuals are white noise
and this is a reasonably good model for the forecast.
Here again we come across seasonal data but with a slightly decreasing trend
atm2_ts <- atm_ts %>% filter(ATM == "ATM2")
atm2_ets <- atm2_ts %>%
model(ETS(Cash))
atm2_arima <- atm2_ts %>%
model(ARIMA(Cash))
accuracy(atm2_ets)
## # A tibble: 1 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM2 ETS(Cash) Training -0.689 25.1 17.9 -Inf Inf 0.859 0.833 0.0199
accuracy(atm2_arima)
## # A tibble: 1 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM2 ARIMA(Cash) Training -0.891 24.1 17.0 -Inf Inf 0.820 0.799 -0.00428
Here again we see the ARIMA model doing better
atm2_forecast <- atm2_arima %>%
forecast(h = 31)
autoplot(atm2_forecast, atm2_ts)
atm2_arima %>% gg_tsresiduals()
Here we see we only have very few reported values so any time series forecasting is irrelevant. Here a simple point forecast being a naive or a mean would suffice as a forecast assuming this is a new ATM that opened operation towards the end of the data set time period
atm3_values <- atm_ts %>%
filter(ATM == "ATM3", Cash > 0) %>%
pull(Cash)
atm3_forecast <- mean(atm3_values)
atm3_forecast
## [1] 87.66667
Here we need to deal with the outlier first
atm4_ts <- atm_ts %>% filter(ATM == "ATM4")
summary(atm4_ts$Cash)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.563 124.334 403.839 474.043 704.507 10919.762
mean_cash <- mean(atm4_ts$Cash, na.rm = TRUE)
sd_cash <- sd(atm4_ts$Cash, na.rm = TRUE)
threshold <- mean_cash + 2 * sd_cash
atm4_outliers <- atm4_ts %>% filter(Cash > threshold)
atm4_outliers
## # A tsibble: 1 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 10920.
nrow(atm4_outliers)
## [1] 1
It’s one outlier and it looks extremely unnatural. Which makes it most likely to be a data entry error. We could replace this with the median to smooth it out
median_cash <- median(atm4_ts$Cash, na.rm = TRUE)
atm4_ts <- atm4_ts %>%
mutate(Cash = ifelse(Cash > 10000, median_cash, Cash))
Now we can start to work with the time series
atm4_ets <- atm4_ts %>%
model(ETS(Cash))
atm4_arima <- atm4_ts %>%
model(ARIMA(Cash))
accuracy(atm4_ets)
## # A tibble: 1 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 ETS(Cash) Training -8.29 329. 265. -523. 552. 0.765 0.726 0.101
accuracy(atm2_arima)
## # A tibble: 1 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM2 ARIMA(Cash) Training -0.891 24.1 17.0 -Inf Inf 0.820 0.799 -0.00428
atm4_ets_forecast <- atm4_ets %>%
forecast(h = "31 days")
autoplot(atm4_ets_forecast, atm4_ts)
atm4_ets %>% gg_tsresiduals()
atm4_forecast_arima <- atm4_arima %>%
forecast(h = 31)
autoplot(atm4_forecast_arima, atm4_ts)
atm4_arima %>% gg_tsresiduals()
## COnverting to xlsx
# Extract DATE and forecast for each and save as data frame
atm1_df <- atm1_forecast_arima %>%
as_tibble() %>%
select(DATE, ATM1_Forecast = .mean)
atm2_df <- atm2_forecast %>%
as_tibble() %>%
select(DATE, ATM2_Forecast = .mean)
atm4_df <- atm4_forecast_arima %>%
as_tibble() %>%
select(DATE, ATM4_Forecast = .mean)
# Full Join by the date
forecast_wide <- atm1_df %>%
full_join(atm2_df, by = "DATE") %>%
full_join(atm4_df, by = "DATE") %>%
arrange(DATE)
# Add ATM3s point forecast
forecast_wide <- forecast_wide %>%
mutate(ATM3_Forecast = atm3_forecast)
forecast_wide <- forecast_wide %>%
mutate(across(starts_with("ATM"), round, 2))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(starts_with("ATM"), round, 2)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
write_csv(forecast_wide, "DATA624_ATMForecasts_May2010.csv")
# Preview first few rows
head(forecast_wide)
## # A tibble: 6 × 5
## DATE ATM1_Forecast ATM2_Forecast ATM4_Forecast ATM3_Forecast
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2010-05-01 87.2 66.4 381. 87.7
## 2 2010-05-02 104. 72.7 382. 87.7
## 3 2010-05-03 73.2 13.7 498. 87.7
## 4 2010-05-04 8.03 5.3 347. 87.7
## 5 2010-05-05 100. 97.6 354. 87.7
## 6 2010-05-06 80.9 88.3 407. 87.7
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above
residential <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(residential)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
summary(residential)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
For this data set we first Date column and fix the formatting, interpolate the missing values and then plot the series
residential <- residential %>%
mutate(
Date = yearmonth(`YYYY-MMM`)
)
residential <- residential %>%
mutate(KWH = ifelse(is.na(KWH), mean(KWH, na.rm = TRUE), KWH))
residential_ts <- residential %>%
as_tsibble(index = Date)
autoplot(residential_ts, KWH)
We then deal with the outliers
outlier_row <- residential %>%
filter(KWH > quantile(KWH, 0.75, na.rm = TRUE) +
1.5 * IQR(KWH, na.rm = TRUE) |
KWH < quantile(KWH, 0.25, na.rm = TRUE) -
1.5 * IQR(KWH, na.rm = TRUE))
print(outlier_row)
## # A tibble: 1 × 4
## CaseSequence `YYYY-MMM` KWH Date
## <dbl> <chr> <dbl> <mth>
## 1 883 2010-Jul 770523 2010 Jul
residential <- residential %>%
mutate(KWH = ifelse(Date %in% outlier_row$Date,
mean(KWH[Date %in% (outlier_row$Date + c(-1, 1))], na.rm = TRUE),
KWH))
residential_ts <- residential %>%
as_tsibble(index = Date)
autoplot(residential_ts, KWH)
Now we can start testing our models
residential_ets <- residential_ts %>% model(ETS = ETS(KWH))
residential_arima <- residential_ts %>% model(ARIMA = ARIMA(KWH))
residential_accuracy_ets <- accuracy(residential_ets) %>%
select(.model, RMSE, MAE, MAPE)
residential_accuracy_arima <- accuracy(residential_arima) %>%
select(.model, RMSE, MAE, MAPE)
accuracy_comparison <- bind_rows(residential_accuracy_ets, residential_accuracy_arima)
accuracy_comparison
## # A tibble: 2 × 4
## .model RMSE MAE MAPE
## <chr> <dbl> <dbl> <dbl>
## 1 ETS 631598. 482729. 7.28
## 2 ARIMA 591059. 430960. 6.54
ARMIA fits better here
report(residential_arima)
## Series: KWH
## Model: ARIMA(0,0,4)(2,1,0)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 ma4 sar1 sar2 constant
## 0.3459 0.0679 0.2105 -0.0772 -0.7379 -0.4343 233942.35
## s.e. 0.0772 0.0838 0.0737 0.0821 0.0762 0.0783 74009.17
##
## sigma^2 estimated as 3.877e+11: log likelihood=-2657.92
## AIC=5331.84 AICc=5332.68 BIC=5357.38
residential_forecast_arima <- residential_arima %>% forecast(h = "12 months")
autoplot(residential_forecast_arima, residential_ts)
residential_arima %>% gg_tsresiduals()
residential_forecast_df <- residential_forecast_arima %>%
as_tibble()
write_csv(residential_forecast_df, "DATA624_ResidentialForecast_2014.csv")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
water1 <- read_excel("Waterflow_Pipe1.xlsx")
water2 <- read_excel("Waterflow_Pipe2.xlsx")
head(water1)
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dbl> <dbl>
## 1 42300. 23.4
## 2 42300. 28.0
## 3 42300. 23.1
## 4 42300. 30.0
## 5 42300. 6.00
## 6 42300. 15.9
head(water2)
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dbl> <dbl>
## 1 42300. 18.8
## 2 42300. 43.1
## 3 42300. 38.0
## 4 42300. 36.1
## 5 42300. 31.9
## 6 42300. 28.2
summary(water1)
## Date Time WaterFlow
## Min. :42300 Min. : 1.067
## 1st Qu.:42302 1st Qu.:13.683
## Median :42305 Median :19.880
## Mean :42305 Mean :19.897
## 3rd Qu.:42307 3rd Qu.:26.159
## Max. :42310 Max. :38.913
summary(water2)
## Date Time WaterFlow
## Min. :42300 Min. : 1.885
## 1st Qu.:42310 1st Qu.:28.140
## Median :42321 Median :39.682
## Mean :42321 Mean :39.556
## 3rd Qu.:42331 3rd Qu.:50.782
## Max. :42342 Max. :78.303
water1 <- water1 %>%
mutate(Timestamp = as.POSIXct(`Date Time` * 86400, origin = "1899-12-30", tz = "UTC"))
water2 <- water2 %>%
mutate(Timestamp = as.POSIXct(`Date Time` * 86400, origin = "1899-12-30", tz = "UTC"))
water1_hourly <- water1 %>%
mutate(Hour = floor_date(Timestamp, "hour")) %>%
group_by(Hour) %>%
summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
ungroup()
water2_hourly <- water2 %>%
mutate(Hour = floor_date(Timestamp, "hour")) %>%
group_by(Hour) %>%
summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
ungroup()
Convert to hourly time series
water1_ts <- ts(water1_hourly$WaterFlow, frequency = 24)
water2_ts <- ts(water2_hourly$WaterFlow, frequency = 24)
ggtsdisplay(water1_ts, main = "Water1")
ggtsdisplay(water2_ts, main = "Water2")
For the first pipe we get almost no significant auto correlations. The series is roughly stationary around its seasonal pattern.Forecasting is feasible. Water pipe 2 has some significant autocorrelation compared to the first pipe but data still looks stationary for the most part and could use the help of some differencing before forecasting.
fit_ets_water1 <- ets(water1_ts)
forecast_ets_water1 <- forecast(fit_ets_water1, h = 168)
autoplot(forecast_ets_water1)
fit_arima_water1 <- auto.arima(water1_ts, seasonal = TRUE)
forecast_arima_water1 <- forecast(fit_arima_water1, h = 168)
autoplot(forecast_arima_water1)
Not enough data to accurately forecast seasonality or trends SO we just get a point forecast of the mean instead. I don’t think this is particularly useful without additional data points.
fit_arima_water2 <- auto.arima(water2_ts, seasonal = TRUE)
forecast_arima_water2 <- forecast(fit_arima_water2, h = 168)
autoplot(forecast_arima_water2)
Here we get a similar situation as Pipe 1 but with a bit more
variability in the forecasts but still returning to the mean pretty
quickly
combined_forecast <- data.frame(
Water1_Forecast = as.numeric(forecast_arima_water1$mean),
Water2_Forecast = as.numeric(forecast_arima_water2$mean)
)
# Save to CSV
write_csv(combined_forecast, "DATA624_Water1Water2_1week.csv")