In part A, I will be forecasting the cash withdrawals from 4 different ATM machines for May 2010.
A few things that stand out is that the ATM and Date columns are in the wrong format. I also observed that the Cash column contains 19 nulls.
summary(ATM)
## DATE ATM Cash
## Min. :39934 Length:1474 Min. : 0.0
## 1st Qu.:40026 Class :character 1st Qu.: 0.5
## Median :40118 Mode :character Median : 73.0
## Mean :40118 Mean : 155.6
## 3rd Qu.:40210 3rd Qu.: 114.0
## Max. :40312 Max. :10919.8
## NA's :19
glimpse(ATM)
## Rows: 1,474
## Columns: 3
## $ DATE <dbl> 39934, 39934, 39935, 39935, 39936, 39936, 39937, 39937, 39938, 39…
## $ ATM <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …
I start by transforming the Date and ATM columns to their appropriate data type. The Date column is turned to a proper Date format. The ATM column is turned from a character type to a factor type.
To address the nulls, I first wanted to see which ATM(s) had missing values.
ATM <- ATM %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30")) %>%
mutate(ATM = as.factor(ATM)) %>%
filter(!is.na(ATM))
Upon further inspection, we can see that there are a few dates where there were Null cash withdrawls for ATM 1 and 2. Because of how few missing values there are, I’m going to assume that no withdrawls took place. I will replace the Nulls with zeros.
ATM %>%
filter(is.na(Cash)) %>%
select(ATM, DATE, Cash)
## # A tibble: 5 × 3
## ATM DATE Cash
## <fct> <date> <dbl>
## 1 ATM1 2009-06-13 NA
## 2 ATM1 2009-06-16 NA
## 3 ATM2 2009-06-18 NA
## 4 ATM1 2009-06-22 NA
## 5 ATM2 2009-06-24 NA
ATM <- ATM %>%
mutate(Cash = replace_na(Cash, 0))
Next I transform the ATM dataframe into a tsibble dataset so that I can perform time series analysis.
ATM_ts <- ATM %>%
as_tsibble(index = DATE, key = ATM)
Now that the data is prepared, we can look at the withdrawal patterns for each ATM Machine.
ATM 1 doesn’t display any clear trend however, we can see there are seasonal patterns. There are some peaks and falls that stick out.
ATM_ts %>%
filter(ATM == "ATM1") %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
In ATM 2, we can also see an obscence of a clear trend and seasonal patterns. The peaks and falls are very consistent and fall within the same range throughout the plot.
ATM_ts %>%
filter(ATM == "ATM2") %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
In ATM 3, it’s difficult to identify any patterns because there are very little withdrawls.
ATM_ts %>%
filter(ATM == "ATM3") %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
ATM 4 has a very stable plot with the exception of an outlier high around 2010. There is no clear trend, and we can observe seasonality.
ATM_ts %>%
filter(ATM == "ATM4") %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
To create our models, I’m going to separate the data for all 4 ATMs into their own tsibble object. Then I will examine the ACF and PACF plots to determine which forecast model to utilize.
ATM1 <- ATM_ts %>%
filter(ATM == 'ATM1')
ATM2 <- ATM_ts %>%
filter(ATM == 'ATM2')
ATM3 <- ATM_ts %>%
filter(ATM == 'ATM3')
ATM4 <- ATM_ts %>%
filter(ATM == 'ATM4')
When looking at the ACF and PACF plots for ATM1 we can observe seasonality. Given this observation, a SARIMA model can accurately capture these type of patterns.
Before creating our models, we will first perform differencing and attempt to make the data stationary - an important step before creating our SARIMA model.
For the SARIMA models, I will create two different models. One by using the ARIMA function which will automatically determine the best parameters and a second model where I will manually select the parameters. Then I will compare metrics and pick the best forecasting model for each ATM dataset.
ATM_ts %>%
filter(ATM == 'ATM1') %>%
gg_tsdisplay(Cash, plot_type = 'partial', lag = 180)
# perform differencing
ATM1 <- ATM1 %>%
mutate(Cash_diff = difference(Cash, lag = 7))
We can see that the data is not fully stationary. The ARIMA function will capture the patterns that remain.
# double check to see if data is stationary
ATM1 %>%
gg_tsdisplay(Cash_diff, plot_type = 'partial', lag = 180)
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
The custom Arima1 with the parameters I set is the better model. It has lower ME, RMSE, and MAE. I will select this as our forecasting model.
model1 <- ATM1 %>%
model(
arima = ARIMA(Cash),
arima1 = ARIMA(Cash ~ pdq(1,0,1) + PDQ(1,1,1, period = 7))
)
accuracy(model1)
## # A tibble: 2 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 arima Training -0.103 25.9 16.3 -Inf Inf 0.847 0.831 0.0650
## 2 ATM1 arima1 Training -0.0600 25.7 16.1 -Inf Inf 0.837 0.824 -0.0153
forecasts1 <- model1 %>%
forecast(h = 31)
autoplot(forecasts1, ATM1)
We will take the same approach for ATM1. We’ll try to address the seasonality through differencing and then create two SARIMA models: One created through the ARIMA function and another with parameters that I set.
# acf and pacf prior to differencing
ATM_ts %>%
filter(ATM == 'ATM2') %>%
gg_tsdisplay(Cash, plot_type = 'partial', lag = 180)
ATM2 <- ATM2 %>%
mutate(Cash_diff = difference(Cash, lag = 7))
# acf and pacf after differencing
ATM2 %>%
gg_tsdisplay(Cash_diff, plot_type = 'partial', lag = 180)
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
The model named arima has a lower RMSE, MAE and RMSE. I will choose this as our forecasting model for ATM2.
model2 <- ATM2 %>%
model(
arima = ARIMA(Cash),
arima1 = ARIMA(Cash ~ pdq(1,0,1) + PDQ(1,1,1, period = 7))
)
accuracy(model2)
## # A tibble: 2 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM2 arima Training -0.854 23.9 16.8 -Inf Inf 0.822 0.802 0.000718
## 2 ATM2 arima1 Training -0.349 24.6 17.1 -Inf Inf 0.835 0.828 -0.0247
forecasts2 <- model2 %>%
forecast(h = 31)
autoplot(forecasts2, ATM2)
Given the lack of data for ATM3, and lack of clear seasonality or trend, I will utilize a Naive model. The Naive model assumes that the most recent observation is the best predictor of the following value.
ATM_ts %>%
filter(ATM == 'ATM3') %>%
gg_tsdisplay(Cash, plot_type = 'partial', lag = 180)
model3 <- ATM3 %>%
model(Naive = NAIVE(Cash))
accuracy(model3)
## # A tibble: 1 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM3 Naive Training 0.234 5.09 0.310 28.8 40.2 0.423 0.632 -0.149
forecasts3 <- model3 %>%
forecast(h = 31)
autoplot(forecasts3, ATM3)
The Arima1 model outperforms the Arima model in the RMSE, MAPE and MASE metrics. Given that it has less bias and is less error prone, I will select this as our forecasting model.
ATM_ts %>%
filter(ATM == 'ATM4') %>%
gg_tsdisplay(Cash, plot_type = 'partial', lag = 180)
model4 <- ATM4 %>%
model(
arima = ARIMA(Cash),
arima1 = ARIMA(Cash ~ pdq(1,0,1) + PDQ(1,1,1, period = 7)+ 0)
)
## Warning in sqrt(diag(best$var.coef)): NaNs produced
accuracy(model4)
## # A tibble: 2 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 arima Training -1.51e-10 650. 324. -617. 647. 0.805 0.725 -0.00936
## 2 ATM4 arima1 Training 1.59e+ 1 634. 288. -542. 576. 0.716 0.708 -0.00774
forecasts4 <- model4 %>%
forecast(h = 31)
autoplot(forecasts4, ATM4)
Lastly, I will combine the forecasted values into one dataframe to
export.
combined_forecasts <- bind_rows(
forecasts1 %>% as_tibble() %>% select(DATE, ATM, .mean),
forecasts2 %>% as_tibble() %>% select(DATE, ATM, .mean),
forecasts3 %>% as_tibble() %>% select(DATE, ATM, .mean),
forecasts4 %>% as_tibble() %>% select(DATE, ATM, .mean)
)
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. I will model this data and give a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours.
At first glance, I can see that the date column labeled ‘YYYY-MMM’is a charactrer type and needs to be changed to a date type. I also noticed that on one of the days there is a KWH reading of NA. We will need to look at the missing value and transformed the ’YYYY-MMM’ column.
summary(Power)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
glimpse(Power)
## Rows: 192
## Columns: 3
## $ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
## $ `YYYY-MMM` <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
## $ KWH <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…
There is a missing recording of KWH for September 2008. Since it’s only one missing value and given the large data we have going back to 1998, we can use the averages for the month of September as a placeholder for September 2008.
Power %>%
filter(is.na(KWH))
## # A tibble: 1 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 861 2008-Sep NA
In this section, we will convert the ‘YYYY-MMM’ column into a date column and turn ‘Power’ into a tsibble object. I will also get the average KWH value for September and replace the missing value from September 2008 with this average.
Power <- Power %>%
mutate(Date = yearmonth(`YYYY-MMM`)) %>%
select(-`YYYY-MMM`) %>%
as_tsibble()
## Using `Date` as index variable.
# look at what values are like historically for Sept
Sept <- Power %>%
filter(month(Date) == 9 & !is.na(KWH))
autoplot(Sept, KWH)
september_avg <- Power %>%
filter(month(Date) == 9 & !is.na(KWH)) %>%
summarize(mean_kwh = mean(KWH, na.rm = TRUE)) %>%
pull(mean_kwh)
Power$KWH[Power$Date == yearmonth("2008-Sep") & is.na(Power$KWH)] <- september_avg
## Warning in Power$KWH[Power$Date == yearmonth("2008-Sep") & is.na(Power$KWH)] <-
## september_avg: number of items to replace is not a multiple of replacement
## length
When looking at the overall timeseries data for Power, we can see that there is no visible trend. We can see a consistent seasonal pattern. There is also a noticeable dip near January 2010.
I will address the seasonality by performing differencing. I will also utilize a SARIMA model to properly capture the seasonal pattern. Similarly like in Part A, I will use a baseline Sarima model using the ARIMA function and a second model that I will set the parameters.
autoplot(Power, KWH)
Power %>%
gg_tsdisplay(KWH, plot_type = 'partial', lag = 24)
Power <- Power %>%
mutate(KWH_diff = difference(KWH, lag = 12))
Power %>%
gg_tsdisplay(KWH_diff, plot_type = 'partial', lag = 24)
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
The baseline arima model performs better across metrics like RMSE, MAE and MAPE.
model5 <- Power %>%
model(
arima = ARIMA(KWH_diff),
arima1 = ARIMA(KWH_diff ~ pdq(1,0,1) + PDQ(1,0,1, period = 12) + 0)
)
accuracy(model5)
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Training -11277. 871906. 539503. -2459. 2672. 0.439 0.418 -0.00475
## 2 arima1 Training 139963. 874915. 549053. -2657. 2983. 0.447 0.420 0.00403
forecasts5 <- model5 %>%
forecast(h = 12)
autoplot(forecasts5, Power)
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).