In this part, we’ll be attempting to forecast how much cash is taken out of 4 different ATM machines for May 2010. Our dataset is provided for us, and available on my GitHub.
Our DATE column looks to be the number of days since
1900, so we’ll fix that on read-in
# Read in our ATM data
atm <- read_excel("../data/ATM624Data.xlsx")
# Store ATM data in a tsibble object
atm <- atm %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30")) %>% as_tsibble(index=DATE, key=ATM) %>% fill_gaps()
To start off with, let’s plot our atm tsibble to get a
sense of the data. We see that we have data for 4 ATMs, with some
NA values for the amount of cash withdrawn on certain
dates. We also have some outlier points that we’ll need to deal with as
well.
atm %>% autoplot() + labs(x = "Date", y="Cash Withdrawn (Hundred of Dollars)", title="Cash withdrawn from ATMs 1-4: May 2009 - May 2010")
In our case, the NA values for the ATM
variable all occur in May 2010, which is the month we are trying to
predict. In this case, we should be able to drop these rows for now,
with the intention of “re-adding” them later when we forecast
# Filter null ATM rows for later forecasting replacement
atm <- atm %>% filter(!is.na(ATM))
atm %>% autoplot() +
labs(x = "Date",
y="Cash Withdrawn (Hundred of Dollars)",
title="Cash withdrawn from ATMs 1-4: May 2009 - May 2010")
As a result of our fill_gaps call above, we’re left with
some NA values in our time series. We can find the mean cash withdrawn
per ATM varies between the machines via a dplyr
aggregation. These can serve us later for imputation
mean_cash_by_atm <- atm %>%
filter(!is.na(Cash)) %>%
as_tibble %>%
group_by(ATM) %>%
summarise(Cash = mean(Cash))
mean_cash_by_atm
## # A tibble: 4 × 2
## ATM Cash
## <chr> <dbl>
## 1 ATM1 83.9
## 2 ATM2 62.6
## 3 ATM3 0.721
## 4 ATM4 474.
As such, it makes more sense to impute the amount of cash withdrawn
based on the per ATM, rather than taking the blanket mean of
our Cash variable. This will help us to handle outliers
like ATM3 better.
# Impute with mean value per ATM (hence the groupBy)
atm <- atm %>% group_by(ATM) %>%
mutate(Cash = ifelse(is.na(Cash),
mean(Cash, na.rm=TRUE),
Cash)) %>%
ungroup()
atm %>% autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
We see a strong outlier point around February 2010, in which it appears a value of \(\$1,092,000\) was withdrawn. There is a chance that this outlier is due to an internal error within the data collection process, so we’ll have to handle that as well. For our purposes, we can replace that outlier value with a simple mean of the time series.
atm_with_outlier <- atm # Save TS with outlier for later comparison
outlier_limit <- 3000
atm_outlier <- atm %>%
filter(Cash > outlier_limit) %>%
mutate(Cash = mean(atm$Cash))
# Remove outlier row and add in "imputed" value
atm <- atm %>% filter(Cash < outlier_limit)
atm <- bind_rows(atm, atm_outlier)
# Plot
atm %>% autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
Finally, let’s plot our time series without
ATM4, which tends to drown out the other ATM withdrawal
values in terms of scale
atm %>% filter(ATM != "ATM4") %>% autoplot(Cash)
These time series appear to not have a strong overall trend upwards or downwards. However, there does appear to be a seasonal component. In our case, the seasonal variation does not change with time, so an additive decomposition may be better than a multiplicative one:
atm %>% model(
classical_decomposition(Cash, type = "additive")
) %>%
components() %>%
autoplot() +
labs(title = "Classical Decomposition fo ATM data")
Now we need to get to actually build a model to predict the amount of
cash taken out of each ATM for May 2010. One consideration if we need to
build an ARIMA model would be to determine if our time series is
stationary. We can do this via the Augmented
Dickey-Fuller test, with a null hypothesis that our data is
non-stationary and a significance level \(\alpha=0.05\). We can leverage the
adf.test function from the R tseries
library.
adf.test(atm$Cash)
## Warning in adf.test(atm$Cash): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: atm$Cash
## Dickey-Fuller = -4.4473, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
In this case, we see a p-value of 0.01, so we can reject the null hypothesis and assume our time series is stationary. This means we won’t have to apply differencing to our data, since it’s already stationary.
Let’s also take a look at the ACF plot of our data. We see significant weekly lags (7 days) with autocorrelation. In simpler terms, ATM users tend to withdraw similar amounts depending on the day of the week. This makes sense, as people will likely have certain days of the week (e.g., Fridays) where they
# Plot the ACF of our ATM time series
atm %>% ACF(Cash) %>% autoplot()
We can also look at a lag plot for our data. Here we see a similar
pattern from our ACF plot, with a strong autocorrelation at a lag of 7
(weekly).
atm %>% filter(ATM == "ATM1") %>% gg_lag(geom="point")
## Plot variable not specified, automatically selected `y = Cash`
Now we can fit an ARIMA model to the ATM dataset. When using ARIMA in fable, the default behavior is that model parameters are automatically selected. We’ll leverage this behavior in determining our ARIMA parameters \(p, q, d\)
# Auto-select ARIMA parameters
atm_fit <-atm %>%
model(ARIMA(Cash))
atm_fit
## # A mable: 4 x 2
## # Key: ATM [4]
## ATM `ARIMA(Cash)`
## <chr> <model>
## 1 ATM1 <ARIMA(0,0,1)(0,1,2)[7]>
## 2 ATM2 <ARIMA(2,0,2)(0,1,1)[7]>
## 3 ATM3 <ARIMA(0,0,2)>
## 4 ATM4 <ARIMA(3,0,2)(1,0,0)[7] w/ mean>
Our mable for the ATM dataset contains an ARIMA model
per panel (in this case, per ATM series). Each ARIMA model contains
different values for \(p, q, d\) as
parameters.
# Forecast and plot
atm_fc <- atm_fit %>%
forecast(h=10)
# Plot forecast per ATM
atm_fc %>% autoplot(atm) +
labs(x="Date",
y="Cash Withdrawn",
title="ARIMA models applied to ATM Withdrawal Datasets")
We can create residual plots for each ATM to visually inspect our erros.
We hop for normally-distributed residuals and no significant
auto-correlations in our errors
# Plot residuals per ATM
atms <- c("ATM1", "ATM2", "ATM3", "ATM4")
for (atm in atms){
plot <- atm_fit %>%
filter(ATM == atm) %>%
gg_tsresiduals() +
labs(title=atm)
print(plot)
}
Finally, we can write our output dataset to a CSV which can be read into excel. The output can be found on my GitHub here
# Write predictions to CSV
atm_fc %>% dplyr::select("ATM", "DATE", "Cash", ".mean") %>%
dplyr::rename( "prediction" =".mean") %>%
write_excel_csv(file="../data/atm-predictions.csv")
In this part we’ll be looking at residential power usage between 1998 and 2010. First, we’ll read in the datasets used. and perform some basic wrangling to get into a shape suitable for time series forecasting.
# Read in our Power Usage data
usage <- read_excel("../data/ResidentialCustomerForecastLoad-624.xlsx")
# Store ATM data in a tsibble object
usage <- usage %>%
rename(Date = `YYYY-MMM`) %>%
mutate(Date = yearmonth(Date)) %>%
as_tsibble(index=Date)
First, we’ll plot our time of power usage over time to get a visual sense of the data
usage %>% autoplot(KWH) +
labs(x="Date",
y="Power Usage (kilowatt-hours)",
title="Power Usage Over Time")
Visually, we see an outlier point around July 2010 we’ll likely need to deal with. We also see some missing values in addition to our seasonal data.
We need to deal with 2 data quality issues before we can build a predictve model and forecast:
First, we’ll impute missing values with the mean value of our timeseries
# Naive mean imputation
usage$KWH <- ifelse(is.na(usage$KWH), mean(usage$KWH, na.rm=TRUE), usage$KWH)
# Should be empty
usage %>% filter(is.na(usage$KWH))
## # A tsibble: 0 x 3 [?]
## # ℹ 3 variables: CaseSequence <dbl>, Date <mth>, KWH <dbl>
Next we’ll remove our outlier point as well.
# First, impute the outlier point to the mean
limit <- 3000000
usage$KWH <- ifelse(usage$KWH < limit, mean(usage$KWH, na.rm=TRUE), usage$KWH)
lambda <- usage %>% features(KWH, features = guerrero) |>
pull(lambda_guerrero)
# Print out optimal lambda value
# print(lambda)
usage$KWH_transformed <- box_cox(usage$KWH, lambda)
usage %>% autoplot(KWH_transformed) +
labs(x="Month", y="Kilowatt-Hours (Box-Cox)", title="Power Usage: Box-Cox Transformed")
Let’s check again for stationarity in our dataset via the Augmented Dickey Fuller test similar to above.
adf.test(usage$KWH_transformed)
## Warning in adf.test(usage$KWH_transformed): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: usage$KWH_transformed
## Dickey-Fuller = -4.5238, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
Again, we see a p-value less than 0.05, so we can reject the null hypothesis and assume our time series is stationary. In this case as well, the variance of our seasonal fluctuations also does not change with time, so we won’t need to transform the data (via Box-Cox or other methods).
Let’s also take a look at the autocorrelation and lag plots. We see some signifiicant peaks in the PACF plot at a lag of 2 and 12, indicating we’ll want both seasonal and non-seasonal MA parameters in our ARIMA model (see Hyndman for similar example).
usage %>%
gg_tsdisplay(difference(KWH_transformed, 12) %>% difference(),
plot_type='partial', lag=36) +
labs(title="Seasonally differenced", y="")
We see some strong seasonality with our data from the ACF plot, which is to be expected as we’re dealing with highly seasonal data.
First, let’s create some models aligning with the basic methods enumerated in Hyndman. We can use the naive
usage_fit <- usage %>%
model(
Mean = MEAN(KWH_transformed),
`Naïve` = NAIVE(KWH_transformed),
`Seasonal naive` = SNAIVE(KWH_transformed)
)
usage_fit %>% forecast(h=5) %>% autoplot(usage)
We can look at the residuals of our basic prediction methods (Seasonal naive, in particular)
usage_fit %>% select("Seasonal naive") %>% gg_tsresiduals()
We see normally-distributed residuals, but a significant lag of 12 in
our Seasonal naive decomposition.
We could also apply an ARIMA model here. In this case, we’ll want a seasonal ARIMA model as we’re handling a time series with stronger seasonal components. The significant lags at multiples of
# Train an ARIMA model on usage data
usage_arima <- usage %>% model(
auto=ARIMA(KWH_transformed),
arima=ARIMA(KWH_transformed ~ pdq(2,1,0) + PDQ(0,1,1))
)
usage_fc <- usage_arima %>%
forecast(h=12)
usage_fc %>% autoplot(usage)
Let’s take a look at our residual plots of our auto-selected model
usage_arima %>% select(auto) %>% gg_tsresiduals()
We see a pretty big lag autocorrelation at periods of 3 and 6, but our residuals do appear to be roughly normally distributed in this case.
Lastly, we’ll write our forecasted values for 2014. This file is availeble as a CSV on my GitHub here
# Write to excel: posted on GitHub
usage_fc %>% write_csv(file="../data/power-usage-predictions.csv")