DATE ATM Cash
Min. :2009-05-01 00:00:00.00 Length:1474 Min. : 0.0
1st Qu.:2009-08-01 00:00:00.00 Class :character 1st Qu.: 0.5
Median :2009-11-01 00:00:00.00 Mode :character Median : 73.0
Mean :2009-10-31 19:11:48.27 Mean : 155.6
3rd Qu.:2010-02-01 00:00:00.00 3rd Qu.: 114.0
Max. :2010-05-14 00:00:00.00 Max. :10919.8
NA's :19
I will sort the ATM in time order for ease of analysis:
atm_data <- atm_data |>arrange(DATE)
Next, I take a look to see where the NAs for the Cash variable are coming from:
atm_data |>filter(is.na(Cash))
# A tibble: 19 × 3
DATE ATM Cash
<dttm> <chr> <dbl>
1 2009-06-13 00:00:00 ATM1 NA
2 2009-06-16 00:00:00 ATM1 NA
3 2009-06-18 00:00:00 ATM2 NA
4 2009-06-22 00:00:00 ATM1 NA
5 2009-06-24 00:00:00 ATM2 NA
6 2010-05-01 00:00:00 <NA> NA
7 2010-05-02 00:00:00 <NA> NA
8 2010-05-03 00:00:00 <NA> NA
9 2010-05-04 00:00:00 <NA> NA
10 2010-05-05 00:00:00 <NA> NA
11 2010-05-06 00:00:00 <NA> NA
12 2010-05-07 00:00:00 <NA> NA
13 2010-05-08 00:00:00 <NA> NA
14 2010-05-09 00:00:00 <NA> NA
15 2010-05-10 00:00:00 <NA> NA
16 2010-05-11 00:00:00 <NA> NA
17 2010-05-12 00:00:00 <NA> NA
18 2010-05-13 00:00:00 <NA> NA
19 2010-05-14 00:00:00 <NA> NA
From this, I can see that there are NAs for a few ATMS in the month of June 2009 and then several NAs for the month of 2010 that do not have cash nor ATM information. Since we are tasked with forecasting withdrawals for the month of May 2010, I will drop the NAs from May 2010 that do not have ATM nor cash information. I will need to determine what to do with the NAs for the observations in June 2009 later.
The plots reveal interesting findings about the data. ATM1 and ATM2 seem similar in nature while ATM3 is unusual as its withdrawals were consistently at zero until recently. It is possible that this is a new ATM that was not in service until recently and that the data was entered as zeros rather than NAs. Lastly, ATM4 has one massive outlier, without which it might look more similar to ATM1 and ATM2.
Given that the time series has daily data over a period of one year, building a forecast model will be difficult as there will be multiple seasonal periods involved (eg, weekly, monthly, quarterly). Correcting for this would involve advance methods, so for now I will work with the data as is and use ARIMA and ETS models to generate forecasts. An alternative would be to aggregate the data to weekly or monthly values; however, based on the prompt, I suspect that daily values are of more interest.
As seen in the plot above and discussed earlier, ATM1 has a few observations that are NAs:
atm_one |>filter(is.na(Cash))
# A tsibble: 3 x 3 [1D]
# Key: ATM [1]
ATM Cash Date
<chr> <dbl> <date>
1 ATM1 NA 2009-06-13
2 ATM1 NA 2009-06-16
3 ATM1 NA 2009-06-22
There are three missing values for ATM1 and based on the pattern of the time series, I will interpolate these via an average of the surrounding values. I use the na_ma function from the imputeTS package:
atm_one <-na_ma(atm_one, k =1)atm_one |>filter(is.na(Cash))
# A tsibble: 0 x 3 [?]
# Key: ATM [0]
# ℹ 3 variables: ATM <chr>, Cash <dbl>, Date <date>
Next, I will use the Guerrero method to see if I can make the variance of the time series more stable. I will check what an appropriate lambda value for a Box-Cox transformation is via the Guerrero method:
lambda_atm_one <- atm_one |>features(Cash, features = guerrero) %>%pull(lambda_guerrero)print(lambda_atm_one)
[1] 0.261571
The lambda is close to 0.25, so for simplicity I will transform the data by raising it to the quarter root.
In order to create a forecast, I will split the data into a train and test set using the 80/20 method. I will try fitting an ETS model, an ARIMA model, and for comparison purposes a Seasonal NAIVE model to see which one performs best. I suspect that the Holt-Winters method with additive seasonality (the level of seasonality seems constant in this time series) could work, as well as a seasonal ARIMA model. The Seasonal Naive model will be used to evaluate whether the ETS or ARIMA models are an improvement over a simple approach.
# A tibble: 3 × 8
.model ATM .type RMSE MAE MAPE MASE RMSSE
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA ATM1 Test 0.814 0.540 27.7 2.74 2.15
2 ETS ATM1 Test 0.816 0.526 27.2 2.67 2.16
3 SNAIVE ATM1 Test 0.865 0.654 31.2 3.32 2.28
The ETS model has the lowest MAPE, so I will use that for forecasting. However, it should be noted that because we have daily data over a long stretch of time, there are multiple seasonal patterns involved (daily, weekly, monthly, quarterly), which creates difficulty in forecasting. This may explain why the MAPE of 27% is somewhat high. Correcting for this would require more advanced methods than we have covered, so for now I will go with the ETS model.
# A tsibble: 2 x 3 [1D]
# Key: ATM [1]
ATM Cash Date
<chr> <dbl> <date>
1 ATM2 NA 2009-06-18
2 ATM2 NA 2009-06-24
I will use the same method as ATM1 to impute these values:
atm_two <-na_ma(atm_two, k =1)atm_two |>filter(is.na(Cash))
# A tsibble: 0 x 3 [?]
# Key: ATM [0]
# ℹ 3 variables: ATM <chr>, Cash <dbl>, Date <date>
Next, I will use the Guerrero method to see if I can make the variance of the time series more stable. I will check what an appropriate lambda value for a Box-Cox transformation is via the Guerrero method:
lambda_atm_two <- atm_two |>features(Cash, features = guerrero) %>%pull(lambda_guerrero)print(lambda_atm_two)
[1] 0.7242767
The lambda is close to 0.75, so for simplicity I will transform the data by raising it to the power of three-fourths.
In order to create a forecast, I will split the data into a train and test set using the 80/20 method. I will try fitting an ETS model, an ARIMA model, and for comparison purposes a Seasonal NAIVE model to see which one performs best.
# A tibble: 3 × 8
.model ATM .type RMSE MAE MAPE MASE RMSSE
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA ATM2 Test 19.1 17.3 Inf 3.02 2.29
2 ETS ATM2 Test 18.8 16.6 Inf 2.89 2.27
3 SNAIVE ATM2 Test 18.1 16.1 Inf 2.80 2.17
Interestingly, both the ARIMA and ETS model performed worse than the SNAIVE model based on RMSE and MAE. Thus, I will use the SNAIVE model to forecast data for May.
As discussed, ATM3 is unique in that it seems like something about it changed. For example, it is possible that this location does not usually receive much foot traffic, but had an outlier day where a person visited and made a withdrawal. It is also possible that the location of the ATM changed places (went from a low traffic location to a higher traffic one). This makes it difficult to forecast. I will generate a forecast for the purposes of the exercise but more information is needed about the outlier before determining an appropriate forecast method.
There are no missing values in the dataset:
atm_three |>filter(is.na(Cash))
# A tsibble: 0 x 3 [?]
# Key: ATM [0]
# ℹ 3 variables: ATM <chr>, Cash <dbl>, Date <date>
I will forgo training and test sets since they will mostly contain zeros. Below are the fitted models (I used a NIAVE model as opposed to SNAIVE as the benchmark).
This dataset set has a large outlier, which seems due to some kind of exogenous even that is not normally present. I will keep the outlier as is for modeling purposes, assuming it is a valid measurement, however it would be beneficial to know what may have caused the outlier measurement.
Next, I check for NAs:
atm_four |>filter(is.na(Cash))
# A tsibble: 0 x 3 [?]
# Key: ATM [0]
# ℹ 3 variables: ATM <chr>, Cash <dbl>, Date <date>
Next, I will use the Guerrero method to see if I can make the variance of the time series more stable. I will check what an appropriate lambda value for a Box-Cox transformation is via the Guerrero method:
lambda_atm_four <- atm_four |>features(Cash, features = guerrero) %>%pull(lambda_guerrero)print(lambda_atm_four)
[1] -0.0737252
The lambda is close to 0, so for simplicity I will transform the data by taking a log of the data.
In order to create a forecast, I will split the data into a train and test set using the 80/20 method. I will try fitting an ETS model, an ARIMA model, and for comparison purposes a Seasonal NAIVE model to see which one performs best.
# A tsibble: 6 x 2 [1M]
KWH Month
<dbl> <mth>
1 6862583 1998 Jan
2 5838198 1998 Feb
3 5420658 1998 Mar
4 5010364 1998 Apr
5 4665377 1998 May
6 6467147 1998 Jun
Next, I check to see if there are any missing values:
power_data |>filter(is.na(KWH))
# A tsibble: 1 x 2 [1M]
KWH Month
<dbl> <mth>
1 NA 2008 Sep
There is one missing value. I next plot the data to see what it looks like:
power_data |>autoplot(KWH)
Based on this graph, I will impute the missing value as the average of its two surrounding values, as this will be consistent with the pattern in the graph.
power_data <-na_ma(power_data, k =1)power_data |>filter(is.na(KWH))
# A tsibble: 0 x 2 [?]
# ℹ 2 variables: KWH <dbl>, Month <mth>
The variance of the time series does not seem constant. I will check what an appropriate lambda value for a Box-Cox transformation is via the Guerrero method:
lambda <- power_data |>features(KWH, features = guerrero) %>%pull(lambda_guerrero)print(lambda)
[1] 0.1073943
The lambda is close enough to zero that for simplicity, I will go with a log transformation.
There is a significant outlier in early 2010. Since there is a recorded value, it does seem like a legitimate measure that was likely due to some kind of exogenous shock (perhaps a blackout or something similar). Assuming that the outlier is a legitimate measurement, I will choose to leave it as is in the data so models properly capture potential shocks in the data.
Because the time series is relatively long, I will create a training and test set on which I will evaluate the models before determining which model to use to create a forecast for 2014. I will use the 80%/20% method.
We are required to export forecasts to an excel file.
Starting with the ATM data, I create an output data frame for each ATM that has the values for the point forecast and confidence intervals in the original scale.