Part A

In this part, we’ll be attempting to forecast how much cash is taken out of 4 different ATM machines for May 2010. Our dataset is provided for us, and available on my GitHub.

Data Wrangling and Visualization

Our DATE column looks to be the number of days since 1900, so we’ll fix that on read-in

# Read in our ATM data
atm <- read_excel("../data/ATM624Data.xlsx")

# Store ATM data in a tsibble object
atm <- atm %>%  
  mutate(DATE = as.Date(DATE, origin = "1899-12-30")) %>% as_tsibble(index=DATE, key=ATM) %>% fill_gaps()

To start off with, let’s plot our atm tsibble to get a sense of the data. We see that we have data for 4 ATMs, with some NA values for the amount of cash withdrawn on certain dates. We also have some outlier points that we’ll need to deal with as well.

atm %>% autoplot() + labs(x = "Date", y="Cash Withdrawn (Hundred of Dollars)", title="Cash withdrawn from ATMs 1-4: May 2009 - May 2010")

Imputation

In our case, the NA values for the ATM variable all occur in May 2010, which is the month we are trying to predict. In this case, we should be able to drop these rows for now, with the intention of “re-adding” them later when we forecast

# Filter null ATM rows for later forecasting replacement
atm <- atm %>% filter(!is.na(ATM))

atm %>% autoplot() + 
  labs(x = "Date",
       y="Cash Withdrawn (Hundred of Dollars)",
       title="Cash withdrawn from ATMs 1-4: May 2009 - May 2010")

Imputation

As a result of our fill_gaps call above, we’re left with some NA values in our time series. We can find the mean cash withdrawn per ATM varies between the machines via a dplyr aggregation. These can serve us later for imputation

mean_cash_by_atm <- atm %>% 
     filter(!is.na(Cash)) %>%
     as_tibble %>% 
     group_by(ATM) %>% 
     summarise(Cash = mean(Cash))
mean_cash_by_atm

## # A tibble: 4 × 2
##   ATM      Cash
##   <chr>   <dbl>
## 1 ATM1   83.9  
## 2 ATM2   62.6  
## 3 ATM3    0.721
## 4 ATM4  474.

As such, it makes more sense to impute the amount of cash withdrawn based on the per ATM, rather than taking the blanket mean of our Cash variable. This will help us to handle outliers like ATM3 better.

# Impute with mean value per ATM (hence the groupBy)
atm <- atm %>% group_by(ATM) %>%
  mutate(Cash = ifelse(is.na(Cash), 
                       mean(Cash, na.rm=TRUE), 
                       Cash)) %>%
  ungroup()

atm %>% autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

Outliers

We see a strong outlier point around February 2010, in which it appears a value of $\$1,092,000$ was withdrawn. There is a chance that this outlier is due to an internal error within the data collection process, so we’ll have to handle that as well. For our purposes, we can replace that outlier value with a simple mean of the time series.

atm_with_outlier <- atm # Save TS with outlier for later comparison
outlier_limit <- 3000
atm_outlier <- atm %>% 
  filter(Cash > outlier_limit) %>%
  mutate(Cash = mean(atm$Cash))

# Remove outlier row and add in "imputed" value
atm <- atm %>% filter(Cash < outlier_limit)

atm <- bind_rows(atm, atm_outlier)

# Plot 
atm %>% autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

Finally, let’s plot our time series without ATM4, which tends to drown out the other ATM withdrawal values in terms of scale

atm %>% filter(ATM != "ATM4") %>% autoplot(Cash)

Decomposition

These time series appear to not have a strong overall trend upwards or downwards. However, there does appear to be a seasonal component. In our case, the seasonal variation does not change with time, so an additive decomposition may be better than a multiplicative one:

atm  %>% model(
    classical_decomposition(Cash, type = "additive")
  ) %>%
  components() %>%
  autoplot() +
  labs(title = "Classical Decomposition fo ATM data")

Modeling

Now we need to get to actually build a model to predict the amount of cash taken out of each ATM for May 2010. One consideration if we need to build an ARIMA model would be to determine if our time series is stationary. We can do this via the Augmented Dickey-Fuller test, with a null hypothesis that our data is non-stationary and a significance level $\alpha=0.05$. We can leverage the adf.test function from the R tseries library.

adf.test(atm$Cash)

## Warning in adf.test(atm$Cash): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm$Cash
## Dickey-Fuller = -4.4473, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

In this case, we see a p-value of 0.01, so we can reject the null hypothesis and assume our time series is stationary. This means we won’t have to apply differencing to our data, since it’s already stationary.

Let’s also take a look at the ACF plot of our data. We see significant weekly lags (7 days) with autocorrelation. In simpler terms, ATM users tend to withdraw similar amounts depending on the day of the week. This makes sense, as people will likely have certain days of the week (e.g., Fridays) where they

# Plot the ACF of our ATM time series
atm %>% ACF(Cash) %>% autoplot()

We can also look at a lag plot for our data. Here we see a similar pattern from our ACF plot, with a strong autocorrelation at a lag of 7 (weekly).

atm %>% filter(ATM == "ATM1") %>% gg_lag(geom="point")

## Plot variable not specified, automatically selected `y = Cash`

Now we can fit an ARIMA model to the ATM dataset. When using ARIMA in fable, the default behavior is that model parameters are automatically selected. We’ll leverage this behavior in determining our ARIMA parameters $p, q, d$

# Auto-select ARIMA parameters
atm_fit <-atm %>% 
  model(ARIMA(Cash))

atm_fit

## # A mable: 4 x 2
## # Key:     ATM [4]
##   ATM                      `ARIMA(Cash)`
##   <chr>                          <model>
## 1 ATM1          <ARIMA(0,0,1)(0,1,2)[7]>
## 2 ATM2          <ARIMA(2,0,2)(0,1,1)[7]>
## 3 ATM3                    <ARIMA(0,0,2)>
## 4 ATM4  <ARIMA(3,0,2)(1,0,0)[7] w/ mean>

Our mable for the ATM dataset contains an ARIMA model per panel (in this case, per ATM series). Each ARIMA model contains different values for $p, q, d$ as parameters.

# Forecast and plot
atm_fc <- atm_fit %>% 
  forecast(h=10)

# Plot forecast per ATM
atm_fc %>% autoplot(atm) + 
  labs(x="Date",
       y="Cash Withdrawn",
       title="ARIMA models applied to ATM Withdrawal Datasets")

We can create residual plots for each ATM to visually inspect our erros. We hop for normally-distributed residuals and no significant auto-correlations in our errors

# Plot residuals per ATM
atms <- c("ATM1", "ATM2", "ATM3", "ATM4")
for (atm in atms){
  plot <- atm_fit %>%
    filter(ATM == atm) %>%
    gg_tsresiduals() +
    labs(title=atm)
  print(plot)
}

Finally, we can write our output dataset to a CSV which can be read into excel. The output can be found on my GitHub here

# Write predictions to CSV
atm_fc %>% dplyr::select("ATM", "DATE", "Cash", ".mean") %>%
  dplyr::rename( "prediction" =".mean") %>%
  write_excel_csv(file="../data/atm-predictions.csv")

Part B

In this part we’ll be looking at residential power usage between 1998 and 2010. First, we’ll read in the datasets used. and perform some basic wrangling to get into a shape suitable for time series forecasting.

# Read in our Power Usage data
usage <- read_excel("../data/ResidentialCustomerForecastLoad-624.xlsx")

# Store ATM data in a tsibble object
usage <- usage %>% 
     rename(Date = `YYYY-MMM`) %>% 
     mutate(Date = yearmonth(Date)) %>%
     as_tsibble(index=Date)

First, we’ll plot our time of power usage over time to get a visual sense of the data

usage %>% autoplot(KWH) + 
  labs(x="Date",
       y="Power Usage (kilowatt-hours)",
       title="Power Usage Over Time")

Visually, we see an outlier point around July 2010 we’ll likely need to deal with. We also see some missing values in addition to our seasonal data.

Data Wrangling

We need to deal with 2 data quality issues before we can build a predictve model and forecast:

Missing values
Outliers

First, we’ll impute missing values with the mean value of our timeseries

# Naive mean imputation
usage$KWH <- ifelse(is.na(usage$KWH), mean(usage$KWH, na.rm=TRUE), usage$KWH)

# Should be empty
usage %>% filter(is.na(usage$KWH))

## # A tsibble: 0 x 3 [?]
## # ℹ 3 variables: CaseSequence <dbl>, Date <mth>, KWH <dbl>

Next we’ll remove our outlier point as well.

# First, impute the outlier point to the mean
limit <- 3000000
usage$KWH <- ifelse(usage$KWH < limit, mean(usage$KWH, na.rm=TRUE), usage$KWH)


lambda <- usage %>% features(KWH, features = guerrero) |>
  pull(lambda_guerrero)
# Print out optimal lambda value
# print(lambda)

usage$KWH_transformed <- box_cox(usage$KWH, lambda)

usage %>% autoplot(KWH_transformed) + 
  labs(x="Month", y="Kilowatt-Hours (Box-Cox)", title="Power Usage: Box-Cox Transformed")

Modeling

Let’s check again for stationarity in our dataset via the Augmented Dickey Fuller test similar to above.

adf.test(usage$KWH_transformed)

## Warning in adf.test(usage$KWH_transformed): p-value smaller than printed
## p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  usage$KWH_transformed
## Dickey-Fuller = -4.5238, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

Again, we see a p-value less than 0.05, so we can reject the null hypothesis and assume our time series is stationary. In this case as well, the variance of our seasonal fluctuations also does not change with time, so we won’t need to transform the data (via Box-Cox or other methods).

Let’s also take a look at the autocorrelation and lag plots. We see some signifiicant peaks in the PACF plot at a lag of 2 and 12, indicating we’ll want both seasonal and non-seasonal MA parameters in our ARIMA model (see Hyndman for similar example).

usage %>%
  gg_tsdisplay(difference(KWH_transformed, 12) %>% difference(),
               plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")

We see some strong seasonality with our data from the ACF plot, which is to be expected as we’re dealing with highly seasonal data.

First, let’s create some models aligning with the basic methods enumerated in Hyndman. We can use the naive

usage_fit <- usage %>%
  model(
    Mean = MEAN(KWH_transformed),
    `Naïve` = NAIVE(KWH_transformed),
    `Seasonal naive` = SNAIVE(KWH_transformed)
  )

usage_fit %>% forecast(h=5) %>% autoplot(usage)

We can look at the residuals of our basic prediction methods (Seasonal naive, in particular)

usage_fit %>% select("Seasonal naive") %>% gg_tsresiduals()

We see normally-distributed residuals, but a significant lag of 12 in our Seasonal naive decomposition.

We could also apply an ARIMA model here. In this case, we’ll want a seasonal ARIMA model as we’re handling a time series with stronger seasonal components. The significant lags at multiples of

# Train an ARIMA model on usage data
usage_arima <- usage %>% model(
  auto=ARIMA(KWH_transformed),
  arima=ARIMA(KWH_transformed ~ pdq(2,1,0) + PDQ(0,1,1))
  )

usage_fc <- usage_arima %>% 
  forecast(h=12)

usage_fc %>% autoplot(usage)

Let’s take a look at our residual plots of our auto-selected model

usage_arima %>% select(auto) %>% gg_tsresiduals()

We see a pretty big lag autocorrelation at periods of 3 and 6, but our residuals do appear to be roughly normally distributed in this case.

Lastly, we’ll write our forecasted values for 2014. This file is availeble as a CSV on my GitHub here

# Write to excel: posted on GitHub
usage_fc %>% write_csv(file="../data/power-usage-predictions.csv")

DATA 624: Project 1

Andrew Bowen

2024-03-02

Part A

Data Wrangling and Visualization

Imputation

Imputation

Outliers

Decomposition

Modeling

Part B

Data Wrangling

Modeling